Hive or HBase for reporting?

  • I am trying to understand what would be the best big data solution for reporting purposes?

    Currently I narrowed it down to HBase vs Hive.

    The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:

    1. Show all users that logged in to the system in the last hour and their origin is US.
    2. Show a graph with the most played games to the least played games.
    3. From all users in the system show the percentage of paying vs non paying users.
    4. For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?

    The way I see it, there are 3 solutions:

    1. Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?

    2. Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.

    3. Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.

      August 28, 2021 1:04 PM IST
    0
  • With Hortonworks Data Platform (HDP), you can use Hive HBase integration to perform READ and WRITE operations on the HBase tables. HBase is integrated with Hive using the StorageHandler. You can access the data through both Hive and HBase.

      December 24, 2021 1:15 PM IST
    0
  • Respective responses on your suggested solutions (based on my personal experience with similar problem):
    1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.
    2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)
    3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)
    If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.
      September 4, 2021 9:09 PM IST
    0
  • You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.

     
      December 22, 2021 1:24 PM IST
    0