Hive or HBase for reporting?

QBoard » Big Data » Big Data - Data Storage : Hive, HBase, MongoDB, Teradata.. » Hive or HBase for reporting?

Back To Topics

Tags : BigData hive HBase reporting phoenix

Vaibhav Mali

259
I am trying to understand what would be the best big data solution for reporting purposes?

Currently I narrowed it down to HBase vs Hive.

The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
1. Show all users that logged in to the system in the last hour and their origin is US.
2. Show a graph with the most played games to the least played games.
3. From all users in the system show the percentage of paying vs non paying users.
4. For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
1. Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?
2. Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.
3. Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.
August 28, 2021 1:04 PM IST

0
Advika Banerjee

319 1
With Hortonworks Data Platform (HDP), you can use Hive HBase integration to perform READ and WRITE operations on the HBase tables. HBase is integrated with Hive using the StorageHandler. You can access the data through both Hive and HBase.
- Prerequisites
  You must complete the following steps before configuring the Hive and HBase.
- Configuring HBase and Hive
  Follow this step to complete the configuration:
- Using HBase Hive integration
  Before you begin to use the Hive HBase integration, complete the following steps:
- HBase Hive integration example
  A change to Hive in HDP 3.0 is that all StorageHandlers must be marked as “external”. There is no such thing as an non-external table created by a StorageHandler. If the corresponding HBase table exists when the Hive table is created, it will mimic the HDP 2.x semantics of an “external” table. If the corresponding HBase table does not exist when the Hive table is created, it will mimic the HDP 2.x semantics of a non-external table (e.g. the HBase table is dropped when the Hive table is dropped).
- Using Hive to access an existing HBase table example
  Use the following steps to access the existing HBase table through Hive.
- Understanding Bulk Loading
  A common pattern in HBase to obtain high rates of data throughput on the write path is to use “bulk loading”. This generates HBase files (HFiles) that have a specific format instead of shipping edits to HBase RegionServers. The Hive integration has the ability to generate HFiles, which can be enabled by setting the property “hive.hbase.generatehfiles” to true, for example, `set hive.hbase.generatehfiles=true`. Additionally, the path to a directory which to write the HFiles must also be provided, for example,`set hfile.family.path=/tmp/hfiles”.
- Understanding HBase Snapshots
  When an HBase snapshot exists for an HBase table which a Hive table references, you can choose to execute queries over the “offline” snapshot for that table instead of the table itself.
December 24, 2021 1:15 PM IST

0
Sindhuja Martha

181

Respective responses on your suggested solutions (based on my personal experience with similar problem):

1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.

2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)

3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)

If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.

September 4, 2021 9:09 PM IST

0
Viaan Prakash

461

You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.

December 22, 2021 1:24 PM IST

0

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Hive or HBase for reporting?

Connect With Us