Difference between hive, impala and beeline

QBoard » Big Data » Big Data - Hadoop Eco-System » Difference between hive, impala and beeline

User Dashboard

Difference between hive, impala and beeline

Back To Topics

Tags : hive Impala beeline

Samar Patil

346 3

I am new to Hadoop eco-system tools. Can anyone help me with understand the difference between hive, beeline and hive.

Thanks in advance!

September 14, 2021 1:54 PM IST

0

Viaan Prakash

461

Beeline versus Hive CLI

HDP supports two Hive clients: the Hive CLI and Beeline. The primary difference between the two involves how the clients connect to Hive.

The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services.
Beeline, which connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-<version>-standalone.jar.

Hortonworks recommends using HiveServer2 and a JDBC client (such as Beeline) as the primary way to access Hive. This approach uses SQL standard-based authorization or Ranger-based authorization. However, some users may wish to access Hive data from other applications, such as Pig. For these use cases, use the Hive CLI and storage-based authorization.

Beeline Operating Modes and HiveServer2 Transport Modes

Beeline supports the following modes of operation:

Table 2.8. Beeline Modes of Operation

Operating Mode	Description
Embedded	The Beeline client and the Hive installation both reside on the same host machine. No TCP connectivity is required.
Remote	Use remote mode to support multiple, concurrent clients executing queries against the same remote Hive installation. Remote transport mode supports authentication with LDAP and Kerberos. It also supports encryption with SSL. TCP connectivity is required.

Administrators may start HiveServer2 in one of the following transport modes:

Table 2.9. HiveServer2 Transport Modes

Transport Mode	Description
TCP	HiveServer2 uses TCP transport for sending and receiving Thrift RPC messages.
HTTP	HiveServer2 uses HTTP transport for sending and receiving Thrift RPC messages.

While running in TCP transport mode, HiveServer2 supports the following authentication schemes:

Table 2.10. Authentication Schemes with TCP Transport Mode

Authentication Scheme	Description
Kerberos	A network authentication protocol which operates that uses the concept of 'tickets' to allow nodes in a network to securely identify themselves. Administrators must specify `hive.server2.authentication=kerberos`, `hive.server2.authentication.kerberos.principal = hive/_HOST@YOUR-REALM.COM`, and `hive.server2.authentication.kerberos.keytab = /etc/hive/conf/hive.keytab` in the hive-site.xml configuration file to use this authentication scheme.
LDAP	The Lightweight Directory Access Protocol, an application-layer protocol that uses the concept of 'directory services' to share information across a network. Administrators must specify hive.server2.authentication=ldap in the hive-site.xml configuration file to use this type of authentication.
PAM	Pluggable Authentication Modules, or PAM, allow administrators to integrate multiple authentication schemes into a single API. Administrators must specify hive.server2.authentication=pam in the hive-site.xml configuration file to use this authentication scheme.
Custom	Authentication provided by a custom implementation of the org.apache.hive.service.auth.PasswdAuthenticationProvider interface. The implementing class must be available in the classpath for HiveServer2 and its name provided as the value of the hive.server2.custom.authentication.class property in the hive-site.xml configuration property file.
None	The Beeline client performs no authentication with HiveServer2. Administrators must specify hive.server2.authentication=none in the hive-site.xml configuration file to use this authentication scheme.

November 2, 2021 2:45 PM IST

Vaibhav Mali

259

January 5, 2022 2:29 PM IST

0
Advika Banerjee

319 1

Apache Hive :

1] Apache Hive is a data warehouse infrastructure build over Hadoop platform for performing data intensive task such as querying, analysis, processing and visualization.
2] Hive generates query expression at compile time.
3] Every Hive query has this problem of "cold start"
4] Hive translates queries to be executed into MapReduce jobs under the hood involving overheads.
5] Hive is more universal, versatile and pluggable language.
6] For an upgradation project where compatibility and speed are equally imprtant. Hive is an ideal choice.

Cloudera Impala :

1] Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn't require data to be moved or transformed.
2] Impala does runtime code generation for "big loops" using llvm.
3] Impala avoids startup overhead as daemon processes are started at boot time itself, always being ready to process a query.
4] Impala resonds quickly through massively parallel processing.
5] Impala is used unleash its brute processing power and give lightning fast analytic result.
6] Impala is an ideal choice when starting a new project.

Beeline :

1] Hive CLI connects directly to the Hive Driver and requires that Hive be installed on the same machine as the client.
2] However, Beeline connects to HiveServer2 and does not require the installation of Hive libraries on the same machine as the client.
3] Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication.
4] Cloudera's Sentry security is working through HiveServer2 and not HiveServer1 which is used by Hive CLI. So hive though the command-line will not follow the policy from Setry. According to the cloudera docs you should not use Hive CLI and WebHCat. Use beeline or impala-sell instead.
5] Connect with Beeline : url is a jdbc connection string, pointing to the hiveServer2 host.
terminal> beeline -u url -n username -p password
OR terminal> beeline
beeline> !connect jdbc:hive2://HiveServer2Host:Port

October 1, 2021 1:21 PM IST

0
Maryam Bains

317

Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine. Hortonworks and Amazon do not support Impala. Update: Hortonworks merged with Cloudera and new company name is Cloudera. And Amazon also supports Impala. MapR also supports Impala. Impala does not use Map-Reduce under the hood and works faster than Hive.

Apache Hive is a database built on top of Hadoop for providing data summarization, query, and analysis. Supported by all Hadoop vendors. Very reliable, can scale virtually unlimited and work with very big data, uses Map-Reduce framework primitives under the hood, even if configured to run on Tez execution engine. Can use Tez or MR(deprecated in Hive 2.x) execution engines.

Beeline is a Hive client. See here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_dataintegration/content/beeline-vs-hive-cli.html

October 2, 2021 2:17 PM IST

0

Cluzters.ai

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Difference between hive, impala and beeline

Beeline versus Hive CLI

Connect With Us