QBoard » Big Data » Big Data - Hadoop Eco-System » Difference between hive, impala and beeline

Difference between hive, impala and beeline

  • I am new to Hadoop eco-system tools. Can anyone help me with understand the difference between hive, beeline and hive.

    Thanks in advance!

     
      September 14, 2021 1:54 PM IST
    0
  • Beeline versus Hive CLI

    HDP supports two Hive clients: the Hive CLI and Beeline. The primary difference between the two involves how the clients connect to Hive.

    • The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services.

    • Beeline, which connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-<version>-standalone.jar.

    Hortonworks recommends using HiveServer2 and a JDBC client (such as Beeline) as the primary way to access Hive. This approach uses SQL standard-based authorization or Ranger-based authorization. However, some users may wish to access Hive data from other applications, such as Pig. For these use cases, use the Hive CLI and storage-based authorization.

    Beeline Operating Modes and HiveServer2 Transport Modes

    Beeline supports the following modes of operation:

    Table 2.8. Beeline Modes of Operation

    Operating Mode Description

    Embedded

    The Beeline client and the Hive installation both reside on the same host machine. No TCP connectivity is required.

    Remote

    Use remote mode to support multiple, concurrent clients executing queries against the same remote Hive installation. Remote transport mode supports authentication with LDAP and Kerberos. It also supports encryption with SSL. TCP connectivity is required.


    Administrators may start HiveServer2 in one of the following transport modes:

    Table 2.9. HiveServer2 Transport Modes

    Transport Mode Description

    TCP

    HiveServer2 uses TCP transport for sending and receiving Thrift RPC messages.

    HTTP

    HiveServer2 uses HTTP transport for sending and receiving Thrift RPC messages.


    While running in TCP transport mode, HiveServer2 supports the following authentication schemes:

    Table 2.10. Authentication Schemes with TCP Transport Mode

    Authentication Scheme Description

    Kerberos

    A network authentication protocol which operates that uses the concept of 'tickets' to allow nodes in a network to securely identify themselves. Administrators must specify hive.server2.authentication=kerberoshive.server2.authentication.kerberos.principal = hive/_HOST@YOUR-REALM.COM, and hive.server2.authentication.kerberos.keytab = /etc/hive/conf/hive.keytab in the hive-site.xml configuration file to use this authentication scheme.

    LDAP

    The Lightweight Directory Access Protocol, an application-layer protocol that uses the concept of 'directory services' to share information across a network. Administrators must specify hive.server2.authentication=ldap in the hive-site.xml configuration file to use this type of authentication.

    PAM

    Pluggable Authentication Modules, or PAM, allow administrators to integrate multiple authentication schemes into a single API. Administrators must specify hive.server2.authentication=pam in the hive-site.xml configuration file to use this authentication scheme.

    Custom

    Authentication provided by a custom implementation of the org.apache.hive.service.auth.PasswdAuthenticationProvider interface. The implementing class must be available in the classpath for HiveServer2 and its name provided as the value of the hive.server2.custom.authentication.class property in the hive-site.xml configuration property file.

    None

    The Beeline client performs no authentication with HiveServer2. Administrators must specify hive.server2.authentication=none in the hive-site.xml configuration file to use this authentication
    scheme.

      November 2, 2021 2:45 PM IST
    0
  • Impala vs Hive: Difference between Sql on Hadoop components
      January 5, 2022 2:29 PM IST
    0
  • Apache Hive :

    1] Apache Hive is a data warehouse infrastructure build over Hadoop platform for performing data intensive task such as querying, analysis, processing and visualization.
    2] Hive generates query expression at compile time.
    3] Every Hive query has this problem of "cold start"
    4] Hive translates queries to be executed into MapReduce jobs under the hood involving overheads.
    5] Hive is more universal, versatile and pluggable language.
    6] For an upgradation project where compatibility and speed are equally imprtant. Hive is an ideal choice.

    Cloudera Impala :

    1] Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn't require data to be moved or transformed.
    2] Impala does runtime code generation for "big loops" using llvm.
    3] Impala avoids startup overhead as daemon processes are started at boot time itself, always being ready to process a query.
    4] Impala resonds quickly through massively parallel processing.
    5] Impala is used unleash its brute processing power and give lightning fast analytic result.
    6] Impala is an ideal choice when starting a new project.

    Beeline :

    1] Hive CLI connects directly to the Hive Driver and requires that Hive be installed on the same machine as the client.
    2] However, Beeline connects to HiveServer2 and does not require the installation of Hive libraries on the same machine as the client.
    3] Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication.
    4] Cloudera's Sentry security is working through HiveServer2 and not HiveServer1 which is used by Hive CLI. So hive though the command-line will not follow the policy from Setry. According to the cloudera docs you should not use Hive CLI and WebHCat. Use beeline or impala-sell instead.
    5] Connect with Beeline : url is a jdbc connection string, pointing to the hiveServer2 host.
    terminal> beeline -u url -n username -p password
    OR terminal> beeline
    beeline> !connect jdbc:hive2://HiveServer2Host:Port

      October 1, 2021 1:21 PM IST
    0
  • Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine. Hortonworks and Amazon do not support Impala. Update: Hortonworks merged with Cloudera and new company name is Cloudera. And Amazon also supports Impala. MapR also supports Impala. Impala does not use Map-Reduce under the hood and works faster than Hive.

    Apache Hive is a database built on top of Hadoop for providing data summarization, query, and analysis. Supported by all Hadoop vendors. Very reliable, can scale virtually unlimited and work with very big data, uses Map-Reduce framework primitives under the hood, even if configured to run on Tez execution engine. Can use Tez or MR(deprecated in Hive 2.x) execution engines.

    Beeline is a Hive client. See here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_dataintegration/content/beeline-vs-hive-cli.html

      October 2, 2021 2:17 PM IST
    0