QBoard » Big Data » Big Data - Others » How does impala provide faster query response compared to hive

How does impala provide faster query response compared to hive

  •  

    I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.

    I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.

    How does Impala provide faster query response compared to Hive for the same data on HDFS?

      May 23, 2019 2:53 PM IST
    0

  • You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".

    In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

    The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.

    That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.
      May 23, 2019 2:54 PM IST
    0