QBoard » Big Data » Big Data - Hadoop Eco-System » The best way to filter large data sets

The best way to filter large data sets

  • I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.

    The relevant tables look roughly like this:

    trade_metadata             18m rows, 10 GB
    trade_economics            18m rows, 15 GB
    business_event             18m rows, 11 GB
    trade_business_event_link  18m rows, 3 GB

     

    One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).

    I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.

    I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:

    1. Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
    2. If it is, would HBase be a good choice to implement this?
    3. We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?

    Thanks very much!

      July 29, 2021 2:22 PM IST
    0
  • Spark or Flink for the ETL, and, in case you aggregate data after that (i.e. you offer OLAP querying on the filtered data), pick up a column store e.g. Druid.

    I call it ETL because you load data, you filter it (transform it) and then you put it somewhere (you load it) to a storage, either a database, or a filesystem.

    You can do this ETL in either batch or streaming mode. Both Spark and Flink provide streaming at close performance to batch. Flink also provides full streaming semantics. Druid has ways to ingest data in real time too. So your OLAP queries will pick up events down to the second. If you have predefined reports and your performance lacks on them, do pre-aggregations in the Flink/Spark transformation flow, and deposit into more Druid datasources (i.e. load both raw events in one "db", aggregated by some dimensions in another datasource, and more aggregated in yet another).

      October 19, 2021 2:55 PM IST
    0
  • Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.

    First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.

    Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.

    An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)

    Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.

    There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

    This post was edited by Viaan Prakash at August 10, 2021 2:53 PM IST
      August 10, 2021 2:53 PM IST
    0