QBoard » Big Data » Big Data - Data Ingestion Tools : Sqoop, Flume, Kafka, Nifi.. » What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?

  • I found many options recently, and interesting in their comparisons primarely by maturity and stability.

    1. Crunch - https://github.com/cloudera/crunch
    2. Scrunch - https://github.com/cloudera/crunch/tree/master/scrunch
    3. Cascading - http://www.cascading.org/
    4. Scalding https://github.com/twitter/scalding
    5. FlumeJava
    6. Scoobi - https://github.com/NICTA/scoobi/
      December 25, 2020 12:22 PM IST
    0
  • Scalding also has the advantage of significant open source projects built atop it, such as Matrix API and Algebird.

    Here are some examples: http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

    Cascalog was released almost two years before Scalding, and arguably has more advanced features for building robust workflows: https://github.com/nathanmarz/cascalog/wiki

      December 28, 2020 11:54 AM IST
    0
  • As I'm a developer of Scoobi, don't expect an unbiased answer.

    First of all, FlumeJava is an internal google project that provides a (awesomely productive) abstraction ontop of MapReduce (not hadoop though). They released a paper about it, which is what projects like Scoobi and Crunch are based on.

    If your only criteria is the maturity -- I guess Cascading is your best bet.

    However, if you're looking for the (imho superior) FlumeJava style abstraction, you'll want to pick between (S)crunch and Scoobi.

    The biggest difference, superficial as it may be is that crunch is written in Java, with Scala bindings (Scrunch). And Scoobi is written in Scala with Java bindings (scoobij). They're both really solid choices, and you won't go wrong which ever you choose. I'm sure there's quite a similar story with Crunch, but Scoobi is being used in real projects and is under continual development. We're pretty very active in fixing bugs and implementing features.

    Anyway, they're both great projects with great people behind them and were both released within days of each other. They provide the same abstraction (with similiar api), so switching between the two won't be an issue in the slightest. My recommendation is to give them both a try, and see what works for you. There' no lock in in either project, so you don't need to commit :)

    And if you have any feedback for either project, please be sure to provide it :)

      August 13, 2021 1:06 PM IST
    0