QBoard » Data Science General, Tips and Techniques » Career Guidance » Do I need to learn Hadoop to be a Data Scientist?

Do I need to learn Hadoop to be a Data Scientist?

  •  

    An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?

      June 11, 2019 4:53 PM IST
    0
  • As a former Hadoop engineer, it is not needed but it helps. Hadoop is just one system - the most common system, based on Java, and a ecosystem of products, which apply a particular technique "Map/Reduce" to obtain results in a timely manner. Hadoop is not used at Google, though I assure you they use big data analytics. Google uses their own systems, developed in C++. In fact, Hadoop was created as a result of Google publishing their Map/Reduce and BigTable (HBase in Hadoop) white papers.

    Data scientists will interface with hadoop engineers, though at smaller places you may be required to wear both hats. If you are strictly a data scientist, then whatever you use for your analytics, R, Excel, Tableau, etc, will operate only on a small subset, then will need to be converted to run against the full data set involving hadoop.

      June 14, 2019 12:32 PM IST
    0
  • Yes, you should learn a platform that is capable of dissecting your problem as a data parallel problem. Hadoop is one. For your simple needs (design patterns like counting, aggregation, filtering etc.) you need Hadoop and for more complex Machine Learning stuff like doing some Bayesian, SVM you need Mahout which in turn needs Hadoop (Now Apache Spark) to solve your problem using a data-parallel approach.

    So Hadoop is a good platform to learn and really important for your batch processing needs. Not only Hadoop but you also need to know Spark (Mahout runs it's algorithms utilizing Spark) & Twitter Storm (for your real time analytics needs). This list will continue and evolve so if you are good with the building blocks (Distributed Computing, Data-Parallel Problems and so on) and know how one such platform (say Hadoop) operates you will fairly quickly be up to speed on others.

      June 14, 2019 12:39 PM IST
    0
  • Different people use different tools for different things. Terms like Data Science are generic for a reason. A data scientist could spend an entire career without having to learn a particular tool like hadoop. Hadoop is widely used, but it is not the only platform that is capable of managing and manipulating data, even large scale data.

    I would say that a data scientist should be familiar with concepts like MapReduce, distributed systems, distributed file systems, and the like, but I wouldn't judge someone for not knowing about such things.

    It's a big field. There is a sea of knowledge and most people are capable of learning and being an expert in a single drop. The key to being a scientist is having the desire to learn and the motivation to know that which you don't already know.

    As an example: I could hand the right person a hundred structured CSV files containing information about classroom performance in one particular class over a decade. A data scientist would be able to spend a year gleaning insights from the data without ever needing to spread computation across multiple machines. You could apply machine learning algorithms, analyze it using visualizations, combine it with external data about the region, ethnic makeup, changes to environment over time, political information, weather patterns, etc. All of that would be "data science" in my opinion. It might take something like hadoop to test and apply anything you learned to data comprising an entire country of students rather than just a classroom, but that final step doesn't necessarily make someone a data scientist. And not taking that final step doesn't necessarily disqualify someone from being a data scientist.

      June 11, 2019 4:55 PM IST
    0