QBoard » Big Data » Big Data - Spark » advice for big data architecture: mongodb + spark

advice for big data architecture: mongodb + spark

  • I need to implement a big data storage + processing system.

    The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).

    Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .

    I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.

    I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.

    In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd ultimately need if i am to provide online querying.

    1 - is mongodb+spark a proven stack for this kind of use case?

    2 - is mongodb ( storing + query performance) scalability unbounded ?

    thanks in advance

      July 23, 2021 1:29 PM IST
    0
  • As mentioned previously there are a number of NoSQL solutions that can fit your needs. I can recommend MongoDB for use with Spark*, especially if you have operational experience with large MongoDB clusters.

    There is a white paper about turning analytics into realtime queries from MongoDB. Perhaps more interesting is the blog post from Eastern Airlines about their use of MongoDB and Spark and how it powers their 1.6 billion flight searches a day.

    Regarding the data size, then managing a cluster with that much data in MongoDB is a pretty normal. The performance part for any solution will be the quickly sending 1b+ documents to Spark for processing. Parallelism and taking advantage of data locality are key here. Also, your Spark algorithm will need to be such to take advantage of that parallelism - shuffling lots of data is time expensive.

    • Disclaimer: I'm the author of the MongoDB Spark Connector and work for MongoDB.
      August 10, 2021 3:09 PM IST
    0
  • Almost any NoSQL database can fit your needs when storing data. And you are right that MongoDB offers some extra's over Hbase and Cassandra when it comes to querying the data. But elasticsearch is a proven solution for high speed storing and retrieval/querying of data (metrics).

    Here is some more information on using elasticsearch with Spark:

    https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html

    I would actually use the complete ELK stack. Since Kibana will allow you to go easily through the data with visualization capabilities (charts etc.).

    I bet you already have Spark, so I would recommend to install the ELK stack on the same machine/cluster to test if it suits your needs.

      August 13, 2021 12:55 PM IST
    0