QBoard » Big Data » Big Data - Spark » Tune spark performance while writing big data to csv

Tune spark performance while writing big data to csv

  • Hi :) I have this code in spark /scala that partitions big data ( more than 50GB) by category into csv files.

      df.write
      .mode(SaveMode.Overwrite)
      .partitionBy("CATEGORY_ID")
      .format("csv")
      .option("header", "true")
      .option("sep", "|")
      .option("quoteAll", true)
      .csv("output/inventory_backup")  ​

    The dataframe df is the result of aggregations on imported data from a csv file :

       df.groupBy("PRODUCT_ID","LOC_ID","DAY_ID")
      .agg(
          functions.sum("ASSORTED_STOCK_UNIT").as("ASSORTED_STOCK_UNIT_sum"),
          functions.sum("SOLID_STOCK_UNIT").as("SOLID_STOCK_UNIT_sum")
      )


    I would like to tune the performance of this program. Through Spark UI , I was able to see that the bottleneck of performance occurs at the stage that exports the data into csv files. enter image description here

    More details = I'm using a 16cores/120GB RAM instance

    Do you guys have any ideas on how to tune the performance ? (It currently takes more than 17min). Any help will be much appreciated. Thank you


      July 24, 2021 2:43 PM IST
    0
  • Apache Spark is a fast, in-memory processing framework designed to support and process big data. Any form of data which is immensely huge in size (i.e. GB’s, TB’s, PB’s) and unable to be processed with standard configuration personal computer is known as big data. Spark uses various API’s to load data and perform analysis. The two common libraries used are dataframe API and Spark SQL.

    Dataframe is like a dataset organized in named columns, equivalent to a table in relational database. Spark SQL is apache spark’s module for working with structured data. A collaboration of spark with python programming is known as PySpark. It is python’s API on spark to write programs in python style (similar to pandas library but spark follows a lazy evaluation which I will talk later in this article).

      October 18, 2021 2:03 PM IST
    0
  • The shuffle write size is pretty large comparing to the data size, ~16Gb vs 50Gb. I suspect the performance may be suffering because of this, not because of saving to disk. E.g. try inserting df.persist(); df.count() before calling df.write to force lazy aggregation to complete and see how long does it take to save after that.
      August 10, 2021 3:07 PM IST
    0
  • textFile() method read an entire CSV record as a String and returns RDD[String], hence, we need to write additional code in Spark to transform RDD[String] to RDD[Array[String]] by splitting the string record with a delimiter.

    The below example reads a file into “rddFromFile” RDD object, and each element in RDD represents as a String.


    val rddFromFile = spark.sparkContext.textFile("C:/tmp/files/text01.txt")
    



    But, we would need every record in a CSV to split by comma delimiter and store it in RDD as multiple columns, In order to achieve this, we should use map() transformation on RDD where we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter. map() method returns a new RDD instead of updating existing.


      August 13, 2021 12:58 PM IST
    0