df.write
.mode(SaveMode.Overwrite)
.partitionBy("CATEGORY_ID")
.format("csv")
.option("header", "true")
.option("sep", "|")
.option("quoteAll", true)
.csv("output/inventory_backup")
The dataframe df is the result of aggregations on imported data from a csv file :
df.groupBy("PRODUCT_ID","LOC_ID","DAY_ID")
.agg(
functions.sum("ASSORTED_STOCK_UNIT").as("ASSORTED_STOCK_UNIT_sum"),
functions.sum("SOLID_STOCK_UNIT").as("SOLID_STOCK_UNIT_sum")
)
I would like to tune the performance of this program. Through Spark UI , I was able to see that the bottleneck of performance occurs at the stage that exports the data into csv files. enter image description here
More details = I'm using a 16cores/120GB RAM instance
Do you guys have any ideas on how to tune the performance ? (It currently takes more than 17min). Any help will be much appreciated. Thank you
Apache Spark is a fast, in-memory processing framework designed to support and process big data. Any form of data which is immensely huge in size (i.e. GB’s, TB’s, PB’s) and unable to be processed with standard configuration personal computer is known as big data. Spark uses various API’s to load data and perform analysis. The two common libraries used are dataframe API and Spark SQL.
Dataframe is like a dataset organized in named columns, equivalent to a table in relational database. Spark SQL is apache spark’s module for working with structured data. A collaboration of spark with python programming is known as PySpark. It is python’s API on spark to write programs in python style (similar to pandas library but spark follows a lazy evaluation which I will talk later in this article).
textFile() method read an entire CSV record as a String and returns RDD[String], hence, we need to write additional code in Spark to transform RDD[String] to RDD[Array[String]] by splitting the string record with a delimiter.
The below example reads a file into “rddFromFile” RDD object, and each element in RDD represents as a String.
val rddFromFile = spark.sparkContext.textFile("C:/tmp/files/text01.txt")
But, we would need every record in a CSV to split by comma delimiter and store it in RDD as multiple columns, In order to achieve this, we should use map() transformation on RDD where we will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter. map() method returns a new RDD instead of updating existing.