Spark - repartition() vs coalesce()

QBoard » Big Data » Big Data - Spark » Spark - repartition() vs coalesce()

User Dashboard

Spark - repartition() vs coalesce()

Back To Topics

Tags : apache-spark rdd distributed-computing

Viaan Prakash

461

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

December 22, 2020 6:00 PM IST

0
Nitara Bobal

53
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:
```
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
```
Then coalesce down to 2 partitions:
```
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)
```
Notice that Node 1 and Node 3 did not require its original data to move.
December 22, 2020 6:36 PM IST

0
Sowkya Annam

6
All the answers are adding some great knowledge into this very often asked question.

So going by tradition of this question's timeline, here are my 2 cents.

I found the repartition to be faster than coalesce, in very specific case.

In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster.

Here is what I mean
```
if(numFiles > 20)
    df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
    df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
```
In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.

Of course, this number (20) will depend on the number of workers and amount of data.

Hope that helps.
December 22, 2020 10:51 PM IST

0
Tarun Reddy

84
One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD
```
scala> pairMrkt.repartition(10)
res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at <console>:26

scala> res16.partitions.length
res17: Int = 10

scala>  pairMrkt.partitions.length
res20: Int = 2
```
December 23, 2020 12:58 PM IST

0

Member Sign In

Member Sign In

Create Account

Spark - repartition() vs coalesce()

Connect With Us