Spark Repartition() vs Coalesce()

ANKUSH THAVALI
21 Feb, 2022
0 Comments
1 Min Read

Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName('SplitFile') \
    .getOrCreate()

readDF = spark.read.format("csv").option("header", True).option("delimiter", "|").load(
    r"C:\Users\ankus\PycharmProjects\pythonProject2\venv\resources\empdata.csv")
print(readDF.count())

print(readDF.rdd.getNumPartitions())
readDF = readDF.repartition(10)
print(readDF.rdd.getNumPartitions())
readDF.printSchema()

readDF.repartition(4,'gender').write.csv(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output','overwrite')
#readDF.coalesce(10).write.save(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output','csv','overwrite',None)

Example 2

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName('SplitFile') \
    .getOrCreate()

rdd = spark.sparkContext.parallelize(range(0,20),6)
print('no of partition is ', rdd.getNumPartitions())
rdd.saveAsTextFile(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output2',)

rdd1.saveAsTextFile("/tmp/partition")
//Writes 6 part files, one for each partition
Partition 1 : 0 1 2
Partition 2 : 3 4 5
Partition 3 : 6 7 8 9
Partition 4 : 10 11 12
Partition 5 : 13 14 15
Partition 6 : 16 17 18 19

Rdd Coalesce

rdd3 = rdd1.coalesce(4)
  printl("Repartition size : "+rdd3.partitions.size)
  rdd3.saveAsTextFile("/tmp/coalesce")


Partition 1 : 0 1 2
Partition 2 : 3 4 5 6 7 8 9
Partition 4 : 10 11 12 
Partition 5 : 13 14 15 16 17 18 19

Rdd Repartition

df2 = df.repartition(6)
printl(df2.rdd.partitions.length)

Partition 1 : 14 1 5
Partition 2 : 4 16 15
Partition 3 : 8 3 18
Partition 4 : 12 2 19
Partition 5 : 6 17 7 0
Partition 6 : 9 10 11 13

Spark Repartition() vs Coalesce()

Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.

Example 2

Rdd Coalesce

Rdd Repartition

The next success story is yours....

Get the right guidance to leap through your career

About Us

Explore

Useful Links

Contact Info