Blog

Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName('SplitFile') \
    .getOrCreate()

readDF = spark.read.format("csv").option("header", True).option("delimiter", "|").load(
    r"C:\Users\ankus\PycharmProjects\pythonProject2\venv\resources\empdata.csv")
print(readDF.count())

print(readDF.rdd.getNumPartitions())
readDF = readDF.repartition(10)
print(readDF.rdd.getNumPartitions())
readDF.printSchema()

readDF.repartition(4,'gender').write.csv(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output','overwrite')
#readDF.coalesce(10).write.save(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output','csv','overwrite',None)

Example 2

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName('SplitFile') \
    .getOrCreate()

rdd = spark.sparkContext.parallelize(range(0,20),6)
print('no of partition is ', rdd.getNumPartitions())
rdd.saveAsTextFile(r'C:\Users\ankus\PycharmProjects\pythonProject2\venv\Output2',)
rdd1.saveAsTextFile("/tmp/partition")
//Writes 6 part files, one for each partition
Partition 1 : 0 1 2
Partition 2 : 3 4 5
Partition 3 : 6 7 8 9
Partition 4 : 10 11 12
Partition 5 : 13 14 15
Partition 6 : 16 17 18 19

Rdd Coalesce

rdd3 = rdd1.coalesce(4)
  printl("Repartition size : "+rdd3.partitions.size)
  rdd3.saveAsTextFile("/tmp/coalesce")


Partition 1 : 0 1 2
Partition 2 : 3 4 5 6 7 8 9
Partition 4 : 10 11 12 
Partition 5 : 13 14 15 16 17 18 19

Rdd Repartition

df2 = df.repartition(6)
printl(df2.rdd.partitions.length)

Partition 1 : 14 1 5
Partition 2 : 4 16 15
Partition 3 : 8 3 18
Partition 4 : 12 2 19
Partition 5 : 6 17 7 0
Partition 6 : 9 10 11 13

Follow me

Contact us for Training/ Job Support

Caution: Your use of any information or materials on this website is entirely at your own risk. It is provided for educational purposes only. It has been tested internally, however, we do not guarantee that it will work for you. Ensure that you run it in your test environment before using.