Performance Tuning — Partitions, Repartition, and Coalesce in PySpark
At NeoMart, large-scale pipelines often process billions of rows.
Efficient partitioning is critical to:
- Reduce shuffle and disk I/O
- Improve join and aggregation performance
- Balance task distribution across executors
This chapter explains partitions, repartition, and coalesce with practical examples.
1. Understanding Partitions
- Spark divides data into partitions → processed in parallel
- Number of partitions affects parallelism and performance
- Default partition count depends on cluster configuration and source
print(df.rdd.getNumPartitions())
- Returns the number of partitions for
df
2. Repartition — Increase or Shuffle Partitions
repartition(n)→ reshuffles data intonpartitions- Useful for parallelizing wide transformations or joins
# Repartition to 8 partitions
df_repart = df.repartition(8)
print(df_repart.rdd.getNumPartitions())
Story
NeoMart joins two large datasets.
- Original partitions = 2 → tasks underutilized
- Repartition to 8 → tasks distributed evenly → faster execution
3. Coalesce — Reduce Partitions Without Shuffle
coalesce(n)→ reduces partitions without full shuffle- Ideal after filtering or aggregation
# Reduce partitions to 2 without shuffle
df_coalesce = df_repart.coalesce(2)
print(df_coalesce.rdd.getNumPartitions())
- Avoids expensive shuffle operation → faster than
repartitionwhen reducing partitions
4. Practical Example
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("NeoMart").getOrCreate()
data = [(i, f"product_{i}", i*10) for i in range(1, 101)]
df = spark.createDataFrame(data, ["id", "name", "price"])
# Check initial partitions
print("Initial partitions:", df.rdd.getNumPartitions())
# Increase partitions for join
df_repart = df.repartition(10)
print("After repartition:", df_repart.rdd.getNumPartitions())
# Filter expensive operation
df_filtered = df_repart.filter(F.col("price") > 500)
# Reduce partitions after filtering
df_final = df_filtered.coalesce(3)
print("After coalesce:", df_final.rdd.getNumPartitions())
Output
Initial partitions: 4
After repartition: 10
After coalesce: 3
- Shuffles only occur in repartition, not in coalesce
- Balanced partitions → better task parallelism
5. Best Practices
✔ Repartition before wide transformations or joins → balance tasks ✔ Coalesce after filtering or aggregations → avoid small files and shuffle ✔ Avoid too many partitions → overhead of task scheduling ✔ Avoid too few partitions → underutilized cluster resources ✔ Combine with partitioning on columns for large datasets
6. Story Example
NeoMart runs nightly ETL on 500 million rows:
- Original partitions = 50 → join tasks uneven
- Repartition to 200 → full cluster utilization → faster joins
- After filtering high-value products → coalesce to 50 → fewer small files
- Result → 3x faster ETL runtime
Summary
- Partitions → control parallelism
- Repartition(n) → increase partitions with shuffle
- Coalesce(n) → decrease partitions without shuffle
- Proper tuning → faster joins, reduced shuffle, optimized cluster utilization
NeoMart pipelines achieve high throughput and low latency by carefully repartitioning and coalescing DataFrames.
Next Topic → Introduction to Structured Streaming.