Partitioning & Bucketing in PySpark — Optimize Large Dataset Performance
At NeoMart, the data team works with billions of rows every day:
- Sales records by date and region
- Clickstream events by user and session
- Product catalogs across multiple categories
Reading all data every time is slow and expensive.
Partitioning and bucketing let Spark skip irrelevant data and reduce shuffle during joins and aggregations.
1. Understanding Partitioning
Partitioning physically splits data into directories based on column values.
- Each partition = a subdirectory
- Example: partition by
yearorregion - Spark can prune partitions during queries
Example Dataset
data = [
("2025-01-01", "Electronics", 12000),
("2025-01-01", "Grocery", 4000),
("2025-02-01", "Electronics", 15000),
("2025-02-01", "Fashion", 3000)
]
df = spark.createDataFrame(data, ["date", "category", "revenue"])
2. Writing Partitioned Data
df.write.mode("overwrite") \
.partitionBy("category") \
.parquet("/tmp/neo_partitioned")
-
Creates a folder for each category:
/tmp/neo_partitioned/category=Electronics//tmp/neo_partitioned/category=Grocery/ -
Querying
category='Electronics'reads only that partition.
3. Reading Partitioned Data
df_read = spark.read.parquet("/tmp/neo_partitioned/category=Electronics")
df_read.show()
Output
| date | category | revenue |
|---|---|---|
| 2025-01-01 | Electronics | 12000 |
| 2025-02-01 | Electronics | 15000 |
Partition pruning reduces I/O and speeds up queries.
4. Partitioning Best Practices
✔ Partition by high-cardinality but selective columns (like date, region) ✔ Avoid partitioning by too many columns → too many small files ✔ Keep less than 1000 partitions per table for efficiency ✔ Combine with predicate filtering for query speed
5. Understanding Bucketing
Bucketing divides data into fixed buckets using a hash function on a column.
- Unlike partitioning, bucketed data is in files within the same folder
- Useful for joins, aggregations, sampling
- Bucketing enables co-located joins → no shuffle
6. Writing Bucketed Tables
df.write.mode("overwrite") \
.bucketBy(4, "category") \
.sortBy("revenue") \
.saveAsTable("bucketed_sales")
- 4 buckets for
category - Within each bucket, data sorted by
revenue
7. Reading & Joining Bucketed Tables
bucketed_df = spark.table("bucketed_sales")
# Optimized join with another bucketed table
other_df = spark.table("bucketed_sales")
joined_df = bucketed_df.join(other_df, "category")
- Spark can perform bucketed joins → avoids full shuffle
- Great for large fact-dimension joins
8. Partitioning vs Bucketing — Quick Comparison
| Feature | Partitioning | Bucketing |
|---|---|---|
| Physical layout | Subdirectories | Files within a directory |
| Purpose | Query pruning | Optimized joins/aggregations |
| Column type | Low/moderate cardinality | High cardinality |
| Shuffle impact | Reduces I/O | Reduces shuffle during join |
| Sort inside | Optional | Can sort inside bucket |
9. Best Practices
✔ Partition by date for time-series data
✔ Bucket large dimension tables by join key
✔ Combine partitioning + bucketing for large fact tables
✔ Use sortBy() within buckets for faster aggregation
✔ Monitor number of files to avoid small-file problem
10. Story Example
NeoMart stores billions of orders.
- Partition by
yearandmonth→ only read relevant months - Bucket by
customer_id→ join with customer master table without shuffle - Result → queries 5x faster, cluster utilization optimized
Summary
With partitioning and bucketing, you can:
- Reduce I/O with partition pruning
- Minimize shuffle in large joins
- Sort data within buckets for fast aggregation
- Build scalable, production-ready pipelines
Mastering these techniques makes PySpark efficient at scale, just like NeoMart handles billions of daily transactions.
Next Topic → Caching, Persisting, and Memory Management