File Compaction & Delta File Management

✨ Story Time — “Why Are My Queries Slowing Down Every Week?”

Meet Arjun, a data engineer responsible for maintaining a busy Delta Lake table receiving:

CDC updates every 5 minutes
Batch data every hour
Streaming inserts all day

At first, everything is fast.
But after a few weeks:

Queries slow down
Dashboards lag
Costs increase
Data engineers keep asking: “Why is Delta so slow now?”

Arjun opens the Delta table storage…

He sees THOUSANDS of tiny files — the dreaded Small File Problem.

He smiles again.
He knows exactly what’s needed:

➡ File Compaction & Proper Delta File Management.

🧩 What Is File Compaction in Delta Lake?

File compaction is the process of merging many small Delta files into fewer, larger, optimized files.

Why small files happen:

Streaming writes produce small batches
Frequent micro-batch ingest
CDC jobs write small delta chunks
Over-partitioning causes tiny files per partition

Small files = slow queries + high compute cost + too much metadata.

Compaction solves this by:

Reducing file count
Increasing file size
Improving read performance
Reducing metadata overhead

🔍 Why Small Files Hurt Performance

❌ More files = More metadata

Each query has to read metadata for every file → slower planning.

❌ More files = More unnecessary reads

Even if only 1 row matches the filter, Databricks still must scan many files.

❌ More files = Higher storage cost

Many tiny files create version bloat.

❌ More files = Slower Z-ORDER & OPTIMIZE

The more files you have, the heavier maintenance operations become.

Solution → Compaction through OPTIMIZE.

⚙️ How Delta Performs File Compaction

The key command:

OPTIMIZE my_delta_table;

What it does:

Scans small files
Groups and merges them
Writes larger Parquet files (typically 128–512MB)
Updates Delta transaction log
Removes old small files (via VACUUM)

🔁 Automatic File Compaction (With Auto-Optimize)

Databricks also offers automated compaction:

ALTER TABLE my_delta_table
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
);

What these do:

Property	Action
`optimizeWrite`	Writes fewer, larger files during ingest
`autoCompact`	Merges files after small batch inserts

Perfect for streaming or frequent batches.

🧪 Real-World Example — Before & After Compaction

Arjun’s table (before):

8,200 files per partition
Avg file size: 40KB
Query runtime: 34 seconds

After:

OPTIMIZE sales_data ZORDER BY (customer_id);
VACUUM sales_data RETAIN 168 HOURS;

320 files per partition
Avg file size: 300MB
Query runtime: 5 seconds

Improved performance, reduced cost, and less pressure on the cluster.

📦 Delta File Management — The Full Picture

Delta Lake automatically manages:

Transaction logs (_delta_log/)
Versioning
Compaction
Data skipping
File pruning
Data removal with VACUUM

But you must manage:

When to compact
How often to vacuum
How to structure partitions
How to avoid unnecessary file explosion

🎯 Best Practices for File Compaction

✅ 1. Compact high-ingestion tables regularly

Daily or weekly, depending on volume.

✅ 2. Enable Auto-Optimize for streaming workloads

Reduces small files during writes.

✅ 3. Combine OPTIMIZE with Z-ORDER

Boosts data skipping for faster queries.

✅ 4. Avoid over-partitioning

Too many partitions → too many tiny files.

✅ 5. Use VACUUM after compaction

Clean old files and free storage:

VACUUM my_delta_table RETAIN 168 HOURS;

✅ 6. Monitor file count

If files per partition > 1000 → compaction required.

📘 Summary

File compaction merges small files into large, efficient ones.
Small files slow down queries, inflate compute cost, and destroy performance.
OPTIMIZE + Auto-Optimize are the main tools for managing Delta Lake storage.
Use VACUUM to clear old files after compaction.
Proper file management makes your Lakehouse fast, clean, and cost-efficient.

👉 Next Topic

Caching in Databricks — Best Practices

✨ Story Time — “Why Are My Queries Slowing Down Every Week?”​

🧩 What Is File Compaction in Delta Lake?​

🔍 Why Small Files Hurt Performance​

❌ More files = More metadata​

❌ More files = More unnecessary reads​

❌ More files = Higher storage cost​

❌ More files = Slower Z-ORDER & OPTIMIZE​

⚙️ How Delta Performs File Compaction​

The key command:​

🔁 Automatic File Compaction (With Auto-Optimize)​

What these do:​

🧪 Real-World Example — Before & After Compaction​

📦 Delta File Management — The Full Picture​

🎯 Best Practices for File Compaction​

✅ 1. Compact high-ingestion tables regularly​

✅ 2. Enable Auto-Optimize for streaming workloads​

✅ 3. Combine OPTIMIZE with Z-ORDER​

✅ 4. Avoid over-partitioning​

✅ 5. Use VACUUM after compaction​

✅ 6. Monitor file count​

📘 Summary​