File Compaction & Delta File Management
β¨ Story Time β βWhy Are My Queries Slowing Down Every Week?ββ
Meet Arjun, a data engineer responsible for maintaining a busy Delta Lake table receiving:
- CDC updates every 5 minutes
- Batch data every hour
- Streaming inserts all day
At first, everything is fast.
But after a few weeks:
- Queries slow down
- Dashboards lag
- Costs increase
- Data engineers keep asking: βWhy is Delta so slow now?β
Arjun opens the Delta table storageβ¦
He sees THOUSANDS of tiny files β the dreaded Small File Problem.
He smiles again.
He knows exactly whatβs needed:
β‘ File Compaction & Proper Delta File Management.
π§© What Is File Compaction in Delta Lake?β
File compaction is the process of merging many small Delta files into fewer, larger, optimized files.
Why small files happen:
- Streaming writes produce small batches
- Frequent micro-batch ingest
- CDC jobs write small delta chunks
- Over-partitioning causes tiny files per partition
Small files = slow queries + high compute cost + too much metadata.
Compaction solves this by:
- Reducing file count
- Increasing file size
- Improving read performance
- Reducing metadata overhead
π Why Small Files Hurt Performanceβ
β More files = More metadataβ
Each query has to read metadata for every file β slower planning.
β More files = More unnecessary readsβ
Even if only 1 row matches the filter, Databricks still must scan many files.
β More files = Higher storage costβ
Many tiny files create version bloat.
β More files = Slower Z-ORDER & OPTIMIZEβ
The more files you have, the heavier maintenance operations become.
Solution β Compaction through OPTIMIZE.
βοΈ How Delta Performs File Compactionβ
The key command:β
OPTIMIZE my_delta_table;
What it does:
- Scans small files
- Groups and merges them
- Writes larger Parquet files (typically 128β512MB)
- Updates Delta transaction log
- Removes old small files (via VACUUM)
π Automatic File Compaction (With Auto-Optimize)β
Databricks also offers automated compaction:
ALTER TABLE my_delta_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
What these do:β
| Property | Action |
|---|---|
optimizeWrite | Writes fewer, larger files during ingest |
autoCompact | Merges files after small batch inserts |
Perfect for streaming or frequent batches.
π§ͺ Real-World Example β Before & After Compactionβ
Arjunβs table (before):
- 8,200 files per partition
- Avg file size: 40KB
- Query runtime: 34 seconds
After:
OPTIMIZE sales_data ZORDER BY (customer_id);
VACUUM sales_data RETAIN 168 HOURS;
- 320 files per partition
- Avg file size: 300MB
- Query runtime: 5 seconds
Improved performance, reduced cost, and less pressure on the cluster.
π¦ Delta File Management β The Full Pictureβ
Delta Lake automatically manages:
- Transaction logs (
_delta_log/) - Versioning
- Compaction
- Data skipping
- File pruning
- Data removal with VACUUM
But you must manage:
- When to compact
- How often to vacuum
- How to structure partitions
- How to avoid unnecessary file explosion
π― Best Practices for File Compactionβ
β 1. Compact high-ingestion tables regularlyβ
Daily or weekly, depending on volume.
β 2. Enable Auto-Optimize for streaming workloadsβ
Reduces small files during writes.
β 3. Combine OPTIMIZE with Z-ORDERβ
Boosts data skipping for faster queries.
β 4. Avoid over-partitioningβ
Too many partitions β too many tiny files.
β 5. Use VACUUM after compactionβ
Clean old files and free storage:
VACUUM my_delta_table RETAIN 168 HOURS;
β 6. Monitor file countβ
If files per partition > 1000 β compaction required.
π Summaryβ
- File compaction merges small files into large, efficient ones.
- Small files slow down queries, inflate compute cost, and destroy performance.
- OPTIMIZE + Auto-Optimize are the main tools for managing Delta Lake storage.
- Use VACUUM to clear old files after compaction.
- Proper file management makes your Lakehouse fast, clean, and cost-efficient.
π Next Topic
Caching in Databricks β Best Practices