OPTIMIZE Command (OPTIMIZE, Z-ORDER) — The Secret to Fast Delta Lake Queries
✨ Story Time — “Why is My Query Slower Today?”
Meet Ray, a data engineer working with a large Delta Lake table that receives millions of updates daily.
One morning:
- Yesterday’s query ran in 6 seconds
- Today the same query takes over 35 seconds
- The dashboard team is already messaging him…
Ray checks the table and discovers:
- Thousands of small Delta files
- Poor clustering
- No data skipping
- And a warehouse that’s working harder than it should
He sighs…
Then smiles — because he knows the fix is simple:
➡ OPTIMIZE + Z-ORDER
The Databricks “performance boost button.”
🧩 What is OPTIMIZE in Databricks?
OPTIMIZE is a Delta Lake command that compacts small files into large, efficient Parquet files.
Why is this important?
Because writing too many small files leads to:
- Slow reads
- High metadata overhead
- Extra compute cost
- Poor parallelization
How OPTIMIZE works:
- Reads many small files
- Combines them into fewer, larger files (usually 128MB+)
- Organizes partitions more efficiently
- Improves scan performance significantly
Example:
OPTIMIZE sales_delta;
Just one command — and read performance improves instantly.
🔍 What is Z-ORDER?
Z-ORDER is a multi-dimensional clustering technique that groups related data together physically on disk.
This improves data skipping, meaning:
➡ Databricks reads only the files that matter ➡ Not the entire dataset
Perfect for speeding up queries with filters such as:
WHERE customer_id = ...WHERE date BETWEEN ...WHERE product_category = ...
Example:
OPTIMIZE sales_delta
ZORDER BY (customer_id, order_date);
This tells Databricks:
“Put rows with similar
customer_idandorder_datecloser together.”
🎯 When Should You Use OPTIMIZE?
Use it when:
✔ Your table receives lots of small batch writes ✔ You have many small files (file fragmentation) ✔ Query performance drops over time ✔ Dashboards require fast scans ✔ Streaming writes produce too many tiny files
Not ideal when:
✖ Data changes extremely frequently
✖ You’re optimizing unpartitioned huge tables without Z-ORDER
✖ You run OPTIMIZE far too often (unnecessary compute cost)
🎯 When Should You Use Z-ORDER?
Use Z-ORDER when your queries filter on a specific column frequently:
- Customer-level queries
- Product or SKU-level queries
- Date or timestamp queries
- Geolocation or region filters
- IoT sensors filtered by device_id
Avoid Z-ORDER when:
- Your table already has perfect partitioning
- You rarely filter on the columns
- Your table is small (< 50 GB)
🧪 Real-World Example — 10× Faster Query
Ray’s company runs this query all day:
SELECT *
FROM sales_delta
WHERE customer_id = 99821;
Before Z-ORDER:
- Databricks scanned 1,200 files
- Query took 28 seconds
After:
OPTIMIZE sales_delta
ZORDER BY (customer_id);
Results:
- Scanned only 73 files
- Query took 2.1 seconds
- Dashboards loaded instantly
- Ray finally finished his coffee ☕
⚡ Benefits of OPTIMIZE + Z-ORDER
| Feature | Benefit |
|---|---|
| File Compaction | Faster reads & fewer metadata operations |
| Data Skipping | Databricks reads only the relevant files |
| Improved Clustering | Better filter performance |
| Lower Cost | Less compute + fewer scanned files |
| Faster Dashboards | BI tools feel “instant” |
🧠 Best Practices
- Run
OPTIMIZEon large Delta tables weekly or daily (depending on volume). - Use
ZORDERon the columns most commonly used in WHERE filters. - Don’t Z-ORDER too many columns at once — 1 to 3 is ideal.
- Schedule OPTIMIZE jobs in non-peak hours.
- Avoid running OPTIMIZE on very small tables (less than 10 GB).
📘 Summary
- OPTIMIZE compacts small files into large, efficient ones.
- Z-ORDER clusters data to enable data skipping and faster filters.
- Together, they can provide 10× to 100× query performance improvements.
- Best for large, heavily updated Delta Lake tables.
- essential for production workloads, dashboards, and BI pipelines.
👉 Next Topic
File Compaction & Delta File Management