Skip to main content

OPTIMIZE Command (OPTIMIZE, Z-ORDER) — The Secret to Fast Delta Lake Queries

✨ Story Time — “Why is My Query Slower Today?”

Meet Ray, a data engineer working with a large Delta Lake table that receives millions of updates daily.

One morning:

  • Yesterday’s query ran in 6 seconds
  • Today the same query takes over 35 seconds
  • The dashboard team is already messaging him…

Ray checks the table and discovers:

  • Thousands of small Delta files
  • Poor clustering
  • No data skipping
  • And a warehouse that’s working harder than it should

He sighs…
Then smiles — because he knows the fix is simple:

OPTIMIZE + Z-ORDER

The Databricks “performance boost button.”


🧩 What is OPTIMIZE in Databricks?

OPTIMIZE is a Delta Lake command that compacts small files into large, efficient Parquet files.

Why is this important?

Because writing too many small files leads to:

  • Slow reads
  • High metadata overhead
  • Extra compute cost
  • Poor parallelization

How OPTIMIZE works:

  • Reads many small files
  • Combines them into fewer, larger files (usually 128MB+)
  • Organizes partitions more efficiently
  • Improves scan performance significantly

Example:

OPTIMIZE sales_delta;

Just one command — and read performance improves instantly.


🔍 What is Z-ORDER?

Z-ORDER is a multi-dimensional clustering technique that groups related data together physically on disk.

This improves data skipping, meaning:

➡ Databricks reads only the files that matter ➡ Not the entire dataset

Perfect for speeding up queries with filters such as:

  • WHERE customer_id = ...
  • WHERE date BETWEEN ...
  • WHERE product_category = ...

Example:

OPTIMIZE sales_delta
ZORDER BY (customer_id, order_date);

This tells Databricks:

“Put rows with similar customer_id and order_date closer together.”


🎯 When Should You Use OPTIMIZE?

Use it when:

✔ Your table receives lots of small batch writes ✔ You have many small files (file fragmentation) ✔ Query performance drops over time ✔ Dashboards require fast scans ✔ Streaming writes produce too many tiny files

Not ideal when:

✖ Data changes extremely frequently ✖ You’re optimizing unpartitioned huge tables without Z-ORDER ✖ You run OPTIMIZE far too often (unnecessary compute cost)


🎯 When Should You Use Z-ORDER?

Use Z-ORDER when your queries filter on a specific column frequently:

  • Customer-level queries
  • Product or SKU-level queries
  • Date or timestamp queries
  • Geolocation or region filters
  • IoT sensors filtered by device_id

Avoid Z-ORDER when:

  • Your table already has perfect partitioning
  • You rarely filter on the columns
  • Your table is small (< 50 GB)

🧪 Real-World Example — 10× Faster Query

Ray’s company runs this query all day:

SELECT *
FROM sales_delta
WHERE customer_id = 99821;

Before Z-ORDER:

  • Databricks scanned 1,200 files
  • Query took 28 seconds

After:

OPTIMIZE sales_delta
ZORDER BY (customer_id);

Results:

  • Scanned only 73 files
  • Query took 2.1 seconds
  • Dashboards loaded instantly
  • Ray finally finished his coffee ☕

⚡ Benefits of OPTIMIZE + Z-ORDER

FeatureBenefit
File CompactionFaster reads & fewer metadata operations
Data SkippingDatabricks reads only the relevant files
Improved ClusteringBetter filter performance
Lower CostLess compute + fewer scanned files
Faster DashboardsBI tools feel “instant”

🧠 Best Practices

  • Run OPTIMIZE on large Delta tables weekly or daily (depending on volume).
  • Use ZORDER on the columns most commonly used in WHERE filters.
  • Don’t Z-ORDER too many columns at once — 1 to 3 is ideal.
  • Schedule OPTIMIZE jobs in non-peak hours.
  • Avoid running OPTIMIZE on very small tables (less than 10 GB).

📘 Summary

  • OPTIMIZE compacts small files into large, efficient ones.
  • Z-ORDER clusters data to enable data skipping and faster filters.
  • Together, they can provide 10× to 100× query performance improvements.
  • Best for large, heavily updated Delta Lake tables.
  • essential for production workloads, dashboards, and BI pipelines.

👉 Next Topic

File Compaction & Delta File Management