Skip to main content

Performance & Optimization (Speed + Cost) - part 2

3. Data Skipping

Data Skipping means Spark automatically avoids scanning files that don’t contain relevant data — by checking column stats (min/max) stored in each file.

Instead of blindly scanning everything, it asks:
“Which files even could contain the data I need?”

It’s a built-in feature in Delta Lake — you don’t have to enable it manually.

🧠 Real-World Analogy
Imagine you’re looking for a book published in 2025:

Library has 10 shelves, each labeled with a year range:
Shelf 1: 2020–2021
Shelf 2: 2022–2023
Shelf 3: 2025–2026 ✅

You check the labels and go only to Shelf 3.
That’s data skipping.

✅ How It Works (Behind the Scenes)

Delta stores metadata like:

File NameMin order_dateMax order_date
File_12025-08-012025-08-02
File_22025-08-032025-08-04
File_32025-08-052025-08-06

Now, if you run:

SELECT * FROM sales  
WHERE order_date = '2025-08-05';

👉 Spark checks file stats first and sees:

Only File_3 can contain that date.
So it skips all other files.

Result: 🚀 Faster query with less I/O.

Why OPTIMIZE Helps

If your data has lots of small files, each one has only a few rows → min/max stats are not very meaningful. Also, many small files = more overhead.

✅ OPTIMIZE solves this:

OPTIMIZE gold.sales;

-Combines small files into larger ones
-Each large file now has better, broader stats
-So data skipping becomes much more effective

4. Auto Optimize & Auto Compaction

The Problem:

When writing data in small batches (e.g., streaming, micro-batch, CDC), Delta tables end up with many tiny files.

This is called the "small file problem"

✅ The Solution: Auto Optimize + Auto Compaction

FeatureWhat it Does
Auto Optimize (Optimize Write)Writes larger, more efficient files right from the start
Auto CompactionPeriodically merges small files in the background into larger ones

Together, they keep your Delta tables clean and performant — without needing manual OPTIMIZE jobs.

🧪 Example SQL to Enable:

ALTER TABLE silver.sales
SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
);

🛠️ When to Use

Use Case✅ Use It
Streaming ingestionYES — perfect fit
Incremental loadsYES — prevents file bloat
Heavy ETL pipelinesYES — keeps tables performant
Static full refresh datasets❌ Not needed — just use OPTIMIZE once after load

🔍 Tuning Notes

-Doesn’t fully replace OPTIMIZE for big performance tuning, but reduces the need for frequent runs
-Works best in high-frequency write environments (like real-time or hourly data)
-Auto Compaction is asynchronous and runs behind the scenes

🚘 5. Cluster Sizing & Photon

Databricks runs your jobs on clusters (groups of machines or VMs).

The cluster size determines:

-How fast your job runs
-How much you pay

Cluster SizeWhen to Use
🧪 Small jobsLight ETL, dev testing → 2–4 nodes
🧱 Heavy batch jobsAggregations, joins, transformations → 8–16 nodes
📊 BI dashboards / SQLUse auto-scaling clusters with Photon
⚙️ Experiments / retry logicUse spot instances for cost savings

❗ Problem:

Too small → Jobs spill to disk, slow performance
Too big → Wastes money, underutilized resources

Top Engineer Insight:

Monitor these to right-size:
Shuffle spill: Memory not enough for joins/sorts? Increase RAM.
Executor memory/CPU: Constantly maxed? Need bigger nodes.
Skewed tasks: One task runs longer than others? Time to tune data layout.

Enter Photon: The Jetpack Engine

Photon

Photon is a new execution engine built in C++ (instead of Java/Scala like traditional Spark).

✅ Automatically speeds up:

-SQL queries
-Delta Lake operations
-BI dashboards (Databricks SQL)

🔥 Photon Benefits

  • 2–3x faster queries
  • Lower compute cost
  • Works with Delta Lake + Spark SQL
  • Easy to enable — just check a box

🛠️ How to Enable Photon

Via UI:

When creating a cluster, just check "Enable Photon"

Via JSON config:

{
"cluster_name": "photon-cluster",
"spark_version": "13.3.x-scala2.12-photon",
"node_type_id": "i3.xlarge",
"num_workers": 4
}

📊 Real Example: Before vs After Photon

A retail company runs daily sales reports.

| Metric       | Before Photon           | After Photon |
| ------------ | ----------------------- | ------------ |
| Cluster Size | 10 nodes | 6 nodes |
| Query Time | 45 minutes | 15 minutes |
| Result | ✅ 3x faster, 💰 cheaper | |

“The best engineers don’t just use big clusters — they use the right clusters, with Photon, at the right time.”

6. File Format Optimization

The Story: Grocery Store Files

Imagine you run a grocery store, and every day you get invoices from different suppliers:

Supplier A sends handwritten notes (slow to read).
Supplier B sends Excel files (a bit faster).
Supplier C sends organized PDFs (okay speed).
Supplier D sends barcode-scanned digital lists (super fast!).

If you want to quickly find the price of milk, which supplier’s file would you prefer?

Clearly, the barcode-scanned list wins!

💡 What This Means for Data Files

FormatLike…Speed & Efficiency
CSVHandwritten notesSlow, bulky, no compression
JSONSemi-structured ExcelFlexible, but heavy
ParquetBarcode formatColumnar, compressed, fast
DeltaBarcode + history + securityBest performance & reliability

Professional
Choose the right format to speed up your data journey.

  • CSV & JSON are easy but slow and large—bad for big data analytics.

  • Parquet is columnar and compressed, speeding up queries and saving storage.

  • Delta Lake builds on Parquet and adds ACID transactions, schema control, and time travel — perfect for reliable, fast analytics.

🚀 Best Practice

  • Always convert raw CSV/JSON to Delta or Parquet ASAP.

  • Use Delta as your production data format for the best speed, compression, and reliability.

  • Partition large datasets (e.g., by date) for faster queries.

Code Snippets

# Not recommended for big data
df.write.format("csv").save("/mnt/raw/sales_data")

# Better: Parquet
df.write.format("parquet").mode("overwrite").save("/mnt/bronze/sales_data")

# Best practice: Delta
df.write.format("delta").mode("overwrite").save("/mnt/silver/sales_data")

Quick Takeaway
Use Delta for your analytics data. Avoid raw CSV/JSON beyond ingestion. It’s the barcode scanner of data formats — fast, efficient, and reliable.

🔑 1-Minute Summary: performance-optimization

🔧 TechniqueBest For⚠️ Watch Out For
PartitioningLow-cardinality columns (e.g., date, region)Too many partitions → small files → worse performance
Z-OrderingHigh-cardinality filter columns (e.g., customer_id, product_id)Requires compute-heavy OPTIMIZE; not real-time
Use Both TogetherPartition by date, Z-Order by customer_idUse when both filter types are common
CachingInteractive queries, repeated dashboard readsNeeds enough memory; must manually trigger with .cache()
Materialized ViewsStable aggregated data (e.g., daily revenue)Can become stale; needs manual/scheduled refresh
Data SkippingFiltering by min/max-aware columns (e.g., order_date)Less effective with many small files
Auto Optimize + CompactionStreaming, frequent micro-batches, CDCNot needed for static, full-refresh datasets
Cluster SizingMatching workload to right compute sizeToo small → slow; too big → expensive waste
Photon EngineSQL, Delta, BI queriesNot all workloads benefit; needs enabling
File Format OptimizationAnalytics workloads on large data volumesCSV/JSON slow and heavy → use Delta/Parquet