Performance & Optimization (Speed + Cost) - part 2

3. Data Skipping

Data Skipping means Spark automatically avoids scanning files that don’t contain relevant data — by checking column stats (min/max) stored in each file.

Instead of blindly scanning everything, it asks:
“Which files even could contain the data I need?”

It’s a built-in feature in Delta Lake — you don’t have to enable it manually.

🧠 Real-World Analogy
Imagine you’re looking for a book published in 2025:

Library has 10 shelves, each labeled with a year range:
Shelf 1: 2020–2021
Shelf 2: 2022–2023
Shelf 3: 2025–2026 ✅

You check the labels and go only to Shelf 3.
That’s data skipping.

✅ How It Works (Behind the Scenes)

Delta stores metadata like:

File Name	Min `order_date`	Max `order_date`
File_1	2025-08-01	2025-08-02
File_2	2025-08-03	2025-08-04
File_3	2025-08-05	2025-08-06

Now, if you run:

SELECT * FROM sales  
WHERE order_date = '2025-08-05';

👉 Spark checks file stats first and sees:

Only File_3 can contain that date.
So it skips all other files.

Result: 🚀 Faster query with less I/O.

Why OPTIMIZE Helps

If your data has lots of small files, each one has only a few rows → min/max stats are not very meaningful. Also, many small files = more overhead.

✅ OPTIMIZE solves this:

OPTIMIZE gold.sales;

-Combines small files into larger ones
-Each large file now has better, broader stats
-So data skipping becomes much more effective

4. Auto Optimize & Auto Compaction

The Problem:

When writing data in small batches (e.g., streaming, micro-batch, CDC), Delta tables end up with many tiny files.

This is called the "small file problem"

✅ The Solution: Auto Optimize + Auto Compaction

Feature	What it Does
Auto Optimize (Optimize Write)	Writes larger, more efficient files right from the start
Auto Compaction	Periodically merges small files in the background into larger ones

Together, they keep your Delta tables clean and performant — without needing manual OPTIMIZE jobs.

🧪 Example SQL to Enable:

ALTER TABLE silver.sales
SET TBLPROPERTIES (
  delta.autoOptimize.optimizeWrite = true,
  delta.autoOptimize.autoCompact = true
);

🛠️ When to Use

Use Case	✅ Use It
Streaming ingestion	YES — perfect fit
Incremental loads	YES — prevents file bloat
Heavy ETL pipelines	YES — keeps tables performant
Static full refresh datasets	❌ Not needed — just use `OPTIMIZE` once after load

🔍 Tuning Notes

-Doesn’t fully replace OPTIMIZE for big performance tuning, but reduces the need for frequent runs
-Works best in high-frequency write environments (like real-time or hourly data)
-Auto Compaction is asynchronous and runs behind the scenes

🚘 5. Cluster Sizing & Photon

Databricks runs your jobs on clusters (groups of machines or VMs).

The cluster size determines:

-How fast your job runs
-How much you pay

Cluster Size	When to Use
🧪 Small jobs	Light ETL, dev testing → 2–4 nodes
🧱 Heavy batch jobs	Aggregations, joins, transformations → 8–16 nodes
📊 BI dashboards / SQL	Use auto-scaling clusters with Photon
⚙️ Experiments / retry logic	Use spot instances for cost savings

❗ Problem:

Too small → Jobs spill to disk, slow performance
Too big → Wastes money, underutilized resources

Top Engineer Insight:

Monitor these to right-size:
Shuffle spill: Memory not enough for joins/sorts? Increase RAM.
Executor memory/CPU: Constantly maxed? Need bigger nodes.
Skewed tasks: One task runs longer than others? Time to tune data layout.

Enter Photon: The Jetpack Engine

Photon

Photon is a new execution engine built in C++ (instead of Java/Scala like traditional Spark).

✅ Automatically speeds up:

-SQL queries
-Delta Lake operations
-BI dashboards (Databricks SQL)

🔥 Photon Benefits

2–3x faster queries
Lower compute cost
Works with Delta Lake + Spark SQL
Easy to enable — just check a box

🛠️ How to Enable Photon

Via UI:

When creating a cluster, just check "Enable Photon"

Via JSON config:

{
  "cluster_name": "photon-cluster",
  "spark_version": "13.3.x-scala2.12-photon",
  "node_type_id": "i3.xlarge",
  "num_workers": 4
}

📊 Real Example: Before vs After Photon

A retail company runs daily sales reports.

| Metric       | Before Photon           | After Photon |
| ------------ | ----------------------- | ------------ |
| Cluster Size | 10 nodes                | 6 nodes      |
| Query Time   | 45 minutes              | 15 minutes   |
| Result       | ✅ 3x faster, 💰 cheaper |              |

“The best engineers don’t just use big clusters — they use the right clusters, with Photon, at the right time.”

6. File Format Optimization

The Story: Grocery Store Files

Imagine you run a grocery store, and every day you get invoices from different suppliers:

Supplier A sends handwritten notes (slow to read).
Supplier B sends Excel files (a bit faster).
Supplier C sends organized PDFs (okay speed).
Supplier D sends barcode-scanned digital lists (super fast!).

If you want to quickly find the price of milk, which supplier’s file would you prefer?

Clearly, the barcode-scanned list wins!

💡 What This Means for Data Files

Format	Like…	Speed & Efficiency
CSV	Handwritten notes	Slow, bulky, no compression
JSON	Semi-structured Excel	Flexible, but heavy
Parquet	Barcode format	Columnar, compressed, fast
Delta	Barcode + history + security	Best performance & reliability

Professional
Choose the right format to speed up your data journey.

CSV & JSON are easy but slow and large—bad for big data analytics.
Parquet is columnar and compressed, speeding up queries and saving storage.
Delta Lake builds on Parquet and adds ACID transactions, schema control, and time travel — perfect for reliable, fast analytics.

🚀 Best Practice

Always convert raw CSV/JSON to Delta or Parquet ASAP.
Use Delta as your production data format for the best speed, compression, and reliability.
Partition large datasets (e.g., by date) for faster queries.

Code Snippets

# Not recommended for big data
df.write.format("csv").save("/mnt/raw/sales_data")

# Better: Parquet
df.write.format("parquet").mode("overwrite").save("/mnt/bronze/sales_data")

# Best practice: Delta
df.write.format("delta").mode("overwrite").save("/mnt/silver/sales_data")

Quick Takeaway
Use Delta for your analytics data. Avoid raw CSV/JSON beyond ingestion. It’s the barcode scanner of data formats — fast, efficient, and reliable.

🔑 1-Minute Summary: performance-optimization

🔧 Technique	✅ Best For	⚠️ Watch Out For
Partitioning	Low-cardinality columns (e.g., `date`, `region`)	Too many partitions → small files → worse performance
Z-Ordering	High-cardinality filter columns (e.g., `customer_id`, `product_id`)	Requires compute-heavy `OPTIMIZE`; not real-time
Use Both Together	Partition by date, Z-Order by customer_id	Use when both filter types are common
Caching	Interactive queries, repeated dashboard reads	Needs enough memory; must manually trigger with `.cache()`
Materialized Views	Stable aggregated data (e.g., daily revenue)	Can become stale; needs manual/scheduled refresh
Data Skipping	Filtering by min/max-aware columns (e.g., `order_date`)	Less effective with many small files
Auto Optimize + Compaction	Streaming, frequent micro-batches, CDC	Not needed for static, full-refresh datasets
Cluster Sizing	Matching workload to right compute size	Too small → slow; too big → expensive waste
Photon Engine	SQL, Delta, BI queries	Not all workloads benefit; needs enabling
File Format Optimization	Analytics workloads on large data volumes	CSV/JSON slow and heavy → use Delta/Parquet

3. Data Skipping​

✅ How It Works (Behind the Scenes)​

Why OPTIMIZE Helps​

4. Auto Optimize & Auto Compaction​

🛠️ When to Use​

🔍 Tuning Notes​

🚘 5. Cluster Sizing & Photon​

Top Engineer Insight:​

Photon​

🔥 Photon Benefits​

🛠️ How to Enable Photon​

📊 Real Example: Before vs After Photon​

6. File Format Optimization​

💡 What This Means for Data Files​

🚀 Best Practice​

🔑 1-Minute Summary: performance-optimization​