Skip to main content

Caching in Databricks β€” Best Practices

✨ Story Time β€” β€œMy Query Is Fast… But Only the Second Time?”​

Mila, a data analyst, runs a heavy analytical query:

SELECT region, SUM(total_sales)
FROM transactions
GROUP BY region;

The first time: 30 seconds The second time: 4 seconds

She wonders: β€œWhy is it so much faster now?”

Her teammate smiles:

β€œThat’s Databricks caching. Used right, it can speed up your entire Lakehouse.”

Let’s understand why caching can be your secret superpower.


🧩 What Is Caching in Databricks?​

Caching means storing frequently accessed data in memory or local SSDs so queries run MUCH faster.

Databricks supports three types of caching:

  1. Delta Cache (Disk-based cache managed by Databricks)
  2. Spark Memory Cache (Stored in RAM with CACHE TABLE)
  3. Disk Cache for Databricks SQL Warehouses

Each one accelerates repeated queries by avoiding costly reads from cloud storage (S3/ADLS/GCS).


πŸ” Type 1: Delta Cache (Databricks Runtime)​

Activated automatically when reading from Delta tables.

How it works:​

  • Cached on each worker node
  • Stored on local SSDs, not RAM
  • Persistent during the cluster lifetime
  • Great for BI dashboards & repeated table scans

Enable (if disabled):​

SET spark.databricks.io.cache.enabled = true;

πŸ” Type 2: Spark Memory Cache​

Manually cache tables into memory:

CACHE SELECT * FROM transactions;

or:

CACHE TABLE transactions;

Best for:​

  • Heavy compute transformations
  • Machine learning workloads
  • Repeated DataFrame operations in notebooks

Limitations:​

  • Uses cluster RAM
  • Cache is lost if cluster goes down
  • Not ideal for very large tables

πŸ” Type 3: Databricks SQL Warehouse Cache​

Used by BI tools and dashboards.

Benefits:​

  • Fast interactive queries
  • Stores results & metadata
  • Extremely efficient for dashboards refreshing frequently

Works automatically behind the scenes β€” no setup required.


⚑ When Should You Use Caching?​

βœ… Use caching when:​

βœ” Running the same query multiple times βœ” Interactive analysis (SQL editor, notebooks, BI tools) βœ” Data fits into memory or SSD βœ” You want ultra-fast dashboard loads βœ” You run iterative ML transformations βœ” You have hot tables read frequently


❌ When NOT to Use Caching​

Avoid caching when:​

βœ– Your table updates very frequently (cache invalidation overhead) βœ– Data is too large to fit in RAM or SSD βœ– You are running one-time ETL jobs βœ– Queries are always unique (no repetition) βœ– You are using Job Clusters (cache resets every job)


πŸ§ͺ Real-World Example β€” 8Γ— Faster Dashboard​

Mila’s team has a PowerBI dashboard querying a Delta table every 10 minutes.

Before caching:

  • 22-second load time
  • Frequent cluster spikes
  • Occasional timeouts

After enabling caching on the cluster:

CACHE TABLE sales_aggregated;

Results:

  • Dashboard loads in 3 seconds
  • Cluster CPU dropped by 45%
  • BI team finally stopped complaining πŸŽ‰

πŸ”§ How to Check If Cache Is Being Used​

For Delta Cache:

DESCRIBE DETAIL delta.`/path/to/table`;

Or at runtime:

spark.conf.get("spark.databricks.io.cache.enabled")

For Memory Cache:

SHOW TABLES;
CLEAR CACHE;

🎯 Best Practices for Caching​

🟩 1. Cache only hot datasets​

Avoid caching huge cold data.

🟩 2. Use Delta Cache for most SQL workloads​

Lightweight and automatic.

🟩 3. Use MEMORY cache only for DataFrame-heavy notebooks​

Not for general SQL.

🟩 4. Don’t over-cache​

Caching useless data = wasted resources.

🟩 5. Combine caching with Z-ORDER + OPTIMIZE​

They complement each other for performance.

🟩 6. Tune cache size for large clusters​

Use SSD-heavy clusters for maximum Delta Cache performance.


πŸ“˜ Summary​

  • Caching improves query speed dramatically by storing frequently accessed data in memory or SSD.
  • Databricks provides three caching layers: Delta Cache, Spark Memory Cache, and SQL Warehouse Cache.
  • Best used for repeated queries, dashboards, ML workflows, and hot datasets.
  • Avoid caching massive or frequently updated data.
  • Use caching alongside OPTIMIZE, Z-ORDER, and file compaction for maximum Lakehouse performance.

Caching = Fast queries, low cost, happy analysts.


πŸ‘‰ Next Topic

Photon Execution Engine β€” When & Why to Use