Caching in Databricks β Best Practices
β¨ Story Time β βMy Query Is Fastβ¦ But Only the Second Time?ββ
Mila, a data analyst, runs a heavy analytical query:
SELECT region, SUM(total_sales)
FROM transactions
GROUP BY region;
The first time: 30 seconds The second time: 4 seconds
She wonders: βWhy is it so much faster now?β
Her teammate smiles:
βThatβs Databricks caching. Used right, it can speed up your entire Lakehouse.β
Letβs understand why caching can be your secret superpower.
π§© What Is Caching in Databricks?β
Caching means storing frequently accessed data in memory or local SSDs so queries run MUCH faster.
Databricks supports three types of caching:
- Delta Cache (Disk-based cache managed by Databricks)
- Spark Memory Cache (Stored in RAM with
CACHE TABLE) - Disk Cache for Databricks SQL Warehouses
Each one accelerates repeated queries by avoiding costly reads from cloud storage (S3/ADLS/GCS).
π Type 1: Delta Cache (Databricks Runtime)β
Activated automatically when reading from Delta tables.
How it works:β
- Cached on each worker node
- Stored on local SSDs, not RAM
- Persistent during the cluster lifetime
- Great for BI dashboards & repeated table scans
Enable (if disabled):β
SET spark.databricks.io.cache.enabled = true;
π Type 2: Spark Memory Cacheβ
Manually cache tables into memory:
CACHE SELECT * FROM transactions;
or:
CACHE TABLE transactions;
Best for:β
- Heavy compute transformations
- Machine learning workloads
- Repeated DataFrame operations in notebooks
Limitations:β
- Uses cluster RAM
- Cache is lost if cluster goes down
- Not ideal for very large tables
π Type 3: Databricks SQL Warehouse Cacheβ
Used by BI tools and dashboards.
Benefits:β
- Fast interactive queries
- Stores results & metadata
- Extremely efficient for dashboards refreshing frequently
Works automatically behind the scenes β no setup required.
β‘ When Should You Use Caching?β
β Use caching when:β
β Running the same query multiple times β Interactive analysis (SQL editor, notebooks, BI tools) β Data fits into memory or SSD β You want ultra-fast dashboard loads β You run iterative ML transformations β You have hot tables read frequently
β When NOT to Use Cachingβ
Avoid caching when:β
β Your table updates very frequently (cache invalidation overhead) β Data is too large to fit in RAM or SSD β You are running one-time ETL jobs β Queries are always unique (no repetition) β You are using Job Clusters (cache resets every job)
π§ͺ Real-World Example β 8Γ Faster Dashboardβ
Milaβs team has a PowerBI dashboard querying a Delta table every 10 minutes.
Before caching:
- 22-second load time
- Frequent cluster spikes
- Occasional timeouts
After enabling caching on the cluster:
CACHE TABLE sales_aggregated;
Results:
- Dashboard loads in 3 seconds
- Cluster CPU dropped by 45%
- BI team finally stopped complaining π
π§ How to Check If Cache Is Being Usedβ
For Delta Cache:
DESCRIBE DETAIL delta.`/path/to/table`;
Or at runtime:
spark.conf.get("spark.databricks.io.cache.enabled")
For Memory Cache:
SHOW TABLES;
CLEAR CACHE;
π― Best Practices for Cachingβ
π© 1. Cache only hot datasetsβ
Avoid caching huge cold data.
π© 2. Use Delta Cache for most SQL workloadsβ
Lightweight and automatic.
π© 3. Use MEMORY cache only for DataFrame-heavy notebooksβ
Not for general SQL.
π© 4. Donβt over-cacheβ
Caching useless data = wasted resources.
π© 5. Combine caching with Z-ORDER + OPTIMIZEβ
They complement each other for performance.
π© 6. Tune cache size for large clustersβ
Use SSD-heavy clusters for maximum Delta Cache performance.
π Summaryβ
- Caching improves query speed dramatically by storing frequently accessed data in memory or SSD.
- Databricks provides three caching layers: Delta Cache, Spark Memory Cache, and SQL Warehouse Cache.
- Best used for repeated queries, dashboards, ML workflows, and hot datasets.
- Avoid caching massive or frequently updated data.
- Use caching alongside OPTIMIZE, Z-ORDER, and file compaction for maximum Lakehouse performance.
Caching = Fast queries, low cost, happy analysts.
π Next Topic
Photon Execution Engine β When & Why to Use