Must-Know Databricks Interview Questions & Answers (Real Company Scenarios) – Part 4
36. How do you tune Spark configurations in Databricks for performance?
Story-Driven
Tuning Spark is like adjusting the speed and fuel of a race car. Too slow, and you waste time; too fast without control, and you crash. Proper tuning makes your data jobs fly efficiently.
Professional / Hands-On
- Common Spark configuration settings:
spark.executor.memory→ Adjust executor memoryspark.executor.cores→ Number of cores per executorspark.sql.shuffle.partitions→ Reduce shuffles
- Techniques:
- Monitor cluster metrics.
- Use dynamic allocation.
- Tune parallelism based on data size.
spark.conf.set("spark.sql.shuffle.partitions", "200")
37. Explain Z-ordering in Delta Lake
Story-Driven
Z-ordering is like arranging books in a library so related books are close together. This makes searches lightning-fast without scanning the whole shelf.
Professional / Hands-On
- Z-ordering: Multi-dimensional clustering of data in Delta tables.
- Improves query performance on filtering columns.
OPTIMIZE sales_delta
ZORDER BY (customer_id, region)
38. How does time travel work in Delta Lake?
Story-Driven
Time travel in Delta Lake is like a magical diary—you can go back and read exactly what your data looked like last week, last month, or even yesterday.
Professional / Hands-On
- Delta Lake keeps versioned data using a transaction log.
- Access historical data via:
SELECT * FROM sales_delta VERSION AS OF 3
SELECT * FROM sales_delta TIMESTAMP AS OF '2025-01-01'
- Useful for audit, recovery, and debugging.
39. How do you implement streaming pipelines in Databricks?
Story-Driven
A streaming pipeline is like a water pipeline delivering fresh water continuously. New data keeps flowing, and your system processes it automatically.
Professional / Hands-On
-
Steps to implement:
- Read data using
readStream. - Transform using Spark operations.
- Write output using
writeStreamwith checkpointing.
- Read data using
df = spark.readStream.format("json").load("/stream/input")
df_transformed = df.filter(df.value > 100)
df_transformed.writeStream.format("delta").option("checkpointLocation", "/checkpoint").start("/delta/output")
40. What are Delta Lake optimizations (OPTIMIZE, VACUUM)?
Story-Driven
- OPTIMIZE: Organizes your data for faster queries, like tidying a messy bookshelf.
- VACUUM: Removes outdated or unnecessary files, like clearing trash.
Professional / Hands-On
OPTIMIZE→ Reorganizes data with Z-ordering.VACUUM→ Deletes old files older than default retention (7 days).
OPTIMIZE sales_delta ZORDER BY (customer_id)
VACUUM sales_delta RETAIN 168 HOURS
41. Explain Databricks REST API usage
Story-Driven
The REST API is like a remote control for Databricks—you can start clusters, run jobs, and access notebooks programmatically without opening the UI.
Professional / Hands-On
-
Use cases:
- Automate cluster creation and job scheduling.
- Fetch job status or logs.
-
Example using Python
requests:
import requests
response = requests.get(
"https://<databricks-instance>/api/2.0/clusters/list",
headers={"Authorization": f"Bearer {TOKEN}"}
)
42. How do you implement role-based access control (RBAC) in Databricks?
Story-Driven
RBAC is like giving keys to rooms only to the people who need them. Developers get dev keys, analysts get read-only keys, and admins get full access.
Professional / Hands-On
-
RBAC in Databricks involves:
- Workspace access control (notebooks, jobs)
- Cluster access control
- Table & data access control with Unity Catalog
-
Example: Assign
CAN_MANAGEpermission to a group for a cluster.
43. How is auto-scaling managed in Databricks clusters?
Story-Driven
Auto-scaling is like hiring extra chefs when orders pile up and sending them home when it’s quiet. Your kitchen stays efficient without manual intervention.
Professional / Hands-On
-
Auto-scaling clusters:
- Minimum and maximum workers defined.
- Databricks automatically scales based on workload.
-
Configurable at cluster creation:
Min Workers: 2, Max Workers: 10
44. Explain checkpointing and write-ahead logs (WAL) in streaming
Story-Driven
Checkpointing and WAL are like saving your progress and keeping a backup diary of every move. If the stream fails, you can pick up exactly where you left off.
Professional / Hands-On
- Checkpointing: Stores streaming progress and offsets.
- Write-ahead logs (WAL): Ensures all data is durably stored before processing.
- Used together to guarantee fault-tolerance and exactly-once semantics.
45. How do you debug failed jobs in Databricks?
Story-Driven
Debugging failed jobs is like detective work—you follow clues (logs), check the crime scene (stages), and find what went wrong.
Professional / Hands-On
-
Steps:
- Check cluster logs (driver & worker).
- Review Spark UI for failed stages or tasks.
- Look at notebook outputs or job logs.
- Retry with smaller dataset or isolated transformations.
- Use Databricks REST API to fetch detailed logs if automated debugging is needed.