Skip to main content

Must-Know Databricks Interview Questions & Answers (Real Company Scenarios) – Part 4

36. How do you tune Spark configurations in Databricks for performance?

Story-Driven

Tuning Spark is like adjusting the speed and fuel of a race car. Too slow, and you waste time; too fast without control, and you crash. Proper tuning makes your data jobs fly efficiently.

Professional / Hands-On

  • Common Spark configuration settings:
    • spark.executor.memory → Adjust executor memory
    • spark.executor.cores → Number of cores per executor
    • spark.sql.shuffle.partitions → Reduce shuffles
  • Techniques:
    • Monitor cluster metrics.
    • Use dynamic allocation.
    • Tune parallelism based on data size.
spark.conf.set("spark.sql.shuffle.partitions", "200")

37. Explain Z-ordering in Delta Lake

Story-Driven

Z-ordering is like arranging books in a library so related books are close together. This makes searches lightning-fast without scanning the whole shelf.

Professional / Hands-On

  • Z-ordering: Multi-dimensional clustering of data in Delta tables.
  • Improves query performance on filtering columns.
OPTIMIZE sales_delta
ZORDER BY (customer_id, region)

38. How does time travel work in Delta Lake?

Story-Driven

Time travel in Delta Lake is like a magical diary—you can go back and read exactly what your data looked like last week, last month, or even yesterday.

Professional / Hands-On

  • Delta Lake keeps versioned data using a transaction log.
  • Access historical data via:
SELECT * FROM sales_delta VERSION AS OF 3
SELECT * FROM sales_delta TIMESTAMP AS OF '2025-01-01'
  • Useful for audit, recovery, and debugging.

39. How do you implement streaming pipelines in Databricks?

Story-Driven

A streaming pipeline is like a water pipeline delivering fresh water continuously. New data keeps flowing, and your system processes it automatically.

Professional / Hands-On

  • Steps to implement:

    1. Read data using readStream.
    2. Transform using Spark operations.
    3. Write output using writeStream with checkpointing.
df = spark.readStream.format("json").load("/stream/input")
df_transformed = df.filter(df.value > 100)
df_transformed.writeStream.format("delta").option("checkpointLocation", "/checkpoint").start("/delta/output")

40. What are Delta Lake optimizations (OPTIMIZE, VACUUM)?

Story-Driven

  • OPTIMIZE: Organizes your data for faster queries, like tidying a messy bookshelf.
  • VACUUM: Removes outdated or unnecessary files, like clearing trash.

Professional / Hands-On

  • OPTIMIZE → Reorganizes data with Z-ordering.
  • VACUUM → Deletes old files older than default retention (7 days).
OPTIMIZE sales_delta ZORDER BY (customer_id)
VACUUM sales_delta RETAIN 168 HOURS

41. Explain Databricks REST API usage

Story-Driven

The REST API is like a remote control for Databricks—you can start clusters, run jobs, and access notebooks programmatically without opening the UI.

Professional / Hands-On

  • Use cases:

    • Automate cluster creation and job scheduling.
    • Fetch job status or logs.
  • Example using Python requests:

import requests
response = requests.get(
"https://<databricks-instance>/api/2.0/clusters/list",
headers={"Authorization": f"Bearer {TOKEN}"}
)

42. How do you implement role-based access control (RBAC) in Databricks?

Story-Driven

RBAC is like giving keys to rooms only to the people who need them. Developers get dev keys, analysts get read-only keys, and admins get full access.

Professional / Hands-On

  • RBAC in Databricks involves:

    • Workspace access control (notebooks, jobs)
    • Cluster access control
    • Table & data access control with Unity Catalog
  • Example: Assign CAN_MANAGE permission to a group for a cluster.


43. How is auto-scaling managed in Databricks clusters?

Story-Driven

Auto-scaling is like hiring extra chefs when orders pile up and sending them home when it’s quiet. Your kitchen stays efficient without manual intervention.

Professional / Hands-On

  • Auto-scaling clusters:

    • Minimum and maximum workers defined.
    • Databricks automatically scales based on workload.
  • Configurable at cluster creation:

Min Workers: 2, Max Workers: 10

44. Explain checkpointing and write-ahead logs (WAL) in streaming

Story-Driven

Checkpointing and WAL are like saving your progress and keeping a backup diary of every move. If the stream fails, you can pick up exactly where you left off.

Professional / Hands-On

  • Checkpointing: Stores streaming progress and offsets.
  • Write-ahead logs (WAL): Ensures all data is durably stored before processing.
  • Used together to guarantee fault-tolerance and exactly-once semantics.

45. How do you debug failed jobs in Databricks?

Story-Driven

Debugging failed jobs is like detective work—you follow clues (logs), check the crime scene (stages), and find what went wrong.

Professional / Hands-On

  • Steps:

    1. Check cluster logs (driver & worker).
    2. Review Spark UI for failed stages or tasks.
    3. Look at notebook outputs or job logs.
    4. Retry with smaller dataset or isolated transformations.
    5. Use Databricks REST API to fetch detailed logs if automated debugging is needed.