Skip to main content

Databricks Jobs β€” Scheduling Batch Processing

🎬 Story Time β€” β€œThe ETL That Never Slept…”​

Nidhi, a data engineer at a logistics company, receives complaints from every direction.

Analytics team:

β€œWhy is our daily ETL running manually?”

Finance:

β€œWhy didn’t yesterday’s batch complete?”

Managers:

β€œCan’t Databricks run jobs automatically?”

Nidhi knows the truth:
Someone runs the ETL notebook manually every morning.

She smiles and opens Databricks.

β€œTime to put this into a Job and let it run like clockwork.”

That’s where Databricks Jobs come in β€” reliable, automated batch processing in the Lakehouse.


πŸš€ 1. What Are Databricks Jobs?​

Databricks Jobs allow you to schedule and automate:

  • Notebooks
  • Python scripts
  • Spark jobs
  • JAR files
  • Delta Live Tables
  • ML pipelines
  • SQL tasks

Jobs ensure processing happens on schedule, with retries, alerts, logging, and monitoring β€” without human involvement.


🧱 2. Creating Your First Databricks Job​

Nidhi starts with a simple daily ETL.

In the Databricks Workspace:​

Workflows β†’ Jobs β†’ Create Job

She configures:

  • πŸ“˜ Task: notebook path (e.g., /ETL/clean_orders)
  • βš™οΈ Cluster: new job cluster (cost-optimized)
  • πŸ•’ Schedule: daily at 1:00 AM
  • πŸ” Retries: 3 attempts
  • πŸ”” Alert: email on failure

Within minutes β€” her ETL is automated.


πŸ”§ 3. Example: Notebook-Based ETL Job​

The ETL notebook:

df = spark.read.format("delta").load("/mnt/raw/orders")

clean_df = df \
.filter("order_status IS NOT NULL") \
.withColumn("cleaned_ts", current_timestamp())

clean_df.write.format("delta").mode("overwrite").save("/mnt/clean/orders")

Databricks Job runs this nightly.


⏱️ 4. Scheduling Jobs​

Databricks offers flexible scheduling:

🟦 Cron Schedule​

0 1 * * *   

🟩 UI-based Scheduling​

  • Daily
  • Weekly
  • Hourly
  • Custom

🟧 Trigger on File Arrival (Auto Loader + Jobs)​

Perfect for streaming-batch hybrid architectures.


πŸ—οΈ 5. Job Clusters vs All-Purpose Clusters​

Nidhi must choose between:

  • Auto-terminated after job finishes
  • Cheaper
  • Clean environment per run
  • Best for production

βœ” All-Purpose Cluster​

  • Shared
  • Not ideal for scheduled jobs
  • More expensive

She selects job clusters to cut compute waste.


πŸ”„ 6. Multi-Step ETL With Dependent Tasks​

A single Databricks Job can contain multiple tasks, such as:

  1. Extract
  2. Transform
  3. Validate
  4. Load into Delta
  5. Notify Slack

Example DAG:

extract β†’ transform β†’ load β†’ validate β†’ notify

πŸ“Œ 7. Retry Policies​

Batch jobs fail sometimes.

Nidhi configures:

  • 3 retries
  • 10-minute delay
  • Exponential backoff

Databricks handles failures automatically.


πŸ“€ 8. Logging & Monitoring​

Databricks Jobs provide:

  • Run page logs
  • Driver and executor logs
  • Spark UI
  • Execution graphs
  • Cluster metrics

She can debug any failure easily.


πŸ“¦ 9. Real-World Enterprise Use Cases​

⭐ E-commerce​

Nightly ETL loading sales, product, and customer data.

⭐ Finance​

Batch jobs calculating daily P&L and risk metrics.

⭐ Manufacturing​

Daily IoT ingestion and device telemetry cleaning.

⭐ Logistics​

Route optimization pipelines.

⭐ SaaS Platforms​

Customer-level usage aggregation.


🧠 Best Practices​

  1. Use job clusters for cost efficiency
  2. Keep each task modular
  3. Add alerts for failures
  4. Store logs in DLT + Delta tables
  5. Use retries for robustness
  6. Use version-controlled notebooks/scripts
  7. Document every pipeline task

πŸŽ‰ Real-World Ending β€” β€œThe Batch Runs Automatically Now”​

With Databricks Jobs:

  • No more manual ETL runs
  • No more failures unnoticed
  • Costs reduced by 35% with job clusters
  • Alerts keep teams informed
  • Nidhi sleeps peacefully

Her manager says:

β€œThis is production-grade analytics. Our pipelines finally look professional.”


πŸ“˜ Summary​

Databricks Jobs enable:

  • βœ” Automated scheduling

  • βœ” Reliable batch processing

  • βœ” Multi-task workflows

  • βœ” Alerts, retries, logging

  • βœ” Cost-effective orchestration

A fundamental building block for production data pipelines on Databricks.


πŸ‘‰ Next Topic

Multi-Task Job Workflows β€” Dependencies Across Tasks