Databricks Jobs β Scheduling Batch Processing
π¬ Story Time β βThe ETL That Never Sleptβ¦ββ
Nidhi, a data engineer at a logistics company, receives complaints from every direction.
Analytics team:
βWhy is our daily ETL running manually?β
Finance:
βWhy didnβt yesterdayβs batch complete?β
Managers:
βCanβt Databricks run jobs automatically?β
Nidhi knows the truth:
Someone runs the ETL notebook manually every morning.
She smiles and opens Databricks.
βTime to put this into a Job and let it run like clockwork.β
Thatβs where Databricks Jobs come in β reliable, automated batch processing in the Lakehouse.
π 1. What Are Databricks Jobs?β
Databricks Jobs allow you to schedule and automate:
- Notebooks
- Python scripts
- Spark jobs
- JAR files
- Delta Live Tables
- ML pipelines
- SQL tasks
Jobs ensure processing happens on schedule, with retries, alerts, logging, and monitoring β without human involvement.
π§± 2. Creating Your First Databricks Jobβ
Nidhi starts with a simple daily ETL.
In the Databricks Workspace:β
Workflows β Jobs β Create Job
She configures:
- π Task: notebook path (e.g.,
/ETL/clean_orders) - βοΈ Cluster: new job cluster (cost-optimized)
- π Schedule: daily at 1:00 AM
- π Retries: 3 attempts
- π Alert: email on failure
Within minutes β her ETL is automated.
π§ 3. Example: Notebook-Based ETL Jobβ
The ETL notebook:
df = spark.read.format("delta").load("/mnt/raw/orders")
clean_df = df \
.filter("order_status IS NOT NULL") \
.withColumn("cleaned_ts", current_timestamp())
clean_df.write.format("delta").mode("overwrite").save("/mnt/clean/orders")
Databricks Job runs this nightly.
β±οΈ 4. Scheduling Jobsβ
Databricks offers flexible scheduling:
π¦ Cron Scheduleβ
0 1 * * *
π© UI-based Schedulingβ
- Daily
- Weekly
- Hourly
- Custom
π§ Trigger on File Arrival (Auto Loader + Jobs)β
Perfect for streaming-batch hybrid architectures.
ποΈ 5. Job Clusters vs All-Purpose Clustersβ
Nidhi must choose between:
β Job Cluster (recommended)β
- Auto-terminated after job finishes
- Cheaper
- Clean environment per run
- Best for production
β All-Purpose Clusterβ
- Shared
- Not ideal for scheduled jobs
- More expensive
She selects job clusters to cut compute waste.
π 6. Multi-Step ETL With Dependent Tasksβ
A single Databricks Job can contain multiple tasks, such as:
- Extract
- Transform
- Validate
- Load into Delta
- Notify Slack
Example DAG:
extract β transform β load β validate β notify
π 7. Retry Policiesβ
Batch jobs fail sometimes.
Nidhi configures:
- 3 retries
- 10-minute delay
- Exponential backoff
Databricks handles failures automatically.
π€ 8. Logging & Monitoringβ
Databricks Jobs provide:
- Run page logs
- Driver and executor logs
- Spark UI
- Execution graphs
- Cluster metrics
She can debug any failure easily.
π¦ 9. Real-World Enterprise Use Casesβ
β E-commerceβ
Nightly ETL loading sales, product, and customer data.
β Financeβ
Batch jobs calculating daily P&L and risk metrics.
β Manufacturingβ
Daily IoT ingestion and device telemetry cleaning.
β Logisticsβ
Route optimization pipelines.
β SaaS Platformsβ
Customer-level usage aggregation.
π§ Best Practicesβ
- Use job clusters for cost efficiency
- Keep each task modular
- Add alerts for failures
- Store logs in DLT + Delta tables
- Use retries for robustness
- Use version-controlled notebooks/scripts
- Document every pipeline task
π Real-World Ending β βThe Batch Runs Automatically Nowββ
With Databricks Jobs:
- No more manual ETL runs
- No more failures unnoticed
- Costs reduced by 35% with job clusters
- Alerts keep teams informed
- Nidhi sleeps peacefully
Her manager says:
βThis is production-grade analytics. Our pipelines finally look professional.β
π Summaryβ
Databricks Jobs enable:
-
β Automated scheduling
-
β Reliable batch processing
-
β Multi-task workflows
-
β Alerts, retries, logging
-
β Cost-effective orchestration
A fundamental building block for production data pipelines on Databricks.
π Next Topic
Multi-Task Job Workflows β Dependencies Across Tasks