Multi-Task Job Workflows β Dependencies Across Tasks
π¬ Story Time β βOne Task Fails And Everything Breaksβ¦ββ
Arjun, a senior data engineer, maintains a pipeline that:
- Extracts data from APIs
- Cleans & transforms it
- Loads it to Delta Lake
- Validates quality
- Sends success notifications
Unfortunately, these steps were split across five separate jobs.
When the extraction job fails, the transform job still runs.
When transformation fails, the notification job still says βpipeline completed.β
Arjun sighs:
βI need something that ties everything togetherβ¦ with dependenciesβ¦ and intelligence.β
Enter Databricks Multi-Task Job Workflows β the Lakehouse-native orchestration layer.
π₯ 1. What Are Multi-Task Job Workflows?β
A workflow in Databricks is a single job that contains multiple tasks with:
- Task dependencies
- Conditional logic
- Modular execution
- Shared compute clusters
- Automatic DAG orchestration
Perfect for building end-to-end ETL pipelines in a single pane.
π§± 2. Creating a Multi-Task Workflowβ
Arjun opens:
Workflows β Jobs β Create Job
Then clicks βAdd Taskβ multiple times.
Example workflow:
extract β transform β load β validate β notify
Each task can be:
- Notebook
- Python script
- SQL query
- JAR
- Delta Live Table pipeline
- dbt task (new)
- dbt CLI runner
- Or a combination
π 3. Defining Task Dependenciesβ
Databricks uses a clean dependency UI:
[extract] β [transform] β [load]
β
[validate]
β
[notify]
A task only runs after its upstream tasks succeed.
Example:β
{
"task_key": "transform",
"depends_on": [{"task_key": "extract"}]
}
Dependencies can form:
- Linear DAGs
- Fan-in DAGs
- Fan-out DAGs
- Branching pipelines
π§ͺ 4. Example: Notebook-Based Multi-Task Pipelineβ
Step 1 β Extractβ
df_raw = spark.read.format("json").load("/mnt/raw/api_logs/")
df_raw.write.format("delta").mode("overwrite").save("/mnt/stage/logs_raw")
Step 2 β Transformβ
df = spark.read.format("delta").load("/mnt/stage/logs_raw")
df_clean = df.filter("event IS NOT NULL")
df_clean.write.format("delta").mode("overwrite").save("/mnt/clean/logs_clean")
Step 3 β Validationβ
from pyspark.sql import functions as F
df = spark.read.format("delta").load("/mnt/clean/logs_clean")
if df.filter(F.col("event").isNull()).count() > 0:
raise Exception("Data validation failed")
Step 4 β Notifyβ
dbutils.notebook.exit("Success: ETL Pipeline Completed")
βοΈ 5. Shared Job Clusterβ
Arjun selects:
- Job cluster (cheaper than all-purpose clusters)
- Applies it to all tasks
- Auto-terminate after 5 minutes
This avoids cluster spin-ups for every task.
π 6. Retry Logic Per Taskβ
Instead of retrying the whole job: Arjun can retry only the failing task.
Task-level retry settings:
- Retry attempts
- Backoff
- Timeout
- Cluster retry vs task retry
This makes workflows extremely resilient.
π§― 7. Error Handling Across Tasksβ
Databricks supports:
-
β Stop entire pipeline on failure
-
β Run downstream tasks only if upstream succeeds
-
β Add "failure notification" as a separate branch
-
β On-failure triggers for Slack/email
Example branch:
validate_failed β slack_alert
π 8. Branching Logic Inside Workflowsβ
Arjun builds a logic:
high_volume β process_big_data
else β process_small_data
Branches allow conditional processing depending on:
- Input size
- Date
- Event type
- External parameters
This is Databricks' version of lightweight if-else orchestration.
π 9. Real-World Enterprise Use Casesβ
β Financeβ
Multi-step risk scoring β aggregation β validation β reporting.
β Retailβ
Daily SKU extraction β price rules β promotions β BI delivery.
β Healthcareβ
PHI ingestion β anonymization β validation β controlled-zone storage.
β Logisticsβ
GPS ingest β cleaning β route clustering β ML scoring β dashboard refresh.
β Manufacturingβ
Sensor data β dedupe β QC β anomaly detection.
π§ Best Practicesβ
- Keep tasks modular (single purpose per task)
- Use job clusters for cost control
- Add alerts + slack notifications
- Add validation task before loading curated data
- Use task parameters instead of hardcoding
- Enable run-as service principals for security
- Store job configs in repos for version control
π Real-World Ending β βThe Pipeline is Finally Smartββ
Now Arjunβs ETL:
- understands dependencies
- retries failures automatically
- alerts the team instantly
- uses clean DAG orchestration
- cuts compute cost with shared job clusters
His manager says:
βThis is the pipeline architecture we should have done years ago.β
And everyone finally stops blaming Arjunβs pipelines.
π Summaryβ
Databricks Multi-Task Job Workflows provide:
-
β DAG orchestration
-
β Multiple task types
-
β Dependency management
-
β Shared job clusters
-
β Conditional branching
-
β Retry & alerting
-
β Production-grade pipeline automation
A core building block for enterprise-scale data workflows.
π Next Topic
Databricks Workflows (New) β Production Orchestration