Multi-Task Job Workflows — Dependencies Across Tasks

🎬 Story Time — “One Task Fails And Everything Breaks…”

Arjun, a senior data engineer, maintains a pipeline that:

Extracts data from APIs
Cleans & transforms it
Loads it to Delta Lake
Validates quality
Sends success notifications

Unfortunately, these steps were split across five separate jobs.

When the extraction job fails, the transform job still runs.
When transformation fails, the notification job still says “pipeline completed.”

Arjun sighs:

“I need something that ties everything together… with dependencies… and intelligence.”

Enter Databricks Multi-Task Job Workflows — the Lakehouse-native orchestration layer.

🔥 1. What Are Multi-Task Job Workflows?

A workflow in Databricks is a single job that contains multiple tasks with:

Task dependencies
Conditional logic
Modular execution
Shared compute clusters
Automatic DAG orchestration

Perfect for building end-to-end ETL pipelines in a single pane.

🧱 2. Creating a Multi-Task Workflow

Arjun opens:

Workflows → Jobs → Create Job

Then clicks “Add Task” multiple times.

Example workflow:

extract → transform → load → validate → notify

Each task can be:

Notebook
Python script
SQL query
JAR
Delta Live Table pipeline
dbt task (new)
dbt CLI runner
Or a combination

🔗 3. Defining Task Dependencies

Databricks uses a clean dependency UI:

[extract] → [transform] → [load]
↓
[validate]
↓
[notify]

A task only runs after its upstream tasks succeed.

Example:

{
  "task_key": "transform",
  "depends_on": [{"task_key": "extract"}]
}

Dependencies can form:

Linear DAGs
Fan-in DAGs
Fan-out DAGs
Branching pipelines

🧪 4. Example: Notebook-Based Multi-Task Pipeline

Step 1 — Extract

df_raw = spark.read.format("json").load("/mnt/raw/api_logs/")
df_raw.write.format("delta").mode("overwrite").save("/mnt/stage/logs_raw")

Step 2 — Transform

df = spark.read.format("delta").load("/mnt/stage/logs_raw")
df_clean = df.filter("event IS NOT NULL")
df_clean.write.format("delta").mode("overwrite").save("/mnt/clean/logs_clean")

Step 3 — Validation

from pyspark.sql import functions as F

df = spark.read.format("delta").load("/mnt/clean/logs_clean")
if df.filter(F.col("event").isNull()).count() > 0:
    raise Exception("Data validation failed")

Step 4 — Notify

dbutils.notebook.exit("Success: ETL Pipeline Completed")

⚙️ 5. Shared Job Cluster

Arjun selects:

Job cluster (cheaper than all-purpose clusters)
Applies it to all tasks
Auto-terminate after 5 minutes

This avoids cluster spin-ups for every task.

🔄 6. Retry Logic Per Task

Instead of retrying the whole job: Arjun can retry only the failing task.

Task-level retry settings:

Retry attempts
Backoff
Timeout
Cluster retry vs task retry

This makes workflows extremely resilient.

🧯 7. Error Handling Across Tasks

Databricks supports:

✔ Stop entire pipeline on failure
✔ Run downstream tasks only if upstream succeeds
✔ Add "failure notification" as a separate branch
✔ On-failure triggers for Slack/email

Example branch:

validate_failed → slack_alert

🌉 8. Branching Logic Inside Workflows

Arjun builds a logic:

high_volume → process_big_data
else → process_small_data

Branches allow conditional processing depending on:

Input size
Date
Event type
External parameters

This is Databricks' version of lightweight if-else orchestration.

📊 9. Real-World Enterprise Use Cases

⭐ Finance

Multi-step risk scoring → aggregation → validation → reporting.

⭐ Retail

Daily SKU extraction → price rules → promotions → BI delivery.

⭐ Healthcare

PHI ingestion → anonymization → validation → controlled-zone storage.

⭐ Logistics

GPS ingest → cleaning → route clustering → ML scoring → dashboard refresh.

⭐ Manufacturing

Sensor data → dedupe → QC → anomaly detection.

🧠 Best Practices

Keep tasks modular (single purpose per task)
Use job clusters for cost control
Add alerts + slack notifications
Add validation task before loading curated data
Use task parameters instead of hardcoding
Enable run-as service principals for security
Store job configs in repos for version control

🎉 Real-World Ending — “The Pipeline is Finally Smart”

Now Arjun’s ETL:

understands dependencies
retries failures automatically
alerts the team instantly
uses clean DAG orchestration
cuts compute cost with shared job clusters

His manager says:

“This is the pipeline architecture we should have done years ago.”

And everyone finally stops blaming Arjun’s pipelines.

📘 Summary

Databricks Multi-Task Job Workflows provide:

✔ DAG orchestration
✔ Multiple task types
✔ Dependency management
✔ Shared job clusters
✔ Conditional branching
✔ Retry & alerting
✔ Production-grade pipeline automation

A core building block for enterprise-scale data workflows.

👉 Next Topic

Databricks Workflows (New) — Production Orchestration

🎬 Story Time — “One Task Fails And Everything Breaks…”​

🔥 1. What Are Multi-Task Job Workflows?​

🧱 2. Creating a Multi-Task Workflow​

🔗 3. Defining Task Dependencies​

Example:​

🧪 4. Example: Notebook-Based Multi-Task Pipeline​

Step 1 — Extract​

Step 2 — Transform​

Step 3 — Validation​

Step 4 — Notify​

⚙️ 5. Shared Job Cluster​

🔄 6. Retry Logic Per Task​

🧯 7. Error Handling Across Tasks​

🌉 8. Branching Logic Inside Workflows​

📊 9. Real-World Enterprise Use Cases​

⭐ Finance​

⭐ Retail​

⭐ Healthcare​

⭐ Logistics​

⭐ Manufacturing​

🧠 Best Practices​

🎉 Real-World Ending — “The Pipeline is Finally Smart”​

📘 Summary​