Task Dependencies in Apache Airflow
You already know how to create tasks using operators.
But tasks alone are like actors without a script β
they need to know when to act and when to wait.
Thatβs where task dependencies come in.
Why Task Dependencies Matterβ
In real data pipelines:
- You extract data before you transform
- You validate before you load
- You notify only after success
Airflow enforces this logic through dependencies.
π Rule:
A task will not run until all its upstream tasks succeed.
Real-World Story: Dependency as Trustβ
Imagine a relay race πββοΈ:
- Runner 2 cannot start until Runner 1 hands over the baton
- If Runner 1 fails, the race stops
This is exactly how Airflow treats dependencies.
Ways to Define Dependencies in Airflowβ
Airflow provides three main methods:
- set_upstream()
- set_downstream()
- Bitshift operators
(>> and <<)
All do the same thing β readability is the difference.
1οΈβ£ Using set_downstream()β
Syntaxβ
task_a.set_downstream(task_b)
Meaningβ
task_b runs after task_a
Exampleβ
extract.set_downstream(transform)
π extract β transform
2οΈβ£ Using set_upstream()β
Syntaxβ
task_b.set_upstream(task_a)
Meaningβ
task_b waits for task_a
Exampleβ
load.set_upstream(transform)
π transform β load
3οΈβ£ Using Bitshift Operators (>> and <<)β
This is the most modern and recommended approach.
Right Shift (>>)β
extract >> transform
Means:
extract runs before transform
Left Shift (<<)β
load << transform
Means:
transform runs before load
Full DAG Example: Linear Dependenciesβ
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from datetime import datetime
with DAG(
dag_id="task_dependencies_demo",
start_date=datetime(2024, 1, 1),
schedule_interval="@daily",
catchup=False,
tags=["dependencies", "airflow"],
) as dag:
start = EmptyOperator(task_id="start")
extract = EmptyOperator(task_id="extract")
transform = EmptyOperator(task_id="transform")
load = EmptyOperator(task_id="load")
end = EmptyOperator(task_id="end")
start >> extract >> transform >> load >> end
Input & Output Flowβ
Inputβ
- DAG triggered manually or by schedule
Outputβ
- Tasks execute top to bottom
- Each task waits for upstream success
- DAG completes successfully
Fan-Out Dependencies (One to Many)β
extract >> [transform, validate]
π extract must complete before both tasks start.
Fan-In Dependencies (Many to One)β
[transform, validate] >> load
π load waits until both tasks succeed.
Complex Dependency Exampleβ
start >> extract
extract >> [transform, validate]
[transform, validate] >> load >> end
π This creates a diamond-shaped DAG, common in production.
What Happens on Failure?β
- Downstream tasks are skipped
- DAG stops progressing
- Logs show failure root cause
π This prevents bad data from flowing downstream.
Best Practices (Enterprise Grade)β
β
Prefer >> and << for readability
β
Keep dependency chains simple
β
Use fan-in/fan-out thoughtfully
β
Avoid overly complex DAG graphs
β
Visualize DAGs in Graph View
Common Mistakesβ
β Circular dependencies (Airflow blocks them)
β Mixing dependency styles in one DAG
β Overloading DAGs with logic
β Forgetting dependency definition
Key Takeawaysβ
- Task dependencies control execution order
- set_upstream and set_downstream define relationships
- Bitshift operators are cleaner and preferred
- Proper dependencies prevent data corruption
- Clear DAG graphs improve maintainability
Summaryβ
In this chapter, you learned:
- Why task dependencies exist
- Three ways to define dependencies
- Linear, fan-in, and fan-out patterns
- Failure behavior in Airflow
- Professional best practices
π― You now control the flow of execution in Airflow.
Whatβs Next?β
π Scheduling & Cron Expressions
Learn how Airflow decides when your DAG runs:
- Presets (@daily, @hourly)
- Cron expressions
- Timezone handling