98 Exploring DAG Design Patterns in Apache Airflow
98 Exploring DAG Design Patterns in Apache Airflow
Software Engineer
Exploring
DAG Design
Patterns
What We'll Cover Today
● Motivation
● Introduction to DAGs in Airflow
● Task best practices
● Organize tasks
● DAG flexibility
● Parallelism
● My presentation, comments and opinions are provided in my personal capacity and not as a representative of
Walmart. They do not reflect the views of Walmart and are not endorsed by Walmart
2
Motivation
● Maintainability
● Efficiency - Resource/cost
● Flexibility - Reuse
3
Introduction to DAGs in Airflow
● DAG
● Key components
○ Tasks
■ Operators
■ Sensors
○ Dependencies
Task Best Practices
5
Keep tasks small & focussed
-
def process_and_load_data():
# Extract data
data = extract_data()
# Clean data
cleaned_data = clean_data(data)
# Transform data
transformed_data = transform_data(cleaned_data)
# Load data
load_data(transformed_data)
task = PythonOperator(
task_id='process_and_load_data',
python_callable=process_and_load_data,
dag=dag,
)
6
Idempotency
● Tasks should produce the same results regardless of how many times
they're run.
def upsert_data(**kwargs):
data = get_data_from_source()
for record in data:
db.upsert(record) # Updates if exists, inserts if not
7
Atomicity
● They should complete entirely or not at all, treat it like a transaction.
def atomic_task():
try:
print("Performing task operations...")
except Exception as e:
print(f"Error occurred: {e}")
raise
8
Retries and error handling
● Proper error handling is vital for robust DAGs.
def unreliable_task():
import random
if random.choice([True, False]):
raise Exception("Random failure!")
retry_task = PythonOperator(
task_id='retry_task',
python_callable=unreliable_task,
retries=3,
retry_delay=timedelta(minutes=5),
)
9
Bad vs Good Example
10
Organize tasks
11
Trigger Rules
12
Linear Workflow Pattern
● Tasks are executed sequentially, with
each task depending on the previous
one.
● Pros:
○ Clear dependency chain
○ Easy to track progress and identify
bottlenecks
● Cons:
○ Limited parallelism, potentially slower
execution for complex workflows
○ If one task fails, the entire workflow
stops
13
Fan-Out/Fan-In Pattern
● Tasks fan out for parallel processing and
then converge for final processing.
● For processing multiple datasets or data
partitions in parallel
14
Branching and Conditional Execution
● Dynamically choose which tasks to
execute based on runtime conditions.
● Pros:
○ Allows for dynamic and flexible
workflows
○ Reduces the need for multiple similar
DAGs
● Cons:
○ May require careful testing to ensure all
branches work correctly
15
Branching and Conditional Execution
BranchPythonOperator(
task_id='data_quality_check’,
python_callable=data_quality_check,
)
● Trigger rules
16
Branching and Conditional Execution
ShortCircuitOperator(
task_id='check_data_availability',
python_callable=check_data_availability,
)
17
Dynamic Task Generation
● You need to create many similar tasks
dynamically based on data or
configuration.
● Dynamically generating tasks based on
API results
● Parallelism
18
Dynamic Task Generation Example
# List of files to process
files_to_process = ['file_A.csv',
'file_B.csv', 'file_C.csv']
19
Task Groups
● Organize complex DAGs into logical
groups
● Improve DAG readability and
maintainability
● Simplify dependency management
between groups of tasks
20
Task Groups
with TaskGroup('extract_and_transform') as
extract_transform_group:
extract >> clean >> transform
with TaskGroup('load_and_validate') as
load_validate_group:
load >> validate
21
Combined dynamic tasks + task groups
22
Configure
DAGs
23
Configuring DAGs for Flexibility and Scalability
● Leverage DAG parameters for dynamic execution
● Implement cross-DAG dependencies
● Generate DAGs dynamically for complex workflows
24
DAG Params
@dag(
start_date=datetime(2023, 6, 1),
schedule=None,
catchup=False,
params={ "greeting": "Hello!",
"multiplier": Param( default=3, type="integer", ),
"repeat_count": Param( default=5, type="integer",),
},
)
@task
25
DAG Params
categories = {
'electronics':
'https://api.example.com/electronics',
'clothing':
'https://api.example.com/clothing',
'books': 'https://api.example.com/books'
}
fetch_task = PythonOperator(
task_id=f'fetch_{category["name"]}_data',
python_callable=fetch_data,
op_kwargs={'category': category['name']},
)
26
Cross DAG triggering
trigger_dag_b = TriggerDagRunOperator(
task_id='trigger_dag_b',
trigger_dag_id='dag_b_reporting',
conf={'triggered_by': 'dag_a'},
)
27
Cross DAG Sensor
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='dag_a_generate',
external_task_id='generate_data',
timeout=3600, # Timeout after 1 hour
poke_interval=60, # Check every 60 seconds
mode='poke'
)
28
Dynamic DAG Generation
● DAGs share common code
● Needs to run at different schedules
● Generate using common template
29
Dynamic DAG Generation
categories:
- name: electronics
schedule: '0 1 * * *'
api_endpoint:
'https://api.example.com/electronics'
- name: clothing
schedule: '0 2 * * *'
api_endpoint:
'https://api.example.com/clothing'
- name: books
schedule: '0 3 * * *'
api_endpoint:
'https://api.example.com/books'
30
DAG Concurrency
31
● Task best practices
● Organize tasks
● DAG flexibility
Questions?