0% found this document useful (0 votes)

148 views9 pages

Apache Airflow - A Python Hands-On Guide

The document is a hands-on guide for using Apache Airflow with Python, detailing how to write Directed Acyclic Graphs (DAGs) and utilize various operators like PythonOperator and BranchPythonOperator. It also covers integration with providers such as AWS, Google Cloud, PostgreSQL, and Slack, along with examples for executing tasks and managing dependencies. Additionally, it includes a section on orchestrating Apache Spark jobs within Airflow workflows.

Uploaded by

bisennikhil49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views9 pages

Apache Airflow - A Python Hands-On Guide

Uploaded by

bisennikhil49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Apache Airflow - A Python Hands-On Guide

Available Provider
Available Provider

1. Writing a DAG in Python

A Directed Acyclic Graph (DAG) is the core abstraction in Airflow. It defines the
workflow and task dependencies.

Basic DAG Structure

from datetime import datetime

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

# Define default arguments

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}

# Initialize DAG
with DAG(
dag_id='example_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:

def print_hello():
print("Hello, Airflow!")

task = PythonOperator(

1/9
Apache Airflow - A Python Hands-On Guide

task_id='print_hello',
python_callable=print_hello,
)

Key Notes:
DAG : The container for your workflow.
PythonOperator : Executes Python functions.
schedule_interval : Defines the schedule (e.g., @daily , @hourly ).
catchup : Prevents past executions from running.

2. Options for Python Operators

Airflow offers various operators that use Python extensively:

PythonOperator
Executes Python callables.

PythonOperator(
task_id='process_data',
python_callable=process_data_function,
op_kwargs={'param': 'value'}, # Pass arguments to the callable
)

BranchPythonOperator
Allows branching based on a condition.

from airflow.operators.python_operator import BranchPythonOperator

def choose_branch(**kwargs):
return 'branch_1' if kwargs['some_condition'] else 'branch_2'

branch_task = BranchPythonOperator(
task_id='branching',
python_callable=choose_branch,
provide_context=True,
)

2/9
Apache Airflow - A Python Hands-On Guide

PythonVirtualenvOperator
Executes Python code within a virtual environment.

from airflow.operators.python_operator import

PythonVirtualenvOperator

virtualenv_task = PythonVirtualenvOperator(
task_id='venv_task',
python_callable=lambda: print("Running in a virtualenv!"),
requirements=["numpy", "pandas"],
system_site_packages=False,
)

3. Working with Providers

Providers are integrations for various platforms. Here's how to use five popular ones
with Python:

1. AWS Provider
Install: pip install apache-airflow-providers-amazon

Example: S3 File Upload

from airflow.providers.amazon.aws.operators.s3 import

S3CreateObjectOperator

upload_task = S3CreateObjectOperator(
task_id='upload_to_s3',
aws_conn_id='my_aws_conn',
s3_bucket='my_bucket',
s3_key='path/to/file.txt',
data="Sample Data",
)

2. Google Cloud Provider

Install: pip install apache-airflow-providers-google

3/9
Apache Airflow - A Python Hands-On Guide

Example: BigQuery Query Execution

from airflow.providers.google.cloud.operators.bigquery import

BigQueryExecuteQueryOperator

bq_task = BigQueryExecuteQueryOperator(
task_id='bq_query',
sql='SELECT * FROM my_dataset.my_table',
gcp_conn_id='my_gcp_conn',
use_legacy_sql=False,
)

3. PostgreSQL Provider
Install: pip install apache-airflow-providers-postgres

Example: Run SQL on PostgreSQL

from airflow.providers.postgres.operators.postgres import

PostgresOperator

sql_task = PostgresOperator(
task_id='run_postgres_query',
postgres_conn_id='my_postgres_conn',
sql='SELECT * FROM my_table;',
)

4. Slack Provider
Install: pip install apache-airflow-providers-slack

Example: Send Slack Notification

from airflow.providers.slack.operators.slack_webhook import

SlackWebhookOperator

slack_task = SlackWebhookOperator(
task_id='send_slack_message',

4/9
Apache Airflow - A Python Hands-On Guide

http_conn_id='slack_conn',
message="Workflow completed successfully!",
channel="#alerts",
)

5. MySQL Provider
Install: pip install apache-airflow-providers-mysql

Example: Execute SQL in MySQL

from airflow.providers.mysql.operators.mysql import MySqlOperator

mysql_task = MySqlOperator(
task_id='mysql_query',
mysql_conn_id='my_mysql_conn',
sql='INSERT INTO my_table (id, value) VALUES (1, "test");',
)

5. Python Cheatsheet for Apache Airflow

Component Python Example

DAG DAG(dag_id='my_dag', schedule_interval='@daily', ...)
Initialization
PythonOperator PythonOperator(task_id='task',
python_callable=my_func)

Branching BranchPythonOperator(task_id='branch',
python_callable=my_func)

S3 Upload S3CreateObjectOperator(...,
s3_key='path/to/file.txt')

SQL Execution PostgresOperator(sql='SELECT * FROM table;')

BigQuery BigQueryExecuteQueryOperator(sql='SELECT * FROM

table')

Slack SlackWebhookOperator(message="Job done!")

Notification

5/9
Apache Airflow - A Python Hands-On Guide

Component Python Example

Virtualenv PythonVirtualenvOperator(python_callable=my_func,
...)

Working with Apache Spark in Airflow

Apache Spark is a distributed data processing framework widely used for big data
tasks. In Airflow, we can manage and orchestrate Spark jobs using operators such as:

SparkSubmitOperator : Submits a Spark job directly to a cluster.

EmrAddStepsOperator : Submits a Spark job to an Amazon EMR cluster.
DataprocSubmitJobOperator : Submits a Spark job to Google Dataproc.

These operators allow us to control Spark jobs programmatically within Airflow

workflows.

Complex Workflow: Conditional Spark Job Execution

Workflow Logic:

1. Execute spark_job_1 .
2. Execute spark_job_2 .
3. If spark_job_2 fails, run spark_job_3 .
4. If spark_job_2 succeeds, run spark_job_4 .

DAG Implementation
Step 1: Import Required Modules

from airflow import DAG

from airflow.operators.dummy_operator import DummyOperator
from airflow.providers.apache.spark.operators.spark_submit import
SparkSubmitOperator

6/9
Apache Airflow - A Python Hands-On Guide

from airflow.operators.python_operator import BranchPythonOperator

from datetime import datetime

Step 2: Define the DAG and Default Arguments

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'retries': 1,
}

dag = DAG(
dag_id='spark_conditional_jobs',
default_args=default_args,
description='A DAG with conditional Spark job execution',
schedule_interval=None,
start_date=datetime(2023, 12, 1),
catchup=False,
)

Step 3: Define the SparkSubmitOperator Jobs

# Spark Job 1
spark_job_1 = SparkSubmitOperator(
task_id='spark_job_1',
application='/path/to/spark_job_1.py',
conn_id='spark_default', # Connection to your Spark cluster
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 2
spark_job_2 = SparkSubmitOperator(
task_id='spark_job_2',
application='/path/to/spark_job_2.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 3
spark_job_3 = SparkSubmitOperator(
task_id='spark_job_3',

7/9
Apache Airflow - A Python Hands-On Guide

application='/path/to/spark_job_3.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 4
spark_job_4 = SparkSubmitOperator(
task_id='spark_job_4',
application='/path/to/spark_job_4.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

Step 4: Branch Logic Using BranchPythonOperator

def choose_next_task(**kwargs):
# Check the state of spark_job_2
task_instance = kwargs['ti']
spark_job_2_state =
task_instance.xcom_pull(task_ids='spark_job_2', key='return_value')

# Return the task to execute next

if spark_job_2_state == 'failed':
return 'spark_job_3'
return 'spark_job_4'

branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=choose_next_task,
provide_context=True,
dag=dag,
)

Step 5: Define the Task Dependencies

start = DummyOperator(task_id='start', dag=dag)

end = DummyOperator(task_id='end', dag=dag)

# Define dependencies
start >> spark_job_1
spark_job_1 >> spark_job_2
spark_job_2 >> branch_task

8/9
Apache Airflow - A Python Hands-On Guide

branch_task >> spark_job_3 >> end

branch_task >> spark_job_4 >> end

Explanation of Components
1. SparkSubmitOperator :
Used to submit Spark jobs to a cluster.
Specify the application path, connection ID, and arguments for the Spark job.
2. BranchPythonOperator :
Dynamically determines the next task based on the state of a previous task.
In this case, it checks if spark_job_2 succeeded or failed.
3. Dependencies:
The workflow ensures sequential execution from spark_job_1 to
spark_job_2 and conditional branching to spark_job_3 or spark_job_4 .

Python Cheatsheet for Spark in Airflow

Component Python Example

SparkSubmitOperator SparkSubmitOperator(application='/path/app.py',
...)

BranchPythonOperator BranchPythonOperator(python_callable=my_func,
...)

Conditional Execution branch_task >> task_1 >> end or branch_task >>

task_2

Task Dependency task_1 >> task_2 >> task_3

9/9

Apache Airflow Cookbook 2
No ratings yet
Apache Airflow Cookbook 2
55 pages
6869173cb375f164b6288668 - ## - Periodic Table Jwala Notes
No ratings yet
6869173cb375f164b6288668 - ## - Periodic Table Jwala Notes
210 pages
CSP Report
No ratings yet
CSP Report
34 pages
The Ultimate Guide To Apache Airflow DAGs
No ratings yet
The Ultimate Guide To Apache Airflow DAGs
135 pages
Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
Apache Airflow
No ratings yet
Apache Airflow
24 pages
Brother Sewing Machine Guide
No ratings yet
Brother Sewing Machine Guide
1 page
Airflow Chapter4
No ratings yet
Airflow Chapter4
30 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
Airflow 101
No ratings yet
Airflow 101
25 pages
Apache Airflow Fundamentals Study Guide
No ratings yet
Apache Airflow Fundamentals Study Guide
7 pages
Review of Literature
No ratings yet
Review of Literature
3 pages
Chapter 1-Financial Decision Making
No ratings yet
Chapter 1-Financial Decision Making
29 pages
Airflow Git CICD
No ratings yet
Airflow Git CICD
6 pages
Airflow
No ratings yet
Airflow
97 pages
Jacques Alain Miller Marginalia
100% (2)
Jacques Alain Miller Marginalia
22 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
2.airflow 2
No ratings yet
2.airflow 2
17 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
Apache Airflow For Data Science
No ratings yet
Apache Airflow For Data Science
23 pages
Study Guide For Apache Airflow Fundamentals Certification
No ratings yet
Study Guide For Apache Airflow Fundamentals Certification
6 pages
Ac Pac10p Pac15px
No ratings yet
Ac Pac10p Pac15px
11 pages
Airflow 101 Mobile
No ratings yet
Airflow 101 Mobile
48 pages
PySpark Airflow Interview Pack Shubham
No ratings yet
PySpark Airflow Interview Pack Shubham
3 pages
Group2 Buntal Hats
No ratings yet
Group2 Buntal Hats
8 pages
Acute GlomeruloNephritis - AGN
No ratings yet
Acute GlomeruloNephritis - AGN
36 pages
GuideToApacheAirflow PDF
100% (1)
GuideToApacheAirflow PDF
6 pages
The Air in The Kitchen Hung Thick and Heavy
No ratings yet
The Air in The Kitchen Hung Thick and Heavy
1 page
29 (Number) - Wikipedia
No ratings yet
29 (Number) - Wikipedia
1 page
Unit 1 TO Cyberpreneuship: Notes By: Farihan Elyana BT Zahari
No ratings yet
Unit 1 TO Cyberpreneuship: Notes By: Farihan Elyana BT Zahari
12 pages
Task - Fixing in Progress Report
No ratings yet
Task - Fixing in Progress Report
4 pages
Going To Exercise
No ratings yet
Going To Exercise
2 pages
Airflow Codespace Syntax
No ratings yet
Airflow Codespace Syntax
2 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
Airflow Chapter2
No ratings yet
Airflow Chapter2
35 pages
Sid Anand Qcon Ai 2018 v2 PDF
No ratings yet
Sid Anand Qcon Ai 2018 v2 PDF
35 pages
Accidentally Pulled A Remote Branch Into Different Local Branch - How To Undo The Pull?: Git
No ratings yet
Accidentally Pulled A Remote Branch Into Different Local Branch - How To Undo The Pull?: Git
4 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Paradigm Shift Taylor Peterson
No ratings yet
Paradigm Shift Taylor Peterson
7 pages
What Is Apache Airflow
No ratings yet
What Is Apache Airflow
22 pages
98 Exploring DAG Design Patterns in Apache Airflow
No ratings yet
98 Exploring DAG Design Patterns in Apache Airflow
32 pages
Aws Ques
No ratings yet
Aws Ques
62 pages
Science Room Rules
No ratings yet
Science Room Rules
5 pages
Suzette Resume Offical Rev 1-2020
No ratings yet
Suzette Resume Offical Rev 1-2020
2 pages
Apache Airflow Certification - Study Guide For DAG Authoring
No ratings yet
Apache Airflow Certification - Study Guide For DAG Authoring
17 pages
Nice To Meet You
No ratings yet
Nice To Meet You
2 pages
Introduction
No ratings yet
Introduction
6 pages
LV MT/MT Series Datasheet: Technical Data
No ratings yet
LV MT/MT Series Datasheet: Technical Data
2 pages
Pyspark For Etl
No ratings yet
Pyspark For Etl
4 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
93 pages
Tetris On Canvas - CodeProject
No ratings yet
Tetris On Canvas - CodeProject
8 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Christianity and Sikhism - Edited
No ratings yet
Christianity and Sikhism - Edited
5 pages
Setting Up Airflow With Docker From Installation To Data Processing
No ratings yet
Setting Up Airflow With Docker From Installation To Data Processing
10 pages
Etalab Talk Apache Airflow Embulk
No ratings yet
Etalab Talk Apache Airflow Embulk
29 pages
Airflow Web UI and CLI
No ratings yet
Airflow Web UI and CLI
51 pages
Porter's Five Forces
No ratings yet
Porter's Five Forces
3 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
No ratings yet
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
18 pages
Overview - DAg Structure and Operators-1
No ratings yet
Overview - DAg Structure and Operators-1
6 pages
Building Data Pipelines - 4
No ratings yet
Building Data Pipelines - 4
38 pages
Airflow Best Practices
No ratings yet
Airflow Best Practices
34 pages
Day10 Airflow
No ratings yet
Day10 Airflow
5 pages
Airflowintroduction 190217155729
No ratings yet
Airflowintroduction 190217155729
21 pages
2.2 Create An Airflow DAG To Read and Write Files Using The PythonOperator
No ratings yet
2.2 Create An Airflow DAG To Read and Write Files Using The PythonOperator
3 pages
Airflow
No ratings yet
Airflow
7 pages
Apache Airflow
No ratings yet
Apache Airflow
10 pages
2022 NCE English 3rd Form
No ratings yet
2022 NCE English 3rd Form
4 pages
Food Pyramid
No ratings yet
Food Pyramid
21 pages
Notes Airflow MQTT
No ratings yet
Notes Airflow MQTT
6 pages
Airflow Dag Bash
No ratings yet
Airflow Dag Bash
6 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Scenario Based Airflow Interview Questions
No ratings yet
Scenario Based Airflow Interview Questions
4 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Airflow
No ratings yet
Airflow
7 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Precise Software Solutions - EPGP - 10 - 119 PDF
No ratings yet
Precise Software Solutions - EPGP - 10 - 119 PDF
4 pages
Appache Airflow
No ratings yet
Appache Airflow
5 pages
Apache Airflow Workflow
No ratings yet
Apache Airflow Workflow
4 pages
Material Safety Data Sheet: The Clorox Company
No ratings yet
Material Safety Data Sheet: The Clorox Company
1 page
AIRFLOW
No ratings yet
AIRFLOW
4 pages
ETL Pipeline, Class Notes
No ratings yet
ETL Pipeline, Class Notes
2 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Direct Examination During Trials
No ratings yet
Direct Examination During Trials
3 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Namdeo Dhasal, A Poet of The Underworld
No ratings yet
Namdeo Dhasal, A Poet of The Underworld
5 pages

Apache Airflow - A Python Hands-On Guide

Uploaded by

Apache Airflow - A Python Hands-On Guide

Uploaded by

Apache Airflow - A Python Hands-On Guide