Airflow Notes
Airflow Notes
General commands
airflow info : Provides sys. and dependencies info. airflow worker :Starts a new Airflow worker
b
airflow we server :Starts Airflow webserver
DAG commands
airflow dags list :List all the DAGs airflow dags pause DAG_ID : Pauses a DAG
airflow dags delete DAG_ID :Deletes a DAG airflow dags unpause DAG_ID :Unpauses a DAG
Task commands
airflow tasks test DAG_ID TASK_ID EXECUTION_DATE :Test a task instance. This will not produce any
database entries or trigger handlers
airflow tasks run DAG_ID TASK_ID EXECUTION_DATE :Run a single task instance
airflow tasks clear -t TASK_REGEX -s START_DATE -e END_DATE DAG_ID :Clears the task instances
airflow pools list :List all pools airflow pools delete POOL_NAME :Delete a pool
airflow pools get POOL_NAME :Get pool size and used slots count
airflow pools set POOL_NAME POOL_SLOT_COUNT POOL_DESCRIPTION :Create a new pool with the given
parameters.
Connection commands
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
t1 = DummyOperator(task_id='task_1')
t2 = DummyOperator(task_id='task_2')
t1 >> t2
DAG arguments
owner :Owner of the DAG retries :Number of times a failed task is retried
start_date :The start date of the DAG retries :Number of times a failed task is retried
email :The email(s) to notify on task failure retry_delay :Time delay between retries
depends_on_past :If True, a task can't run unless the previous schedule succeeded
email_on_failure :If True, emails are sent on task failure
email_on_retry :If True, emails are sent on each task retry
description :Description of the DAG
schedule_interval :The interval of DAG execution
Ordering and Dependencies
t1 >> t2 :t1 precedes t2 t1.set_upstream(t2) : t1 follows t2.
t1 << t2 :Get value of a variable t1.set_downstream(t2) :t1 precedes t2
chain(t1, t2, t3) :Chains the tasks in sequence (t1 >> t2 >> t3).
cross_downstream([t1, t2], [t3]) :Creates dependencies t1 >> t3 and t2 >> t3
Using Operators
Basic Operator Usage
from airflow import DAG
task = DummyOperator(task_id='my_task')
def my_func():
print('Hello, World!')
PythonOperator(task_id='python_task', python_callable=my_func)
BashOperator :Executes a bash command. Main parameters are task_id and bash_command
from airflow.operators.bash_operator import BashOperator
SubDagOperator(task_id='subdag_task', subdag=subdag())
class MyFirstOperator(BaseOperator):
@apply_defaults
def __init__(self, my_param, *args, **kwargs):
self.my_param = my_param
Tasks
Basic Task Declaration
A task is an instance of an operator. Here's a basic task declaration using the PythonOperator:
from airflow.operators.python_operator import PythonOperator
def hello_world():
print('Hello, World!')
Task Arguments
depends_on_past : If set to True, the task instance will only run if the previous task instance succeeded. Defaults to False
wait_for_downstream : If set to True, this task instance will wait for tasks downstream of the previous task instance to finish
before running.
retries : The number of times to retry failed tasks. Defaults to 0.
retry_delay : The time to wait before retrying the task. Defaults to five minutes. This must be a datetime.timedelta object.
queue : The name of the queue to use for task execution. This is used with the CeleryExecutor and KubernetesExecutor.
priority_weight : Defines the priority of the task in relation to other tasks. This can be used to prioritize certain tasks in the
scheduler.
pool : The name of the pool to use, if any. This can be used to limit parallelism for a set of tasks.
trigger_rule : Defines the rule by which the task gets triggered (all_success, all_failed, all_done, one_success, one_failed,
none_failed, none_skipped, etc.)
Example:-
task = PythonOperator(
task_id='hello_world_task',
python_callable=hello_world,
depends_on_past=True,
wait_for_downstream=True,
retries=3,
retry_delay=timedelta(minutes=1),
queue='my_queue',
priority_weight=10,
pool='my_pool'
)
TaskFlow API
@task
from airflow.models import XComArg
def my_task():
@task
greet(
print(
@task
greet(my_task())
The BranchPythonOperator
The BranchPythonOperator is used when you want to follow different paths in your DAG based on a condition. It takes a Python
airflow.operators.pyt
def hoose_path():
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=choose_path
The ShortCircuitOperator
The ShortCircuitOperator is used to bypass (short-circuit) a section of the DAG based on a condition. It is a PythonOperator that
airflow.operators.pyt
def heck_condition():
condition_task = ShortCircuitOperator(
task_id='condition_task',
python_callable=check_condition
#Define further tasks here. They won't run if check_condition returns False
Dynamic DAGs
DummyOperator
DummyOperator
dag = DAG(
start_date=days_ago(2),
dag_id=dag_id,
default_args={'retries': 1},
schedule_interval=schedule,
)
start_date=days_ago(2),
#Loop to create multiple tas ks
default_args={'retries': 1},
for i in range(1, 6):
)
DummyOperator(task_id=f'task_{i}', dag=dag)
with dag:
DummyOperator(task_id='task_1')
DummyOperator(task_id='task_2')
return dag
dag_id = f"my_dag_{i}"
globals()[dag_id] = create_dag(dag_id,
'@daily')
Error H dl g an in
Callbacks
on_failure_callback : Function called on task failure. Use for custom failure behavior (e.g., notifications or cleanup).
on_retry_callback : Function called before task retry. Use for pre-retry adjustments or logging.
on_success_callback : Function called on task success. Use for post-success actions (e.g., notifications or downstream
triggers).
Example:-
#Def ine y our call b kf
ac unctions
#Assi gn call b k
ac s to your tas k
task = PythonOperator(
task_id='task',
p ython_callable=my_function,
on_failure_callback=task_fail_callback,
on_retry_callback=task_retry_callback,
on_success_callback = task_success_callback,
dag=dag,
)
TriggerRule class for handling task dependencies
TriggerRule.ALL_SUCCESS : All parent tasks must have succeeded. This is the default rule.
Example:-
#Setting dependencies
S3ToRedshiftOperator
transfer_s3_redshift = S3ToRedshiftOperator(
task_id='transfer_s3_redshift',
schema='SCHEMA_NAME',
table='TABLE_NAME',
s3_bucket='S3_BUCKET',
s3_key='S3_KEY',
copy_options=['csv'],
dag=dag,
GoogleCloudStorageToBigQueryOperator
transfer_gcs_bq = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_example',
bucket='BUCKET_NAME',
source_objects=['SOURCE_FILE_NAME.csv'],
destination_project_dataset_table='DESTINATION_DATASET.TABLE_NAME',
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
Example for database operators
PostgresOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
t1 = PostgresOperator(
task_id='run_sql',
sql='sql_file.sql',
postgres_conn_id='postgres_default',
dag=dag,
)
MySqlOperator
from airflow.providers.mysql.operators.mysql import MySqlOperator
sql_op = MySqlOperator(
task_id='run_sql',
sql='sql_file.sql',
mysql_conn_id='mysql_default',
dag=dag,
TriggerRule Class
all_success : All parent tasks must have succeeded. This is the default trigger rule.
task3 = DummyOperator(task_id='task3', trigger_rule=TriggerRule.ALL_SUCCESS,dag=dag)
all_failed : All parent tasks must have failed
task3 = DummyOperator(task_id='task3',trigger_rule=TriggerRule.ALL_FAILED, dag=dag)
all_done : All parent tasks are done with their execution.
task3 = DummyOperator(task_id='task3',trigger_rule=TriggerRule.ALL_DONE, dag=dag)
one_success : Fires if at least one parent has succeeded, it allows triggering the task soon after one parent succeeds.
task3 = DummyOperator(task_id='task3',trigger_rule=TriggerRule.ONE_SUCCESS, dag=dag)
one_failed : Fires if at least one parent has failed, it allows triggering the task soon after one parent fails.
task3 = DummyOperator(task_id='task3',trigger_rule=TriggerRule.ONE_FAILED, dag=dag)
none_failed : Fires as long as no parents have failed, i.e. all parents have succeeded or been skipped.
task3 = DummyOperator(task_id='task3',trigger_rule=TriggerRule.NONE_FAILED, dag=dag)
How to use ChatGPT Effectively for Airflow Workflow
ChatGPT: "Explain the step-by-step process to setup a Directed Acyclic Graph (DAG) in Apache Airflow."
CodeGPT: "Generate a simple DAG in Apache Airflow with 3 tasks that run in sequence."
ChatGPT: "Detail the method for implementing conditional execution in Apache Airflow. What operators are
commonly used?"
CodeGPT: "Create an example DAG in Apache Airflow demonstrating conditional execution using the
BranchPythonOperator."
ChatGPT: " H ow should I handle errors in Apache Airflow tasks? What are the best practices?"
CodeGPT: "Generate a sample task in Apache Airflow with on _ failure _ callback , _ on retry _ callback , and
on _ success _ callback."
ChatGPT: "Explain the process of setting up a connection to a Postgre SQL database in Apache Airflow."
CodeGPT: "Generate a sample PostgresOperator task in an Airflow DAG that executes a simple SQL command."
B k
ac filling Data in Airflow
ChatGPT: " H fi
ow to back ll data in Airflow? What considerations should be kept in mind when doing so?"
CodeGPT: " S fi
how me an example of an Airflow command for back lling a DAG from a speci c date." fi
Scaling Airflow
ChatGPT: "What strategies can be applied to scale Airflow? H ow do I ensure my Airflow setup remains robust as
v
CodeGPT: "Pro ide an example of how to set up the L ocalExecutor in Airflow for better scalability."
v
ChatGPT: "Explain how to handle exceptions in Apache Airflow effecti ely. What are the best practices?"
v
CodeGPT: "Pro ide a Python snippet demonstrating the use of on _ failure _ callback in Apache Airflow for exception
handling."