Apache Airflow - A Python Hands-On Guide
Apache Airflow - A Python Hands-On Guide
# Initialize DAG
with DAG(
dag_id='example_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
def print_hello():
print("Hello, Airflow!")
task = PythonOperator(
1/9
Apache Airflow - A Python Hands-On Guide
task_id='print_hello',
python_callable=print_hello,
)
Key Notes:
DAG : The container for your workflow.
PythonOperator : Executes Python functions.
schedule_interval : Defines the schedule (e.g., @daily , @hourly ).
catchup : Prevents past executions from running.
PythonOperator
Executes Python callables.
PythonOperator(
task_id='process_data',
python_callable=process_data_function,
op_kwargs={'param': 'value'}, # Pass arguments to the callable
)
BranchPythonOperator
Allows branching based on a condition.
def choose_branch(**kwargs):
return 'branch_1' if kwargs['some_condition'] else 'branch_2'
branch_task = BranchPythonOperator(
task_id='branching',
python_callable=choose_branch,
provide_context=True,
)
2/9
Apache Airflow - A Python Hands-On Guide
PythonVirtualenvOperator
Executes Python code within a virtual environment.
virtualenv_task = PythonVirtualenvOperator(
task_id='venv_task',
python_callable=lambda: print("Running in a virtualenv!"),
requirements=["numpy", "pandas"],
system_site_packages=False,
)
1. AWS Provider
Install: pip install apache-airflow-providers-amazon
upload_task = S3CreateObjectOperator(
task_id='upload_to_s3',
aws_conn_id='my_aws_conn',
s3_bucket='my_bucket',
s3_key='path/to/file.txt',
data="Sample Data",
)
3/9
Apache Airflow - A Python Hands-On Guide
bq_task = BigQueryExecuteQueryOperator(
task_id='bq_query',
sql='SELECT * FROM my_dataset.my_table',
gcp_conn_id='my_gcp_conn',
use_legacy_sql=False,
)
3. PostgreSQL Provider
Install: pip install apache-airflow-providers-postgres
sql_task = PostgresOperator(
task_id='run_postgres_query',
postgres_conn_id='my_postgres_conn',
sql='SELECT * FROM my_table;',
)
4. Slack Provider
Install: pip install apache-airflow-providers-slack
slack_task = SlackWebhookOperator(
task_id='send_slack_message',
4/9
Apache Airflow - A Python Hands-On Guide
http_conn_id='slack_conn',
message="Workflow completed successfully!",
channel="#alerts",
)
5. MySQL Provider
Install: pip install apache-airflow-providers-mysql
mysql_task = MySqlOperator(
task_id='mysql_query',
mysql_conn_id='my_mysql_conn',
sql='INSERT INTO my_table (id, value) VALUES (1, "test");',
)
Branching BranchPythonOperator(task_id='branch',
python_callable=my_func)
S3 Upload S3CreateObjectOperator(...,
s3_key='path/to/file.txt')
5/9
Apache Airflow - A Python Hands-On Guide
1. Execute spark_job_1 .
2. Execute spark_job_2 .
3. If spark_job_2 fails, run spark_job_3 .
4. If spark_job_2 succeeds, run spark_job_4 .
DAG Implementation
Step 1: Import Required Modules
6/9
Apache Airflow - A Python Hands-On Guide
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'retries': 1,
}
dag = DAG(
dag_id='spark_conditional_jobs',
default_args=default_args,
description='A DAG with conditional Spark job execution',
schedule_interval=None,
start_date=datetime(2023, 12, 1),
catchup=False,
)
# Spark Job 1
spark_job_1 = SparkSubmitOperator(
task_id='spark_job_1',
application='/path/to/spark_job_1.py',
conn_id='spark_default', # Connection to your Spark cluster
application_args=['arg1', 'arg2'],
dag=dag,
)
# Spark Job 2
spark_job_2 = SparkSubmitOperator(
task_id='spark_job_2',
application='/path/to/spark_job_2.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)
# Spark Job 3
spark_job_3 = SparkSubmitOperator(
task_id='spark_job_3',
7/9
Apache Airflow - A Python Hands-On Guide
application='/path/to/spark_job_3.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)
# Spark Job 4
spark_job_4 = SparkSubmitOperator(
task_id='spark_job_4',
application='/path/to/spark_job_4.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)
def choose_next_task(**kwargs):
# Check the state of spark_job_2
task_instance = kwargs['ti']
spark_job_2_state =
task_instance.xcom_pull(task_ids='spark_job_2', key='return_value')
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=choose_next_task,
provide_context=True,
dag=dag,
)
# Define dependencies
start >> spark_job_1
spark_job_1 >> spark_job_2
spark_job_2 >> branch_task
8/9
Apache Airflow - A Python Hands-On Guide
Explanation of Components
1. SparkSubmitOperator :
Used to submit Spark jobs to a cluster.
Specify the application path, connection ID, and arguments for the Spark job.
2. BranchPythonOperator :
Dynamically determines the next task based on the state of a previous task.
In this case, it checks if spark_job_2 succeeded or failed.
3. Dependencies:
The workflow ensures sequential execution from spark_job_1 to
spark_job_2 and conditional branching to spark_job_3 or spark_job_4 .
BranchPythonOperator BranchPythonOperator(python_callable=my_func,
...)
9/9