Overview- DAg Structure and Operators-1
Overview- DAg Structure and Operators-1
Introduction
Apache Airflow is a Python framework that helps create workflows using multiple technologies using
both CLI and a user-friendly WebUI. An Apache Airflow Directed Acyclic Graph (DAG) is a Python
program where you define the tasks and the pipeline with the order in which the tasks will be executed.
BranchDateTimeOperator
LatestOnlyOperator - Skip tasks that are not running during the most recent schedule interval.
Besides these, there are also many community provided operators. Some of the popular and useful ones
are:
HttpOperator
MySqlOperator
PostgresOperator
MsSqlOperator
OracleOperator
JdbcOperator
DockerOperator
HiveOperator
S3FileTransformOperator
PrestoToMySqlOperator
SlackAPIOperator
In addition to operators, you also have sensors and decorators that allow you to combine bash and
Python.
Anatomy of a DAG
A DAG
consists
of these
logical
blocks.
Imports
DAG Arguments
DAG Definition
Task Definitions
Task Pipeline
DAG Arguments
5. 'start_date': days_ago(0),
6. 'email': ['[email protected]'],
7. 'retries': 1,
8. 'retry_delay': timedelta(minutes=5),
9. }
DAG arguments are like the initial settings for the DAG.
The above settings mention:
4. default_args=default_args,
6. schedule_interval=timedelta(days=1),
7. )
Here we are creating a variable named dag by instantiating the DAG class with the following
parameters:
unique_id_for_DAG is the ID of the DAG. This is what you see on the web console. This is what you can
use to trigger the DAG using a TriggerDagRunOperator.
You are passing the dictionary default_args , in which all the defaults are defined.
description helps us in understanding what this DAG does.
schedule_interval tells us how frequently this DAG runs. In this case every day. ( days=1 ).
4. bash_command='<some bashcommand>',
5. dag=dag,
6. )
7.
12. dag=dag,
13. )
14.
18. to='[email protected]',
21. dag=dag,
22. )
A task is defined using:
1. # task pipeline
2. task1 >> task2 >> task3
Copied!
You can also use upstream and downstream to define the pipeline. For example:
1. task1.set_downstream(task2)
2. task3.set_upstream(task2)
Copied!
Task pipeline helps us to organize the order of tasks. In the example, the task task1 must run first,
followed by task2 , followed by the task task3 .