Airflow
Airflow
• Scheduler:
– This is the most important part of Airflow, which orchestrates (organize/arrange) various DAGs and their tasks, taking
care of their interdependencies, limiting the number of runs of each DAG so that one DAG doesn’t overwhelm the entire
system and makes it easy for users to schedule and run DAGs on Airflow.
• Executor:
– While the Scheduler orchestrates the tasks, the executors are the components that actually execute tasks.
– There are various types of executors that come with Airflow, such as SequentialExecutor, LocalExecutor,
CeleryExecutor and the KubernetesExecutor.
• Metadata Database:
– Airflow uses SQLAlchemy an Object Relational Mapping (ORM) written in Python to connect to the
metadata database. This means that any database supported by SQLALchemy can be used to store all the
Airflow metadata by default it uses SQLite database.
– This database stores metadata about DAGs, their runs, and other Airflow configurations like users, roles,
and connections.
– The Web Server shows the DAGs’ states and its runs from the database. The Scheduler also updates this
information in this metadata database.
Basic Airflow concepts
• Task: a defined unit of work (these are called operators in Airflow)
• Task instance: an individual run of a single task. Task instances also have
an indicative state, which could be “running”, “success”, “failed”,
“skipped”, “up for retry”, etc.
• DAG: Directed acyclic graph, a set of tasks (operators) with explicit
execution order, beginning, and end
• DAG run: individual execution/run of a DAG