Apache Airflow Cookbook 2
Apache Airflow Cookbook 2
COOKBOOK
M K PAVAN KUMAR
FIRST EDITION
Dedicated to my Dad
and
2
Preface
As you flip through these pages, you'll explore different aspects of Airflow's
capabilities. We'll start with the basics of Airflow's architecture and then dive into
the practical use of Airflow Operators, TaskFlow API, and the nitty-gritty of
configuration.
Whether you're just starting or you're an Airflow pro, this cookbook has
something for you. Each recipe is designed to give you the hands-on knowledge
and tips you need to become an Airflow expert and succeed in managing data
workflows.
So, get ready, let's dive into the world of Apache Airflow, where data flows
smoothly, work gets done, and automation becomes a breeze. Enjoy your
journey!
3
FORE WORD
Sashank Pappu
CEO, Antz.ai
4
Special Thanks to my family, friends and seniors who guided me in
this beautiful art of writing the book
5
TABLE OF CONTENTS
A brief introduction to Apache Airflow ...................................................... 7
Airflow Architecture ............................................................................... 11
Airflow Operators ................................................................................... 18
TaskFlow API ......................................................................................... 27
Catchup and Backfill .............................................................................. 31
Scheduling, Connections and Hooks ...................................................... 35
Xcom and Sensors ................................................................................. 40
Airflow Configuration Deep Dive ............................................................. 48
Airflow UI ............................................................................................... 50
Final Chapter: The Conclusion ............................................................... 55
6
CHAPTER-1
At its core, Apache Airflow is a workflow automation tool that empowers users to
define, schedule, and monitor workflows composed of a series of interrelated
tasks. What distinguishes Airflow is its focus on orchestrating intricate workflows
while offering a user-friendly interface for designing, executing, and managing
these tasks. Whether you need to perform Extract, Transform, and Load (ETL)
operations on data, automate data processing, or manage a complex sequence
of tasks, Apache Airflow can serve as your trusted companion.
As we progress through this textbook, you will gain the knowledge required to
monitor, troubleshoot, and optimize your workflows, ensuring the efficient and
reliable execution of your data pipelines. We also cover advanced topics such as
7
backfill, catchup, and extending Airflow's functionality through custom operators
and sensors.
By the end of this educational journey, you will be well-prepared to harness the
full potential of Apache Airflow, enabling you to streamline your data processes,
minimize manual intervention, and make data-driven decisions with confidence.
So, let's immerse ourselves in the world of Apache Airflow and uncover how it
can revolutionize the management of data and automation within your
organization.
Apache Airflow requires a designated home directory, which is, by default, set to
~/airflow. However, users have the flexibility to specify a different location if they
prefer. This is accomplished by setting the AIRFLOW_HOME environment
variable. It's important to note that this step should be completed before the
actual installation process begins. Configuring the environment variable informs
Apache Airflow of the desired location for storing essential files. For example, if
you want to set your AIRFLOW_HOME to a different location, you can do so as
follows:
Shell code/Terminal
export AIRFLOW_HOME=~/my_custom_airflow_location
• Specify the Airflow version you intend to install. In this case, we'll use
version 2.7.2:
Shell code/Terminal
8
AIRFLOW_VERSION=2.7.2
• Determine the Python version you have installed. If the installed Python
version is not supported by Airflow, you may need to set it manually. The
following command extracts the Python version:
Shell code/Terminal
• Define the URL for the constraint file specific to your Airflow and Python
versions. This URL is used to guide the installation process:
Shell code/Terminal
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constrai
nts-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
• Finally, you can use pip to install Apache Airflow, ensuring that it adheres
to the specified constraints. Here's how you can do it:
Shell code/Terminal
This installation method guarantees that you have a compatible and stable
configuration for your specific Airflow and Python versions.
The "airflow standalone" command is a pivotal step that sets up your Apache
Airflow environment. When executed, it performs the following actions:
9
• Creates a user: The command also establishes an initial user account,
typically referred to as the admin account. This user account is essential
for accessing the Airflow UI and managing the system.
• Starts all components: Apache Airflow is composed of various
components, such as the scheduler, web server, and worker processes.
The "airflow standalone" command ensures that all of these components
are up and running, ready to execute your workflows and tasks.
Shell code/Terminal
airflow standalone
After successfully setting up Apache Airflow, you can access the Airflow User
Interface (UI) through a web browser. The UI provides a user-friendly
environment for managing your workflows, tasks, and scheduling. Here's how to
access it:
Upon executing the provided commands, Apache Airflow will create the
designated $AIRFLOW_HOME folder and generate the "airflow.cfg"
configuration file. This file includes default settings to expedite your setup. You
also have the option to customize these default configurations using
environment variables, as outlined in the Configuration Reference. Additionally,
you can review and adjust the "airflow.cfg" file either in
$AIRFLOW_HOME/airflow.cfg or through the UI by navigating to Admin-
>Configuration.
It's important to note that, out of the box, Apache Airflow uses a SQLite
database, which is a lightweight database system. However, it may become
limiting as your requirements grow, particularly because it supports only
sequential task execution. For more complex scenarios that require
parallelization, you may need to explore other database backends and executors
provided by Apache Airflow. Nevertheless, starting with the default configuration
allows you to quickly explore the UI and command line utilities to gain a better
understanding of the platform's capabilities.
10
CHAPTER-2
AIRFLOW ARCHITECTURE
n the context of Apache Airflow, an Airflow setup typically encompasses
several pivotal components, each playing a specific role in the orchestration
4. DAG Files: Directed Acyclic Graph (DAG) Files are foundational building
blocks of an Airflow setup. These files are organized within a designated folder
and contain the instructions and structure of the workflows. The Scheduler,
Executor, and worker nodes, when in use, rely on these DAG Files to
comprehend the arrangement of tasks and dependencies within the workflows.
They serve as the blueprints for task execution.
11
execution of data workflows. As you delve deeper into the world of Airflow, a
comprehensive knowledge of these components will empower you to harness
the full potential of this powerful platform for workflow automation.
Within the framework of Apache Airflow, Directed Acyclic Graphs (DAGs) serve
as the structural backbone of workflow design and orchestration. These DAGs
progress through a sequence of tasks, which can be conveniently grouped into
three prominent categories:
12
geared toward monitoring and responding to external conditions or events. They
stand vigilant, waiting for specific external occurrences to transpire before
allowing the DAG's execution to proceed. Sensors provide a vital mechanism for
ensuring that workflows remain responsive and adaptive to the ever-changing
dynamics of the environment.
While the terms "Task" and "Operator" may seem interchangeable in the
underlying implementation, it proves beneficial to maintain a conceptual
distinction between them. Operators and Sensors may be perceived as
templates, and when integrated into a DAG file, they effectively create a Task,
specifying a particular operation or action within the workflow.
13
The Task Lifecycle
• None: This denotes that the task has not been queued for execution due
to the unsatisfied fulfilment of its dependencies.
• Scheduled: The scheduler has ascertained that the task's dependencies
are met, rendering it ready for execution.
• Queued: The task has been allocated to an Executor and awaits
assignment to a worker for execution.
• Running: The task is actively in the process of execution, either on a
dedicated worker or within a local/synchronous executor.
• Success: The task has successfully completed its execution without
encountering any errors.
• Shutdown: External requests have induced the task to terminate while it
was in an active running state.
• Restarting: External requests have triggered the task to restart its
execution while it was in progress.
• Failed: The task has encountered an error during its execution, resulting
in an unsuccessful run.
• Skipped: This denotes that the task has been bypassed, typically due to
branching conditions, the utilization of the LatestOnly operator, or similar
factors.
• Upstream Failed: An upstream task has failed, and the associated
Trigger Rule mandates the presence of this particular task.
• Up For Retry: The task has failed but retains available retry attempts,
prompting it to be rescheduled for another execution attempt.
• Up For Reschedule: In this instance, the task serves as a Sensor
actively engaged in reschedule mode, signifying its continuous monitoring
for a specific event.
• Deferred: The task has been deferred as part of a trigger operation,
awaiting a subsequent execution opportunity.
• Removed: The task has been excluded from the current execution of the
DAG, signifying its detachment from the ongoing workflow.
14
CREATING OUR FIRST DAG:
Important imports
# Operators: we will check this in our upcoming chapters, these are the initial start of
workflow mechanism
15
default_args = {
'owner': 'your_name',
'start_date': days_ago(1),
'retries': 1,
with DAG(
"tutorial",
default_args = default_args,
schedule=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["example"],
) as dag:
task1 = BashOperator(
task_id='task1',
16
dag=dag,
task2 = BashOperator(
task_id='task2',
dag=dag,
task3 = BashOperator(
task_id='task3',
dag=dag,
QUICK RECAP:
17
CHAPTER-3
AIRFLOW OPERATORS
n Apache Airflow, the concept of Operators holds a fundamental role, serving
as the elemental building blocks that represent individual, atomic tasks within
18
• DummyOperator: A non-operational Operator employed for specifying
dependencies within a DAG.
• SubDagOperator: Executes a sub-DAG as a singular task within the
parent DAG.
• SimpleHttpOperator: Facilitates the transmission of HTTP requests to
engage with web services, APIs, or other HTTP endpoints (Note: This
involves the use of the airflow connections concept).
BASHOPERATOR
Import Statements: Import necessary modules and classes from the Airflow
library. These include the DAG class, BashOperator, and timedelta for working
with time intervals.
19
• start_date: Defines the start date for the DAG. In this case, it's set to
2 days ago.
• email: An email address to send notifications to.
• retries: Number of times to retry a task if it fails.
• retry_delay: The delay between retries.
DAG Initialization: Create an instance of the DAG class with various properties:
Task Definitions:
20
uses Jinja templating to print date-related values and a parameter
passed as an argument.
21
Documentation: Additional documentation is provided for ‘task1’ and the DAG
itself using the doc_md attribute. This documentation can include Markdown
content and is displayed in the Airflow web interface.
Note: for now just remember xcom is used for sharing the data between
operators or tasks later we have dedicated chapter in which we will discuss more
on xcom & sensors in a detail.Import Statements:
Import necessary modules and classes from the Airflow library and other Python
libraries.
22
Variable Initialization: Define several variables to store information needed for
the DAG’s tasks. These variables include the city name, limit for API requests,
and API key for accessing the OpenWeatherMap API.
DAG Initialization: Create an instance of the DAG class with various properties:
23
• default_args: Assign the previously defined args dictionary as default
arguments.
Task Definitions:
task_fetch_geocodes = SimpleHttpOperator(
task_id='fetch_geo_codes',
http_conn_id='open_weather_geo_codes',
endpoint=f'direct?q=Hyderabad,Telangana&limit=1&appid={API_KEY}',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)
task_fetch_weather = SimpleHttpOperator(
task_id='fetch_weather',
http_conn_id='open_weather_data',
endpoint='weather?lat={{ti.xcom_pull(task_ids=["fetch_geo_codes"])[0][0].get("lat")}}&lon={{ti.xco
m_pull('
'task_ids=["fetch_geo_codes"])[0][0].get("lon")}}&appid=f38c94b357eae713857038d2f1a912cc',
method='GET',
24
response_filter=lambda response: json.loads(response.content),
log_response=True
)
task_convert_to_csv = PythonOperator(
task_id='convert_to_csv',
python_callable=create_csv_file
)
you can check individual tasks by executing below command without scheduling
the entire DAG.
import csv
def create_csv_file(ti):
25
This DAG fetches geolocation data, weather data, and converts it to CSV format.
It demonstrates the use of the SimpleHttpOperator for making HTTP requests
and the PythonOperator for executing custom Python code within an Airflow
DAG. The tasks are orchestrated to execute in the specified order, with
'fetch_weather' dependent on the output of 'fetch_geo_codes'.
the data between the tasks is communicated using xcom and its visually seen in
the Airflow UI as below.
QUICK RECAP:
26
CHAPTER-4
TASKFLOW API
he Apache Airflow TaskFlow API stands as a ground-breaking
advancement within the realm of workflow automation and data
1. Reduced Repetitive Code: The TaskFlow API minimizes the need for
redundant code, enhancing the efficiency of workflow development.
2. Efficient Data Transfer: Streamlined data transfer between Directed
Acyclic Graphs (DAGs) is facilitated by the TaskFlow API, simplifying the
movement of information within the workflow.
3. Dependency Chain Specification: Unnecessary code for specifying
dependency chains is eliminated, leading to a more streamlined and
comprehensible workflow structure.
4. Simplified Task Management: Task creation and initialization are
simplified with the TaskFlow API, offering a more straightforward and
intuitive process.
27
Import necessary libraries:
• The @dag decorator is used to define the DAG with the following
parameters:
• dag_id: This specifies the unique identifier for the DAG as
"housing_etl_workflow".
• schedule: The DAG is set to run on demand (schedule=None).
28
• start_date: The DAG is set to start on October 7, 2023.
• catchup: Catchup is disabled, meaning that the DAG won't run backfill
for missed schedules.
Please note that this code is a basic example of how to create an ETL workflow
using Airflow. Depending on your use case, you may need to customize the
tasks and add more functionality to suit your specific data processing
requirements. In the airflow web console the ETL flow looks as below
29
QUICK RECAP:
30
CHAPTER-5
A that orchestrates the past, present, and future facets of data pipelines. The
Catchup feature within Airflow serves as a remarkable tool, effortlessly
breathing life into historical data through a simple switch toggle. This
functionality metaphorically transports us to an era where we can
manipulate the hands of time within our workflow's calendar, ensuring the
execution of any previously missed runs and securing a thorough coverage of
data points. This temporal control is not only instrumental in preserving data
integrity but also adeptly addresses historical data processing, skilfully bridging
any temporal gaps within data timelines with grace and precision.
Using start Date and Execution Date we can explain the DAG runs as below
31
with the help of above diagram we can easily understand that
CATCH-UP:
if you clearly follow the coding aspects of our chapters while we are creating
DAG’s we are explicitly passing a parameter called catchup=False , the reason
32
for this is Airflow by default will run any past scheduled intervals that have not
been run to avoid this we are passing that parameter.
33
BACKFILL:
The Apache Airflow Backfill feature emerges as a strategic maneuver within the
landscape of workflow orchestration, presenting a distinctive capability for
retroactive task execution. Functioning akin to navigating through the chapters of
a narrative, the Backfill feature empowers users to seamlessly address historical
data processing from a specified start date, treating the data as if it unfolded in
real-time. This versatile tool becomes particularly valuable in instances of
system downtime, maintenance activities, or unforeseen circumstances, allowing
for the effortless filling of data gaps and ensuring the comprehensive and
forward-looking nature of the workflow. The Backfill feature, with its ability to
reconcile historical data seamlessly, stands as a fundamental and sophisticated
component in Apache Airflow's repertoire, enhancing the coherence and
completeness of data pipelines. If for some reason we want to re-run DAG’s on
certain schedules manually we can use the following CLI command to do so.
QUICK RECAP:
34
CHAPTER-6
AIRFLOW SCHEDULING:
In the domain of Apache Airflow, the scheduler assumes a proactive role,
vigilantly overseeing tasks and Directed Acyclic Graphs (DAGs). Task instances
come to life under its watchful eye, awakening once their dependencies find
fulfillment. Functioning as a persistent presence in an Airflow production
environment, the scheduler is effortlessly set into motion with a simple
command, “airflow scheduler,” drawing configurations from the designated
“airflow.cfg” file.
35
Let’s have a practical example to understand the concept more clearly.
import os
import logging
import csv
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonOperator
with (DAG(dag_id='ecgdataprocessing',
description='ecg data processing with schedule',
schedule_interval='*/2 * * * *', # "*/2 * * * *" At every 2nd minute.
start_date=datetime(2023, 10, 9), catchup=False)
as dag):
@task
def copy_ecg_data():
destination_path =
f"/Users/pavanmantha/Pavans/PracticeExamples/apache_airflow_tutorial/data{datetime.now()}.c
sv"
logging.info('Write Dir')
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
logging.info(f'File={destination_path}')
src_file =
"/Users/pavanmantha/Pavans/PracticeExamples/DataScience_Practice/Advanced-
ML/ECGCvdata.csv"
with open(file=destination_path, mode='w') as fd2:
writer = csv.writer(fd2)
with open(file=src_file, mode='r') as src_file:
reader = csv.reader(src_file)
header = next(reader)
writer.writerow(header)
for row in reader:
writer.writerow(row)
logging.info(f'Data written to File={destination_path}')
36
def log_info():
logging.info('WRITING NOW')
logging.info(datetime.utcnow().isoformat())
logging.info('TIME WRITTEN')
CONNECTIONS:
Airflow plays a crucial role in both retrieving and pushing data to external
systems, and it introduces a core concept called Connections to securely store
the credentials required for communication with these external systems.
You have the flexibility to manage Connections through either the user
interface (UI) or the command-line interface (CLI). To gain a deeper
understanding of creating, modifying, and handling Connections, refer to the
section on Managing Connections which provides comprehensive information.
Moreover, there are customizable options available for storing Connections and
choosing the backend that suits your requirements.
37
Connections can be directly utilized in your own code, harnessed via
Hooks, or incorporated within templates.
HOOKS:
with DAG(
dag_id='weather_api_dag',
catchup=False,
38
schedule_interval='@daily',
default_args=args
) as dag:
task_fetch_geocodes = SimpleHttpOperator(
task_id='fetch_geo_codes',
http_conn_id='open_weather_geo_codes',
endpoint=f'direct?q=Hyderabad,Telangana&limit=1&appid={API_KEY}',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)
task_fetch_weather = SimpleHttpOperator(
task_id='fetch_weather',
http_conn_id='open_weather_data',
endpoint='weather?lat={{ti.xcom_pull(task_ids=["fetch_geo_codes"])[0][0].get("lat")}}&lon={{ti.xco
m_pull('
'task_ids=["fetch_geo_codes"])[0][0].get("lon")}}&appid=f38c94b357eae713857038d2f1a912cc',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)
task_convert_to_csv = PythonOperator(
task_id='convert_to_csv',
python_callable=create_csv_file
)
we are just using the connection created in the admin section of airflow ui.
QUICK RECAP:
39
CHAPTER-7
40
Here’s why XCOMs are used in Apache Airflow:
Data Sharing: Airflow tasks often need to share information or data between
them. For example, one task may generate some data, and another task may
need to use that data as input. XCOMs provide a mechanism for these tasks to
share data efficiently.
Logging and Monitoring: XCOMs are also useful for logging and monitoring.
You can log and capture the values of XCOMs, which can be valuable for
debugging and tracking the progress of tasks in a workflow.
41
While XCOMs can be powerful, it’s essential to use them judiciously, as passing
too much data between tasks can lead to performance issues and make your
DAGs harder to manage. They are best suited for sharing small pieces of
metadata or data necessary for task coordination within your workflow.
let us code the full implementation of the above scenario. We will combine most
of the topics that we learnt till now like hooks, connections, taskflow api and
operators etc to implement the above said scenario.
import json
from datetime import datetime
@dag(dag_id="products_etl_workflow",
schedule=None,
start_date=datetime(2023, 10, 14), catchup=False)
def products_etl_workflow():
@task()
def load_to_mongo(records):
hook = MongoHook(conn_id="mongo_local")
conn = hook.get_conn()
db = conn.northwind
collection = db.products
print(records)
records = json.loads(records)
for record in records:
collection.insert_one(record)
extract = PostgresOperator(
task_id="fetch_data_from_northwind",
postgres_conn_id="postgresql-local-northwind",
sql="select p.product_id,"
"p.product_name,"
"p.quantity_per_unit,"
"p.unit_price,"
"p.units_in_stock,"
"p.units_on_order,"
"c.category_id,"
"c.category_name,"
"c.description "
"from products p, categories c "
"where p.category_id = c.category_id",
do_xcom_push=True
)
data_transform = PythonOperator(
task_id='convert_to_json',
42
python_callable=transform
)
products_etl_workflow()
The above dag contacts “postgresql” using “PostgresOperator” for the data by
executing the sql. The results from the task are pushed to “xcom” and are
available for our data_transform task and then the transformed result is passed
to a taskflow api enabled method which eventually push the data to mongodb .
the connections that are created from airflow ui for both mongodb and
postgresql respectively looks as below.
43
Once the DAG is run and if you face no issues then your run should look
something as below, where all the tasks succeeds.
44
Once the ETL is completed this is what you can realise.
Just see the similarity, we have total of 77 records in both the databases
after we ran the query, so all the data has been transformed successfully and
ingested to destination database.
45
AIRFLOW SENSORS:
Airflow sensors are a crucial component of Apache Airflow, They are used
to pause the execution of a workflow until a certain external condition or event is
met. They are designed to wait for specific conditions to be satisfied before
allowing downstream tasks to proceed. Sensors play a vital role in ensuring that
your workflows run smoothly and efficiently by waiting for the right conditions to
be in place.
Sensors are used in a variety of scenarios, often when working with external
systems, services, or data sources. For example, here are a few common use
cases for Airflow sensors:
File Sensor: A File Sensor can be used to wait for the existence of a specific file
in a directory. This is useful when a task downstream in the workflow depends
on the availability of a file.
HTTP Sensor: An HTTP Sensor can wait for a specific URL to return a particular
status code or contain specific content. This is useful for monitoring external web
services or APIs.
Time Sensor: A Time Sensor can be used to pause a workflow until a specific
time or time interval is reached. It’s useful for scheduling tasks to run at precise
times.
External Task Sensor: This sensor waits for the successful completion of
another task in the same or a different DAG. It allows for inter-DAG
dependencies and is crucial for coordinating tasks across different workflows.
46
QUICK RECAP:
Apache Airflow's sensors and XCOM mechanisms play pivotal roles in crafting
resilient and effective data workflows. Sensors empower workflows to gracefully
handle external dependencies, ensuring execution pauses until specific
conditions are met. This safeguards data integrity, mitigating the risk of errors
arising from incomplete or unavailable resources. On the other hand, XCOMs
facilitate seamless sharing of data and metadata among tasks, fostering
coordination, dynamic workflows, and efficient information transfer. Through
adept implementation of sensors and XCOMs in your Airflow Directed Acyclic
Graphs (DAGs), you gain the ability to orchestrate intricate data processes with
heightened reliability, responsiveness, and adaptability. This, in turn, streamlines
data pipelines and elevates your workflow management capabilities to new
heights.
47
CHAPTER-8
In
this chapter, we'll explore advanced configurations to enhance the
experience and effectiveness of using Apache Airflow. The “airflow.cfg”
file is a critical configuration component in Apache Airflow, housing all
the settings and parameters that dictate the behavior and performance of an
Apache Airflow deployment. In this chapter, we'll dive into the “airflow.cfg” file,
covering its key sections and parameters, and explaining how you can
customize these configurations to suit your requirements.
"[CORE]" SECTION
"[SCHEDULER]" SECTION
"[WEBSERVER]" SECTION
• Web UI Port: Specify the port for the Airflow web interface.
• Web UI Host: Define the host that serves the web interface.
• Web Worker Timeout: Set the timeout for web server workers.
"[DATABASE]" SECTION
• Configuration for database that acts as major storage for actual data and
metadata for the entire airflow engine.
"[EMAIL]" SECTION
48
MODIFYING CONFIGURATIONS
Locate the Configuration File: Begin by finding the airflow.cfg file within
your Airflow installation directory, typically situated in /etc/airflow/ or
$AIRFLOW_HOME. Employ a text editor to access the file, and remember that
administrative privileges might be necessary depending on your system's
permissions.
Covering all these will be out of scope of this book as this book is
for more crisp and focused on important aspects of concepts.
49
CHAPTER-9
AIRFLOW UI
The Apache Airflow UI is your portal to a world of streamlined workflow
management. It's not just a collection of features; it's your command center, your
mission control, your window into the choreography of your data pipelines.
But Airflow isn't just about pipelines; it's about resources. The Pools tab is
your resource pool manager, letting you allocate CPU cores and slots like a
seasoned general. Need to connect to external databases or cloud storage? The
Connections tab is your diplomatic envoy, forging secure pathways for data
exchange. Variables, like secret intel, are stored and accessed with ease,
keeping your workflows dynamic and adaptable.
And if you ever need to retrace your steps, the Audit Logs are your faithful
chronicler, documenting every action taken within this digital realm.
50
Login with your username and password, and the landing page should
look as below.
With some example DAG’s listed in the page. Let’s start exploring all the
menus from right to left
DOCS MENU
In this section, you'll find a link to the official Airflow documentation, which
provides comprehensive information about Airflow's features and usage. You
can also access the official Airflow website directly from here. Additionally, there
is a link to the GitHub repository, allowing you to explore the Airflow source code
and contribute to the project. You'll also find links to the API documentation,
provided in both Redoc and Swagger formats, enabling you to interact with the
API and test its functionality
ADMIN MENU
51
Pools: Pools are a way to control the concurrency and resource allocation
for different tasks within your DAGs. The "Pools" submenu allows you to
manage task pools, specifying the maximum number of concurrent task
executions for each pool. This can help prevent resource contention and
optimize the execution of tasks.
Variables: Within the "Variables" submenu, you can define and manage
global variables. These variables are accessible to your DAGs and can be used
for storing configuration parameters, constants, or any other data that needs to
be shared across multiple workflows. This section provides a centralized location
to create, edit, and delete variables.
BROWSE MENU
DAG Runs: In the first alcove of the "Browse" menu, the "DAG Runs"
submenu provides a panoramic view of the historical run instances of your
DAGs. Here, you can scrutinize the life story of each DAG run, from its inception
to completion, analyzing timestamps and identifying trends. It's your historical
compass, helping you track the trajectory of your workflows.
52
Triggers: Within the "Triggers" submenu, you are given the authority to
awaken your tasks when the timing is impeccable. This feature allows you to
nudge task instances into action at your discretion, affording you greater control
over the execution sequence and dependencies. It is your orchestral baton,
conducting the performance of your workflows.
Jobs: The "Jobs" submenu is your backstage pass to the laborious work
of Airflow's internal components. It uncovers the status and intricacies of jobs
responsible for executing tasks. This section provides insights into the
operational heart of your Airflow environment, helping you understand and
optimize the performance of your workflow engine.
SECURITY MENU
List Users: The "List Users" sub-menu acts as your registry of individuals
who interact with your Airflow environment. It provides an overview of all users
and their associated permissions, making it easy to manage access control.
Here, you can add, remove, or modify user accounts, ensuring that your Airflow
instance is secure and accessible only to authorized personnel.
53
DAG MENU
54
FINAL CHAPTER: THE CONCLUSION
In the pages of this comprehensive exploration into Apache Airflow, we
embarked on a journey spanning the fundamental pillars of its architecture and
functionality. The initial chapters provided a nuanced understanding of the
Airflow ecosystem, delving into the core concepts underpinning its architecture.
From unraveling the intricacies of Airflow Operators to navigating the dynamics
of the TaskFlow API, our exploration ventured deep into the mechanics of
workflow orchestration. The crucial aspects of Catchup and Backfill were
unwrapped, shedding light on their pivotal roles in ensuring the completeness
and integrity of data workflows.
55