0% found this document useful (0 votes)
424 views55 pages

Apache Airflow Cookbook 2

This document provides an introduction to Apache Airflow, including its installation process. It describes Airflow as a tool for defining, scheduling, and monitoring workflows. The document then outlines the steps to install Airflow, including defining the home directory, using a constraints file for a stable installation, and launching the standalone server to initialize required components.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
424 views55 pages

Apache Airflow Cookbook 2

This document provides an introduction to Apache Airflow, including its installation process. It describes Airflow as a tool for defining, scheduling, and monitoring workflows. The document then outlines the steps to install Airflow, including defining the home directory, using a constraints file for a stable installation, and launching the standalone server to initialize required components.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

APACHE AIRFLOW

COOKBOOK

Build your scalable data pipelines by understanding concepts quickly

M K PAVAN KUMAR

FIRST EDITION
Dedicated to my Dad

and

To my best friend Arun Sir

2
Preface

Welcome to the "Apache Airflow Cookbook." If you're navigating the world of


data organization and workflow automation, Apache Airflow is your trusted
guide. This book is your go-to resource for bootstrapping the art of managing
data workflows with Airflow.

Our cookbook is a practical companion designed for beginners of Airflow. Inside,


you'll discover a variety of recipes, and examples that will help you make the
most of Apache Airflow.

As you flip through these pages, you'll explore different aspects of Airflow's
capabilities. We'll start with the basics of Airflow's architecture and then dive into
the practical use of Airflow Operators, TaskFlow API, and the nitty-gritty of
configuration.

We've also whipped up special chapters dedicated to important Airflow


ingredients like Xcom, Sensors, Catchup and Backfill. You'll learn how to use
these elements to create and manage data pipelines efficiently. We'll also show
you the inner workings of Airflow's user interface (UI) and how it can become
your command center for monitoring workflows.

Whether you're just starting or you're an Airflow pro, this cookbook has
something for you. Each recipe is designed to give you the hands-on knowledge
and tips you need to become an Airflow expert and succeed in managing data
workflows.

So, get ready, let's dive into the world of Apache Airflow, where data flows
smoothly, work gets done, and automation becomes a breeze. Enjoy your
journey!

3
FORE WORD

"This unique book serves as a valuable resource, enabling developers and


project managers to seamlessly delve into the intricacies of Apache Airflow. Its
engaging approach provides a swift yet comprehensive journey, allowing
readers to grasp core fundamentals and gain hands-on expertise within a week."

Sashank Pappu

CEO, Antz.ai

4
Special Thanks to my family, friends and seniors who guided me in
this beautiful art of writing the book

5
TABLE OF CONTENTS
A brief introduction to Apache Airflow ...................................................... 7
Airflow Architecture ............................................................................... 11
Airflow Operators ................................................................................... 18
TaskFlow API ......................................................................................... 27
Catchup and Backfill .............................................................................. 31
Scheduling, Connections and Hooks ...................................................... 35
Xcom and Sensors ................................................................................. 40
Airflow Configuration Deep Dive ............................................................. 48
Airflow UI ............................................................................................... 50
Final Chapter: The Conclusion ............................................................... 55

6
CHAPTER-1

A BRIEF INTRODUCTION TO APACHE


AIRFLOW
elcome to the realm of Apache Airflow, a robust and versatile open-
source platform that has transformed the way organizations manage

W their data workflows and automation processes. In this textbook, we


embark on a journey to explore the intricacies of Apache Airflow and how
it has become an indispensable tool for orchestrating complex data
pipelines, automating repetitive tasks, and facilitating data-driven
decision-making.

In today's data-driven business landscape, enterprises and organizations


constantly grapple with an ever-growing volume of data that requires processing,
transformation, and analysis. Apache Airflow was conceived as a response to
the need for a resilient, flexible, and scalable solution to address these
challenges. Originally developed at Airbnb, Apache Airflow has evolved into a
thriving open-source project that enjoys widespread adoption across various
industries.

At its core, Apache Airflow is a workflow automation tool that empowers users to
define, schedule, and monitor workflows composed of a series of interrelated
tasks. What distinguishes Airflow is its focus on orchestrating intricate workflows
while offering a user-friendly interface for designing, executing, and managing
these tasks. Whether you need to perform Extract, Transform, and Load (ETL)
operations on data, automate data processing, or manage a complex sequence
of tasks, Apache Airflow can serve as your trusted companion.

Throughout this textbook, we delve into the fundamental principles of Apache


Airflow, commencing with an examination of its architecture, components, and
essential functionalities. This exploration will provide you with a comprehensive
understanding of how Airflow handles task scheduling, dependencies, and
dynamic workflows, making it a versatile choice for a wide range of automation
requirements.

Furthermore, we guide you through the process of setting up and configuring


Apache Airflow to tailor it to your specific needs. Whether you are a data
engineer, data scientist, or a DevOps professional, you will discover practical
insights and best practices for creating and managing workflows that align with
the unique demands of your organization.

As we progress through this textbook, you will gain the knowledge required to
monitor, troubleshoot, and optimize your workflows, ensuring the efficient and
reliable execution of your data pipelines. We also cover advanced topics such as

7
backfill, catchup, and extending Airflow's functionality through custom operators
and sensors.

By the end of this educational journey, you will be well-prepared to harness the
full potential of Apache Airflow, enabling you to streamline your data processes,
minimize manual intervention, and make data-driven decisions with confidence.
So, let's immerse ourselves in the world of Apache Airflow and uncover how it
can revolutionize the management of data and automation within your
organization.

Download and Installation

The installation process for Apache Airflow is designed to be accessible and


efficient, ensuring that users can quickly set up their environment to leverage the
power of this workflow automation tool. Below, we'll walk through each
installation step in detail.

1. Define Airflow Home (optional):

Apache Airflow requires a designated home directory, which is, by default, set to
~/airflow. However, users have the flexibility to specify a different location if they
prefer. This is accomplished by setting the AIRFLOW_HOME environment
variable. It's important to note that this step should be completed before the
actual installation process begins. Configuring the environment variable informs
Apache Airflow of the desired location for storing essential files. For example, if
you want to set your AIRFLOW_HOME to a different location, you can do so as
follows:

Shell code/Terminal

export AIRFLOW_HOME=~/my_custom_airflow_location

2. Install Airflow using the constraints file:

To ensure a reproducible and stable installation, Apache Airflow relies on


constraint files. These files help define the dependencies and constraints
necessary for the installation process. The specific version of Airflow and Python
you're using is taken into account during this step. Here's a breakdown of how to
install Apache Airflow:

• Specify the Airflow version you intend to install. In this case, we'll use
version 2.7.2:

Shell code/Terminal

8
AIRFLOW_VERSION=2.7.2

• Determine the Python version you have installed. If the installed Python
version is not supported by Airflow, you may need to set it manually. The
following command extracts the Python version:

Shell code/Terminal

PYTHON_VERSION=”$(python –version | cut -d “ “ -f 2 | cut -d “.” -f 1-2)”

• Define the URL for the constraint file specific to your Airflow and Python
versions. This URL is used to guide the installation process:

Shell code/Terminal

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constrai
nts-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

• Finally, you can use pip to install Apache Airflow, ensuring that it adheres
to the specified constraints. Here's how you can do it:

Shell code/Terminal

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint


"${CONSTRAINT_URL}"

This installation method guarantees that you have a compatible and stable
configuration for your specific Airflow and Python versions.

3. Launch Airflow Standalone:

The "airflow standalone" command is a pivotal step that sets up your Apache
Airflow environment. When executed, it performs the following actions:

• Initializes the Airflow database: This process is crucial as it sets up the


database where Airflow will store critical information about your
workflows, tasks, and their statuses.

9
• Creates a user: The command also establishes an initial user account,
typically referred to as the admin account. This user account is essential
for accessing the Airflow UI and managing the system.
• Starts all components: Apache Airflow is composed of various
components, such as the scheduler, web server, and worker processes.
The "airflow standalone" command ensures that all of these components
are up and running, ready to execute your workflows and tasks.

To execute this command, simply run:

Shell code/Terminal

airflow standalone

4. Access the Airflow User Interface (UI):

After successfully setting up Apache Airflow, you can access the Airflow User
Interface (UI) through a web browser. The UI provides a user-friendly
environment for managing your workflows, tasks, and scheduling. Here's how to
access it:

• Open your web browser and navigate to "localhost:8080".


• You will be prompted to log in with the admin account details displayed in
the terminal. The admin account is created during the installation process,
and you can use these credentials to access the UI.
• Once you're logged in, you'll have access to the Airflow UI, where you can
manage your workflows and tasks. On the home page, you can enable
and configure the example_bash_operator DAG to get started.

Upon executing the provided commands, Apache Airflow will create the
designated $AIRFLOW_HOME folder and generate the "airflow.cfg"
configuration file. This file includes default settings to expedite your setup. You
also have the option to customize these default configurations using
environment variables, as outlined in the Configuration Reference. Additionally,
you can review and adjust the "airflow.cfg" file either in
$AIRFLOW_HOME/airflow.cfg or through the UI by navigating to Admin-
>Configuration.

It's important to note that, out of the box, Apache Airflow uses a SQLite
database, which is a lightweight database system. However, it may become
limiting as your requirements grow, particularly because it supports only
sequential task execution. For more complex scenarios that require
parallelization, you may need to explore other database backends and executors
provided by Apache Airflow. Nevertheless, starting with the default configuration
allows you to quickly explore the UI and command line utilities to gain a better
understanding of the platform's capabilities.

10
CHAPTER-2

AIRFLOW ARCHITECTURE
n the context of Apache Airflow, an Airflow setup typically encompasses
several pivotal components, each playing a specific role in the orchestration

I and execution of data workflows. These core components are as follows:

1. Scheduler: The Scheduler serves as a central orchestrator within the


Apache Airflow architecture. Its key responsibilities include the scheduling of
workflows according to specified time intervals and the initiation of task
execution. It acts as the temporal guide for the entire system, ensuring that tasks
are dispatched to the appropriate execution environment.

2. Executor: The Executor is the workhorse responsible for executing tasks


within an Airflow setup. By default, a basic Airflow installation handles task
execution within the Scheduler itself. However, in more complex and production-
oriented configurations, the Executor frequently delegates the actual execution
of tasks to dedicated worker nodes. This separation of concerns enhances the
system's scalability and efficiency.

3. Webserver: The Webserver component provides a user-friendly and intuitive


interface for users to interact with Apache Airflow. Through this interface, users
can perform essential tasks such as inspecting, triggering, and troubleshooting
Directed Acyclic Graphs (DAGs) and individual tasks. The Webserver is a
valuable tool for visualizing and managing the workflow behavior.

4. DAG Files: Directed Acyclic Graph (DAG) Files are foundational building
blocks of an Airflow setup. These files are organized within a designated folder
and contain the instructions and structure of the workflows. The Scheduler,
Executor, and worker nodes, when in use, rely on these DAG Files to
comprehend the arrangement of tasks and dependencies within the workflows.
They serve as the blueprints for task execution.

5. Metadata Database: The Metadata Database is a critical component that


plays a central role in storing and managing the state of tasks and workflows. It
is utilized by the Scheduler, Executor, and Webserver to keep track of the
progress of ongoing tasks, monitor dependencies, and maintain an accurate
record of the system's current state. This database serves as the backbone for
effective workflow management and coordination.

Understanding these fundamental components is essential for grasping the inner


workings of Apache Airflow and its ability to streamline the orchestration and

11
execution of data workflows. As you delve deeper into the world of Airflow, a
comprehensive knowledge of these components will empower you to harness
the full potential of this powerful platform for workflow automation.

Within the framework of Apache Airflow, Directed Acyclic Graphs (DAGs) serve
as the structural backbone of workflow design and orchestration. These DAGs
progress through a sequence of tasks, which can be conveniently grouped into
three prominent categories:

1. Operators: Operators are foundational building blocks within the realm of


Airflow. They represent pre-defined tasks that offer a versatile toolkit for
constructing the various components of your DAGs. Operators essentially serve
as templates for specific actions or operations. Through a rich library of
Operators, users can encapsulate a wide array of actions, such as file transfers,
data processing, and more, thereby streamlining the creation of intricate
workflows.

2. Sensors: Sensors, while sharing a familial connection with Operators, exhibit


a specialized functionality tailored to a specific need. Sensors are primarily

12
geared toward monitoring and responding to external conditions or events. They
stand vigilant, waiting for specific external occurrences to transpire before
allowing the DAG's execution to proceed. Sensors provide a vital mechanism for
ensuring that workflows remain responsive and adaptive to the ever-changing
dynamics of the environment.

3. TaskFlow-decorated @task: In contrast to predefined Operators, the


TaskFlow-decorated @task represents a customizable and Python-centric
approach to task creation. This unique category of task allows users to
encapsulate custom functionality within a Python function, which is then adorned
with the @task decorator. This method empowers users to tailor tasks to their
specific needs and leverage the full expressive power of Python for defining task
behavior. It provides a more flexible and programmatic means of integrating
custom functionality into your DAGs.

Understanding these distinct task types—Operators, Sensors, and TaskFlow-


decorated @task—is essential for proficiently designing and implementing
workflows within Apache Airflow. Each category caters to specific use cases and
offers a spectrum of choices for task creation, allowing you to craft workflows
that precisely align with your automation requirements. As you delve deeper into
the world of Airflow, a comprehensive knowledge of these task categories will
equip you with the necessary tools to efficiently design and execute complex
data workflows.

While the terms "Task" and "Operator" may seem interchangeable in the
underlying implementation, it proves beneficial to maintain a conceptual
distinction between them. Operators and Sensors may be perceived as
templates, and when integrated into a DAG file, they effectively create a Task,
specifying a particular operation or action within the workflow.

13
The Task Lifecycle

Analogous to the transformation of a Directed Acyclic Graph (DAG) into a DAG


Run upon execution, the individual tasks within a DAG are transmuted into Task
Instances. A Task Instance represents the execution of a specific task within a
particular DAG and, correspondingly, for a defined data interval. These
instances embody tasks with distinct states, signifying their current status in the
lifecycle.

The potential states for a Task Instance encompass the following:

• None: This denotes that the task has not been queued for execution due
to the unsatisfied fulfilment of its dependencies.
• Scheduled: The scheduler has ascertained that the task's dependencies
are met, rendering it ready for execution.
• Queued: The task has been allocated to an Executor and awaits
assignment to a worker for execution.
• Running: The task is actively in the process of execution, either on a
dedicated worker or within a local/synchronous executor.
• Success: The task has successfully completed its execution without
encountering any errors.
• Shutdown: External requests have induced the task to terminate while it
was in an active running state.
• Restarting: External requests have triggered the task to restart its
execution while it was in progress.
• Failed: The task has encountered an error during its execution, resulting
in an unsuccessful run.
• Skipped: This denotes that the task has been bypassed, typically due to
branching conditions, the utilization of the LatestOnly operator, or similar
factors.
• Upstream Failed: An upstream task has failed, and the associated
Trigger Rule mandates the presence of this particular task.
• Up For Retry: The task has failed but retains available retry attempts,
prompting it to be rescheduled for another execution attempt.
• Up For Reschedule: In this instance, the task serves as a Sensor
actively engaged in reschedule mode, signifying its continuous monitoring
for a specific event.
• Deferred: The task has been deferred as part of a trigger operation,
awaiting a subsequent execution opportunity.
• Removed: The task has been excluded from the current execution of the
DAG, signifying its detachment from the ongoing workflow.

14
CREATING OUR FIRST DAG:

Important imports

from datetime import datetime, timedelta

# The DAG object; we'll need this to instantiate a DAG

from airflow import DAG

# Operators: we will check this in our upcoming chapters, these are the initial start of
workflow mechanism

from airflow.operators.bash import BashOperator

The DAG definition

# Define the default_args dictionary

15
default_args = {

'owner': 'your_name',

'start_date': days_ago(1),

'retries': 1,

with DAG(

"tutorial",

# These args will get passed on to each operator

# You can override them on a per-task basis during operator initialization

default_args = default_args,

description="A simple tutorial DAG",

schedule=timedelta(days=1),

start_date=datetime(2021, 1, 1),

catchup=False,

tags=["example"],

) as dag:

Definition of tasks and their dependency for execution

# t1, t2 and t3 are examples of tasks created by instantiating operators

# Define three BashOperator tasks

task1 = BashOperator(

task_id='task1',

bash_command='echo "Task 1"',

16
dag=dag,

task2 = BashOperator(

task_id='task2',

bash_command='echo "Task 2"',

dag=dag,

task3 = BashOperator(

task_id='task3',

bash_command='echo "Task 3"',

dag=dag,

# sequence of tasks and their dependency mention

task1 >> [task2, task3]

QUICK RECAP:

In our previous discussion, we explored the architecture of Apache


Airflow, examining how its components interact and how Directed Acyclic
Graphs (DAGs) are executed when scheduled. This understanding lays the
foundation for grasping Airflow. Once we embark on creating our own DAGs, it is
essential to connect these concepts and develop the ability to visualize them.
Moving forward in the subsequent chapters, we will delve deeper into DAG
creation, continuing our exploration of the various aspects of Airflow.

17
CHAPTER-3

AIRFLOW OPERATORS
n Apache Airflow, the concept of Operators holds a fundamental role, serving
as the elemental building blocks that represent individual, atomic tasks within

I a workflow. As an open-source platform designed for the programmatic


orchestration of workflows, Airflow employs Operators to delineate the
specific steps or operations within a given workflow. These tasks can span
from straightforward activities, such as executing a Python script or running a
SQL query, to more intricate processes like data transfer between systems or
the initiation of external processes.

Key aspects of Airflow Operators include:

Atomic Tasks: Each Operator encapsulates a singular unit of work,


undertaking a precise action or operation. For instance, a BashOperator
facilitates the execution of a Bash command, a PythonOperator handles the
execution of a Python function, and a MySQLOperator interfaces with a
MySQL database.

Extensibility: While Airflow furnishes an extensive collection of built-in


Operators catering to common use cases, the platform also supports the
creation of custom Operators. This feature empowers users to define tasks
that align with the unique requirements of their organization.

Reusability: Operators are intentionally crafted to be reusable, enabling their


application across multiple instances within a Directed Acyclic Graph (DAG)
or even across distinct DAGs. This design minimizes the necessity for code
duplication.

Parameterization: Operators can be parameterized, allowing the


incorporation of dynamic values, such as dates or configuration settings,
when defining a task. This adaptability facilitates the establishment of
dynamic and data-driven workflows.

Dependencies: Task dependencies are established through the


set_upstream() and set_downstream() methods, governing the sequence in
which tasks are executed within a DAG.

A selection of notable Operators includes:

• BashOperator: Executes Bash commands or scripts.


• PythonOperator: Executes Python functions.
• SQLOperator: Runs SQL queries on a database.

18
• DummyOperator: A non-operational Operator employed for specifying
dependencies within a DAG.
• SubDagOperator: Executes a sub-DAG as a singular task within the
parent DAG.
• SimpleHttpOperator: Facilitates the transmission of HTTP requests to
engage with web services, APIs, or other HTTP endpoints (Note: This
involves the use of the airflow connections concept).

BASHOPERATOR

The BashOperator in Apache Airflow is a fundamental component


designed for the integration of bash commands and scripts into workflow tasks. It
enables users to execute arbitrary shell commands or scripts as part of their
data orchestration pipelines. Configured with the desired bash command and
optional parameters, the BashOperator facilitates the seamless incorporation of
shell functionality within the Airflow framework. This operator is particularly
useful for scenarios where workflow tasks necessitate interactions with the
underlying system, file manipulations, or the execution of shell-based utilities. By
encapsulating shell commands, the BashOperator promotes the interoperability
of Airflow with existing shell scripts and commands, contributing to the
adaptability and extensibility of Airflow workflows in diverse data engineering and
automation contexts.

Import Statements: Import necessary modules and classes from the Airflow
library. These include the DAG class, BashOperator, and timedelta for working
with time intervals.

Default Arguments: Define a dictionary args that specifies default configuration


and settings for the DAG. These settings will be inherited by all tasks in the
DAG, but they can be overridden on a per-task basis during task initialization.
Here are some of the key settings in args.

• owner: Specifies the owner of the DAG.


• depends_on_past: If set to True, it would make tasks depend on the
success of their previous runs.

19
• start_date: Defines the start date for the DAG. In this case, it's set to
2 days ago.
• email: An email address to send notifications to.
• retries: Number of times to retry a task if it fails.
• retry_delay: The delay between retries.

DAG Initialization: Create an instance of the DAG class with various properties:

• dag_id: A unique identifier for the DAG.


• default_args: Assign the previously defined args dictionary as default
arguments.
• description: A description of the DAG.
• schedule_interval: Defines how often the DAG should run. In this
case, it's set to run once every day.

Task Definitions:

• task1: This is a BashOperator task named 'task_1'. It runs the 'date'


command in a Bash shell.
• task2: Another BashOperator task named 'task_2'. It sleeps for 5
seconds using the 'sleep' command.
• task3: A third BashOperator task named 'templated_task'. It runs a
Bash script defined by the templated_command variable. This script

20
uses Jinja templating to print date-related values and a parameter
passed as an argument.

The templated_command is a multi-line string that contains a Bash script with


embedded Jinja2 templating expressions. This script will be executed when the
'templated_task' is run as part of the DAG.

Task Dependencies: Define the dependencies between tasks using


the >> operator. In this case, 'task1' should be executed before 'task2' and
'task3'.

21
Documentation: Additional documentation is provided for ‘task1’ and the DAG
itself using the doc_md attribute. This documentation can include Markdown
content and is displayed in the Airflow web interface.

SimpleHttpOperator & PythonOperator

The SimpleHttpOperator is a fundamental component in Apache Airflow,


designed to facilitate HTTP requests within workflows. It encapsulates the
interaction with HTTP endpoints, enabling users to seamlessly integrate web
services into their data pipeline orchestrations. To employ the
SimpleHttpOperator, one configures the task with essential parameters such as
the HTTP method (GET, POST, etc.), the endpoint URL, request headers, and
request payload. This operator streamlines the execution of HTTP requests,
offering simplicity and efficiency in handling RESTful APIs or other HTTP-based
services. By encapsulating the intricacies of HTTP communication, the
SimpleHttpOperator abstracts away the underlying complexities, ensuring a
standardized and straightforward interface for Airflow users. This operator is
particularly advantageous for scenarios where data extraction, transformation, or
loading involves interaction with external web services, contributing to the
versatility and extensibility of Airflow workflows.

The PythonOperator is a pivotal element in Apache Airflow, providing a


mechanism to seamlessly integrate custom Python functions or scripts into
workflow execution. This operator allows practitioners to encapsulate arbitrary
Python logic within their data workflows, fostering flexibility and extensibility.
Users can define a Python function or callable object and execute it as part of an
Airflow task by configuring a PythonOperator instance. The operator supports
passing parameters, templating, and managing dependencies, ensuring the
seamless integration of custom Python code into the broader orchestration. The
PythonOperator is instrumental in scenarios where workflow tasks demand
custom processing, data manipulation, or interactions with external systems
through Python scripts. By enabling the incorporation of user-defined Python
functionality, the PythonOperator enhances the expressive power of Apache
Airflow, making it a versatile tool for diverse data engineering and pipeline
automation scenarios.

Note: for now just remember xcom is used for sharing the data between
operators or tasks later we have dedicated chapter in which we will discuss more
on xcom & sensors in a detail.Import Statements:

Import necessary modules and classes from the Airflow library and other Python
libraries.

22
Variable Initialization: Define several variables to store information needed for
the DAG’s tasks. These variables include the city name, limit for API requests,
and API key for accessing the OpenWeatherMap API.

Default Arguments: Define a dictionary args that specifies default configuration


and settings for the DAG. These settings include the owner, start date, retries,
and retry delay.

DAG Initialization: Create an instance of the DAG class with various properties:

• dag_id: A unique identifier for the DAG.


• catchup: Set to False to prevent backfilling historical data.
• schedule_interval: Defines how often the DAG should run. In this
case, it's set to run daily.

23
• default_args: Assign the previously defined args dictionary as default
arguments.

Task Definitions:

• task_fetch_geocodes: This is a SimpleHttpOperator task named


'fetch_geo_codes'. It makes an HTTP GET request to retrieve
geolocation data for the specified city using the OpenWeatherMap
API. The response content is filtered as JSON, and the response is
logged.
• task_fetch_weather: Another SimpleHttpOperator task named
'fetch_weather'. It makes an HTTP GET request to retrieve weather
data for the specified latitude and longitude, which are obtained from
the output of the 'fetch_geo_codes' task. The response content is
filtered as JSON, and the response is logged.
• task_convert_to_csv: This is a PythonOperator task named
'convert_to_csv'. It calls the create_csv_file Python function, which
presumably converts the weather data obtained from 'fetch_weather'
into a CSV file and stores it.

task_fetch_geocodes = SimpleHttpOperator(
task_id='fetch_geo_codes',
http_conn_id='open_weather_geo_codes',
endpoint=f'direct?q=Hyderabad,Telangana&limit=1&appid={API_KEY}',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)

task_fetch_weather = SimpleHttpOperator(
task_id='fetch_weather',
http_conn_id='open_weather_data',

endpoint='weather?lat={{ti.xcom_pull(task_ids=["fetch_geo_codes"])[0][0].get("lat")}}&lon={{ti.xco
m_pull('

'task_ids=["fetch_geo_codes"])[0][0].get("lon")}}&appid=f38c94b357eae713857038d2f1a912cc',
method='GET',

24
response_filter=lambda response: json.loads(response.content),
log_response=True
)

task_convert_to_csv = PythonOperator(
task_id='convert_to_csv',
python_callable=create_csv_file
)

you can check individual tasks by executing below command without scheduling
the entire DAG.

The utility method used utils/create_csv_file is as follows.

import csv
def create_csv_file(ti):

# pull data from the xcom objects


data = ti.xcom_pull(task_ids=["fetch_weather"])[0]
# field names
fields = ['Name', 'lat', 'lon', 'weather_desc', 'temperature', 'feels_like', 'temp_min', 'temp_max',
'pressure',
'humidity', 'visibility', 'wind_speed', 'sunrise', 'sunset']
# data rows of csv file
row = [str(data.get("name")), str(data.get("coord").get("lat")), str(data.get("coord").get("lon")),
str(data.get("weather")[0].get("description")), str(data.get("main").get("temp")),
str(data.get("main").get("feels_like")),
str(data.get("main").get("temp_min")), str(data.get("main").get("temp_max")),
str(data.get("main").get("pressure")),
str(data.get("main").get("humidity")), str(data.get("visibility")),
str(data.get("wind").get("speed")),
str(data.get("sys").get("sunrise")), str(data.get("sys").get("sunset"))]

# name of csv file


filename = f"weather_records.csv"

# save the data


with open(filename, 'w', encoding="utf-8") as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)
writer.writerow(row)
csvfile.flush()
csvfile.close()

25
This DAG fetches geolocation data, weather data, and converts it to CSV format.
It demonstrates the use of the SimpleHttpOperator for making HTTP requests
and the PythonOperator for executing custom Python code within an Airflow
DAG. The tasks are orchestrated to execute in the specified order, with
'fetch_weather' dependent on the output of 'fetch_geo_codes'.

The SimpleHttpOperator basically picks the relavent data from Airflow


connections and you can configure different connections like HTTP
Connection, DB Connection etc in Admin menu from UI as shown below.

the data between the tasks is communicated using xcom and its visually seen in
the Airflow UI as below.

QUICK RECAP:

Apache Airflow Operators serve as the building blocks for defining


individual tasks within workflows. They enable users to orchestrate a wide range
of operations, from simple scripts to complex data transformations, all while
providing flexibility, reusability, and error handling capabilities. Operators are
essential for creating dynamic, parameterized, and robust workflows in Airflow.

26
CHAPTER-4

TASKFLOW API
he Apache Airflow TaskFlow API stands as a ground-breaking
advancement within the realm of workflow automation and data

T engineering. This exploration unveils the innovative paradigm of TaskFlow,


which significantly simplifies the creation of intricate data pipelines, offering
a refreshing approach to the design of workflows. By delving into the
foundational components of TaskFlow, one can discern a systematic
method for task definition, dependency management, and error handling, all
elegantly implemented through Pythonic code. This book not only caters to
seasoned developers but also provides newcomers with the tools to effectively
harness TaskFlow's capabilities, from intuitive dependency management to
robust error handling and advanced techniques. Throughout this journey, a novel
approach to orchestrating workflows is revealed, one that is both potent and
user-friendly, enhancing clarity, simplicity, and efficiency in the development
process of data pipelines.

In our preceding article, PythonOperators were utilized for task execution.


For those familiar with the PythonOperator in Airflow, an exploration into the
reasons for contemplating a shift is warranted. The TaskFlow API offers several
advantages over the traditional PythonOperator, some of which include:

1. Reduced Repetitive Code: The TaskFlow API minimizes the need for
redundant code, enhancing the efficiency of workflow development.
2. Efficient Data Transfer: Streamlined data transfer between Directed
Acyclic Graphs (DAGs) is facilitated by the TaskFlow API, simplifying the
movement of information within the workflow.
3. Dependency Chain Specification: Unnecessary code for specifying
dependency chains is eliminated, leading to a more streamlined and
comprehensible workflow structure.
4. Simplified Task Management: Task creation and initialization are
simplified with the TaskFlow API, offering a more straightforward and
intuitive process.

To illustrate the advantages of TaskFlow API, let's consider a basic


Extract, Transform, Load (ETL) process. In this scenario, a CSV file is read from
local storage, subjected to a transformation process, and the resulting output is
printed.

27
Import necessary libraries:

• pandas is imported to work with data in a tabular format.


• datetime is imported to specify the start date for the DAG.
• Import decorators and functions from the airflow.decorators module
for creating tasks and DAGs.

Define the DAG:

• The @dag decorator is used to define the DAG with the following
parameters:
• dag_id: This specifies the unique identifier for the DAG as
"housing_etl_workflow".
• schedule: The DAG is set to run on demand (schedule=None).

28
• start_date: The DAG is set to start on October 7, 2023.
• catchup: Catchup is disabled, meaning that the DAG won't run backfill
for missed schedules.

Define ETL tasks:

• Inside the housing_etl_workflow function, three tasks are defined


using the @task() decorator.
• extract: This task reads data from a CSV file located at
"/<path_to_file>/IndianHouses.csv" using Pandas and returns it as a
DataFrame.
• transform: This task takes the extracted data as input, converts it to a
DataFrame (although this step seems unnecessary), and then filters
the rows where the "Transaction" column has the value
"New_Property".
• load: This task simply prints the transformed data (you may want to
modify it to perform an actual data load operation).

Execute the ETL workflow:

• Outside the task definitions, the workflow begins with the df =


extract() line, which extracts data and stores it in the df variable.
• The extracted data (df) is then passed as input to the transform task,
and the result is stored in the new_properties variable.
• Finally, the load task is called with new_properties as its argument.

Trigger the DAG:

• The last line housing_etl_workflow() is used to execute the DAG.


When this script is run with Airflow, it will trigger the defined tasks
within the DAG based on their dependencies.

Please note that this code is a basic example of how to create an ETL workflow
using Airflow. Depending on your use case, you may need to customize the
tasks and add more functionality to suit your specific data processing
requirements. In the airflow web console the ETL flow looks as below

29
QUICK RECAP:

In the domain of workflow automation and data engineering, the Apache


Airflow TaskFlow API stands as a catalyst for profound transformation. As we
conclude our exploration, it becomes apparent that TaskFlow transcends mere
utility; it embodies a philosophy that champions simplicity, clarity, and reliability
in orchestrating intricate data pipelines. This API serves as a potent tool,
empowering both seasoned engineers and novices alike, providing them with the
means to craft workflows that not only demonstrate efficiency and maintainability
but also stand as exemplars of the elegance inherent in Pythonic code. As we
take leave of these pages, it is crucial to recognize that TaskFlow represents a
portal to a new era of workflow design. In this realm, dependencies, errors, and
complex tasks yield to simplicity, and every data journey, no matter its intricacy,
can be shaped with grace and precision. TaskFlow beckons to all who aspire to
elevate their workflow design, infuse vitality into their data pipelines, and usher in
a realm of boundless possibilities in the realm of data automation.

30
CHAPTER-5

CATCHUP AND BACKFILL


irflow's Catchup and Backfill functionalities emerge as unsung heroes in
the realm of efficient workflow management, constituting a dynamic duo

A that orchestrates the past, present, and future facets of data pipelines. The
Catchup feature within Airflow serves as a remarkable tool, effortlessly
breathing life into historical data through a simple switch toggle. This
functionality metaphorically transports us to an era where we can
manipulate the hands of time within our workflow's calendar, ensuring the
execution of any previously missed runs and securing a thorough coverage of
data points. This temporal control is not only instrumental in preserving data
integrity but also adeptly addresses historical data processing, skilfully bridging
any temporal gaps within data timelines with grace and precision.

On the contrary, the concept of Backfill in Airflow presents itself as a strategic


maneuverer, akin to navigating through the chapters of our workflow's narrative.
It bestows users with the power to retroactively execute tasks from a specified
start date, treating historical data as if it unfolded in real-time. The Backfill
capability serves as a versatile tool, effortlessly managing data gaps that may
arise due to system downtime, maintenance activities, or unforeseen
circumstances. This prowess empowers workflows to be both comprehensive
and forward-looking, transforming the data pipeline into a seamless narrative
where past, present, and future seamlessly coexist. The collaboration of
Catchup and Backfill breathes vitality into workflows, offering control, reliability,
and a holistic perspective on the data journey, making them indispensable
components in the tapestry of efficient workflow management.

Using start Date and Execution Date we can explain the DAG runs as below

31
with the help of above diagram we can easily understand that

Trigger Point = start_date + { schedule_interval }

CATCH-UP:

The Apache Airflow Catchup feature stands as a foundational element in


the orchestration of workflows, providing a mechanism akin to turning back the
hands of time within the framework of data pipelines. In the context of Airflow,
Catchup operates as a toggle switch, offering the ability to seamlessly bring
historical data to life. This temporal control is indispensable, allowing for the
execution of missed runs and ensuring the comprehensive coverage of data
points. By effortlessly bridging any temporal gaps within data timelines, the
Catchup feature not only upholds data integrity but also affords practitioners the
invaluable capability to retrospectively address historical data processing with
precision and clarity. This feature is a cornerstone in the arsenal of tools that
Airflow offers, contributing significantly to the reliability and robustness of
workflow management.

if you clearly follow the coding aspects of our chapters while we are creating
DAG’s we are explicitly passing a parameter called catchup=False , the reason

32
for this is Airflow by default will run any past scheduled intervals that have not
been run to avoid this we are passing that parameter.

33
BACKFILL:

The Apache Airflow Backfill feature emerges as a strategic maneuver within the
landscape of workflow orchestration, presenting a distinctive capability for
retroactive task execution. Functioning akin to navigating through the chapters of
a narrative, the Backfill feature empowers users to seamlessly address historical
data processing from a specified start date, treating the data as if it unfolded in
real-time. This versatile tool becomes particularly valuable in instances of
system downtime, maintenance activities, or unforeseen circumstances, allowing
for the effortless filling of data gaps and ensuring the comprehensive and
forward-looking nature of the workflow. The Backfill feature, with its ability to
reconcile historical data seamlessly, stands as a fundamental and sophisticated
component in Apache Airflow's repertoire, enhancing the coherence and
completeness of data pipelines. If for some reason we want to re-run DAG’s on
certain schedules manually we can use the following CLI command to do so.

airflow backfill -s <START_DATE> -e <END_DATE> -rerun_failed_tasks -B


<DAG_NAME>

QUICK RECAP:

Airflow’s TaskFlow API offers a streamlined approach to building complex


workflows, reducing boilerplate code and simplifying task dependencies.
Combined with the powerful features of backfill and catchup, it becomes a robust
tool for managing data pipelines and ETL processes. Backfilling allows you to
retroactively run and populate missed DAG Runs, ensuring data integrity, while
catchup provides a dynamic solution for handling scheduling gaps. Together,
these capabilities empower data engineers and operators to efficiently manage
and maintain their workflows, making Airflow a valuable asset in the world of
data orchestration.

34
CHAPTER-6

SCHEDULING, CONNECTIONS AND


HOOKS
hope you have been enjoying the content presented in the preceding
chapters, delving into the intricacies of Apache Airflow. In this chapter, our

I focus shifts towards the pivotal aspects of Airflow's Scheduling, Connections,


and Hooks. It is assumed that you have diligently traversed the earlier
chapters, assimilating a comprehensive understanding of the concepts
covered thus far. Your familiarity with these foundational concepts is
paramount as we navigate through the nuanced discussions within this
chapter. Your dedication to mastering the preceding content will undoubtedly
enhance your comprehension of the forthcoming topics, providing a solid
foundation for delving into the scheduling mechanisms, connections, and hooks
within the Apache Airflow framework.

AIRFLOW SCHEDULING:
In the domain of Apache Airflow, the scheduler assumes a proactive role,
vigilantly overseeing tasks and Directed Acyclic Graphs (DAGs). Task instances
come to life under its watchful eye, awakening once their dependencies find
fulfillment. Functioning as a persistent presence in an Airflow production
environment, the scheduler is effortlessly set into motion with a simple
command, “airflow scheduler,” drawing configurations from the designated
“airflow.cfg” file.

It's imperative to grasp that when a DAG is scheduled with a


"schedule_interval" of, say, one day, the task associated with the date 2023–01–
01 takes its cue shortly after 2023–01–01T23:59. In essence, the job instance
commences precisely at the conclusion of the time span it represents. To
underscore, the scheduler triggers your job at one "schedule_interval" after the
commencement date, marking the finale of that specific time frame. This
temporal orchestration can be configured as either "preset" or "cron," affording
users flexibility in aligning their workflows with diverse scheduling needs.

35
Let’s have a practical example to understand the concept more clearly.

import os
import logging
import csv
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonOperator

with (DAG(dag_id='ecgdataprocessing',
description='ecg data processing with schedule',
schedule_interval='*/2 * * * *', # "*/2 * * * *" At every 2nd minute.
start_date=datetime(2023, 10, 9), catchup=False)
as dag):
@task
def copy_ecg_data():
destination_path =
f"/Users/pavanmantha/Pavans/PracticeExamples/apache_airflow_tutorial/data{datetime.now()}.c
sv"
logging.info('Write Dir')
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
logging.info(f'File={destination_path}')

src_file =
"/Users/pavanmantha/Pavans/PracticeExamples/DataScience_Practice/Advanced-
ML/ECGCvdata.csv"
with open(file=destination_path, mode='w') as fd2:
writer = csv.writer(fd2)
with open(file=src_file, mode='r') as src_file:
reader = csv.reader(src_file)
header = next(reader)
writer.writerow(header)
for row in reader:
writer.writerow(row)
logging.info(f'Data written to File={destination_path}')

36
def log_info():
logging.info('WRITING NOW')
logging.info(datetime.utcnow().isoformat())
logging.info('TIME WRITTEN')

logger_task = PythonOperator(task_id='print_now', python_callable=log_info, dag=dag)

logger_task >> copy_ecg_data()

DAG run of the above process looks as below.

CONNECTIONS:

Airflow plays a crucial role in both retrieving and pushing data to external
systems, and it introduces a core concept called Connections to securely store
the credentials required for communication with these external systems.

A Connection essentially comprises a set of essential parameters, such


as a username, password, and hostname, along with details about the specific
type of system it connects to. Each Connection is also assigned a unique name,
referred to as the conn_id.

You have the flexibility to manage Connections through either the user
interface (UI) or the command-line interface (CLI). To gain a deeper
understanding of creating, modifying, and handling Connections, refer to the
section on Managing Connections which provides comprehensive information.
Moreover, there are customizable options available for storing Connections and
choosing the backend that suits your requirements.

37
Connections can be directly utilized in your own code, harnessed via
Hooks, or incorporated within templates.

HOOKS:

A Hook emerges as a pivotal abstraction, offering a sophisticated


interface to facilitate seamless interactions with external platforms. Its primary
function is to alleviate the burden of crafting intricate, low-level code that would
otherwise be required for direct engagement with APIs or specialized libraries.
By employing Hooks, users can navigate external systems with ease,
abstracting away the complexities inherent in direct code interfacing. Moreover,
Hooks play a foundational role in the creation of Operators, serving as the
essential building blocks for constructing these higher-level components in the
Airflow ecosystem.

An inherent strength of Hooks lies in their seamless integration with


Connections, which are instrumental in acquiring the necessary credentials for
external interactions. This integration not only enhances security by centralizing
credential management but also promotes reusability and modularity across
workflows. It's noteworthy that Hooks often come equipped with a default
conn_id, simplifying the configuration process. As an illustrative example, the
PostgresHook, a specific implementation of a Hook tailored for PostgreSQL
databases, automatically seeks the Connection with a conn_id of
"postgres_default" in instances where an explicit identifier is not provided. This
nuanced integration of Hooks and Connections underscores the user-friendly
design philosophy of Apache Airflow, streamlining the development of robust
and modular data workflows.

with DAG(
dag_id='weather_api_dag',
catchup=False,

38
schedule_interval='@daily',
default_args=args
) as dag:

task_fetch_geocodes = SimpleHttpOperator(
task_id='fetch_geo_codes',
http_conn_id='open_weather_geo_codes',
endpoint=f'direct?q=Hyderabad,Telangana&limit=1&appid={API_KEY}',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)

task_fetch_weather = SimpleHttpOperator(
task_id='fetch_weather',
http_conn_id='open_weather_data',

endpoint='weather?lat={{ti.xcom_pull(task_ids=["fetch_geo_codes"])[0][0].get("lat")}}&lon={{ti.xco
m_pull('

'task_ids=["fetch_geo_codes"])[0][0].get("lon")}}&appid=f38c94b357eae713857038d2f1a912cc',
method='GET',
response_filter=lambda response: json.loads(response.content),
log_response=True
)

task_convert_to_csv = PythonOperator(
task_id='convert_to_csv',
python_callable=create_csv_file
)

we are just using the connection created in the admin section of airflow ui.

QUICK RECAP:

Airflow schedules, connections, and hooks are essential components that


collectively enable efficient and reliable workflow automation. Schedules define
when and how often tasks should run, connections facilitate communication with
external systems and data sources, while hooks serve as the bridge for
executing tasks and exchanging data. These foundational elements empower
organizations to streamline processes, orchestrate complex workflows, and
leverage data seamlessly, making Airflow a valuable tool in the realm of data
engineering and workflow management.

39
CHAPTER-7

XCOM AND SENSORS


COM, or "cross-communication," functions as a vital element within
Apache Airflow's framework, serving as the metaphorical

X conductor's baton that orchestrates the exchange of information


and data between tasks within a Directed Acyclic Graph (DAG).
Much like a bridge connecting musical notes, XCOM ensures a
seamless connection between tasks, allowing for the transfer of
small amounts of data and metadata. This mechanism empowers
tasks to influence one another's behavior, making workflows not only
interconnected but also dynamic, responsive, and adaptable. Throughout our
exploration, we will navigate the intricate landscape where XCOM becomes the
guiding thread, facilitating the graceful flow of data, orchestrating a harmonious
composition within workflows.

On the flip side, Sensors in Airflow's orchestral analogy act as vigilant


sentinels, embodying the ears that keenly listen for the opportune moment to
act. These sensors function by pausing the execution of a workflow until specific
external conditions are met, allowing for meticulous coordination with external
systems, services, and data sources. In our exploration, we will uncover the
pivotal role of Sensors in ensuring that workflows remain finely attuned to the
rhythms of external events and resources. With an acute sense of timing,
Sensors uphold the baton, skillfully orchestrating workflows and contributing to a
melodious and reliable composition, alongside the finesse of XCOM, within the
realm of data workflows.

Let us assume that we have a scenario to extract (from PostgreSQL),


transform and load (to mongodb) the data to another database as shown as
below.

40
Here’s why XCOMs are used in Apache Airflow:

Data Sharing: Airflow tasks often need to share information or data between
them. For example, one task may generate some data, and another task may
need to use that data as input. XCOMs provide a mechanism for these tasks to
share data efficiently.

Dependency Management: Airflow allows you to define task dependencies


within a DAG. XCOMs help propagate the data produced by one task to another
as it follows the directed acyclic graph’s dependencies.

Logging and Monitoring: XCOMs are also useful for logging and monitoring.
You can log and capture the values of XCOMs, which can be valuable for
debugging and tracking the progress of tasks in a workflow.

Parameterization: XCOMs can be used to parameterize and customize tasks.


You can pass parameters and configuration data to tasks through XCOMs,
allowing you to make tasks more dynamic and adaptable.

Dynamic Workflows: XCOMs can be used to create dynamic workflows that


respond to the results of previous tasks. For example, you can use the output of
a task to determine which downstream tasks should be executed.

In Airflow, XCOMs are typically implemented as a key-value store where a


task can push a value, and another task can pull that value using the key. The
XCOM system ensures that data is stored and retrieved in a consistent and
thread-safe manner.

41
While XCOMs can be powerful, it’s essential to use them judiciously, as passing
too much data between tasks can lead to performance issues and make your
DAGs harder to manage. They are best suited for sharing small pieces of
metadata or data necessary for task coordination within your workflow.

let us code the full implementation of the above scenario. We will combine most
of the topics that we learnt till now like hooks, connections, taskflow api and
operators etc to implement the above said scenario.

import json
from datetime import datetime

from airflow.operators.python import PythonOperator


from airflow.providers.mongo.hooks.mongo import MongoHook
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.decorators import task, dag

from utils.data_transformer import transform

@dag(dag_id="products_etl_workflow",
schedule=None,
start_date=datetime(2023, 10, 14), catchup=False)
def products_etl_workflow():
@task()
def load_to_mongo(records):
hook = MongoHook(conn_id="mongo_local")
conn = hook.get_conn()
db = conn.northwind
collection = db.products
print(records)
records = json.loads(records)
for record in records:
collection.insert_one(record)

extract = PostgresOperator(
task_id="fetch_data_from_northwind",
postgres_conn_id="postgresql-local-northwind",
sql="select p.product_id,"
"p.product_name,"
"p.quantity_per_unit,"
"p.unit_price,"
"p.units_in_stock,"
"p.units_on_order,"
"c.category_id,"
"c.category_name,"
"c.description "
"from products p, categories c "
"where p.category_id = c.category_id",
do_xcom_push=True
)

data_transform = PythonOperator(
task_id='convert_to_json',

42
python_callable=transform
)

extract >> data_transform >> load_to_mongo(data_transform.output)

products_etl_workflow()

The above dag contacts “postgresql” using “PostgresOperator” for the data by
executing the sql. The results from the task are pushed to “xcom” and are
available for our data_transform task and then the transformed result is passed
to a taskflow api enabled method which eventually push the data to mongodb .

the connections that are created from airflow ui for both mongodb and
postgresql respectively looks as below.

43
Once the DAG is run and if you face no issues then your run should look
something as below, where all the tasks succeeds.

44
Once the ETL is completed this is what you can realise.

Just see the similarity, we have total of 77 records in both the databases
after we ran the query, so all the data has been transformed successfully and
ingested to destination database.

45
AIRFLOW SENSORS:

Airflow sensors are a crucial component of Apache Airflow, They are used
to pause the execution of a workflow until a certain external condition or event is
met. They are designed to wait for specific conditions to be satisfied before
allowing downstream tasks to proceed. Sensors play a vital role in ensuring that
your workflows run smoothly and efficiently by waiting for the right conditions to
be in place.

Sensors are used in a variety of scenarios, often when working with external
systems, services, or data sources. For example, here are a few common use
cases for Airflow sensors:

File Sensor: A File Sensor can be used to wait for the existence of a specific file
in a directory. This is useful when a task downstream in the workflow depends
on the availability of a file.

HTTP Sensor: An HTTP Sensor can wait for a specific URL to return a particular
status code or contain specific content. This is useful for monitoring external web
services or APIs.

Database Sensor: A Database Sensor can be used to wait for a particular


condition in a database, such as a specific record being inserted or updated.
This is helpful when tasks depend on database changes.

Time Sensor: A Time Sensor can be used to pause a workflow until a specific
time or time interval is reached. It’s useful for scheduling tasks to run at precise
times.

External Task Sensor: This sensor waits for the successful completion of
another task in the same or a different DAG. It allows for inter-DAG
dependencies and is crucial for coordinating tasks across different workflows.

46
QUICK RECAP:

Apache Airflow's sensors and XCOM mechanisms play pivotal roles in crafting
resilient and effective data workflows. Sensors empower workflows to gracefully
handle external dependencies, ensuring execution pauses until specific
conditions are met. This safeguards data integrity, mitigating the risk of errors
arising from incomplete or unavailable resources. On the other hand, XCOMs
facilitate seamless sharing of data and metadata among tasks, fostering
coordination, dynamic workflows, and efficient information transfer. Through
adept implementation of sensors and XCOMs in your Airflow Directed Acyclic
Graphs (DAGs), you gain the ability to orchestrate intricate data processes with
heightened reliability, responsiveness, and adaptability. This, in turn, streamlines
data pipelines and elevates your workflow management capabilities to new
heights.

47
CHAPTER-8

AIRFLOW CONFIGURATION DEEP DIVE

In
this chapter, we'll explore advanced configurations to enhance the
experience and effectiveness of using Apache Airflow. The “airflow.cfg”
file is a critical configuration component in Apache Airflow, housing all
the settings and parameters that dictate the behavior and performance of an
Apache Airflow deployment. In this chapter, we'll dive into the “airflow.cfg” file,
covering its key sections and parameters, and explaining how you can
customize these configurations to suit your requirements.

"[CORE]" SECTION

• TaskExecutor: Here, you specify the executor responsible for task


execution, such as LocalExecutor or CeleryExecutor.
• DAGs Location: This parameter designates the directory where your DAG
files are stored.
• Example DAGs: Determine whether example DAGs should be loaded
during initialization.
• Database Connection: Configure the URL for the database connection.
• Concurrency: This sets the maximum number of concurrently running task
instances.

"[SCHEDULER]" SECTION

• Heartbeat Frequency: Control how often the scheduler sends heartbeats


to the database.
• Scheduler-Worker Heartbeat: Configure the frequency of heartbeats
between the scheduler and workers.

"[WEBSERVER]" SECTION

• Web UI Port: Specify the port for the Airflow web interface.
• Web UI Host: Define the host that serves the web interface.
• Web Worker Timeout: Set the timeout for web server workers.

"[DATABASE]" SECTION

• Configuration for database that acts as major storage for actual data and
metadata for the entire airflow engine.

"[EMAIL]" SECTION

• Configuration for sending email notifications.

48
MODIFYING CONFIGURATIONS

Modifying the airflow.cfg file is pivotal as it allows you to adapt Airflow to


your precise needs. Here's a step-by-step guide to achieve this:

Locate the Configuration File: Begin by finding the airflow.cfg file within
your Airflow installation directory, typically situated in /etc/airflow/ or
$AIRFLOW_HOME. Employ a text editor to access the file, and remember that
administrative privileges might be necessary depending on your system's
permissions.

Modify Parameters: Inside the configuration file, navigate to the section


and parameter that necessitates customization. Modify the parameter's value in
line with your specific requirements. For instance, if you intend to switch the
database from the default one navigate to [database] section, edit the parameter
called “sql_alchemy_conn”.

Save Your Modifications: After implementing your desired changes, save


the airflow.cfg file to preserve your configurations.

Restart Airflow Services and Validate Changes: To activate the new


configurations, it is imperative to restart the Airflow webserver, scheduler, and
worker processes. To ensure the changes have taken effect, you can verify by
accessing the Airflow web UI, scrutinizing logs, or executing Airflow commands.

CHOOSING THE DATABASE

Choosing the right database for Airflows data and metadata is


crucial for its execution and performance. Airflow support wide variety of
databases like “postgresql”, “MySql”, “SqLite”.

Please check the airflow.cfg for any further modifications as


number of executors, database pools, load balancing, monitoring & scaling, task
optimizations, logging etc.

Covering all these will be out of scope of this book as this book is
for more crisp and focused on important aspects of concepts.

49
CHAPTER-9

AIRFLOW UI
The Apache Airflow UI is your portal to a world of streamlined workflow
management. It's not just a collection of features; it's your command center, your
mission control, your window into the choreography of your data pipelines.

Imagine a central hub where you can effortlessly orchestrate your


workflows. The DAGs tab is your roster of carefully crafted plans, each ready to
be triggered with a click. Need to see the bigger picture? The Tree tab unveils
the interconnected tapestry of tasks, dependencies laid bare like a strategist's
map. The Graph, your detailed spyglass, zooms in on individual tasks, their
execution times laid out like a performance report.

But Airflow isn't just about pipelines; it's about resources. The Pools tab is
your resource pool manager, letting you allocate CPU cores and slots like a
seasoned general. Need to connect to external databases or cloud storage? The
Connections tab is your diplomatic envoy, forging secure pathways for data
exchange. Variables, like secret intel, are stored and accessed with ease,
keeping your workflows dynamic and adaptable.

And if you ever need to retrace your steps, the Audit Logs are your faithful
chronicler, documenting every action taken within this digital realm.

The Airflow UI is more than just an interface; it's a symphony conductor, a


data whisperer, a silent guardian of your workflows. It's the power of Airflow
made tangible, waiting for your touch to orchestrate the dance of your data.

50
Login with your username and password, and the landing page should
look as below.

With some example DAG’s listed in the page. Let’s start exploring all the
menus from right to left

DOCS MENU

In this section, you'll find a link to the official Airflow documentation, which
provides comprehensive information about Airflow's features and usage. You
can also access the official Airflow website directly from here. Additionally, there
is a link to the GitHub repository, allowing you to explore the Airflow source code
and contribute to the project. You'll also find links to the API documentation,
provided in both Redoc and Swagger formats, enabling you to interact with the
API and test its functionality

ADMIN MENU

1. Connections: The "Connections" submenu is where you can define and


manage connections to external systems or services. These connections
are crucial for tasks within your workflows to interact with databases,
cloud services, APIs, and other external resources. In this section, you
can create, edit, or delete connection configurations, ensuring seamless
integration with external systems.

51
Pools: Pools are a way to control the concurrency and resource allocation
for different tasks within your DAGs. The "Pools" submenu allows you to
manage task pools, specifying the maximum number of concurrent task
executions for each pool. This can help prevent resource contention and
optimize the execution of tasks.

Variables: Within the "Variables" submenu, you can define and manage
global variables. These variables are accessible to your DAGs and can be used
for storing configuration parameters, constants, or any other data that needs to
be shared across multiple workflows. This section provides a centralized location
to create, edit, and delete variables.

XComs: Cross-communication, or XCom, is the mechanism that allows


tasks within a DAG to share small amounts of data or metadata. In the "XComs"
submenu, you can view and manage the data exchanged between tasks. This is
valuable for monitoring the flow of data within your workflows, troubleshooting
issues, and ensuring seamless coordination between tasks.

BROWSE MENU

DAG Runs: In the first alcove of the "Browse" menu, the "DAG Runs"
submenu provides a panoramic view of the historical run instances of your
DAGs. Here, you can scrutinize the life story of each DAG run, from its inception
to completion, analyzing timestamps and identifying trends. It's your historical
compass, helping you track the trajectory of your workflows.

Task Instances: The "Task Instances" submenu, serving as a vigilant


guardian, enables you to keep an eye on individual task instances. It presents
you with a real-time tableau of task statuses—those that have gracefully
concluded their journey and those that may need your attention. This tab is your
operational console for assessing the current state of your workflows.

Task Reschedule: The "Task Reschedule" submenu offers a portal to


rectify and refine the execution of your tasks. When tasks encounter turbulence,
this tab empowers you to intervene with grace. It allows you to reschedule tasks
that might have hit a snag, ensuring the continuity and reliability of your data
pipeline.

52
Triggers: Within the "Triggers" submenu, you are given the authority to
awaken your tasks when the timing is impeccable. This feature allows you to
nudge task instances into action at your discretion, affording you greater control
over the execution sequence and dependencies. It is your orchestral baton,
conducting the performance of your workflows.

Jobs: The "Jobs" submenu is your backstage pass to the laborious work
of Airflow's internal components. It uncovers the status and intricacies of jobs
responsible for executing tasks. This section provides insights into the
operational heart of your Airflow environment, helping you understand and
optimize the performance of your workflow engine.

SECURITY MENU

List Users: The "List Users" sub-menu acts as your registry of individuals
who interact with your Airflow environment. It provides an overview of all users
and their associated permissions, making it easy to manage access control.
Here, you can add, remove, or modify user accounts, ensuring that your Airflow
instance is secure and accessible only to authorized personnel.

List Roles: Roles are the cornerstone of user permissions in Airflow. In


the "List Roles" sub-menu, you have the ability to define and manage roles,
tailoring permissions and access levels to suit your organization's requirements.
You can create custom roles, assign users to them, and ensure that each user
operates within a well-defined scope of actions.

User Statistics: The "User Statistics" sub-menu offers valuable insights


into user activity within your Airflow environment. It provides data on login times,
the frequency of interactions, and the tasks performed by each user. This
information can be invaluable for tracking user behavior, ensuring compliance,
and optimizing the user experience.

53
DAG MENU

The "DAG" menu in Apache Airflow is a pivotal portal for orchestrating,


monitoring, and managing Directed Acyclic Graphs (DAGs), which represent the
very core of your data workflows. This menu serves as the command center for
users to create, configure, and maintain their DAGs, offering a bird's-eye view of
the complete workflow structure. It facilitates actions such as triggering DAG
runs, pausing or unpausing workflows, and accessing task logs and visual
representations of task dependencies. The "DAG" menu is where the intricate
architecture of your data pipelines is brought to life, allowing for meticulous
control and oversight of your workflow orchestration processes.

54
FINAL CHAPTER: THE CONCLUSION
In the pages of this comprehensive exploration into Apache Airflow, we
embarked on a journey spanning the fundamental pillars of its architecture and
functionality. The initial chapters provided a nuanced understanding of the
Airflow ecosystem, delving into the core concepts underpinning its architecture.
From unraveling the intricacies of Airflow Operators to navigating the dynamics
of the TaskFlow API, our exploration ventured deep into the mechanics of
workflow orchestration. The crucial aspects of Catchup and Backfill were
unwrapped, shedding light on their pivotal roles in ensuring the completeness
and integrity of data workflows.

Moving forward, the exploration extended to the essential components of


Scheduling, Connections, and Hooks, elucidating how these elements form the
backbone of Airflow's workflow management capabilities. The discussion then
shifted towards the transformative capabilities of Xcom and Sensors, revealing
their roles as conduits for seamless communication between tasks and vigilant
sentinels ensuring workflows remain attuned to external events.

As we delved into the intricacies of Airflow Configuration, a profound


understanding of the system's internal mechanisms unfolded. This deep dive
served as a compass for readers to navigate the configuration landscape
effectively. Navigating through the Airflow UI, we witnessed how visual
representation enhances the management and monitoring of workflows, offering
a user-friendly interface.

In conclusion, this exploration has equipped readers with a holistic


comprehension of Apache Airflow, covering its architecture, operators, API,
scheduling mechanisms, and crucial components like Xcom and Sensors. The
deep dive into configuration aspects and the user interface demystified the inner
workings, providing a robust foundation for orchestrating and managing data
workflows. This comprehensive journey ensures that readers are well-versed in
the intricacies of Apache Airflow, poised to apply this knowledge to streamline
and optimize their workflow management endeavors.

55

You might also like