0% found this document useful (0 votes)
252 views25 pages

Airflow 101

Uploaded by

fabio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
252 views25 pages

Airflow 101

Uploaded by

fabio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Apache Airflow 101

Essential concepts and tips for beginners

Powered by Astronomer
Table of Contents

04 What is Apache Airflow?

08 Why Apache Airflow?

12 Why Open Source Software?

16 Apache Airflow 101 Core Concepts & Components


16 Core Concepts
22 Core Components

Editor’s Note 24
25
Apache Airflow 2
Major Features in Airflow 2.0
31 Major Features in Airflow 2.2
34 Major Features in Airflow 2.3
Welcome to the Apache Airflow 101 ebook, brought to you by Astronomer.
38 Useful Resources to Get Started with Airflow
We believe that Airflow is the best-in-class data orchestration technology for
38 Airflow Essentials
adapting to the ever-changing data landscape. In this ebook, we’ve gathered
41 Integrations
and explained key Airflow concepts and components to help you get started
using this open source platform. We’ve also compiled some guides and tuto- 46 About Astronomer
rials for more advanced Airflow features. This ebook is everything you need to
successfully kick off your Airflow journey.

Share this ebook with others!

2
Apache Airflow’s popularity as a tool for data pipeline automation has grown
for a few reasons:

• Proven core functionality for data pipelining


Airflow’s core capabilities are used by thousands of organizations in pro-
duction, delivering value across scheduling, scalable task execution, and
UI-based task management and monitoring.

• An extensible framework
Airflow was designed to make integrating existing data sources as simple
as possible. Today, it supports over 80 providers, including AWS, GCP,
Microsoft Azure, Salesforce, Slack, and Snowflake. Its ability to meet the
needs of simple and complex use cases alike makes it both easy to adopt

1. What is •
and scale.

Scalability

Apache Airflow? From running a few pipelines to thousands every day, Airflow manages
workflows using a reliable scheduler. Additionally, you can use parame-
ters to fine-tune the performance of Airflow to fit any use case.

Apache Airflow is the world’s most popular data orchestration platform,
• A large, vibrant community
a framework for programmatically authoring, scheduling, and monitoring
Airflow boasts thousands of users, and over 2k+ contributors who regu-
data pipelines. It was created as an open-source project at Airbnb and
larly submit features, plugins, content and bug fixes to ensure continuous
later brought into the Incubator Program of the Apache Software Foun-
momentum, and improvement. So far Airflow has reached 15k+ commits
dation, which named it a Top-Level Apache Project in 2019. Since then, it
and 25k+ GitHub stars.
has evolved into an advanced data orchestration tool used by organizations
spanning many industries, including software, healthcare, retail, banking, and
Apache Airflow continues to grow thanks to its active and expanding
fintech.
community. And because it belongs to the Apache Software Foundation
and is governed by a group of PMC members, it can live forever.
Today, with a thriving community of more than 2k contributors, 25k+ stars on
Github, and thousands of users—including organizations ranging from ear-
As a result, hundreds of companies — including Fortune 500s, tech gi-
ly-stage startups to Fortune 500s and tech giants—Airflow is widely recog-
ants, and early-stage startups — are adopting Airflow as their preferred
nized as the industry’s leading data orchestration solution.
tool to programmatically author, schedule, and monitor workflows.
Among the biggest names, you’ll find Adobe, Bloomberg, Asana,
Dropbox, Glassdoor, HBO, PayPal, Tesla, ThoughtWorks, WeTransfer,
and more!

4 5
Apache Airflow — Built on a Apache Airflow Core Principles
Strong and Growing Community Airflow is built on a set of core ideals that allow you to leverage the most
popular open source workflow orchestrator on the market while maintaining
enterprise-ready flexibility and reliability.

Flexible Extensible
Fully programmatic workflow Leverage a robust ecosystem
25k+ 21k+ authoring allows you to of open source mintegrations
Github Stars Slack community maintain full control of the to connect natively to any
logic you wish to execute. third party datastore or API.

8m+ 1.9k+
Monthly downloads Contributors
Open Source Scalable
Get the optionality of an Scale your Airflow
open-source codebase while environment to infinity with a
tapping into a buzzing and modular and highly-available
action-packed community. architecture across a variety of
execution frameworks.

Secure Modular
Integrate with your internal Plug into your internal logging
authentication systems and and monitoring systems to
secrets managers for an keep all of the metrics you
platform ops experience that care about in one place.
your security team will love.

6 7
Q: Why would you run your pipelines as code?
At Astronomer, we believe using a code-based data pipeline tool like
Airflow should be a standard. There are many benefits to that solution,
but a few come to mind as high-level concepts:

Firstly, code-based pipelines are dynamic. If you can write it in code,


then you can do it in your data pipeline. And that’s really powerful.

Secondly, code-based pipelines are highly extensible. If your external


system has an API, there’s a way to integrate it with Airflow.

Finally, they are more manageable. Because everything is in code, these


pipelines can integrate seamlessly into your source controls CICD and

2. Why Apache Airflow? general developer workflows.

Q: What about a company that is looking at a drag-and-drop tool or other


low/no-code approach? Why should they consider Airflow?
We never think of it as “Airflow or”– instead, it’s always “Airflow and.”
Viraj Parekh Because Airflow is so extensible, you can still use those drag-and-drop
Field CTO at Astronomer tools and then orchestrate and integrate them with other pipelines using
Airflow. This approach allows you to have a one-stop-shop for orchestra-
tion without necessarily having to make a giant migration effort upfront
or convince everybody at one time.

Kenten Danas
As those internal teams get more comfortable with Airflow and see all
Lead Developer Advocate at Astronomer
the benefits, they’ll naturally start to transfer to Airflow.

Our Field CTO Viraj Parekh and Lead Developer Advocate Kenten
Danas gathered some of the most common questions data practi-
tioners ask us on a daily basis. If you’d like to find out why Airflow
is the best tool for data orchestration, here are their answers.

8 9
Q: How can functional data pipelines improve my business?
Data pipelines help users manage data workflows that need to be run
on a consistent, reliable schedule. You can use data pipelines to help you
with simple use cases, such as running a report about yesterday’s sales.
Data pipelines empower those users who are looking for baseline insights
into the business.

You can also use data pipelines to help with more complex use cases, Get Apache Airflow
such as running machine learning models, training AI, and generating
complex business intelligence reports that help you make critical deci- Support
sions.
Schedule time with Astronomer experts to learn how we can help
you address your Airflow issues.
So, if you want something that’s flexible, standardized, and scalable that
has the ongoing support of the open-source community— Airflow is
perfect for you.
Contact Us

Q: Is Apache Airflow hard to learn?


Many resources exist for learning how to write code for Airflow, includ-
ing Astronomer’s Airflow Guides. Airflow becomes more difficult to learn
once you try to run it at business scale. Luckily, there are tools like Astro
that allow for abstracting and managing the infrastructure easily and
efficiently.


Q: What companies use Apache Airflow?
Walmart, Tinder, JP Morgan, Tesla, Revolut....just to name a few. Current-
ly, hundreds of companies around the world, from various industries like
healthcare, AI, fintech, or e-commerce, use Apache Airflow as part of
their modern data stack to help them make sense of their data and drive
better business decisions.

10 11
Reliable

We’ve all had “technical difficulties” when our technology operated unreli-
ably. Professionally, that can cause a variety of outcomes, from mild annoy-
ance to genuine detriment. When everybody’s using the same components,
everybody’s optimizing the same components. As Linux Creator Linus Tor-
valds said, “Given enough eyeballs, all bugs are shallow” (Linus’s Law). In
other words, the more people who can see and test a set of code, the more
likely any flaws will be caught and fixed quickly.


Diverse

3. Why Open Source It’s a huge benefit to have a large, diverse community of developers look at
and contribute to the same system. Instead of a few people with the same
view of the world-building components, an entire community with varying

Software? perspectives and strengths contributes to the system (and has a stake in its
success and adoption), which makes the system more resilient.


The massive increase in open source software (OSS) projects is changing
Transparent
both the tech scene and the business scene on a global scale. So why do we

love open-source? Because it is:
When software is transparent people aren’t tied down to a proprietary system
protected in a black box. Users and builders of open source software can see
everything; they can open GitHub, look at the source code, and trust that
Innovative the code has been reviewed publicly. This gives decision-makers insight into

what’s going on — where components can be swapped around — and the
According to this 2016 article in Wired, Facebook decided to stop “treating
control they need to do it.
data center design like Fight Club” by making its architecture open and ac-
cessible to everyone. Companies like Microsoft, HP, and even Google fol-
lowed suit by open-sourcing many of their technologies that had been kept
secret for years. By making code for their projects accessible to the public,
companies could freely adapt golden standards in the open-source world
and create better integrations between products, all while freeing up time to
innovate elsewhere.

12 13

Agile

When components live in a community-curated open source world, they are
built to “play nicely” with other components. This flexibility allows us to stay
on the cutting edge. With open-source, any organization can take a piece of
code and customize it to best fit their needs. When it comes to tech, we want
to live in a world where we’re free to explore, invent best-in-class technology
and use it in revolutionary ways. We can do it through
open source.

Apache Airflow
Fundamentals
The Astronomer Certification for Apache Airflow Fundamentals is
a course and exam focused on imparting an understanding of the
basics of the Airflow architecture, and the ability to create basic
data pipelines for scheduling and monitoring tasks.

Concepts Covered:
• User Interface • Architectural Components
• Scheduling • Use Cases
• Backfill & Catchup
• DAG Structure

Get Certified

14 15
• The graphing component of DAGs allows you to visualize dependencies in
Airflow’s user interface.
• Because every path in a DAG is linear, it’s easy to develop and test your
data pipelines against expected outcomes.

Learn more:

A R I TC L E

What exactly is a DAG?


SEE ARTICLE

4. Apache Airflow 101 Operators

Core Concepts & Operators are the building blocks of Airflow. They contain the logic of how
data is processed in a pipeline. There are different operators for different

Components types of work: some operators execute general types of code, while others
are designed to complete very specific types of work.

Airflow Core Concepts 1


2
operator
file = open(“myfile”, “r”)
3 print (f.read() )

DAGs

In Airflow, a DAG is your data pipeline and represents a set of instructions


that must be completed in a specific order. This is beneficial to data orches-
tration for a few reasons:

• DAG dependencies ensure that your data tasks are executed in the same
order every time, making them reliable for your everyday data infrastructure.

16 17
• Action Operators execute pieces of code. For example, a Python action
Tasks
operator will run a Python function, a bash operator will run a bash
script, etc.
A task is an instance of an operator. In order for an operator to complete
• Transfer Operators are more specialized, designed to move data from
work within the context of a DAG, it must be instantiated through a task.
one place to another.
Generally speaking, you can use tasks to configure important context for
• Sensor Operators, frequently called “sensors,” are designed to wait for
your work, including when it runs in your DAG.
something to happen — for example, for a file to land on S3, or for an-
other DAG to finish running.
Learn more:

When you create an instance of an operator in a DAG and provide it with its
required parameters, it becomes a task.
GUIDE

Using Task Groups in Airflow


Learn more: SEE GUIDE

T U TO R I A L
GUIDE
Operators 101
Passing Data Between Airflow Tasks
S E E T U TO R I A L
SEE GUIDE

T U TO R I A L

Deferrable Operators
S E E T U TO R I A L

T U TO R I A L

Sensors 101
S E E T U TO R I A L

18 19
A data pipeline

A “data pipeline” describes the general process by which data moves from
one system into another. From an engineering perspective, it can also be
represented by a DAG. Each task in a DAG is defined by an operator, and
there are specific downstream or upstream dependencies set between tasks.
A DAG run either extracts, transforms, or loads data - becoming a data pipe-
line, essentially.

Learn more:

A R I TC L E

Data Pipeline: Components, Types, and Best Practices


SEE ARTICLE

Providers

Airflow providers are Python packages that contain all of the relevant Airflow
modules for interacting with external services. Airflow is designed to be an
agnostic workflow orchestrator: You can do your work within Airflow, but you
can also use other tools with it, like AWS, Snowflake, or Databricks.

Most of these tools already have community-built Airflow modules, giving


Airflow spectacular flexibility. Check out the Astronomer Registry to find all
the providers.

The following diagram shows how these concepts work in practice. As you
can see, by writing a single DAG file in Python using existing provider
packages, you can begin to define complex relationships between data
and actions.

20 21
Airflow Core Components The infrastructure:

When working with Airflow, it is important to understand the underlying


components of its infrastructure. Even if you mostly interact with Airflow as
a DAG author, knowing which components are “under the hood” and why
webserver Flask server running with
they are needed can be helpful for developing your DAGs, debugging, and
Gunicorn serving the UI
running Airflow successfully. The first four components below run at all times,
and the last two are situational components that are used only to run tasks or
make use of certain features. scheduler Daemon responsible for
scheduling jobs

metastore A database where all


metadata are stored

executor Defines how tasks are


executed

worker Process executing the tasks,


defined by the executor

triggerer Process running asyncio to


support deferrable operators

22 23
Major Features in Airflow 2.0
Airflow 2.0 includes hundreds of features and bug fixes both large and small.
Many of the significant improvements were influenced and inspired by
feedback from Airflow’s 2019 Community Survey, which garnered over
300 responses.

A New Scheduler: Low-Latency + High-Availability

The Airflow scheduler as a core component has been key to the growth and
success of the project following its creation in 2014. In fact, “Scheduler Per-

5. Apache Airflow 2 formance” was the most asked for improvement in the Community Survey.
Airflow users have found that while the Celery and Kubernetes Executors
allow for task execution at scale, the scheduler often limits the speed at
which tasks are scheduled and queued for execution. While effects vary
As Apache Airflow grew in adoption in recent years, a major release to
across use cases, it’s not unusual for users to grapple with induced downtime
expand on the project’s core strengths became long overdue. Astronomer
and a long recovery in the case of a failure and experience high latency be-
was delighted to release Airflow 2.0 together with the community in
tween short-running tasks.
December 2020.

It is for that reason we introduced a new, refactored scheduler with the Air-
Since then, various organizations and leaders within the Airflow community
flow 2.0 release. The most impactful Airflow 2.0 change in this area is sup-
have worked in close collaboration, refining the scope of Airflow 2.0,
port for running multiple schedulers concurrently in an active/active model.
enhancing existing functionality, and introducing changes to make Airflow
Coupled with DAG Serialization, Airflow’s refactored scheduler is now highly
faster, more reliable, and more performant at scale.
available, significantly faster, and infinitely scalable. Here’s a quick overview
of the new functionality:

1. Horizontal Scalability. If task load one scheduler increases, a user


can now launch additional “replicas” of the scheduler to increase the
throughput of their Airflow Deployment.

2. Lowered Task Latency. In Airflow 2.0, even a single scheduler has proven
to schedule tasks at much faster speeds with the same level of CPU
and Memory.

24 25
3. Zero Recovery Time. Users running 2+ schedulers see zero downtime and These capabilities enable a variety of use cases and create new opportunities
no recovery time in the case of a failure. for automation. For example, users now have the ability to programmatically
set Connections and Variables, show import errors, create Pools, and monitor
4. Easier Maintenance. The Airflow 2.0 model allows users to make changes the status of the Metadata Database and scheduler.
to individual schedulers without impacting the rest and inducing down-
time. For more information, reference Airflow’s REST API documentation.

The scheduler’s now-zero recovery time and readiness for scale eliminate it
as a single point of failure within Apache Airflow. Given the significance of TaskFlow API
this change, our team published “The Airflow 2.0 Scheduler,” a blog post that
dives deeper into the story behind scheduler improvements alongside an While Airflow has historically shined in scheduling and running idempotent
architecture overview and benchmark metrics. tasks, it has historically lacked a simple way to pass information between
tasks. Let’s say you are writing a DAG to train some set of Machine Learning
For more information on how to run more than one scheduler concurrently, models. The first set of tasks in that DAG generates an identifier for each
refer to the official documentation on the Airflow Scheduler. model and the second set of tasks outputs the results generated by each of
those models. In this scenario, what’s the best way to pass output from that
first set of tasks to the latter?
Full REST API
Historically, XComs have been the standard way to pass information between
Data engineers have been using Airflow’s “Experimental API” for years, most tasks and would be the most appropriate method to tackle the use case
often for triggering DAG runs programmatically. With that said, the API has above. As most users know, however, XComs are often cumbersome to use
historically remained narrow in scope and lacked critical elements of func- and require redundant boilerplate code to set return variables at the end of a
tionality, including a robust authorization and permissions framework. task and retrieve them in downstream tasks.
Airflow 2.0 introduces a new, comprehensive REST API that sets a strong
foundation for a new Airflow UI and CLI in the future. Additionally, the new With Airflow 2.0, we were excited to introduce the TaskFlow API and Task
API: Decorator to address this challenge. The TaskFlow API implemented in 2.0
makes DAGs significantly easier to write by abstracting the task and depen-
• Makes for easy access by third-parties. dency management layer from users. Here’s a breakdown of the functionality:
• Is based on the Swagger/OpenAPI Spec.
• Implements CRUD (Create, Update, Delete) operations on all Airflow
resources.
• Includes authorization capabilities (parallel to those of the Airflow UI).

26 27
• A framework that automatically creates PythonOperator tasks from Airflow 2.0 introduced Task Groups as a UI construct that doesn’t affect task
Python functions and handles variable passing. Now, variables such as execution behavior but fulfills the primary purpose of SubDAGs. Task Groups
Python Dictionaries can simply be passed between tasks as return and give a DAG author the management benefits of “grouping” a logical set of
input variables for cleaner and more efficient code. tasks with one another without having to look at or process those tasks any
• Task dependencies are abstracted and inferred as a result of the Python differently.
function invocation. This again makes for much cleaner and simpler DAG
writing for all users. While Airflow 2.0 continues to support the SubDAG Operator, Task Groups
• Support for Custom XCom Backends. Airflow 2.0 includes support for a are intended to replace it in the long term.
new xcom_backend parameter that allows users to pass even more ob-
jects between tasks. Out-of-the-box support for S3, HDFS, and other
tools is coming soon. Independent Providers

It’s worth noting that the underlying mechanism here is still XCom and data One of Airflow’s signature strengths is its sizeable collection of communi-
is still stored in Airflow’s Metadata Database, but the XCom operation itself ty-built operators, hooks, and sensors — all of which enable users to integrate
is hidden inside the PythonOperator and is completely abstracted from the with external systems like AWS, GCP, Microsoft Azure, Snowflake, Slack and
DAG developer. Now, Airflow users can pass information and manage depen- many more.
dencies between tasks in a standardized Pythonic manner for cleaner and
more efficient code. Providers have historically been bundled into the core Airflow distribution
and versioned alongside every Apache Airflow release. As of Airflow 2.0, they
To learn more, refer to Airflow documentation on the TaskFlow API and the are now split into their own airflow/providers directory such that they can be
accompanying tutorial. released and versioned independently from the core Apache Airflow dis-
tribution. Cloud service release schedules often don’t align with the Airflow
release schedule and either result in incompatibility errors or prohibit users
Task Groups from being able to run the latest versions of certain providers. The separation
in Airflow 2.0 allows the most up-to-date versions of Provider packages to
Airflow SubDAGs have long been limited in their ability to provide users with be made generally available and removes their dependency on core Airflow
an easy way to manage a large number of tasks. The lack of parallelism cou- releases.
pled with confusion around the fact that SubDAG tasks can only be executed
by the Sequential Executor, regardless of which executor is employed for all It’s worth noting that some operators, including the Bash and Python Opera-
other tasks, made for a challenging and unreliable user experience. tors, remain in the core distribution given their widespread usage.

To learn more, refer to Airflow documentation on Provider Packages.

28 29
Simplified Kubernetes Executor Major Features in Airflow 2.2
Airflow 2.0 includes a re-architecture of the Kubernetes Executor and Kuber-
netesPodOperator, both of which allow users to dynamically launch tasks as
individual Kubernetes Pods to optimize overall resource consumption.
Airflow 2.2 Big Features
Given the known complexity users previously had to overcome to successful-
ly leverage the executor and operator, we drove a concerted effort towards
simplification that ultimately involved removing over 3,000 lines of code. Deferrable tasks Custom @task.docker
The changes incorporated in Airflow 2.0 made the executor and operator timetables
easier to understand, faster to execute and offers far more flexibility in con- Turn any task into Run DAGs exactly Easily run Python
figuration. a super-efficient when you want functions in
asynchronous loop seperate docker
Data Engineers have now access to the full Kubernetes API to create a yaml containers
‘pod_template_file’ instead of being restricted to a partial set of configura-
tions through parameters defined in the airflow.cfg file. We’ve also replaced
the executor_config dictionary with the pod_override parameter, which takes
a Kubernetes V1Pod object for a clear 1:1 override setting.
Custom Timetables
For more information, we encourage you to follow the documentation on the
Cron expressions got us as far as regular time intervals, which is only con-
new pod_template file and pod_override functionality.
venient for predictable data actions. Before Airflow 2.2.0, for example, you
could not run a DAG at 9am and 12:30pm, because that schedule has differ-
ent hours and minutes, which is impossible with Cron. Now that the custom
UI/UX Improvements timetables are finally here, the scheduling sky is the limit. With this feature,
it’s also now possible to look at a given data for a specific period of time.
Perhaps one of the most welcomed sets of changes brought by Airflow 2.0
has been the visual refreshment of the Airflow UI.
New concept: data_interval — A period of datca that a task should operate
on.
In an effort to give users a more sophisticated and intuitive front-end experi-
ence, we’ve made over 30 UX improvements.
Since the concept of execution_date was confusing to every new user, a
better version is now available. No more ”why didn’t my DAG run?”, as this
feature has been replaced with data_interval, which is the period of data that
a task should operate on.

30 31
and events. Airbnb introduced smart sensors, the first tackle to this issue.
It includes: Deferrable operators go further than smart sensors — they are perfect for
• logical_date (aka execution_date) anything that submits a job to an external system and then polls for status.
• data_interval_start (same value as execution_date for cron) With this feature, operators or sensors can postpone themselves until a light-
• data_interval_end (aka next_execution_date) weight async check succeeds, at which time they can resume executing.
This causes the worker slot, as well as any resources it uses, to be returned
to Airflow. A deferred task does not consume a worker slot while in deferral
mode — instead, one triggerer process (a new component, which is the
Key Benefits of Custom Timetables daemon process that executes the asyncio event loop) can run 100s of async
deferred tasks concurrently. As a result, tasks like monitoring a job on an
external system or watching for an event become far less expensive.

• Flexibility • Full back-compability


maintained —
• Run DAGs Custom @task decorators and @task.docker
schedule_interval
• Introducing not going away
explicit “data interval” The ‘@task.docker’ decorator allows for running a function inside a docker
container. Airflow handles putting the code into the container and returning
xcom. This is especially beneficial when there are competing dependencies
between Airflow and tasks that must run.

Bonus for Astronomer Customers: NYSE Key Benefits of Deferrable Tasks


Trading Timetable

The trading hours timetable allows users to run DAGs based on the start and • Turn any task into a • Resilient against restarts
super-efficient
end of trading hours for NYSE and Nasdaq. It includes historic trading hours • Doesn’t use worker
asynchronous loop
as well as holidays and half-days where the markets have irregular hours. resources when deferred
• Resilient against
• Paves the way to
restarts
event-based DAGs!
Deferrable Operators

Many longtime Airflow users are familiar with the frustration of tasks or
sensors clogging up worker resources when waiting for external systems

32 33
Learn more:
Major Features in Airflow 2.3
GUIDE

Dynamic Tasks in Airflow


SEE GUIDE
Note: Airflow 2.3 was released in April 2022

New Grid View


Dynamic Task Mapping
Airflow’s new grid view replaces the tree view, which was not ideal for repre-
Dynamic task mapping permits Airflow’s scheduler to trigger tasks based senting DAGs and their topologies, since a tree cannot natively represent a
on context: given this input — for example, a set of files in an S3 bucket — DAG that has more than one path, such as a task with branching dependen-
run these tasks. The number of tasks can change at run time based on the cies. The tree view could only represent these paths by displaying multiple,
number of files; you no longer have to configure a static number of tasks. separate instances of the same task. So if a task had three paths, Airflow’s
tree view would show three instances of the same task — confusing even for
As of Airflow 2.3, you can use dynamic task mapping to hook into S3, list any expert users.
new files, and run separate instances of the same task for each file. You don’t
have to worry about keeping your Airflow workers busy, because Airflow’s The grid view offers first-class support for Airflow task groups. The tree
scheduler automatically optimizes for available parallelism. view chained task group IDs together, resulting in repetitive text and,
occasionally, broken views. In the new grid view, task groups display summary
Dynamic task mapping not only gives you a means to easily parallelize these information and can be expanded or collapsed as needed. The new grid view
operations across your available Airflow workers, but also makes it easier also dynamically generates lines and hover effects based on the task you’re
to rerun individual tasks (e.g., in the case of failure) by giving you vastly inspecting, and displays the durations of your DAG runs; this lets you quickly
improved visibility into the success or failure of these tasks. Imagine that track performance — and makes it easier to spot potential problems. These
you create a monolithic task to process all of the files in the S3 bucket, and are just a few of the improvements the new grid view brings.
that one or more steps in this task fail. In past versions of Airflow, you’d have
to parse the Airflow log file generated by your monolithic task to determine
which step failed and why. In 2.3, when a dynamically mapped task fails, it
generates a discrete alert for that step in Airflow. This enables you to zero
in on the specific task and troubleshoot from there. If appropriate, you can
easily requeue the task and run it all over again.

34 35
A New LocalKubenetesExecutor

Airflow 2.3’s new local K8s executor allows you to be selective about which
tasks you send out to a new pod in your K8s cluster — i.e., you can either
use a local Airflow executor to run your tasks within the scheduler service or
send them out to a discrete K8s pod. Kubernetes is powerful, to be sure, but
it’s overkill for many use cases. (It also adds layers of latency that the local
executor does not.) The upshot is that for many types of tasks — especially
lightweight ones — it’s faster (and, arguably, just as reliable) to run them in
Airflow’s local executor, as against spinning up a new K8s pod.

Storing Airflow Connections in JSON


Instead of URI Format

And the new release’s ability to store Airflow connections in JSON (rather
than in Airflow URI) format is a relatively simple feature that — for some
users — could nevertheless be a hugely welcomed change. Earlier versions
stored connection information in Airflow’s URI format, and there are cases in
which that format can be tricky to work with. JSON provides a simple,
human-readable alternative. Ready for more?
Altogether, Airflow 2.3 introduces more than a dozen new features, including,
in addition to the above, a new command in the Airflow CLI for reserializing
Discover our guides, covering everything from introductory content
DAGs, a new listener plugin API that tracks TaskInstance state changes, a
new REST API endpoint for bulk-pausing/resuming DAGs, and other ease-of-
to advanced tutorials around Airflow.
use (or fit-and-finish) features. Learn more about Airflow 2.3 on our blog.

Discover Guides

36 37
VIDEO

Coding Your First DAG for Beginners


SEE VIDEO

WEBINAR

Best Practices For Writing DAGs in Airflow 2


SEE WEBINAR

GUIDE

Understanding the Airflow UI

6. Useful Resources
SEE GUIDE

to Get Started GUIDE

Airflow Executors Explained

with Airflow
SEE GUIDE

GUIDE

Managing Your Connections in Apache Airflow


Airflow Essentials SEE GUIDE

WEBINAR GUIDE

Intro to Airflow DAG Writing Best Practices


SEE WEBINAR SEE GUIDE

WEBINAR GUIDE

Intro to Data Orchestration with Apache Airflow Using Airflow to Execute SQL
SEE WEBINAR SEE GUIDE

38 39
GUIDE
Integrations
Using TaskGroups in Airflow
SEE GUIDE Make the most of Airflow by connecting it to other tools.

GUIDE
Airflow + dbt
Debugging DAGs
SEE GUIDE

A R I TC L E

Building a Scalable Analytics


Architecture With Airflow and dbt
GUIDE
SEE ARTICLE
Rerunning DAGs
SEE GUIDE

A R I TC L E

Airflow and dbt, Hand in Hand


GUIDE
SEE ARTICLE
Airflow Decorators
SEE GUIDE

GUIDE

Integrating Airflow and dbt


GUIDE
SEE GUIDE
Deferrable Operators
SEE GUIDE

Airflow + Great Expectations


WEBINAR

Scaling out Airflow


SEE WEBINAR
GUIDE

Integrating Airflow and Great Expectations


SEE GUIDE

WEBINAR

Scheduling in Airflow
SEE WEBINAR

40 41
Airflow + Azure Airflow + Notebooks

GUIDE GUIDE

Executing Azure Data Factory Pipelines with Airflow Executing Notebooks with Airflow
SEE GUIDE SEE GUIDE

GUIDE

Executing Azure Data Explorer Queries with Airflow


Airflow + Kedro
SEE GUIDE

GUIDE

Deploying Kedro Pipelines to Apache Airflow


GUIDE
SEE GUIDE
Orchestrating Azure Container Instances with Airflow
SEE GUIDE

Airflow + Databricks
Airflow + Redshift
GUIDE

Orchestrating Databricks Jobs with Airflow


SEE GUIDE
GUIDE

Orchestrating Redshift Operations from Airflow


SEE GUIDE

Airflow + Sagemaker

GUIDE

Using Airflow with SageMaker


SEE GUIDE

42 43
Airflow + Talend

GUIDE

Executing Talend Jobs with Airflow


SEE GUIDE

Airflow + AWS Lambda

GUIDE

Best Practices Calling AWS Lambda from Airflow


SEE GUIDE
Can’t find
Setting Up Error Notifications in Airflow Using
Email, Slack, and SLAs
the integration you’re
GUIDE
looking for?
Error Notifications in Airflow
SEE GUIDE

Check out the Astronomer Registry — a go-to discovery


and distribution hub for Apache Airflow integrations.

Browse Registry

44 45
With every new release, Astronomer is unlocking more of Airflow’s potential
and moving closer to the goal of democratizing the use of Airflow, so that
all members of the data team can work with or otherwise benefit from it.
And we’ve also built Astro—the essential data orchestration platform. With
the strength and vibrancy of Airflow at its core, Astro offers a scalable,
adaptable, and efficient blueprint for successful data orchestration.

If you’d like to learn more about how Astronomer drives the Apache Airflow
project together with the community check out this recent article.

7. About Astronomer
With its rapidly growing number of downloads and contributors, Airflow
is at the center of the data management conversation—a conversation
Astronomer is helping to drive.

Our commitment is evident in our people. We’ve got a robust Airflow


Engineering team and sixteen active committers on board, including
seven PMC members: Ash Berlin-Taylor, Kaxil Naik, Daniel Imberman,
Jed Cunningham, Bolke de Bruin, Sumit Maheshwari, and Ephraim Anierobi.

A significant portion of the work all these people do revolves around Airflow
releases—creating new features, testing, troubleshooting, and ensuring that
the project continues to improve and grow. Their hard work, combined with
that of the community at large, allowed us to deliver Airflow 2.0 in late 2020,
Airflow 2.2 in 2021, and Airflow 2.3 in early 2022.

46 47
Thank you
We hope you’ve enjoyed our guide to Airflow. Please follow us on
Twitter and LinkedIn, and share your feedback, if any.

Start building your


next-generation data
platform with Astro

Get Started

Experts behind this ebook:


Viraj Parekh | Field CTO at Astronomer
Kenten Danas | Lead Developer Advocate at Astronomer
Jake Witz | Technical Writer at Astronomer

Created by ©Astronomer 2022

48

You might also like