0% found this document useful (0 votes)
233 views6 pages

Study Guide For Apache Airflow Fundamentals Certification

Uploaded by

Don
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
233 views6 pages

Study Guide For Apache Airflow Fundamentals Certification

Uploaded by

Don
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Home Blog About Contact 

Data Pipelines Apache Airflow Study Guide

Study Guide for Apache


Airflow Fundamentals
Certification
Orest Tokovenko – 22 December 2021

This study guide covers Astronomer Certification for Apache Airflow Fundamentals.
Apache Airflow is the leading orchestrator for authoring, scheduling, and monitoring
data pipelines. The exam consists of 75 questions, and you have 60 minutes to write it.
The study guide below covers everything you need to know for it. The exam includes
scenarios (both text and images of Python code) where you need to determine what
the output will be, if any at all. To study for this exam I watched the official
Astronomer preparation course, I highly recommend it. 
According to Astronomer, this exam will test the following:

“To pass the Apache Airflow Fundamentals certification, you will need to demonstrate
an understanding of Airflow’s architecture, the task life cycle, and the scheduling
process. You should be comfortable recommending use cases, architectural needs,
settings and design choices for data pipelines. You should be able to trigger, debug
and retry tasks and look at the right views from the UI to monitor them.”

Study Guide

User Interface

• Graph View

◦ The Graph View shows a visualization of the tasks and dependencies in your DAG
and their current status for a specific DAG Run. This view is particularly useful
when reviewing and developing a DAG. When running the DAG, you can toggle
the Auto-refresh button to on to see the status of the tasks update in real time.
• Tree View

◦ The Tree View shows a tree representation of the DAG and its tasks across time.
Each column represents a DAG Run and each square is a task instance in that DAG
Run. Task instances are color-coded according to their status. DAG Runs with a
black border represent scheduled runs, whereas DAG Runs with no border are
manually triggered.
• Code View

◦ The Code View shows the code that is used to generate the DAG. While your code
should live in source control, the Code View can be a useful way of gaining quick
insight into what is going on in the DAG. Note that code for the DAG cannot be
edited directly in the UI.
• Gantt View

◦ Allows you to analyze task duration as well as overlaps. Check if the tasks are
running in parallel.

Core Concepts

• Dags

◦ DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks
together, organized with dependencies and relationships to say how they should
run.
▪ Here’s a basic example DAG:

▪ It defines four Tasks - A, B, C, and D - and dictates the order in which they have
to run, and which tasks depend on what others. It will also say how often to
run the DAG - maybe “every 5 minutes starting tomorrow”, or “every day since
January 1st, 2020”.

▪ The DAG itself doesn’t care about what is happening inside the tasks; it is
merely concerned with how to execute them - the order to run them in, how
many times to retry them, if they have timeouts, and so on
• DAG Runs

◦ DAGs will run in one of two ways:


▪ When they are triggered either manually or via the API

▪ On a defined schedule, which is defined as part of the DAG


• Tasks

◦ A Task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and
then have upstream and downstream dependencies set between them into order
to express the order they should run in.
▪ There are three basic kinds of Task:
▪ Operators, predefined task templates that you can string together quickly to
build most parts of your DAGs.

▪ Sensors, a special subclass of Operators which are entirely about waiting for
an external event to happen.

▪ A TaskFlow-decorated @task, which is a custom Python function packaged


up as a Task.
• Operators

◦ An operator represents a single, ideally idempotent, task. Operators determine


what actually executes when your DAG runs.
▪ Types of Operators
▪ Sensor Operator
▪ A special kind of operator that are designed to wait for something to
happen.
• Backfill and Catchup

◦ Note: Both are for rerunning tasks


▪ Catchup
▪ An Airflow DAG with a start_date, possibly an end_date, and a
schedule_interval defines a series of intervals which the scheduler turns
into individual DAG Runs and executes. The scheduler, by default, will kick off a
DAG Run for any data interval that has not been run since the last data interval
(or has been cleared). This concept is called Catchup.
▪ If your DAG is not written to handle its catchup (i.e., not limited to the interval,
but instead to Now for instance.), then you will want to turn catchup off. This
can be done by setting catchup=False in DAG or
catchup_by_default=False in the configuration file. When turned off, the
scheduler creates a DAG run only for the latest interval.

▪ Backfill
▪ Rerun past DAG run. There can be the case when you may want to run the
DAG for a specified historical period e.g., A data filling DAG is created with
start_date 2019-11-21, but another user requires the output data from a
month ago i.e., 2019-10-21. This process is known as Backfill.
▪ You may want to backfill the data even in the cases when catchup is disabled.
This can be done through CLI. Run the below command
• Architectural Components

◦ Apache Airflow consists of 4 core components:


▪ Webserver: Airflow’s UI.
▪ At its core, this is just a Flask app that displays the status of your jobs and
provides an interface to interact with the database and reads logs from a
remote file store (S3, Google Cloud Storage, AzureBlobs, ElasticSearch etc.).

▪ Scheduler: Responsible for scheduling jobs.


▪ This is a multi-threaded Python process that uses the DAG object with the
state of tasks in the metadata database to decide what tasks need to be run,
when they need to be run, and where they are run.

▪ The scheduler stores and updates task statuses, which the Webserver then
uses to display job information
▪ Executor: how each task will be executed by Airflow
▪ There are several executors, each with strengths and weaknesses.

▪ Metadata Database: A database (usually PostgresDB or MySql, but can be


anything with SQLAlchemy support) that determines how the other components
interact.
• Use Cases

◦ Airflow can be used for virtually any batch data pipelines, and there are a ton of
documented use cases in the community. Because of its extensibility, Airflow is
particularly powerful for orchestrating jobs with complex dependencies in multiple
external systems.
• Miscellaneous

◦ parallelism = 32 (max)

◦ dag_concurrency = 16 (max)

◦ max_active_runs = 16 (max)

◦ Types of executors
▪ Local executor
▪ The go-to executor, can complete tasks in parallel

▪ Sequential executor
▪ Not many use cases, can not complete tasks in parallel

▪ Celery executor
▪ Built for horizontal scaling, requires multiple machines to run

◦ Airflow is an orchestrator
▪ It is Dynamic, Scalable, Interactive 
▪ It is NOT a streaming or data processing framework (like Spark, although it can
trigger Spark jobs)

◦ start_date + schedule_interval = execution_date. THis means that a DAG


run set for a 10-minute interval, starting at 10:00 will first run at 10:10.

◦ By default, Airflow will run all untriggered dag_runs between the current date and
the start_date (if in the past)

◦ How to read time in Airflow,e.g. schedule_interval="-/10 * * * 1-5" means


that you would like to trigger a DAG every 10 minutes
◦ Airflow operates using UTC time

◦ Defining default arguments increases efficiency

Once you pass, you’ll earn the beautiful badge below!

-Airflow Fundamentals Badge-

 

You may also like 


See All Data Pipelines
10 Jan 2022 30 Dec 2021 27 Dec 2021

Study Guide for AWS Data Study Guide for GitLab Study Guide for DAG
Analytics Specialty Certified Associate Authoring for Apache Airflow
Certification Certification Certification
This study guide covers AWS Certification This study guide covers the GitLab Certified This study guide covers the Astronomer
for Data Analytics Specialty. This ... Associate Certification. It is a... Certification DAG Authoring for Apache...

 Twitter  Github  Instagram  Linkedin

2023 © Orest Tokovenko

You might also like