Study Guide For Apache Airflow Fundamentals Certification
Study Guide For Apache Airflow Fundamentals Certification
This study guide covers Astronomer Certification for Apache Airflow Fundamentals.
Apache Airflow is the leading orchestrator for authoring, scheduling, and monitoring
data pipelines. The exam consists of 75 questions, and you have 60 minutes to write it.
The study guide below covers everything you need to know for it. The exam includes
scenarios (both text and images of Python code) where you need to determine what
the output will be, if any at all. To study for this exam I watched the official
Astronomer preparation course, I highly recommend it.
According to Astronomer, this exam will test the following:
“To pass the Apache Airflow Fundamentals certification, you will need to demonstrate
an understanding of Airflow’s architecture, the task life cycle, and the scheduling
process. You should be comfortable recommending use cases, architectural needs,
settings and design choices for data pipelines. You should be able to trigger, debug
and retry tasks and look at the right views from the UI to monitor them.”
Study Guide
User Interface
• Graph View
◦ The Graph View shows a visualization of the tasks and dependencies in your DAG
and their current status for a specific DAG Run. This view is particularly useful
when reviewing and developing a DAG. When running the DAG, you can toggle
the Auto-refresh button to on to see the status of the tasks update in real time.
• Tree View
◦ The Tree View shows a tree representation of the DAG and its tasks across time.
Each column represents a DAG Run and each square is a task instance in that DAG
Run. Task instances are color-coded according to their status. DAG Runs with a
black border represent scheduled runs, whereas DAG Runs with no border are
manually triggered.
• Code View
◦ The Code View shows the code that is used to generate the DAG. While your code
should live in source control, the Code View can be a useful way of gaining quick
insight into what is going on in the DAG. Note that code for the DAG cannot be
edited directly in the UI.
• Gantt View
◦ Allows you to analyze task duration as well as overlaps. Check if the tasks are
running in parallel.
Core Concepts
• Dags
◦ DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks
together, organized with dependencies and relationships to say how they should
run.
▪ Here’s a basic example DAG:
▪ It defines four Tasks - A, B, C, and D - and dictates the order in which they have
to run, and which tasks depend on what others. It will also say how often to
run the DAG - maybe “every 5 minutes starting tomorrow”, or “every day since
January 1st, 2020”.
▪ The DAG itself doesn’t care about what is happening inside the tasks; it is
merely concerned with how to execute them - the order to run them in, how
many times to retry them, if they have timeouts, and so on
• DAG Runs
◦ A Task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and
then have upstream and downstream dependencies set between them into order
to express the order they should run in.
▪ There are three basic kinds of Task:
▪ Operators, predefined task templates that you can string together quickly to
build most parts of your DAGs.
▪ Sensors, a special subclass of Operators which are entirely about waiting for
an external event to happen.
▪ The scheduler stores and updates task statuses, which the Webserver then
uses to display job information
▪ Executor: how each task will be executed by Airflow
▪ There are several executors, each with strengths and weaknesses.
◦ Airflow can be used for virtually any batch data pipelines, and there are a ton of
documented use cases in the community. Because of its extensibility, Airflow is
particularly powerful for orchestrating jobs with complex dependencies in multiple
external systems.
• Miscellaneous
◦ parallelism = 32 (max)
◦ dag_concurrency = 16 (max)
◦ max_active_runs = 16 (max)
◦ Types of executors
▪ Local executor
▪ The go-to executor, can complete tasks in parallel
▪ Sequential executor
▪ Not many use cases, can not complete tasks in parallel
▪ Celery executor
▪ Built for horizontal scaling, requires multiple machines to run
◦ Airflow is an orchestrator
▪ It is Dynamic, Scalable, Interactive
▪ It is NOT a streaming or data processing framework (like Spark, although it can
trigger Spark jobs)
◦ By default, Airflow will run all untriggered dag_runs between the current date and
the start_date (if in the past)
Study Guide for AWS Data Study Guide for GitLab Study Guide for DAG
Analytics Specialty Certified Associate Authoring for Apache Airflow
Certification Certification Certification
This study guide covers AWS Certification This study guide covers the GitLab Certified This study guide covers the Astronomer
for Data Analytics Specialty. This ... Associate Certification. It is a... Certification DAG Authoring for Apache...