100% found this document useful (1 vote)
80 views50 pages

Apache Airflow 50

Apache Airflow is an open-source tool for creating, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs) to manage task dependencies. It allows users to define complex workflows in Python, with various deployment options available across cloud providers like AWS, Azure, and GCP. The document also covers the basics of data engineering, workflows, and the functionalities of Airflow, including its web interface and command line tools.

Uploaded by

dhariharan789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
80 views50 pages

Apache Airflow 50

Apache Airflow is an open-source tool for creating, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs) to manage task dependencies. It allows users to define complex workflows in Python, with various deployment options available across cloud providers like AWS, Azure, and GCP. The document also covers the basics of data engineering, workflows, and the functionalities of Airflow, including its web interface and command line tools.

Uploaded by

dhariharan789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Orchestration Automation

Monitoring Scheduling
Pipeline Python
Architect-Data

Airflow Architecture (Astro)


Apache Airflow
Apache Airflow is an open-source tool for creating,
scheduling, and monitoring workflows as code, using
DAGs (Directed Acyclic Graphs) to define task
1.Permanent Table
dependencies and manage execution, retries, and
2.Temporary
3. Transient Tables
failures.
4.External Tables
5.Hybrid Tables
Airflow Web UI (Enable/Disable DAGS ,Monitor,Refresh)
6.Apache Iceberg™ tables
7.Dynamic tables
1.Permanent Table
2.Temporary
3. Transient Tables
4.External Tables
5.Hybrid Tables
6.Apache Iceberg™ tables
7.Dynamic tables

Anil Patel
Architect-Data
Engineering & Analytics
Career Transition Coach

Airflow DAG
DAGs (Directed Acyclic Graphs) -

A DAG represents the overall structure of a workflow in


1.Permanent Table
Airflow
2.Temporary
3. Transient Tables
4.External Tables
tasks in a workflow, with specified
6.Apace Iceberg™ execution
tables 7.Dynamic order and
tables
An Airflow DAG (Directed Acyclic Graph) is a collection of
dependencies. Written5.Hybrid
in Python,
Tables it ensures tasks run in

1.Permanent Table
workflows ,including scheduling and logic.
sequence without loops, allowing users to define complex
2.Temporary

A DAG defines task dependencies, determining their


execution order, while tasks specify the actions, such
as fetching data, running analyses, or triggering other
systems.
Airflow Architecture

The diagrams below show different ways to


deploy Airflow - gradually from the simple “one
machine” and single person deployment, to a
more complex deployment with separate
components, separate user roles and finally with
more isolated security perimeters.
Create a DAG
Design a Basic DAG -

It defines four Tasks - A, B, C, and D - and dictates the order


in which they have to run, and which tasks depend on what
others. It will also say how often to run the DAG - maybe
“every 5 minutes starting tomorrow”, or “every day since
January 1st, 2020”.
Architect-Data

Python program DAG

Part 1//2
Python program DAG
Part 2//2

Then Deploy python code in GIT/AWS S3 based on


repo (it’s very simple)
DAG will display in Airflow UI once you place DAG
code
Architect-Data
Engineering & Analytics
Career Transition Coach

Ways to View and Manage DAGs in Airflow


Architect-Data

Ways to View and Manage DAGs in Airflow


Architect-Data
Anil Patel Engineering & Analytics
Career Transition Coach
@aniltppatel

Ways to View and Manage DAGs in Airflow


cloud providers that offer Airflow services

Amazon Web Services (AWS) -

Amazon Managed Workflows for Apache Airflow


(MWAA) is a fully managed Airflow service on AWS

Follow For more Data Engineering ,Analytics & AI content Anil Patel
Architect-Data

cloud providers that offer Airflow services

Microsoft Azure -

Apache Airflow integrates with Microsoft Azure


services through Microsoft. azure provider. You can
install any provider package by editing the airflow
environment from the Azure Data Factory UI

Apache Airflow™ on Astro - An Azure Native ISV


Service
cloud providers that offer Airflow services

Google Cloud Platform (GCP)-

Cloud Composer is a fully managed workflow


orchestration service that runs on GCP. It's built
on the Apache Airflow open source project
cloud providers that offer Airflow services

Astronomer -
A managed Airflow service that allows data
teams to build, run, and manage data
pipelines as code. Astronomer can run Airflow
on AWS, GCP, Azure, or on-premise
Introduction to
Apache Airflow
INTRODUCTION TO APACHE AIRFLOW IN PYTHON

Data Engineer
What is data engineering?
Data engineering is:

Taking any action involving data and turning it into a reliable, repeatable, and maintainable
process.
What is a workflow?
A workflow is:

A set of steps to accomplish a given data


engineering task
Such as: downloading files, copying data,
filtering information, writing to a
database, etc

Of varying levels of complexity


A term with various meaning depending on
context
What is Airflow?
Airflow is a platform to program workflows,
including:

Creation
Scheduling
Monitoring
Airflow continued...
Can implement programs from any
language, but workflows are written in
Python
Implements workflows as DAGs: Directed
Acyclic Graphs
Accessed via code, command-line, or via
web interface / REST API

1 https://airflow.apache.org/docs/stable/
Other workflow tools
Other tools:

ADF
SSIS
Dagster

Prefect

Mage

Apache Oozie

Informatica

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Quick introduction to DAGs
A DAG stands for Directed Acyclic Graph

In Airflow, this represents the set of tasks


that make up your workflow.
Consists of the tasks and the dependencies
between tasks.
Created with various details about the
DAG, including the name, start date, owner,
etc.
Further depth in the next lesson.

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAG code example
Simple DAG definition:

etl_dag = DAG(
dag_id='etl_pipeline',
default_args={"start_date": "2024-01-08"}
)

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Running a workflow in Airflow
Running a simple Airflow task

airflow tasks test <dag_id> <task_id> [execution_date]

Using a DAG named example-etl , a task named download-file on 2024-01-10:

airflow tasks test example-etl download-file 2024-01-10

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Let's practice!
INTRODUCTION TO APACHE AIRFLOW IN PYTHON
Airflow DAGs
INTRODUCTION TO APACHE AIRFLOW IN PYTHON
What is a DAG?
DAG, or Directed Acyclic Graph:

Directed, there is an inherent flow


representing dependencies between
components.
Acyclic, does not loop / cycle / repeat.
Graph, the actual set of components.
Seen in Airflow, Apache Spark, dbt

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAG in Airflow
Within Airflow, DAGs:

Are written in Python (but can use components written in other languages).

Are made up of components (typically tasks) to be executed, such as operators, sensors,


etc.
Contain dependencies defined explicitly or implicitly.
ie, Copy the file to the server before trying to import it to the database service.

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Define a DAG
Example DAG:

from airflow import DAG

from datetime import datetime


default_arguments = {
'owner': 'jdoe',
'email': '[email protected]',
'start_date': datetime(2020, 1, 20)
}

with DAG('etl_workflow', default_args=default_arguments ) as etl_dag:

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Define a DAG (before Airflow 2.x)
Example DAG:

from airflow import DAG

from datetime import datetime


default_arguments = {
'owner': 'jdoe',
'email': '[email protected]',
'start_date': datetime(2020, 1, 20)
}

etl_dag = DAG('etl_workflow', default_args=default_arguments )

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs on the command line
Using airflow :

The airflow command line program contains many subcommands.

airflow -h for descriptions.


Many are related to DAGs.

airflow dags list to show all recognized DAGs.

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Command line vs Python
Use the command line tool to: Use Python to:

Start Airflow processes Create a DAG


Manually run DAGs / Tasks Edit the individual properties of a DAG
Get logging information from Airflow

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Let's practice!
INTRODUCTION TO APACHE AIRFLOW IN PYTHON
Airflow web
interface
INTRODUCTION TO APACHE AIRFLOW IN PYTHON

Mike Metzger
Data Engineer
DAGs view

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view DAGs

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view owner

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view runs

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view schedule

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view last run

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view next run

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view recent tasks

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAGs view example_dag

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAG detail view

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAG graph view

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


DAG code view

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Audit logs

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Web UI vs command line
In most cases:

Equally powerful depending on needs


Web UI is easier
Command line tool may be easier to access depending on settings

INTRODUCTION TO APACHE AIRFLOW IN PYTHON


Let's practice!
INTRODUCTION TO APACHE AIRFLOW IN PYTHON
Thanks for reading !
Follow my profile for more coding
related contents

Like Comment Share

@rganesh0203

You might also like