0% found this document useful (0 votes)
413 views14 pages

Intro To Apache Airflow

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows or data pipelines. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks and automates the process of creating, scheduling, and updating data pipelines. Airflow addresses the challenges of using cron jobs to run workflows by providing a rich web UI and APIs to programmatically author, schedule, monitor, debug, and scale workflows across multiple tasks and data sources.

Uploaded by

Robert Ngenzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
413 views14 pages

Intro To Apache Airflow

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows or data pipelines. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks and automates the process of creating, scheduling, and updating data pipelines. Airflow addresses the challenges of using cron jobs to run workflows by providing a rich web UI and APIs to programmatically author, schedule, monitor, debug, and scale workflows across multiple tasks and data sources.

Uploaded by

Robert Ngenzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Apache Airflow

Introduction
Agenda
● What is Airflow
● What is a workflow
● Example of an Airflow workflow
● Background and the world before Airflow
● Purpose
● Terminologies
● Core Components
● Usages
● Demo
What is Airflow
● Apache Airflow is an open-source platform for programmatically
authoring, scheduling, and monitoring workflows.

● It allows you to define and manage complex data pipelines as


directed acyclic graphs (DAGs) of tasks and automate the process of
creating and updating data pipelines. It provides a rich web-based
interface for setting up, monitoring and managing workflow
execution, and an API for triggering and monitoring workflows.
What is a Workflow?

● A sequence of tasks
● Started on a schedule or triggered by an event
● Frequently used to handle big data processing pipelines

Example for an Airflow workflow


Example of an Airflow workflow

1. Download data
2. Send data to processing
3. monitor processing
4. Generate report
5. Send email
Background

A developer wants to run a job on schedule


● Cron job (Job scheduling)
● Python or bash script

Extract data from data


Start End
source A to storage B
Background
Business demands more data extractions from various sources

Solution. Develop more cron job

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source C to End
storage D
…. ….
n
Extract data from
Start data source E to End
storage F
Challenges with cron jobs
● Hard to scale
● Hard to monitor
● Hard to maintain
● Hard to maintaining dependencies
● Hard to manage jobs failures and timeouts
● Hard to manage deployments
Airflow advantages

Developers can programmatically:


● Author workflows
● Schedule workflows
● Monitor workflow
● Debug
● Scale easly
Airflow Terms
● Task

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source A to End
storage B
Airflow DAG
● A DAG (Directed Acyclic Graph) is used to define a workflow as a series of
tasks and how they interact with each other.
● Each task in a DAG represents a single operation in your workflow, such as
running a query, sending an email, or uploading a file.
● The relationships between tasks are defined by dependencies, where one task
can only run after another task has completed
Airflow Core components

Task
Execution Webserver Web UI
logs

Metadata
Scheduler Workers
database
Airflow usage
● Run and automate ETL pipelines
● Data ingestion pipelines
● Machine learning pipelines
● Predictive data pipelines
● General purpose scheduling
Airflow architecture
● Scheduler: Triggers scheduled workflows and submits tasks to executor to
run
● Executor: Manages tasks
● Worker: Runs the tasks
● Webserver:Supports the user interface
● Metadata database: Stores information about DAGs and tasks

You might also like