0% found this document useful (0 votes)
105 views32 pages

Air Flow Clustering HA

This document discusses Airflow clustering and high availability. It describes the different Airflow daemons, single and cluster deployments, scaling workers and masters, limitations, and how the Airflow Scheduler Failover Controller provides high availability for the scheduler process across multiple nodes. The Failover Controller ensures there is only one scheduler running at a time by monitoring the process, restarting it if it fails, and failing it over to another node if needed.

Uploaded by

Deepak Mane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views32 pages

Air Flow Clustering HA

This document discusses Airflow clustering and high availability. It describes the different Airflow daemons, single and cluster deployments, scaling workers and masters, limitations, and how the Airflow Scheduler Failover Controller provides high availability for the scheduler process across multiple nodes. The Failover Controller ensures there is only one scheduler running at a time by monitoring the process, restarting it if it fails, and failing it over to another node if needed.

Uploaded by

Deepak Mane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Airflow Clustering

and High Availability

By: Robert Sanders


Agenda

• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure

Page: 2
Airflow Daemons

• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute Page: 3
Single Node Deployment

Page: 4
Cluster Deployment

Page: 5
Why setup a Cluster Deployment?

• Distributes heavy processes onto many machines for better


use of resources
• More Highly Available Airflow environment
• If you have many Workflows with many Tasks your executors
would not be able to get to all the messages in the queue.
Adding more executors would fix this issue.

Page: 6
Scaling Workers

• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers

Page: 7
Scaling Master

Page: 8
Limitations

• There can only be one scheduler running at a time


• If you have multiple Scheduler processes running, there's
a possibility that multiple instances of a single task that
will be scheduled to run.
• If the Scheduler Daemon or Machine with the process goes
down then no jobs will get scheduled

Page: 9
Airflow Scheduler Failover Controller

• Dedicated Daemon that runs with Airflow on the Master


Nodes
• Ensures that there is always one and only one Scheduler
running on the Master nodes at a time
• Developed Internally and Open Sourced
• https://github.com/teamclairvoyant/airflow-scheduler-
failover-controller
• High Level Steps
• Polls (every x seconds) to check if the scheduler is
running
• If scheduler isn’t running, restart the scheduler
• If it still doesn’t start up, then try starting it up on the
other master nodes Page: 10
Failover Controller Diagram

Page: 11
Start Up Scenario

Page: 12
Failover Controller Process (Start Up)
On startup, the processes start out in STANDBY

Master Node 1 Master Node 2

Failover Failover
Controller Controller
(standby) (standby)

Page: 13
Failover Controller Process (Start Up)
The first one to enter data into the Metastore is elected as the active
controller.

Master Node 1 Master Node 2

Failover Failover
Controller Controller
(active) (standby)

Page: 14
Failover Controller Process (Start Up)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 15
Failover Controller Process (Start Up)
Failover Controller starts up the Scheduler

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 16
Scheduler Failure
Scenario

Page: 17
Failover Controller Process (Process Failure)
Scheduler process has died

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 18
Failover Controller Process (Process Failure)
Failover Controller restarts the Scheduler

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 19
Scheduler Failure and
Failed Restart Scenario

Page: 20
Failover Controller Process (Process Failure 2)
Scheduler process has died

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 21
Failover Controller Process (Process Failure 2)
Failover Controller tries to restart the Scheduler, but its still not running

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 22
Failover Controller Process (Process Failure 2)
Failover Controller tries to restart the Scheduler on a different node

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 23
Failover Controller Process (Process Failure 2)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 24
Node Failure Scenario

Page: 25
Failover Controller Process (Node Failure)
Everything is running as expected

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(active) (standby)

Page: 26
Failover Controller Process (Node Failure)
Master Node 1 dies and all the processes running on it are gone

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(dead) (standby)

Page: 27
Failover Controller Process (Node Failure)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(dead) (active)

Page: 28
Failover Controller Process (Node Failure)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(dead) (active)

Page: 29
Failover Controller Process (Node Failure)
The Failover Controller then starts it on another node and it succeeds

Master Node 1 Master Node 2

Scheduler Scheduler

Failover Failover
Controller Controller
(dead) (active)

Page: 30
Failover Controller Process (Node Failure)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state

Master Node 1 Master Node 2

Scheduler

Failover Failover
Controller Controller
(standby) (active)

Page: 31
Q&A

Page: 32

You might also like