Air Flow Clustering HA
Air Flow Clustering HA
• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure
Page: 2
Airflow Daemons
• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute Page: 3
Single Node Deployment
Page: 4
Cluster Deployment
Page: 5
Why setup a Cluster Deployment?
Page: 6
Scaling Workers
• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers
Page: 7
Scaling Master
Page: 8
Limitations
Page: 9
Airflow Scheduler Failover Controller
Page: 11
Start Up Scenario
Page: 12
Failover Controller Process (Start Up)
On startup, the processes start out in STANDBY
Failover Failover
Controller Controller
(standby) (standby)
Page: 13
Failover Controller Process (Start Up)
The first one to enter data into the Metastore is elected as the active
controller.
Failover Failover
Controller Controller
(active) (standby)
Page: 14
Failover Controller Process (Start Up)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 15
Failover Controller Process (Start Up)
Failover Controller starts up the Scheduler
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 16
Scheduler Failure
Scenario
Page: 17
Failover Controller Process (Process Failure)
Scheduler process has died
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 18
Failover Controller Process (Process Failure)
Failover Controller restarts the Scheduler
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 19
Scheduler Failure and
Failed Restart Scenario
Page: 20
Failover Controller Process (Process Failure 2)
Scheduler process has died
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 21
Failover Controller Process (Process Failure 2)
Failover Controller tries to restart the Scheduler, but its still not running
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 22
Failover Controller Process (Process Failure 2)
Failover Controller tries to restart the Scheduler on a different node
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 23
Failover Controller Process (Process Failure 2)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 24
Node Failure Scenario
Page: 25
Failover Controller Process (Node Failure)
Everything is running as expected
Scheduler
Failover Failover
Controller Controller
(active) (standby)
Page: 26
Failover Controller Process (Node Failure)
Master Node 1 dies and all the processes running on it are gone
Scheduler
Failover Failover
Controller Controller
(dead) (standby)
Page: 27
Failover Controller Process (Node Failure)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat
Scheduler
Failover Failover
Controller Controller
(dead) (active)
Page: 28
Failover Controller Process (Node Failure)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.
Scheduler
Failover Failover
Controller Controller
(dead) (active)
Page: 29
Failover Controller Process (Node Failure)
The Failover Controller then starts it on another node and it succeeds
Scheduler Scheduler
Failover Failover
Controller Controller
(dead) (active)
Page: 30
Failover Controller Process (Node Failure)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state
Scheduler
Failover Failover
Controller Controller
(standby) (active)
Page: 31
Q&A
Page: 32