0% found this document useful (0 votes)
36 views12 pages

Airflow

The document discusses building a machine learning pipeline with Apache Airflow. It provides steps to set up a virtual machine on Google Cloud Platform, pull code from GitHub, and run an example pipeline that trains an image classifier model and serves it as a web app. The pipeline demonstrates automating the workflow from data collection to model deployment.

Uploaded by

sudhanshu2198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views12 pages

Airflow

The document discusses building a machine learning pipeline with Apache Airflow. It provides steps to set up a virtual machine on Google Cloud Platform, pull code from GitHub, and run an example pipeline that trains an image classifier model and serves it as a web app. The pipeline demonstrates automating the workflow from data collection to model deployment.

Uploaded by

sudhanshu2198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

Open in app Get unlimited access

Published in Towards Data Science

Binh Phan Follow

Jun 20, 2020 · 7 min read · Listen

Save

10 Minutes to Building a Machine Learning


Pipeline with Apache Airflow
Why ML pipelines matter, and how to build a simple ML
pipeline using Apache AirFlow

Photo by Jaimie Phillips


4 on Unsplash
268

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 1/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

Often, when you think about Machine Learning, you tend to think about the great
models that you can now create. If you want to take these amazing models and
make them available to the world, you will have to move beyond just training the
model and incorporating data collection, feature engineering, training,
evaluating, and serving.

On top of all that, you will also have to remember that you’re putting a software
application into production. That means you’ll have all the requirements that any
production software has, including scalability, consistency, modularity,
testability, and security.

ML code is only one piece of a ML system

A ML pipeline allows you to automatically run the steps of a Machine


Learning system, from data collection to model serving (as shown in the
photo above). It will also reduce the technical debt of a machine learning
system, as this linked paper describes. This segues into the fields of MLOps,
a fast-growing field that, similar to DevOps, (aims to automate and monitor
all steps of the ML System.

This tutorial will show you how to build a simple ML pipeline that automates
the workflow of a deep learning image classifier for dandelions and grass
built using FastAI, and then served as a web app using Starlette. We’ll use
Apache AirFlow, out of the many workflow tools like Luigi, MLFlow, and
KubeFlow, because it provides an extensive set of features and a beautiful

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 2/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

UI. AirFlow is open-source software that allows you to programmatically


author and schedule your workflows using a directed acyclic graph (DAG)
and monitor them via the built-in Airflow user interface. At the end of the
tutorial, I’ll show you further steps you can take to make your pipeline
production-ready.

Requirements: Since you will run this tutorial on a VM instance, all you will
need is a computer running any OS, and a Google account.

This tutorial will be broken down into the following steps:

1. Sign up for Google Cloud Platform and create a compute instance

2. Pull tutorial contents from Github

3. Overview of ML pipeline in AirFlow

4. Install Docker & set up virtual hosts using nginx

5. Build and run a Docker container

6. Open Airflow UI and run ML pipeline

7. Run deployed web app

1. Sign in to Google Cloud Platform and Create a Compute Instance

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 3/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

Signing up for GCP is free!

If you haven’t already, sign up for Google Cloud Platform through your
Google account. You’ll have to enter your credit card, but you won’t be
charged anything upon signing up. You’ll also get $300 worth of free credits
that last for 12 months! If you’ve run out of credits, don’t worry — running this
tutorial will cost pennies, provided you stop your VM instance afterward!

Once you’re in the console, go to Compute Engine and create an instance.


Then:

1. name the instance greenr-airflow

2. set the compute instance as n1-standard-8

3. set the OS to Ubuntu 18.04

4. ramp up the HDD memory to 30GB

5. Allow cloud access to APIs and HTTP/HTTPS traffic

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 4/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

Binh Phan how to create a VM instance


@btphan95

When your instance has been created and is running, SSH into your instance
by clicking on the SSH button located on the right side of the screen.

2. Pull the Trained Model from Github


Let’s pull the tutorial contents from Github. Clone the repo from Github by
typing this into the VM instance:
3.86 s SD

72.4K views
git clone https://github.com/btphan95/greenr-airflow
cd greenr-airflow

3. Overview of ML Pipeline components


Let’s explore the contents of this Git repository. First, let’s explore the
AirFlow configuration file, /config/airflow.cfg . It sets all of the
configuration options for your AirFlow pipeline, including the location of
your airflow pipelines (in this case, we set this folder to be /dags/ , and
where we connect to our metadata database, sql_alchemy_conn . AirFlow uses
a database to store metadata about DAGs, their runs, and other Airflow
configurations like users, roles, and connections. Think of it like saving your
progress in a video game.

Next, let’s explore the ML pipeline, which is defined in


/dags/ml_pipeline.py .

1 from datetime import timedelta


2 # The DAG object; we'll need this to instantiate a DAG
3 from airflow import DAG
4 # Operators; we need this to operate!
5 from airflow.operators.bash_operator import BashOperator
6 from airflow.utils.dates import days_ago
7 # These args will get passed on to each operator
8 # You can override them on a per-task basis during operator initialization
9 default_args = {
10 'owner': 'Binh Phan',
11 'depends on past': False
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 5/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
11 depends_on_past : False,
ml_pipeline.py
12 'start_date': days_ago(31),
13 'email': ['[email protected]'],
What
14 this'email_on_failure':
Python script does is define our directed acyclic graph structure as
False,

code,
15 including the tasksFalse,
'email_on_retry': that we want to execute and their ordering.
16 'retries': 1,
17 'retry_delay': timedelta(minutes=2),
Let’s break down the code in ml_pipeline.py:
18 # 'queue': 'bash_queue',
19 # 'pool': 'backfill',
1. default_args = … default_args defines arguments that will be fed into
20 # 'priority_weight': 10,
21 our DAG function.
# 'end_date': This includes
datetime(2016, the owner name,
1, 1), owner , whether or not
22 the DAG should run again False,
# 'wait_for_downstream': after a failed instance, depends_on_past , and
23 # 'dag': dag,
the start date, start_date . Note: AirFlow will start your DAG after
24 # 'sla': timedelta(hours=2),
25 start_date AND one instance
# 'execution_timeout': of schedule_interval
timedelta(seconds=300), under the DAG
26 definition. This means that
# 'on_failure_callback': in our case, we start at (31 days ago) + (30 day
some_function,
27 # 'on_success_callback': some_other_function,
interval) = yesterday.
28 # 'on_retry_callback': another_function,
29 # 'sla_miss_callback': yet_another_function,
2. dag = DAG … DAG instantiates a directed acyclic graph along with
30 # 'trigger_rule': 'all_success'
31
default_args
}
and the schedule_interval .

32
3.
33
download_images = BashOperator…
#instantiates a directed acyclic graph
Remember that Apache AirFlow defines
34 our
dagdirected
= DAG( acyclic graph structure as code, and that DAGS consist of tasks.
35
Tasks'ml_pipeline',
are defined using Operators, which execute code. Here, we use
36 default_args=default_args,
BashOperator to run the Python files for our tasks, located in scripts/.
37 description='A simple Machine Learning pipeline',
38 For example, in our download_images task, where we download images
schedule_interval=timedelta(days=30),
39 from
) Google Images, the BashOperator calls python3
40
/usr/local/airflow/scripts/download_images.py .
41 # instantiate tasks using Operators.
42 #BashOperator defines tasks that execute bash scripts. In this case, we run Python scr
4. download_images >> train >> serve Here, we set the ordering of the tasks
43 download_images = BashOperator(
44 in thetask_id='download_images',
DAG. The >> directs the 2nd task to run after the first has
45 completed. That means/usr/local/airflow/scripts/download_images.py',
bash_command='python3 in the pipeline, download_images will run
46 dag=dag,
before train, which will run before serve.
47 )
48 train = BashOperator(
I encourage
49
you to take a look at the python scripts for each of the tasks!
task_id='train',
50 depends_on_past=False,

4.51Installbash_command='python3
Docker and Set/usr/local/airflow/scripts/train.py',
up Virtual Hosts Using nginx
52 retries=3,
Docker allows you to run software in containers, which is virtual
53 dag=dag,
environments
54 ) running their own OS, making it easy to bundle software and
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 6/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

55 serve_commands = """
their required packages reliably and easily. That’s exactly why we’re going to
56 lsof -i tcp:8008 | awk 'NR!=1 {print $2}' | xargs kill;
run our tutorial using Docker. First, let’s install Docker on our VM by
57 python3 /usr/local/airflow/scripts/serve.py serve
running
58 a"""Bash script from the Docker website (or you can manually install
Docker
59 here):
serve = BashOperator(
60 task_id='serve',
61 depends_on_past=False,
62 bash_command=serve_commands,
bash <(curl -s https://get.docker.com/)
63 retries=3,
64 dag=dag,

Then, you’ll need to set up virtual hosts using nginx, an open-source web
server software. Since I’m planning on using two ports in the Docker
container, one for the AirFlow UI, and one for the final web app, and only
have 1 public IP address (the external address of our compute instance), you
will use nginx to route HTTP requests to our external IP address. Route from
AirFlow on port 8080 to our default port 80 using this script:

bash scripts/nginx-airflow.sh

5. Build and Run Docker Container


Now, let’s build the Docker container using the following command:

sudo docker build -t greenr-airflow:latest .

This will run the Dockerfile in our directory and download and install the
necessary libraries for our container.

Be patient — this will take around 10 minutes.

Then, run the Docker container:

sudo docker run -d -p 8080:8080 -p 8008:8008 greenr-airflow

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 7/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

After a few seconds, visit the external IP address of your machine, which
you can find on Compute Engine in the GCP console. Make sure when you
enter it in your browser, to prefix the IP address with http:// like this:
http://34.68.160.231/

You should now see the AirFlow user interface:

you should now see the AirFlow UI

Now, click the Off switch, and click the left-most button, Trigger DAG. The
pipeline will now be running, and when you click ml_pipeline and go to the
Tree section, you’ll see the directed acyclic graph and also the status of the
pipeline:

After a few minutes, the last task, serve, will be running. Your ML pipeline
has been successfully executed! Now, let’s have nginx route our app from
port 8008 to our external IP address:

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 8/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

bash scripts/nginx-app.sh

At this point, go to your external IP address and add /app after, like this:
http://34.68.160.231/app

You will now see the final web app, greenr, running! If you are interested in
learning how to build and deploy greenr, check out my tutorial on deploying
a deep learning model on Google Cloud Platform!

the deployed web app! Upload an image to see if it’s a dandelion or grass

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 9/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

the final result!

In this tutorial, you learned how to build a simple Machine Learning


pipeline in Apache AirFlow consisting of three tasks: download images,
train, and serve. Certainly, this can be improved to be more production-
ready and scalable. Here are some suggestions on how to take your pipeline
further:

1. Consider using KubernetesOperator to create tasks that run in


Kubernetes, allowing for much more scalability

2. Consider using another Executor like CeleryExecutor that allows for


parallel and scaled-out workers (we used the rudimentary
SequentialExecutor)
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 10/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

3. Consider using KubeFlow, which allows for large-scale, production-


ready ML pipelines that orchestrate tasks in Kubernetes, similar to 1.

I hope this gives you a gentle introduction to building ML pipelines on


Apache AirFlow and that this can serve as a template for your work!

Don’t forget to turn off your compute instance to ensure that you don’t get charged
for usage by Google Cloud!

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 11/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science

Machine Learning DevOps Mlops Cloud Computing Pipeline

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and
cutting-edge research to original features you don't want to miss. Take a look.

Emails will be sent to [email protected]. Not you?

Get this newsletter

https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 12/12

You might also like