Airflow
Airflow
Save
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 1/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
Often, when you think about Machine Learning, you tend to think about the great
models that you can now create. If you want to take these amazing models and
make them available to the world, you will have to move beyond just training the
model and incorporating data collection, feature engineering, training,
evaluating, and serving.
On top of all that, you will also have to remember that you’re putting a software
application into production. That means you’ll have all the requirements that any
production software has, including scalability, consistency, modularity,
testability, and security.
This tutorial will show you how to build a simple ML pipeline that automates
the workflow of a deep learning image classifier for dandelions and grass
built using FastAI, and then served as a web app using Starlette. We’ll use
Apache AirFlow, out of the many workflow tools like Luigi, MLFlow, and
KubeFlow, because it provides an extensive set of features and a beautiful
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 2/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
Requirements: Since you will run this tutorial on a VM instance, all you will
need is a computer running any OS, and a Google account.
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 3/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
If you haven’t already, sign up for Google Cloud Platform through your
Google account. You’ll have to enter your credit card, but you won’t be
charged anything upon signing up. You’ll also get $300 worth of free credits
that last for 12 months! If you’ve run out of credits, don’t worry — running this
tutorial will cost pennies, provided you stop your VM instance afterward!
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 4/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
When your instance has been created and is running, SSH into your instance
by clicking on the SSH button located on the right side of the screen.
72.4K views
git clone https://github.com/btphan95/greenr-airflow
cd greenr-airflow
code,
15 including the tasksFalse,
'email_on_retry': that we want to execute and their ordering.
16 'retries': 1,
17 'retry_delay': timedelta(minutes=2),
Let’s break down the code in ml_pipeline.py:
18 # 'queue': 'bash_queue',
19 # 'pool': 'backfill',
1. default_args = … default_args defines arguments that will be fed into
20 # 'priority_weight': 10,
21 our DAG function.
# 'end_date': This includes
datetime(2016, the owner name,
1, 1), owner , whether or not
22 the DAG should run again False,
# 'wait_for_downstream': after a failed instance, depends_on_past , and
23 # 'dag': dag,
the start date, start_date . Note: AirFlow will start your DAG after
24 # 'sla': timedelta(hours=2),
25 start_date AND one instance
# 'execution_timeout': of schedule_interval
timedelta(seconds=300), under the DAG
26 definition. This means that
# 'on_failure_callback': in our case, we start at (31 days ago) + (30 day
some_function,
27 # 'on_success_callback': some_other_function,
interval) = yesterday.
28 # 'on_retry_callback': another_function,
29 # 'sla_miss_callback': yet_another_function,
2. dag = DAG … DAG instantiates a directed acyclic graph along with
30 # 'trigger_rule': 'all_success'
31
default_args
}
and the schedule_interval .
32
3.
33
download_images = BashOperator…
#instantiates a directed acyclic graph
Remember that Apache AirFlow defines
34 our
dagdirected
= DAG( acyclic graph structure as code, and that DAGS consist of tasks.
35
Tasks'ml_pipeline',
are defined using Operators, which execute code. Here, we use
36 default_args=default_args,
BashOperator to run the Python files for our tasks, located in scripts/.
37 description='A simple Machine Learning pipeline',
38 For example, in our download_images task, where we download images
schedule_interval=timedelta(days=30),
39 from
) Google Images, the BashOperator calls python3
40
/usr/local/airflow/scripts/download_images.py .
41 # instantiate tasks using Operators.
42 #BashOperator defines tasks that execute bash scripts. In this case, we run Python scr
4. download_images >> train >> serve Here, we set the ordering of the tasks
43 download_images = BashOperator(
44 in thetask_id='download_images',
DAG. The >> directs the 2nd task to run after the first has
45 completed. That means/usr/local/airflow/scripts/download_images.py',
bash_command='python3 in the pipeline, download_images will run
46 dag=dag,
before train, which will run before serve.
47 )
48 train = BashOperator(
I encourage
49
you to take a look at the python scripts for each of the tasks!
task_id='train',
50 depends_on_past=False,
4.51Installbash_command='python3
Docker and Set/usr/local/airflow/scripts/train.py',
up Virtual Hosts Using nginx
52 retries=3,
Docker allows you to run software in containers, which is virtual
53 dag=dag,
environments
54 ) running their own OS, making it easy to bundle software and
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 6/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
55 serve_commands = """
their required packages reliably and easily. That’s exactly why we’re going to
56 lsof -i tcp:8008 | awk 'NR!=1 {print $2}' | xargs kill;
run our tutorial using Docker. First, let’s install Docker on our VM by
57 python3 /usr/local/airflow/scripts/serve.py serve
running
58 a"""Bash script from the Docker website (or you can manually install
Docker
59 here):
serve = BashOperator(
60 task_id='serve',
61 depends_on_past=False,
62 bash_command=serve_commands,
bash <(curl -s https://get.docker.com/)
63 retries=3,
64 dag=dag,
Then, you’ll need to set up virtual hosts using nginx, an open-source web
server software. Since I’m planning on using two ports in the Docker
container, one for the AirFlow UI, and one for the final web app, and only
have 1 public IP address (the external address of our compute instance), you
will use nginx to route HTTP requests to our external IP address. Route from
AirFlow on port 8080 to our default port 80 using this script:
bash scripts/nginx-airflow.sh
This will run the Dockerfile in our directory and download and install the
necessary libraries for our container.
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 7/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
After a few seconds, visit the external IP address of your machine, which
you can find on Compute Engine in the GCP console. Make sure when you
enter it in your browser, to prefix the IP address with http:// like this:
http://34.68.160.231/
Now, click the Off switch, and click the left-most button, Trigger DAG. The
pipeline will now be running, and when you click ml_pipeline and go to the
Tree section, you’ll see the directed acyclic graph and also the status of the
pipeline:
After a few minutes, the last task, serve, will be running. Your ML pipeline
has been successfully executed! Now, let’s have nginx route our app from
port 8008 to our external IP address:
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 8/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
bash scripts/nginx-app.sh
At this point, go to your external IP address and add /app after, like this:
http://34.68.160.231/app
You will now see the final web app, greenr, running! If you are interested in
learning how to build and deploy greenr, check out my tutorial on deploying
a deep learning model on Google Cloud Platform!
the deployed web app! Upload an image to see if it’s a dandelion or grass
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 9/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
Don’t forget to turn off your compute instance to ensure that you don’t get charged
for usage by Google Cloud!
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 11/12
5/2/23, 4:46 PM 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow | by Binh Phan | Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and
cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/10-minutes-to-building-a-machine-learning-pipeline-with-apache-airflow-53cd09268977 12/12