Lecture+notes+-+Automating+Machine+Learning+Workflows
Lecture+notes+-+Automating+Machine+Learning+Workflows
Automation has played a major role in our society. From the Second Industrial Revolution, when
factories were automated using machines, to the Green Revolution in India, where along with
other things, automation was adopted in farming to scale up food production, we can see plenty
of examples of automation bringing about revolutionary changes.
Apache Airflow
Apache Airflow is a tool for building complex pipelines in the form of directed acyclic graphs.
Apache Airflow is a versatile tool using which you can:
● Create pipelines: Using Apache Airflow, you can create complex pipelines easily. You
can create pipelines using tasks and dependencies. These tasks and dependencies
when put together in an acyclic manner form directed acyclic graphs.
● Manage pipelines: Using Apache Airflow, you can easily manage the pipelines created.
This means that using Apache Airflow, you can schedule or trigger your pipeline based
on some events. Recall that in software engineering, two types of automation are
predominantly used, schedule-based and event-based. Apache Airflow can be used to
perform both types of automation. Apache Airflow can backfill. Let’s understand backfill
using the following example. There can be the case when you may want to run the DAG
for a specified historical period e.g., A data filling DAG is created with start_date
Any flow chart can be a DAG if it has the following three conditions:
● Tasks: A DAG is a collection of tasks/nodes. Eeach function in our modularised code will
constitute a task. In some places, you will find that people call this a node, which is
nothing but its mathematical name. For our use, you will simply call this a task.
In the following figure, you can see that each box represents a task. You have seven
tasks: A, B, C, D, E and F.
In this diagram, you can see that Task-1 and Task-2 depend on inputs from each other.
This creates a paradox in which we cannot run Task-1 until Task-2 is completed, and
vice versa. Thus, we have an issue where we cannot decide the starting point of
execution of the DAG. Hence, acyclicity is an important constraint of DAGs.
Tasks
● Tasks are the basic units of execution: Tasks in Apache Airflow are basic units of
execution. Usually, there are multiple tasks in a pipeline, which are executed based on
their dependencies once the pipeline is triggered.
● Tasks are independent: Any two tasks in Apache Airflow are independent of each
other. A major implication of this is that we cannot pass a huge amount of information
between two tasks. Practically, this means that we need to write the output of each task
to a database or file and read it from there for the next task. This is the standard way of
passing a huge amount of data between two tasks.
A sample DAG is given below.
● Operators act as templates for defining tasks:In order to define a task in Airflow, you
use operators. Operators are essentially templates that take functions/commands as
input. You can construct pipelines with tasks in different languages. This is a major
implication of tasks being independent. Tasks being independent means that in a
pipeline, we can have one task in, say, Python and another task in, say, Java.
bash_task = BashOperator( # Defining that we will be using Bash Command for this task
task_id="make_directory", #We are giving our task an ID
bash_command='mkdir "DIR” ’ #The bash command that we want the task to
execute
)
You can clearly see that defining a task in Airflow is simple, and this is facilitated by operators.
Operators essentially port the code that you write in any language to Airflow
Relations
● Upstream: Upstream dependency means that a given task must be completed before
the task on which it has an upstream dependency. From the diagram given above, you
can see that Task 2.1 needs to be executed before Task 3. This means that Task 2.1 has
an upstream dependency on Task 3.
● Downstream: Downstream dependency means that a given task can only be executed
once the task on which it has a downstream dependency is completed. From the
diagram, you can see that Task 3 has a downstream dependency on Task 2.1 and Task
2.2.
Organising Code
Note that we need to keep all our pipeline-related files in the ‘dags’ folder, which is present in
the ‘airflow’ folder, else your dag will not be recognised.
We first created a directory named ‘heart_disease’ inside the dag folder in airflow. We then
uploaded data files, saved models, utils and constants files. and organised them.
We did not upload the main file in this case, as the inference_pipeline.py would take care of all
the functionalities of the main file and add more functionalities of scheduling, logging and
exception handling.
● You modified the functions present in the utils.py file to write their output to CSV files and
read their inputs from CSV files. Recall that each time while writing the output to csv files
we used index=False, this argument does not write the index to the csv files. This is
important because, if we do not pass this argument while writing the csv files, we will get
new index column each time we write the files
● You removed the apply_pre_processing function, as you want to run all the functions
independently as tasks. Using this function would mean that a single Airflow task will
contain the entire pipeline, which is not desirable.
Recall that we create an “inference_pipeline.py” file. There you will define an Airflow
pipeline that will run all the functions in utils.py in the same order but in the form of
AIrflow Tasks.
● You created a function predict_data that loads the model, passes the data, gets the
model’s prediction and writes it to a file.
● You modified the constants.py file to include all the newly defined constants.
Only then will the code work with Airflow, else it will throw an import error saying that the module
is not found. In normal Python, “from file_name import *” will work just fine, but that is not the
case in Airflow. So, please ensure that whenever you import functions from a custom file in
Airflow, you follow this convention.
You defined some arguments that we will pass while creating an instance of DAG.
default_args = {
'owner': 'airflow',
'start_date': datetime(2022,7,30),
'retries' : 1,
'retry_delay' : timedelta(seconds=30)
Let’s understand all the default arguments that we have defined in detail:
● owner: This is the name of the owner of the pipeline. It can be the name of the company
or organisation. We have randomly set this argument to ‘airflow’.
● start_date: This is the date from which the scheduling of the pipeline needs to be
started. If the start date is a past date, then Airflow will automatically run the pipeline for
the previous dates. This is called catchup
● retries: This is the number of times a particular task is to be retried once it fails. This
argument needs to be a whole number. In our case, we have instructed Airflow to retry
failed tasks just once. This is similar to exception handling that you learnt about in the
previous modules.
● retry_delay: This is the duration to wait before retrying a failed task. We have instructed
Airflow to wait five minutes before retrying a failed task.
ML_inference_dag = DAG(
dag_id = 'Heart_Disease_ML_dag',
default_args = default_args,
schedule_interval = '@hourly'
Let’s understand all the arguments that we have passed to the DAG class:
● dag_id: It is a unique identifier associated with every DAG in Airflow. It is essentially the
name of the DAG. An important point to note is that there should not be any whitespces
in the DAG’s ID.
For example: ‘Heart Disease ML Dag’ is not a valid dag_id, but
‘Heart_Disease_ML_dag’ is. The key takeaway here is that you should use ‘_’ instead
of ‘ ‘ (space)in your dag_id.
● default_args: These are a set of arguments that configure a DAG. Here, we will pass
the arguments that we defined earlier in the default_args dictionary.
● description: This is a brief description of what your DAG does.
● schedule_interval: This specifies the interval at which your DAG will be triggered. We
can trigger our DAGs hourly, daily, weekly, monthly or yearly. In our case, we have
instructed Airflow to trigger our DAG hourly, as we are getting new inference data hourly.
load_task = PythonOperator(
task_id = 'load_task',
python_callable = get_inference_data,
dag = ML_inference_dag)
ML_inference_dag = DAG(
dag_id = 'Heart_Disease_ML_dag',
default_args = default_args,
description = 'Dag to run inferences on predictions of heart disease patients',
schedule_interval = '@hourly'
)
Ensure that you pass the variables name in the dag argument and not the dag_id.
Whether our DAG is showing up in the list of all dags in airflow dags list
Airflow.
NOTE: When you run the given command many DAGs will
be listed. These DAGs are sample DAGs provided by default
in Airflow.
Whether the DAG has all the tasks that we defined airflow task list <dag_id>
Whether the individual tasks are running properly airflow tasks test <dag_id>
<task_id> <date>
In order to initialise the Airflow server, we need to run two commands: airflow scheduler, which
initiates the scheduler in airflow that takes care of scheduling all the pipelines, and airflow
webserver, which provides a GUI.
Since we are running Airflow on Jarvislabs, we need to ensure that Airflow runs on any one of
the available ports. In Jarvislabs, we only have port 6006, 6007, 6008 and 6009 available to us.
Thus, we need to specify the port on which we want to run airflow webserver.
Navigating Airflow's UI
You learnt that by going to the following tabs you can access the mentioned information
Recall that we did not add any logging. This is because we know that Airflow automatically logs
our runs.
You learnt how the architecture of Airflow can be related to a factory/assembly line.
In the analogy, we related a scheduler to a manager who is assigned the task of ensuring that
things are running smoothly. You learnt that a scheduler is responsible for managing all the
tasks in Airflow.
For this, it maintains tabs on when to trigger which pipeline, pipeline configuration, etc. in a
database. This is our metadata database. This is similar to the task sheet managed by the
manager of a factory. Now, in order to manufacture a part, we need workers. In the case of
Airflow, workers are analogous to computational resources.
Our DAG directory is a directory where we store all the files used to create a DAG. Recall that
we created our heart_disease directory in a folder called dags in Airflow. This is Airflow’s DAG
directory. Now, a web server is essentially a way for users to interact with all the information
about the pipelines' schedules.
Recall that while executing the commands, we first ran airflow db init, then airflow scheduler
and finally airflow webserver. This means that we first initiated our database, as all the
information regarding scheduling is stored in it. Then, we initialised our scheduler. Initiating the
scheduler before the database would throw an error. This is because what we are doing is
equivalent to asking a factory manager to do a set of tasks without telling them what those tasks
are. Finally, after initialising the database and scheduler, we initialised the webserver.