Data Engineering For Machine Learning Pipelines From Python Libraries To ML Pipelines and Cloud Platforms (Pavan Kumar Narayanan)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML Pipelines and Cloud Platforms (Pavan Kumar Narayanan)
for Machine
Learning Pipelines
From Python Libraries to ML Pipelines
and Cloud Platforms
—
Pavan Kumar Narayanan
Data Engineering for
Machine Learning
Pipelines
From Python Libraries to ML
Pipelines and Cloud Platforms
v
Table of Contents
vi
Table of Contents
Data Reshaping��������������������������������������������������������������������������������������������������������������������������� 75
pivot( )������������������������������������������������������������������������������������������������������������������������������������ 75
pivot_table( )�������������������������������������������������������������������������������������������������������������������������� 76
stack( )����������������������������������������������������������������������������������������������������������������������������������� 78
unstack( )������������������������������������������������������������������������������������������������������������������������������� 80
melt( )������������������������������������������������������������������������������������������������������������������������������������� 81
crosstab( )������������������������������������������������������������������������������������������������������������������������������ 82
factorize( )������������������������������������������������������������������������������������������������������������������������������ 85
compare( )������������������������������������������������������������������������������������������������������������������������������ 87
groupby( )������������������������������������������������������������������������������������������������������������������������������� 89
Conclusion���������������������������������������������������������������������������������������������������������������������������������� 91
vii
Table of Contents
viii
Table of Contents
Chapter 5: Getting Started with Data Validation using Pydantic and Pandera����� 163
Introduction������������������������������������������������������������������������������������������������������������������������������� 163
Introduction to Data Validation�������������������������������������������������������������������������������������������������� 164
Need for Good Data������������������������������������������������������������������������������������������������������������� 164
Definition����������������������������������������������������������������������������������������������������������������������������� 165
Principles of Data Validation������������������������������������������������������������������������������������������������ 166
Introduction to Pydantic������������������������������������������������������������������������������������������������������������ 168
Type Annotations Refresher������������������������������������������������������������������������������������������������� 168
Setup and Installation���������������������������������������������������������������������������������������������������������� 170
Pydantic Models������������������������������������������������������������������������������������������������������������������ 171
Fields����������������������������������������������������������������������������������������������������������������������������������� 173
JSON Schemas�������������������������������������������������������������������������������������������������������������������� 174
Constrained Types��������������������������������������������������������������������������������������������������������������� 176
Validators in Pydantic���������������������������������������������������������������������������������������������������������� 176
Introduction to Pandera������������������������������������������������������������������������������������������������������������ 178
Setup and Installation���������������������������������������������������������������������������������������������������������� 179
DataFrame Schema in Pandera������������������������������������������������������������������������������������������� 180
Data Coercion in Pandera���������������������������������������������������������������������������������������������������� 182
Checks in Pandera��������������������������������������������������������������������������������������������������������������� 183
ix
Table of Contents
x
Table of Contents
xi
Table of Contents
xii
Table of Contents
Partitions����������������������������������������������������������������������������������������������������������������������������� 282
Broker���������������������������������������������������������������������������������������������������������������������������������� 282
Replication��������������������������������������������������������������������������������������������������������������������������� 282
Producers���������������������������������������������������������������������������������������������������������������������������� 283
Consumers�������������������������������������������������������������������������������������������������������������������������� 283
Schema Registry����������������������������������������������������������������������������������������������������������������� 284
Kafka Connect��������������������������������������������������������������������������������������������������������������������� 284
Kafka Streams and ksqlDB�������������������������������������������������������������������������������������������������� 284
Kafka Admin Client�������������������������������������������������������������������������������������������������������������� 286
Setup and Development������������������������������������������������������������������������������������������������������������ 287
Kafka Application with the Schema Registry���������������������������������������������������������������������������� 296
Protobuf Serializer��������������������������������������������������������������������������������������������������������������� 301
Stream Processing�������������������������������������������������������������������������������������������������������������������� 307
Stateful vs. Stateless Processing���������������������������������������������������������������������������������������� 308
Kafka Connect��������������������������������������������������������������������������������������������������������������������������� 320
Best Practices���������������������������������������������������������������������������������������������������������������������� 321
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 322
Chapter 10: Engineering Machine Learning and Data REST APIs using FastAPI��� 323
Introduction������������������������������������������������������������������������������������������������������������������������������� 323
Introduction to Web Services and APIs������������������������������������������������������������������������������������� 324
OpenWeather API����������������������������������������������������������������������������������������������������������������� 324
Types of APIs����������������������������������������������������������������������������������������������������������������������� 324
Typical Process of APIs�������������������������������������������������������������������������������������������������������� 326
Endpoints����������������������������������������������������������������������������������������������������������������������������� 326
API Development Process���������������������������������������������������������������������������������������������������� 327
REST API������������������������������������������������������������������������������������������������������������������������������ 328
HTTP Status Codes�������������������������������������������������������������������������������������������������������������� 329
FastAPI�������������������������������������������������������������������������������������������������������������������������������������� 330
Setup and Installation���������������������������������������������������������������������������������������������������������� 331
Core Concepts��������������������������������������������������������������������������������������������������������������������� 332
Path Parameters and Query Parameters����������������������������������������������������������������������������� 332
xiii
Table of Contents
Chapter 11: Getting Started with Workflow Management and Orchestration������ 361
Introduction������������������������������������������������������������������������������������������������������������������������������� 361
Introduction to Workflow Orchestration������������������������������������������������������������������������������������ 362
Workflow����������������������������������������������������������������������������������������������������������������������������� 362
ETL and ELT Data Pipeline Workflow����������������������������������������������������������������������������������� 365
Workflow Configuration������������������������������������������������������������������������������������������������������� 368
Workflow Orchestration������������������������������������������������������������������������������������������������������� 369
Introduction to Cron Job Scheduler������������������������������������������������������������������������������������������ 371
Concepts������������������������������������������������������������������������������������������������������������������������������ 371
Crontab File������������������������������������������������������������������������������������������������������������������������� 372
Cron Logging����������������������������������������������������������������������������������������������������������������������� 374
Cron Job Usage������������������������������������������������������������������������������������������������������������������� 375
Cron Scheduler Applications����������������������������������������������������������������������������������������������� 380
Cron Alternatives����������������������������������������������������������������������������������������������������������������� 381
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 381
Chapter 12: Orchestrating Data Engineering Pipelines using Apache Airflow����� 383
Introduction������������������������������������������������������������������������������������������������������������������������������� 383
Introduction to Apache Airflow�������������������������������������������������������������������������������������������������� 384
Setup and Installation���������������������������������������������������������������������������������������������������������� 384
Airflow Architecture������������������������������������������������������������������������������������������������������������������ 389
Web Server�������������������������������������������������������������������������������������������������������������������������� 390
Database����������������������������������������������������������������������������������������������������������������������������� 390
xiv
Table of Contents
Executor������������������������������������������������������������������������������������������������������������������������������ 390
Scheduler���������������������������������������������������������������������������������������������������������������������������� 391
Configuration Files��������������������������������������������������������������������������������������������������������������� 391
A Simple Example���������������������������������������������������������������������������������������������������������������� 393
Airflow DAGs����������������������������������������������������������������������������������������������������������������������������� 395
Tasks����������������������������������������������������������������������������������������������������������������������������������� 397
Operators����������������������������������������������������������������������������������������������������������������������������� 397
Sensors�������������������������������������������������������������������������������������������������������������������������������� 399
Task Flow����������������������������������������������������������������������������������������������������������������������������� 400
Xcom����������������������������������������������������������������������������������������������������������������������������������� 401
Hooks����������������������������������������������������������������������������������������������������������������������������������� 402
Variables������������������������������������������������������������������������������������������������������������������������������ 403
Params�������������������������������������������������������������������������������������������������������������������������������� 406
Templates���������������������������������������������������������������������������������������������������������������������������� 407
Macros��������������������������������������������������������������������������������������������������������������������������������� 407
Controlling the DAG Workflow��������������������������������������������������������������������������������������������������� 408
Triggers�������������������������������������������������������������������������������������������������������������������������������� 411
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 413
xv
Table of Contents
Chapter 14: Getting Started with Big Data and Cloud Computing������������������������ 451
Introduction������������������������������������������������������������������������������������������������������������������������������� 451
Background of Cloud Computing���������������������������������������������������������������������������������������������� 452
Networking Concepts for Cloud Computing������������������������������������������������������������������������ 453
Introduction to Big Data������������������������������������������������������������������������������������������������������������ 454
Hadoop�������������������������������������������������������������������������������������������������������������������������������� 454
Spark����������������������������������������������������������������������������������������������������������������������������������� 455
Introduction to Cloud Computing���������������������������������������������������������������������������������������������� 456
Cloud Computing Deployment Models��������������������������������������������������������������������������������� 456
Cloud Architecture Concepts����������������������������������������������������������������������������������������������� 458
Cloud Computing Vendors��������������������������������������������������������������������������������������������������� 461
Cloud Service Models���������������������������������������������������������������������������������������������������������� 461
Cloud Computing Services�������������������������������������������������������������������������������������������������� 463
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 471
Chapter 15: Engineering Data Pipelines Using Amazon Web Services����������������� 473
Introduction������������������������������������������������������������������������������������������������������������������������������� 473
AWS Console Overview������������������������������������������������������������������������������������������������������������� 474
Setting Up an AWS Account������������������������������������������������������������������������������������������������� 474
Installing the AWS CLI���������������������������������������������������������������������������������������������������������� 488
AWS S3������������������������������������������������������������������������������������������������������������������������������������� 489
Uploading Files�������������������������������������������������������������������������������������������������������������������� 492
AWS Data Systems�������������������������������������������������������������������������������������������������������������������� 493
Amazon RDS������������������������������������������������������������������������������������������������������������������������ 493
xvi
Table of Contents
Chapter 16: Engineering Data Pipelines Using Google Cloud Platform����������������� 531
Introduction������������������������������������������������������������������������������������������������������������������������������� 531
Google Cloud Platform�������������������������������������������������������������������������������������������������������������� 532
Set Up a GCP Account���������������������������������������������������������������������������������������������������������� 533
Google Cloud Storage���������������������������������������������������������������������������������������������������������������� 534
Google Cloud CLI����������������������������������������������������������������������������������������������������������������������� 535
Google Compute Engine������������������������������������������������������������������������������������������������������������ 537
Cloud SQL���������������������������������������������������������������������������������������������������������������������������������� 541
Google Bigtable������������������������������������������������������������������������������������������������������������������������� 547
Google BigQuery����������������������������������������������������������������������������������������������������������������������� 552
Google Dataproc������������������������������������������������������������������������������������������������������������������������ 554
Google Vertex AI Workbench����������������������������������������������������������������������������������������������������� 559
Google Vertex AI������������������������������������������������������������������������������������������������������������������������ 562
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 569
xvii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 617
xviii
About the Author
Pavan Kumar Narayanan has an extensive and diverse
career in the information technology industry, with a
primary focus on the data engineering and machine
learning domains. Throughout his professional journey,
he has consistently delivered solutions in environments
characterized by heterogeneity and complexity. His
experience spans a broad spectrum, encompassing
traditional data warehousing projects following waterfall
methodologies and extending to contemporary integrations
that involve application programming interfaces (APIs) and
message-based systems. Pavan has made substantial contributions to large-scale data
integrations for applications in data science and machine learning. At the forefront
of these endeavors, he has played a key role in delivering sophisticated data products
and solutions, employing a versatile mix of both traditional and agile approaches.
Pavan completed his Master of Science in Computational Mathematics at Buffalo
State University, Operations Research at the University of Edinburgh, and Information
Technology at Northern Arizona University. He can be contacted at pavan.narayanan@
gmail.com.
xix
About the Technical Reviewer
Suvoraj Biswas is a leading expert and emerging
thought leader in enterprise generative AI, specializing
in architecture and governance. He authored the award-
winning book Enterprise GENERATIVE AI Well-Architected
Framework & Patterns, acclaimed by industry veterans
globally. Currently an enterprise solutions architect
at Ameriprise Financial, a 130-year-old Fortune 500
organization, Biswas has over 19 years of IT experience
across India, the United States, and Canada. He has held key
roles at Thomson Reuters and IBM, focusing on enterprise architecture, architectural
governance, DevOps, and DevSecOps.
Biswas excels in designing secure, scalable enterprise systems, facilitating multi-
cloud adoption, and driving digital transformation. His expertise spans Software as a
Service (SaaS) platform engineering, Amazon Web Services (AWS) cloud migration,
generative AI implementation, machine learning, and smart IoT platforms. He has
led large-scale, mission-critical enterprise projects, showcasing his ability to integrate
advanced cutting-edge technologies into practical business strategies.
xxi
Introduction
The role of data engineering has become increasingly crucial. Data engineering is
the foundation on which organizations build their reporting, analytics, and machine
learning capabilities. Data engineering as a discipline transforms raw data into reliable
and insightful informational resources that provide valuable insights to organizations.
However, data engineering is much more than that. It is a complex and iterative process.
If an enterprise project can be seen as constructing a multi-storied building, then
data engineering is where the execution happens—from architecting and designing
the structure of the data solution to laying the data storage foundation; installing the
data processing frameworks; developing pipelines that source, wrangle, transform,
and integrate data; and developing reports and dashboards while ensuring quality
and maintenance of the systems, pipelines, reports, and dashboards, among several
other things.
This book is designed to serve you as a desk reference. Whether you are a beginner
or a professional, this book will help you with the fundamentals and provide you with
job-ready skills to deliver high-value proposition to the business. This book is written
with an aim to bring back the knowledge gained from traditional textbook learning,
providing an organized exploration of key data engineering concepts and practices. I
encourage you to own a paper copy. This book will serve you well for the next decade in
terms of concepts, tools, and practices.
Though the content is exhaustive, the book covers around 60% of data engineering
topics. Data engineering is a complex and evolving discipline; there are so many more
topics that are not touched on. As a reader, I encourage you to contact me and share your
remarks and things you witnessed in the data space that you wish to share. I am thankful
to those who have already written to me, and I wish to see more of these efforts from the
readers and enthusiasts.
xxiii
Introduction
Having said that, this book covers a wide range of topics that are practiced in
modern data engineering, with a focus on building data pipelines for machine learning
development. This book is organized into the following topics:
The book covers major cloud computing service providers. The reason is that
organizations are increasingly moving toward a multi-cloud approach. And so, being
equipped with knowledge on diverse cloud environments is crucial for your career
success. This book delivers that.
If you are a data engineer looking to expand your skillset and move up the ladder,
a data scientist who seeks to understand how things work under the hood, a software
engineer who is already doing the work, or a student who is looking to gain a position,
this book is essential for you. Even if you are in a functional domain or a founder of a
start-up, interested in setting up a team and good data engineering practice, this book
will help you from start to end. This book is a guide that will help with learning for
people at every stage (from novice to expert).
xxiv
Introduction
I highly encourage you not to seek solutions from chat bots during learning. At
the time of writing this (June 2024), many bot programs either provide legacy code or
steer you into their own direction and approach. It may seem better because of the way
information is presented, but it slows down your learning and professional development.
Even to this date, I find search engines to be more effective, which take me to Q&A sites
where one can witness how a solution has been arrived at and why other approaches
didn’t succeed. Furthermore, I find Google search to be effective in terms of locating
a certain method or a function in an API documentation that may be better than the
website itself.
At the end of this book, you will be equipped with skills, knowledge, and practices
that can be derived from a decade of experience in data engineering. Many vendors and
cloud computing service providers provide you with free credits if you register for the
first time. I encourage you to make use of these free credits, explore various tools and
services, and gain knowledge. And when you are done, please remember to clean up
your workspace, shut down any virtual machines and databases, take a backup of your
work, and delete your workspace in the cloud to prevent additional charges.
xxv
CHAPTER 1
1
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_1
Chapter 1 Core Technologies in Data Engineering
Engineering data pipelines is not easy. There is a lot of complexity and it involves
leveraging analytical thinking. The complexity includes but is not limited to varying
data models, intricate business logic, tapping into various data sources, and so many
other tasks and projects that make up the profession. Often, data engineers do build
machine learning models to identify anomalies and build advanced imputations on the
data. I have taken some of the topics in my experience that would serve value to the data
engineer or software developer who is engineering pipelines.
Python Programming
There are two major versions of the Python programming language, Python 2 and
Python 3. Python 2 has reached end of life for support. Python 3 is the de facto language
version for Python programming. The Python programming language (version 3.*) is still
the most preferred language for development, comes with extensive community-backed
data processing libraries, and is widely adopted. In this section, let us take some time
to look at some of the Python programming concepts that may help, in addition to your
basic programming knowledge. These are discussed at a high level. I encourage you to
perform further investigation into these classes, methods, and arguments that can be
passed into these methods.
F Strings
Python f strings allow you to easily insert variables and expressions within a print string.
When compared with traditional string formatting, f strings are a lot more readable and
easy to understand. You can create an f string by placing the character F in front of the
string, before quotes. The character is case insensitive, meaning you can either insert an
F or an f. Here is an example:
2
Chapter 1 Core Technologies in Data Engineering
Now you can insert variables and expressions within the f string. We have shown
the preceding f string without inserts for demonstration purposes. Here is how you can
perform that:
Fruit = "berries"
Cost = 1.5
Number = 3
will return
Python Functions
In programming, functions are a set of instructions that can be executed together to
accomplish a specific task or operation. It may or may not accept inputs and may or may
not return output. Let us look at an example of a function:
def add(a,b):
c = a+b
return c
print(add(2,3)
it will return
Let us look at a bit more substantial function. Here, we have a function that takes
Celsius as input and calculates Fahrenheit as output:
def f_calc(celsius):
return ((9/5 * celsius) + 32)
3
Chapter 1 Core Technologies in Data Engineering
This is the same as the following function. Both are syntactically equal. We will
discuss this way of defining functions in another section:
In the preceding code, we have passed a variable named celsius of type integer.
Inside the function we are calculating a value and returning that value in floating
point type. This is how functions can accept input parameters, process them, and
provide output.
def add(a,b):
c = a+b
return c
add(2,3)
it will return
What if we do not know how many input parameters we need? We can use
arguments here:
def add_args(*args):
total_sum = 0
for i in args:
total_sum += i
return total_sum
4
Chapter 1 Core Technologies in Data Engineering
add_args(3,4,5,6,7)
it will return
25
Note You can also pass a list to the function and still obtain the functionality. But
with arguments, you can call an empty function and return none.
**kwargs
This is quite similar to arguments, and the difference is that you can have a key
associated with each item in the argument:
def add_kwargs(**kwargs):
total_sum = 0
for i in kwargs.values():
total_sum += i
return total_sum
Use of args and kwargs can be highly beneficial in enhancing the flexibility of a
function, complementing the decorator implementation and passing a varying number
of parameters as inputs.
Lambda Functions
In Python, lambdas are short for lambda expressions. They are inline functions that
evaluate one single expression. That single expression is evaluated when the expression
is called. Let us look at this with an illustration:
calculate_fahrenheit(27)
5
Chapter 1 Core Technologies in Data Engineering
it will return
80.6
Decorators in Python
Decorators in Python are a way to add additional functionality to your existing function.
The decorator itself is a function. In the parameter column, it would take another
function, as an input parameter. In calculus terms, it is equal to the function of a
function. If math is not your cup of tea, then let’s look at a simple example.
Here is a test decorator:
def testdec(func):
def wrapper():
print("+++Beginning of Decorator Function+++")
func()
print("+++End of Decorator Function+++")
return wrapper
@testdec
def hello():
print("Hello!")
hello()
it will provide us
6
Chapter 1 Core Technologies in Data Engineering
Type Hinting
Type hinting is a Python feature that lets you specify the data types of the variable
declarations, arguments in functions, and the variables that these functions return.
Type hints make it easier to understand what data types to expect and which function
returns what type of data. The idea behind type hinting is better readability, code
maintainability, and being able to debug and troubleshoot errors faster. Type hints are
not mandatory; you can still pass an argument without specifying the data type (Python
will still check them during runtime).
Here is a simple example demonstrating the type hinting functionality in Python:
Typing Module
The typing module supports type hints for advanced data structures with various
features. Here is an example for a dictionary:
Notice the usage of “Dict.” It serves the same purpose as “dict,” a dictionary type
offered by core Python. “Dict” does not provide additional functionality except you have
to specify the data type for both key and value aspects of a dictionary. If you are using
Python 3.9 or later, you can use either “Dict” or “dict.”
Let us look at another example called the Union, where a variable could be one of
few types:
7
Chapter 1 Core Technologies in Data Engineering
In Python 3.10+, you can also use the | operator. Here are some examples:
Generators in Python
A generator is like a regular function in Python and uses a yield keyword instead of a
return keyword to return the results. The yield keyword helps return an iterator that
would provide a sequence of values when iterated over. The return keyword on the other
hand returns a one-time result of the function.
Let us look at a simple example:
def generator_square(n):
x = 0
while x <= n:
yield x*x
x += 1
for i in generator_square(10):
print(i)
Enumerate Functions
The Oxford dictionary of English defines enumerate as “to list or mention of a number
of things, one by one.” This is all you need to remember. The enumerate function, when
applied to a data type, returns an enumerate object. These enumerate objects return the
data type with an index that starts from 0 and the value corresponding to the index.
Here is a basic illustration:
a = ['apples','oranges','melons','bananas']
print(a)
8
Chapter 1 Core Technologies in Data Engineering
would yield
However, if you enumerate the same list, you would obtain something like this:
for i in enumerate(a):
print(i)
would yield
(0, 'apples')
(1, 'oranges')
(2, 'melons')
(3, 'bananas')
Here is an illustration for using a tuple with type hints for better readability:
print(var10)
([1, 'apple'], [2, 'orange'], [3, 'melon'], [4, 'banana'])
would yield
When it comes to a dictionary, the dictionary already has a key and it is not
necessary to add an index. However, it is not a limiting factor. We can still iterate over a
dictionary using enumerate but it needs to be over each (key, value) tuple. We need to
use the “items()” method to obtain each tuple from the dictionary.
Here is an illustration iterating over a dictionary:
9
Chapter 1 Core Technologies in Data Engineering
would yield
would yield
Let us use enumerate, where the variable “i” stands for index and variable “j” stands
for each key–value pair in a dictionary:
for i, j in enumerate(sample_dict.items()):
print(i,j)
would yield
0 (1, 'apple')
1 (2, 'orange')
2 (3, 'melon')
3 (4, 'banana')
would yield
0 1 apple
1 2 orange
2 3 melon
3 4 banana
10
Chapter 1 Core Technologies in Data Engineering
name = "JohnDoe"
would yield
0 --> J
1 --> o
2 --> h
3 --> n
4 --> D
5 --> o
6 --> e
List Comprehension
List comprehension is a method of defining and creating a list through pythonic
expressions. List comprehension returns a list. It can be seen as a substitute for lambda
expressions.
Let us look at a simple example:
readings_in_km = [10,12,13,9,15,21,24,27]
print(readings_in_miles)
We were able to dynamically (and elegantly) create a list of readings that were in
miles from readings that were measured in kilometers.
Let us look at another illustration, where we square or cube a number based on
whether the number is odd or even. This would mean there needs to be an iterator loop
and a decision loop to check each item in the list. Here is how the code may look without
list comprehensions:
11
Chapter 1 Core Technologies in Data Engineering
a: list[int] = [1,2,3,4,5,6,7,8,9,10]
b: list[int] =[]
for x in a:
if x % 2 == 0:
b.append(x**2)
else:
b.append(x**3)
print(b)
a: list[int] = [1,2,3,4,5,6,7,8,9,10]
Here is another example creating tuples within lists. Here is the regular way of
performing this:
for x in a_list:
for y in b_list:
if x+y <= 4:
var13.append((x,y))
print(var12)
would yield
[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (3, 1)]
12
Chapter 1 Core Technologies in Data Engineering
Random Module
The random module is Python’s inbuilt module that is used to generate random
numbers or pseudo-random numbers including integers, floats, and sequences. Let’s
look at some methods that you may utilize more often as part of building your data
pipelines.
random( )
This method returns a random float in the range between 0.0 and 1.0:
import random
var14 = random.random()
print(var14)
randint( )
This method returns an integer in the range between a minimum and a maximum
threshold supplied by the user:
import random
print(random.randint(1,33))
getrandbits( )
The getrandbits() method from the random library would generate a random integer
in the range between 0 and 2^k – 1. This is beneficial when you need random integers
bigger than randint or even in cases where you need to generate random Boolean
returns.
Let us look at both:
import random
print(random.getrandbits(20))
print(random.getrandbits(20))
print(random.getrandbits(20))
13
Chapter 1 Core Technologies in Data Engineering
would yield
619869
239874
945215
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
would yield
True
False
True
True
As you can see, you can either generate the value in k random bits, or you can cast
it to a Boolean with a parameter of 1 and you would get a random Boolean generator
as well.
choice( )
The choice function selects a random element from a user-supplied sequence (such as a
list, a tuple, etc.):
import random
fruit = ['apple','banana','melon','orange']
print(random.choice(fruit))
would return
'apple'
print(random.choice(fruit))
14
Chapter 1 Core Technologies in Data Engineering
'orange'
shuffle( )
The shuffle() method shuffles the sequence, as it changes the way the characters are
currently placed. For instance, we already have
print(fruit)
fruit = ['apple','banana','melon','orange']
Now
random.shuffle(fruit)
And so
print(fruit)
sample( )
The sample method from the random library returns k unique elements chosen from a
population list or a set. This method generates samples without replacement, meaning
once an item is chosen, it cannot be chosen again:
Import random
alpha = list("abcdefghijklmnopqrstuvwxyz")
print(alpha)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
print(random.sample(alpha, 7))
['n', 'v', 'x', 'u', 'p', 'i', 'k']
print(random.sample(alpha, 7))
['s', 'y', 'i', 't', 'k', 'w', 'u']
15
Chapter 1 Core Technologies in Data Engineering
print(random.sample(alpha, 7))
['n', 'u', 'm', 'z', 'o', 'r', 'd']
print(random.sample(alpha, 7))
['e', 'y', 'a', 'i', 'o', 'c', 'z']
seed( )
The seed() method from the random library would set the seed for generating the
random numbers. And so, if you specify a seed with a certain integer, the idea is that
you would obtain similar random numbers and, in that, you can reproduce the same
output every time. It helps generate random samples in machine learning, simulation
models, etc.
Foundations of Git
Git is a distributed version control system, wherein there exists a remote copy and a
server copy of the same code. The code database is called a repository or simply repo.
You obtain a copy of a repo from the server and start working with it. The process
is called branching. Then, you can submit a proposal to merge your branch to the
centralized copy (master branch), which is commonly referred to as pull request. Upon
submission it goes to a person who is administering the repo, and they will do what is
called approving a pull request.
16
Chapter 1 Core Technologies in Data Engineering
GitHub
Let us look at GitHub, one of the leading git platforms. GitHub is considered the world’s
largest distributed version control system that hosts and tracks changes of software code
while offering a plethora of features for development.
Alternatively you may wish to visit the website, git-scm, and obtain the git package
under the Downloads section. Here is the website:
https://git-scm.com/downloads
Once the download and installation is complete, let us proceed to set up the git in
our workstations. Here is how to get started.
Let us set up our username and our email as this information will be used by git
commit every time you commit changes to files. Here are the commands:
Remember to supply the unique username you had created during the registration
process and please use quotes for registering the username. For the email, you need
to provide the email you had utilized for the GitHub registration process. And make
sure to use the “--global” keyword option as git will always use these credentials for
any of your interactions with the git system. If you wish to override this information
with a different set of credentials, you may run these same commands but without the
“--global” option.
17
Chapter 1 Core Technologies in Data Engineering
You can access and write files to your GitHub repository right from your own
terminal using Secure Shell protocol. You generate an SSH key and passphrase, sign in
to your GitHub account, and add the SSH key and passphrase you generated from your
terminal to the GitHub account. This way you have established a secured handshake
between your workstation and your GitHub account.
Let us generate a new SSH key:
For now, please skip (by pressing Enter) when asked for a passphrase among other
options. Let the default location be as is. Here is how that may look for you.
Let us start the SSH agent in your terminal:
This command will start the SSH daemon process in the background. Now, let us
add the SSH private key to the SSH agent using the following command:
ssh-add ~/.ssh/id_ed25519
Now open a new tab on your web browser and log in to your GitHub (in case you
were signed off ); click the icon on the top-right corner of your screen and choose the
option called “Settings” from the list of options. Here is how that may look:
18
Chapter 1 Core Technologies in Data Engineering
Once you are inside the GitHub settings, navigate and click the option called “SSH
and GPG Keys” and choose the option “New SSH Key” on the top-right side of the
browser.
You will get an option to add the SSH public key that was created in the previous
steps. To perform that you will have to copy the SSH public key first.
Open your terminal and obtain the SSH public key using the following command:
cat ~/.ssh/id_ed25519.pub
ssh-ed25519 DH37589D364RIUHFIUSDHsr4SFFsdfx4WFRoc9Ks/7AuOsNpYy4xkjH6X
[email protected]
Copy this public key and navigate to the web browser where you have GitHub
open—specifically, Settings ➤ SSH and GPG Keys—to add a new SSH key. Paste the key
that you just copied. Here is how that looks:
19
Chapter 1 Core Technologies in Data Engineering
You can provide a title as you wish, something of your convenience, and leave the
key type as “Authentication Key.” Click Add SSH key.
Now that we have added the SSH public key, let us test our connection from our shell
to a remote GitHub server by using the following command:
ssh -T [email protected]
You may receive a warning. Choose yes. Here is how that may look for you:
20
Chapter 1 Core Technologies in Data Engineering
You may ignore the last line “GitHub does not provide shell access.” Overall, if you
see the preceding message, it is certain that you have set up GitHub correctly.
Let us create a new repository and use GitHub to version control the same. To create
a new repository, open a terminal and change directory into a folder where you can test
your git functionalities. Or create a new folder and perform the git initialization:
mkdir firstrepo
cd firstrepo
Now, within the folder, initialize a git repository by using the following command:
git init
The preceding command will create a hidden folder and initialize a database. You
can change the directory into the hidden folder and notice the files. Here is how that may
look for you:
It may be useful to note that a HEAD file is a text file that points to the current
branch, whereas the config file, another text file, stores local configuration about your
repository. Without changing anything exit off of the hidden folder.
If you want to know the history of the repository, then you may use this command:
git log
You may not obtain any logs just yet as you recently created this repo. Another useful
command is the git status that helps obtain the status of your repo:
git status
21
Chapter 1 Core Technologies in Data Engineering
This will provide the status of the current git repo. At the moment, we do not have
any files added or changed, so let us try to add a new file.
Here is a Python script that you can use:
import random
somefunct(5)
When you run the code, it will print five random integers between 0 and 100. Now let
us check to see if the status of our git repo has changed:
Now let us add the Python code to git so the git system can track the code’s progress:
The git add begins to enable git tracking features on the file “mycode.py”; from the
workflow perspective, the git system has picked this file from your local workstation and
moved it to a staging area, where the git tracking features are enabled.
Up until now, we have just moved the file from the local file system to a staging area
where git can identify this file and start tracking. This file is still not in the git repository
yet. If you type in “git log,” you would get a response something like this:
fatal: your current branch 'master' does not have any commits yet
It is true. The file has not been committed yet. So let us try to commit this file using
the following command:
22
Chapter 1 Core Technologies in Data Engineering
This commit has moved the code into an actual git repository situated in the local
directory. The message in the quote is actually a commit message describing what
changes has been made in this commit.
If you would like to push your local git repository to a remote git repository, then you
would want the git push command. It is also called syncing, where you would upload the
local git repo with files to the remote git repo.
Note It is important to know that so far we have been working only on the local
workstation. This repository does not exist in the remote GitHub. Hence, we have to
push the current branch and set it as master. We will look at other examples where
we begin working with a remote branch.
def hello():
print("hello world")
You want to work on this project to add another function. Hence, you need to get a
copy of this existing code from the GitHub server to your local workstation. Well, there
are several ways to accomplish this. Let us look at them in detail.
23
Chapter 1 Core Technologies in Data Engineering
Cloning
Cloning in git refers to obtaining a copy of this repository from the remote GitHub server
to your local workstation. The idea is to obtain a local copy and work locally. This is
usually done at the beginning of a new project or when joining an existing project as a
newest team member. Cloning is mandatory. Here are some examples:
Branching
A branch is a parallel version of the codebase from the main codebase called the main
branch (or master branch). Branching, as the name indicates, enables you to work
on new features and troubleshooting code and propose the changes back to the main
branch through pull request. Branching is for temporary changes and collaboration.
Here’s how to create a new branch:
By default, the pointer will be in the main branch. If you want to access the
development branch, you need to change to it manually. Here’s how:
or
Forking
A fork is also a copy of the repository except that you create your own main or master
branch within your git environment. Imagine the team has a master branch. When you
branch, you get a development branch (or whatever you want to name it). When you
fork, you get a master or main branch in your git account. Forking is done in special
cases, external projects, and independent work done in parallel.
24
Chapter 1 Core Technologies in Data Engineering
In the GitHub web graphical user interface (GUI), you may open the repository that
needs to be forked and click the “Fork” button that is on the top-right corner. Here is how
that may look for you:
Note The Fork button is grayed, because I cannot “fork” my own repository into
my own account.
Pull request
Pull requests are a way of proposing changes to the current codebase. By proposing a
pull request, you have developed a feature, or you have completed some work in the
code and are now proposing that your code be merged into the main repo.
To create a pull request
1. You already have a branch you are working with or create your
own branch.
25
Chapter 1 Core Technologies in Data Engineering
4. You would push these changes into the origin repository (your
branch or your fork):
26
Chapter 1 Core Technologies in Data Engineering
Now you are ready to perform a pull request. In the GitHub GUI, you would see
something pop up saying recently received a pull request. Simply click “Compare & pull
request.”
Now you will have to enter the title and message for this pull request. GitHub will
show you the changes made to the file by highlighting the new code in green. You can
click Create pull request.
Here is how that looks:
27
Chapter 1 Core Technologies in Data Engineering
Once you create a pull request, you now have to merge the pull request to
remote main.
Here is how that may look:
28
Chapter 1 Core Technologies in Data Engineering
Once done, you may delete the branch. Here is how that may look:
Gitignore
Data pipelines sometimes contain sensitive information like passwords, temporary
files, and other information that should not be committed in the repository. We use the
“.gitignore” feature to intentionally ignore certain files.
Create a “.gitignore” file in the root directory of your folder, and specify various files
that GitHub should ignore.
29
Chapter 1 Core Technologies in Data Engineering
Here is an illustration of a simple .gitignore that you can incorporate in your data
pipelines as well:
Once created, let us stage this file and add to the local repository:
We have only scratched the surface on git here. I encourage you to visit http://git-
scm.org to learn more about git features.
30
Chapter 1 Core Technologies in Data Engineering
SQL Programming
SQL is at the heart of every data engineering profession. All our historical data and
enterprise systems are equipped with relational database systems. SQL is the way to
access, query, and work with data from these relational database systems. SQL stands for
structured query language. Let us look at some basic syntax. We will be using Postgres
database syntax to demonstrate some concepts and functionalities that I feel would be
relevant in everyday tasks.
which would return the table with all the rows and columns.
SELECT
fieldname1,
fieldname2
FROM
TableName
SELECT
fieldname1,
fieldname2
FROM
TableName
LIMIT
10
31
Chapter 1 Core Technologies in Data Engineering
SELECT
product_id,
LENGTH(product_id) as charlength
FROM
productTable
WHERE
product_id LIKE 'H%' AND
LENGTH(product_id) BETWEEN 2 and 4
ORDER BY
charlength
SELECT
productTable.fieldname1,
productTable.fieldname2,
inventory.fieldname1,
inventory.fieldname4
FROM
productTable
LEFT JOIN
inventory
ON
productTable.id = inventory.product_id
32
Chapter 1 Core Technologies in Data Engineering
Self-Join in SQL
Another important concept that helps you in your development work is the concept of
self-joins. It is basically joining a given table with the exact same table. Let’s say you have
an employee table with employee id, name, and manager id, where every employee
is assigned an id, their name, and the id of their supervisor. We need to obtain a table
where we have the employee name and manager name.
Here is how that may look:
SELECT
Emp.id,
Emp.name,
Mang.name
FROM
Employee Emp
INNER JOIN
Employee Mang
ON
Emp.id = Mang.manager_id
In the preceding code, we are referencing the same table with a different alias and
join based on their id value. There is also the concept of common table expression
(CTE), where the result of a query is temporarily persisted to perform a specific
operation with another query.
WITH productRaw AS (
SELECT
p_name AS name,
p_qty AS qty,
p_price AS price
33
Chapter 1 Core Technologies in Data Engineering
FROM
productsTable
),
calculateGT AS (
SELECT
name,
qty,
price,
(qty*price) AS grand_total
FROM
productRaw
)
SELECT
name,
qty,
price,
grand_total
FROM
calculateGT
Of course, there may be optimized ways of calculating the grand total given price and
quantity of each product. The idea is to show how you can leverage persisting results of
multiple queries temporarily and utilize the result in generating the final result table.
Views in SQL
In SQL, we can persist the results of these queries as objects within the database.
These are done in the form of views. There are two types of views, standard views and
materialized views.
Standard View
Standard views are views that you can set up within the database that display the results
of a specific query with multiple joins, filters, functions, etc. When you query a standard
view (instead of running a large blurb of complex SQL code), it will always return the
most recent results.
34
Chapter 1 Core Technologies in Data Engineering
SELECT
*
FROM
prod_inventory_report
Materialized View
Materialized views store the results of the view in a physical table (standard views do
not physically store the results, but rather call the underlying query every time). And
so, you created a physical table in the name of materialized view, so the performance
is significantly better. However, you have to set up triggers to refresh materialized views
every so often.
Here is how a materialized view may look (similar syntax to a standard view):
35
Chapter 1 Core Technologies in Data Engineering
FROM
productTable
LEFT JOIN
inventory
ON
productTable.id = inventory.product_id
WITH DATA
WITH REFRESH ON DEMAND
WITH TABLESPACE pg_default
In the case of temporary tables, both the schema and the data (if you insert data after
creation of this table) persist only during that given database session. When you mention
preserve rows, they are deleted at the end of the session. You can also mention delete
rows that would delete immediately.
36
Chapter 1 Core Technologies in Data Engineering
SELECT
Department,
EmpID,
Salary,
avg(Salary) OVER (PARTITION BY Department)
FROM
Employees;
This example would obtain the fields and calculate the mean salary for each
department. The results may look like
As you can see, the window function computed the statistical average salary and
grouped by each department while retaining the table structure and adding a new
column for the computed average.
37
Chapter 1 Core Technologies in Data Engineering
SELECT
Department,
EmpID,
Salary,
RANK() OVER (PARTITION BY Department ORDER BY Salary DESC)
FROM
Employees;
38
Chapter 1 Core Technologies in Data Engineering
You can also create an index that involves more than one field, typically two or three
fields. It is called composite index and helps improve the query performance.
When you have a bigger query that involves joins and other operations, use the idea
of common table expressions to simplify the complex logic.
Do minimize any queries that involve full table scan, by using “Select *”; define the
fields as much as possible.
If you do have a complex query, consider using materialized views to precompute
and store the results of these queries. When the result of a complex query is heavy on the
read operation, then the materialized view is the best fit here.
Consider using temp tables, in cases where you have to write results to a table
temporarily but you do not want that table to persist in the database.
It may be good to minimize using a subquery in your SQL query; if you do end up
writing a subquery, always keep your sort operations within the master query, and try to
avoid specifying ORDER BY in a subquery.
When specifying a condition that includes an OR statement, here is an approach for
you to consider: write two separate queries and union the results of the queries.
Conclusion
So far, we have covered a lot of ground from refreshing Python programming to getting
hands-on with git and GitHub and reviewing the SQL essentials. The concepts and
techniques you have learned here will serve as a foundation to tackle complex data
engineering projects. Python’s data ecosystem will enable you to write effective data
pipelines for data transformation and analysis tasks. Git will allow you to collaborate
with your team, participate in collaborative software development, and maintain a
clean and documented history of your data pipeline as it evolves. SQL skills will prove
indispensable for working with the relational database systems that form the foundation
of almost all data architectures. I encourage you to practice, explore, and apply what you
have learned. Having said that, let us venture into the exciting topic of data wrangling, in
the next few chapters.
39
CHAPTER 2
• The core data structures of Pandas 2.0, Series and DataFrame, and
how they can be used in data wrangling
41
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_2
Chapter 2 Data Wrangling using Pandas
Pandas is a Python library that was built for data manipulation and transformation
purposes. Pandas is built on top of NumPy, another Python library for scientific and
numerical computations.
Pandas is largely driven by user community, open source, and free to modify,
distribute, and use for commercial purposes. Born in 2011, Pandas has evolved
significantly over the past decade, with a major version having been released recently,
with rewritten codebase, additional functionalities, and its ability to interact well with
other libraries. With over 100 million downloads per month, Pandas library is the de
facto standard for data manipulation and transformation (source: http://library.io).
Even if one is not using Pandas and NumPy in their production, the knowledge of the
data structures and methods will certainly help with getting frequented with other data
manipulation libraries. Pandas sets standards for various other Python libraries.
Data Structures
Pandas essentially provides two data structures, series and data frames, respectively. A
series is an n-dimensional array, whereas a data frame is a two-dimensional labeled data
structure that supports tabular structure like spreadsheets or an output of a SQL table.
Series
The series data structure enables one to handle both labeled and unlabeled arrays
of data. The data can be integers, floating point numbers, strings of text, or any other
Python object.
The methods provided by the Pandas library are called objects. To call the Series
object, one has to initialize the constructor and supply appropriate parameters. An
empty initialization would look like
pd.Series()
a = [1,2,3,4,5]
b = pd.Series(data=np.random.rand(5), index=a)
42
Chapter 2 Data Wrangling using Pandas
One can also generate Series using a Python dictionary. Here is an example:
simple_dict1 = {
"library" : "pandas",
"version" : 2.0,
"year" : 2023
}
simple_series = pd.Series(simple_dict1)
One can also get and set values in a given series. Here’s how to get a value:
simple_series["year"]
2023
simple_series["year"] = 2024
simple_series
library pandas
version 2.0
year 2024
dtype: object
In this output, the dtype is NumPy’s data type object, which contains the type of the
data, in this case, an object.
Data Frame
A data frame is the most commonly used data structure in Pandas. A Pandas data frame
is a two-dimensional structured array of objects that can contain integers, floats, strings
of text, or other Python objects (Series, another data frame, etc.).
43
Chapter 2 Data Wrangling using Pandas
While writing production code, one may generate a data frame from an external
dataset stored in a different format. However, it may be essential to understand the key
attributes of a data frame. Some of them are as follows:
Data: The data points or the actual data. It could be one or more
Series or dictionary data types. The data may contain integers,
floats, strings, or any other Python object.
Index: The row labels of the data points and the index of the
DataFrame itself.
data = {
"i": [121, 144, 169],
"j": [11, 12, 13]
}
print(df_data)
i j
a 121 11
b 144 12
c 169 13
Indexing
Indexing is simply a row reference in the data frame. Indexing helps identify the
Pandas data structure. One can also say that indexing is a way of referring to a pointer
that identifies a specific row or a subset of rows in a Pandas data frame. It is a way of
improving the performance of data retrieval.
44
Chapter 2 Data Wrangling using Pandas
df = pd.read_csv(
"/your-path-to-the-file/winequality/winequality-red.csv"
,sep=";"
,usecols=['residual sugar', 'chlorides', 'total sulfur dioxide',
'density', 'pH', 'alcohol', 'quality']
)
df = df.rename({
'residual sugar': 'rSugar',
'total sulfur dioxide': 'total_so2'
},
axis=1
)
print(df.iloc[5])
will yield
rSugar 1.8000
chlorides 0.0750
total_so2 40.0000
density 0.9978
pH 3.5100
alcohol 9.4000
quality 5.0000
Name: 5, dtype: float64
print(df.loc[[5,9,22]])
45
Chapter 2 Data Wrangling using Pandas
will yield
rSugar chlorides total_so2 density pH alcohol quality
5 1.8 0.075 40.0 0.9978 3.51 9.4 5
9 6.1 0.071 102.0 0.9978 3.35 10.5 5
22 1.6 0.106 37.0 0.9966 3.17 9.5 5
pinot_noir = [1,4,7,22,39]
df.iloc[pinot_noir]
will yield
rSugar chlorides total_so2 density pH alcohol quality
1 2.6 0.098 67.0 0.9968 3.20 9.8 5
4 1.9 0.076 34.0 0.9978 3.51 9.4 5
7 1.2 0.065 21.0 0.9946 3.39 10.0 7
22 1.6 0.106 37.0 0.9966 3.17 9.5 5
39 5.9 0.074 87.0 0.9978 3.33 10.5 5
Furthermore, one can also pass a range as a parameter, in the following manner:
df.loc[5:25]
will yield
rSugar chlorides total_so2 density pH alcohol quality
5 1.8 0.075 40.0 0.9978 3.51 9.4 5
6 1.6 0.069 59.0 0.9964 3.30 9.4 5
7 1.2 0.065 21.0 0.9946 3.39 10.0 7
8 2.0 0.073 18.0 0.9968 3.36 9.5 7
9 6.1 0.071 102.0 0.9978 3.35 10.5 5
10 1.8 0.097 65.0 0.9959 3.28 9.2 5
11 6.1 0.071 102.0 0.9978 3.35 10.5 5
12 1.6 0.089 59.0 0.9943 3.58 9.9 5
13 1.6 0.114 29.0 0.9974 3.26 9.1 5
14 3.8 0.176 145.0 0.9986 3.16 9.2 5
15 3.9 0.170 148.0 0.9986 3.17 9.2 5
46
Chapter 2 Data Wrangling using Pandas
would yield
rSugar chlorides total_so2 density pH alcohol quality
267 3.6 0.078 37.0 0.99730 3.35 12.8 8
278 6.4 0.073 13.0 0.99760 3.23 12.6 8
390 1.4 0.045 88.0 0.99240 3.56 12.9 8
440 2.2 0.072 29.0 0.99870 2.88 9.8 8
455 5.2 0.086 19.0 0.99880 3.22 13.4 8
481 2.8 0.080 17.0 0.99640 3.15 11.7 8
495 2.6 0.070 16.0 0.99720 3.15 11.0 8
498 2.6 0.070 16.0 0.99720 3.15 11.0 8
588 2.0 0.060 50.0 0.99170 3.72 14.0 8
828 2.3 0.065 45.0 0.99417 3.46 12.7 8
1061 1.8 0.071 16.0 0.99462 3.21 12.5 8
1090 1.9 0.083 74.0 0.99451 2.98 11.8 8
1120 2.5 0.076 17.0 0.99235 3.20 13.1 8
1202 1.8 0.068 12.0 0.99516 3.35 11.7 8
1269 1.8 0.044 87.0 0.99080 3.50 14.0 8
1403 1.7 0.061 13.0 0.99600 3.23 10.0 8
1449 2.0 0.056 29.0 0.99472 3.23 11.3 8
1549 1.8 0.074 24.0 0.99419 3.24 11.4 8
47
Chapter 2 Data Wrangling using Pandas
Conditional indexing is also possible within data frames using the “query” method.
Here is an example:
would yield
rSugar chlorides total_so2 density pH alcohol quality
390 1.4 0.045 88.0 0.99240 3.56 12.9 8
440 2.2 0.072 29.0 0.99870 2.88 9.8 8
481 2.8 0.080 17.0 0.99640 3.15 11.7 8
495 2.6 0.070 16.0 0.99720 3.15 11.0 8
498 2.6 0.070 16.0 0.99720 3.15 11.0 8
588 2.0 0.060 50.0 0.99170 3.72 14.0 8
828 2.3 0.065 45.0 0.99417 3.46 12.7 8
1061 1.8 0.071 16.0 0.99462 3.21 12.5 8
1090 1.9 0.083 74.0 0.99451 2.98 11.8 8
1120 2.5 0.076 17.0 0.99235 3.20 13.1 8
1202 1.8 0.068 12.0 0.99516 3.35 11.7 8
1269 1.8 0.044 87.0 0.99080 3.50 14.0 8
1403 1.7 0.061 13.0 0.99600 3.23 10.0 8
1449 2.0 0.056 29.0 0.99472 3.23 11.3 8
1549 1.8 0.074 24.0 0.99419 3.24 11.4 8
One can use both .loc and .iloc methods to retrieve specific rows of the data frame.
The .iloc method is a positional index and attempts to access rows and columns by
referencing integer-based location to select a position, whereas the .loc is a label-based
index and accesses the rows and columns by labels.
Let’s say we have a data frame with indexes that are not in a specific order, for
instance:
48
Chapter 2 Data Wrangling using Pandas
74 peach
75 caramel
76 apples
1 melon
2 orange
3 grapes
dtype: object
When using the loc method, Pandas would attempt to locate the label that is passed
as a parameter. Here, the label “1” has the value melon:
idx_series.loc[1]
Out[107]: 'melon'
In contrast, the iloc method would attempt to locate the integer position that is
passed as the parameter. The integer referencing starts with 0 and ends in length of the
list – 1. And so, the first row would yield caramel:
idx_series.iloc[1]
Out[108]: 'caramel'
Multi-indexing
Multi-indexing simply enables to select values from both rows and columns, by
referencing more than one index. One can create a multi-index from arrays, tuples, and
data frames using appropriate methods. The simplest method is to define a multi-
index using arrays and pass them directly into a data frame or a series. Pandas will
automatically integrate them into its data structure. Here’s an example:
First, let’s create the multi-index:
import numpy as np
multi_index_arr = [
np.array(
['chunk1','chunk1','chunk1','chunk1','chunk1',
'chunk2','chunk2','chunk2','chunk2','chunk2']
),
49
Chapter 2 Data Wrangling using Pandas
np.array(
['a','b','c','d','e',
'f','g','h','i','j']
)
]
Now pass this array into the data frame. For convenience, a data frame with random
numbers has been initialized:
df = pd.DataFrame(
np.random.randn(10,5),
index=multi_index_arr
)
print(df)
0 1 2 3 4
chunk1 a 0.681923 -0.298217 1.269076 0.202422 0.419290
b -1.397923 -1.308753 1.624954 -0.968342 -2.025347
c 0.003117 -0.677547 -0.262260 0.684654 1.064416
d -0.811899 1.571434 0.169619 0.424022 0.108066
e -0.366923 0.365224 -0.173412 -1.852635 0.019401
chunk2 f 1.102141 1.069164 -0.446039 -1.744530 1.069642
g 1.335156 1.166403 -0.880927 0.031362 0.268140
h 0.381478 -2.620881 -0.669324 -0.892250 2.069332
i -0.063832 -1.076992 -0.944367 0.616954 1.183344
j -0.110984 0.858179 -0.899983 -0.829470 -1.078892
multi_index_col = [
np.array(
['chunk1a','chunk1a','chunk1a',
'chunk2a','chunk2a','chunk2a']
),
50
Chapter 2 Data Wrangling using Pandas
np.array(
['k','l','m',
'n','o','p']
)
]
Now reference both the row and column multi-indexes to the data frame. For
simplicity, we are initializing a data frame with random numbers:
df = pd.DataFrame(
np.random.randn(10,6),
index=multi_index_arr,
columns=multi_index_col
)
print(df)
chunk1a chunk2a
k l m n o p
chunk1 a 0.933805 1.117525 -1.159911 -1.503417 1.863563 0.179210
b -1.690026 0.995953 -0.763078 1.580060 1.132970 -0.930042
c -0.325951 -0.148638 -1.481013 1.211841 -1.341596 0.582081
d -0.661698 0.838978 -0.734689 0.341315 -1.776477 -1.236906
e 0.460951 -2.022154 -0.236631 1.863633 0.466077 -1.603421
chunk2 f -0.122058 0.141682 1.123913 0.300724 -2.104929 -1.036110
g -0.332314 0.000606 0.239429 0.869499 -0.482440 -1.219407
h -1.035517 -0.698888 2.486068 -1.379440 -0.503729 1.170276
i -0.213047 2.714762 0.463097 -1.864837 -0.603789 0.351523
j -0.346246 0.720362 1.328618 -0.271097 1.034155 0.665097
print(df.loc['chunk1']['chunk1a'])
51
Chapter 2 Data Wrangling using Pandas
would yield
k l m
a 0.933805 1.117525 -1.159911
b -1.690026 0.995953 -0.763078
c -0.325951 -0.148638 -1.481013
d -0.661698 0.838978 -0.734689
e 0.460951 -2.022154 -0.236631
print(df.loc['chunk1','a']['chunk2a','p'])
would yield
0.1792095
will yield
52
Chapter 2 Data Wrangling using Pandas
Let us initialize a series with our newly created time delta index.
Here is how the code may look:
would yield
0 days 00:00:00 0
0 days 00:00:36 1
0 days 00:01:12 2
0 days 00:01:48 3
0 days 00:02:24 4
0 days 00:03:00 5
0 days 00:03:36 6
0 days 00:04:12 7
0 days 00:04:48 8
0 days 00:05:24 9
0 days 00:06:00 10
0 days 00:06:36 11
0 days 00:07:12 12
0 days 00:07:48 13
0 days 00:08:24 14
0 days 00:09:00 15
0 days 00:09:36 16
0 days 00:10:12 17
0 days 00:10:48 18
0 days 00:11:24 19
0 days 00:12:00 20
0 days 00:12:36 21
53
Chapter 2 Data Wrangling using Pandas
0 days 00:13:12 22
0 days 00:13:48 23
0 days 00:14:24 24
0 days 00:15:00 25
0 days 00:15:36 26
0 days 00:16:12 27
0 days 00:16:48 28
0 days 00:17:24 29
0 days 00:18:00 30
0 days 00:18:36 31
0 days 00:19:12 32
0 days 00:19:48 33
0 days 00:20:24 34
0 days 00:21:00 35
0 days 00:21:36 36
0 days 00:22:12 37
0 days 00:22:48 38
0 days 00:23:24 39
0 days 00:24:00 40
0 days 00:24:36 41
0 days 00:25:12 42
0 days 00:25:48 43
0 days 00:26:24 44
0 days 00:27:00 45
0 days 00:27:36 46
0 days 00:28:12 47
0 days 00:28:48 48
0 days 00:29:24 49
Freq: 36S, dtype: int64
would yield
46
54
Chapter 2 Data Wrangling using Pandas
Here’s how to look at a subset of the data with respect to time range:
would yield
0 days 00:12:36 21
0 days 00:13:12 22
0 days 00:13:48 23
0 days 00:14:24 24
0 days 00:15:00 25
0 days 00:15:36 26
0 days 00:16:12 27
0 days 00:16:48 28
0 days 00:17:24 29
0 days 00:18:00 30
0 days 00:18:36 31
0 days 00:19:12 32
0 days 00:19:48 33
0 days 00:20:24 34
0 days 00:21:00 35
0 days 00:21:36 36
0 days 00:22:12 37
0 days 00:22:48 38
0 days 00:23:24 39
0 days 00:24:00 40
0 days 00:24:36 41
0 days 00:25:12 42
0 days 00:25:48 43
0 days 00:26:24 44
0 days 00:27:00 45
Freq: 36S, dtype: int64
While the simplest form of index is referencing a specific row, Pandas library offers
much more sophisticated indexing such as multi-index, time delta index, period index,
and categorical index.
55
Chapter 2 Data Wrangling using Pandas
CSV
CSV stands for comma-separated values; the CSV format is one of the oldest and the
most common types of storing data. It is also known as tabular store, where data is stored
in rows and columns.
The data is stored in a text file. Each line of the file corresponds to a row. As the
title suggests, the data points may be separated using commas. There will be the same
number of data points per row. It is expected that the first line would contain the
column definitions also known as header or metadata. CSV data files are known for their
simplicity and ability to parse quickly.
Here is how one can import data in CSV format:
import pandas as pd
where the usecols parameter would help select only the columns required for the
project.
Here’s how to export the data:
final_df = pd.to_csv(index=False)
56
Chapter 2 Data Wrangling using Pandas
JSON
JSON stands for JavaScript Object Notation. It is more frequently used to transmit data
between the app (web app, mobile app, etc.) and the web server. Furthermore, it is the
de facto data format for APIs to communicate to requests and responses.
The data is stored in key–value pairs, and data is organized in nested structure. JSON
supports data types like integers, strings, arrays, and Booleans, to name a few.
Here is how one can import a JSON document using Pandas:
final_df.to_json('path to file.json')
HDF5
HDF5 stands for Hierarchical Data Format, version 5. Unlike CSV and JSON, HDF5 is
a compressed format, where data is optimized for efficient storage, and provides faster
read and write operations.
While working with huge amounts of data in HDF5, one can attempt to access just
the subset of the huge dataset without having to load the entire dataset in memory. And
so, HDF5 is physical memory friendly.
HDF5 supports large and complex heterogeneous data. As the name suggests, the
HDF5 data format has a hierarchical structure. The structure contains one or more
groups, wherein each group may contain various types of data along with their metadata:
print(new_df)
If the entire file contains a single data frame, then the key parameter can be ignored.
Here is how to write a data frame to an existing HDF5 store:
print(exportstore.keys())
57
Chapter 2 Data Wrangling using Pandas
Feather
Originally developed by Apache Arrow, for faster data exchange, Feather is a fast and
efficient data format that is developed for data science and analysis. To access Feather,
one needs to install PyArrow and Feather-format Python packages as prerequisites to
work with Feather data format.
Here is an example of reading a Feather file:
import pandas as pd
import numpy as np
Parquet
Parquet is developed by the Apache Parquet project and is an open source, columnar-
style file format. Parquet is designed for big data processing, data warehousing, and
analytics applications. Parquet has cross-language support and supports various
compression algorithms for data exchange and transport. Parquet provides faster
performance, where only a subset of a large dataset is required to be read for a given
request or a query. Parquet is considered to be a good format for reading heavy data
pipelines.
Here is an example:
import pandas as pd
import pyarrow
print(df)
58
Chapter 2 Data Wrangling using Pandas
import pandas as pd
import fastparquet
print(df)
There are options to choose the type of compression, like snappy, gzip, lz4, zstd, etc.
These are just various algorithms that attempt to reduce the file size.
ORC
ORC stands for Optimized Row Columnar. It is considered more efficient than flat file
formats. ORC is a columnar file format, where the column values are stored next to each
other. ORC file formats support various compression algorithms. Each ORC file contains
one or more stripes of data. Each stripe contains a header, data stored in rows, and a
footer. The footer contains metadata and summary statistics of the data points stored in
the given stripe. ORC is considered to be a good format for writing heavy data pipelines:
import pandas as pd
Avro
Avro is a file format that is row based. It is safe to say that Avro evolved as part of
Apache Hadoop for big data processing. Avro schema is defined in JSON and supports
binary encoding for faster access and efficient storage. Binary encoding is a process
of serializing and deserializing data for processing and storing large amounts of data
59
Chapter 2 Data Wrangling using Pandas
efficiently. Avro also provides a container file, a special kind of a file that contains
multiple data inside it and enables storing various types of data together (image, audio,
text, etc.).
Pandas does not have native support for handling Avro files; however, third party
libraries can be used to load the Avro data. Here is an example:
import fastavro
import pandas as pd
new_df = pd.Dataframe(avro_data)
print(new_df)
Pickle
Just like how one can serialize and deserialize data for efficient storage, Python pickle
enables serializing and deserializing a Python object. When a Python object is said to
be picked, the pickle module serializes the given Python object before writing to a file.
When it is said to be unpicked, the file is deserialized and so gets reconstructed back to
its original form. Unlike other file formats, pickle is limited to Python language alone.
Here’s how to export a data frame to pickle:
Chunk Loading
Chunk loading refers to importing an entire large dataset in relatively smaller chunk
sizes, as opposed to attempting to load the entire dataset in one go. By leveraging the
concept of chunk loading, you can process several large datasets (larger than your
memory datasets) without running into memory problems. Here is one way of leveraging
chunk loading:
60
Chapter 2 Data Wrangling using Pandas
import pandas as pd
df = pd.DataFrame()
Missing Values
Background
In data analysis and data processing tasks, missing data is a common phenomenon.
Missing data or missing values happen when there is no value available for a data
point in an event or an observation. You might witness an empty space with no value
populated in certain places in a dataset. Missing values can occur for a variety of reasons.
The most common idea is the error that may occur during the data entry stages.
During the process of manual data entry, it is not uncommon to forget to enter a value;
another case is when data values have been entered incorrectly and, during quality
checks, such data values are removed, therefore carrying a missing value label for a given
observation and a variable.
Such data entry errors can also occur in automated machines or sensor mechanisms.
There may be certain cases when the equipment performing reading may be under
maintenance or any other issues that could have led to not reading the value at a given
point in time, leading to missing value occurrences.
One of the common sources of data are surveys, whether it is done in person or
online. Surveys are not easy; one may choose not to answer questions or to provide a
different value from one’s usual response (e.g., responding strongly to a survey scale
of 1–5). Besides, the relationships between various data points in a given survey also
matter. There may be cases where a missing value for one question may lead to missing
values for several downstream questions. For instance, imagine we have a survey about
motor vehicle insurance and the respondent does not own a motor vehicle; this may lead
to skipping most questions.
61
Chapter 2 Data Wrangling using Pandas
None
None is a common way of representing missing values in data analysis. None has its own
data type called “NoneType.” None specifically denotes no value at all, which is different
from an integer taking a 0; None is not the same as an empty string.
NaN
NaN stands for not a number. Not a number or NaN means the value is missing at a given
place. NaN has a float type. Here is an example.
Let us create a data frame:
df = pd.DataFrame(np.random.randn(2,3), index=[1,2,3,4,5],
columns=["a","b","c"])
a b c
1 1.720131 -0.623429 0.196984
2 -1.093708 -0.190980 -1.063972
df.reindex([1,2,3,4,5])
62
Chapter 2 Data Wrangling using Pandas
would yield us
a b c
1 1.720131 -0.623429 0.196984
2 -1.093708 -0.190980 -1.063972
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
NaT
NaT means not a time. NaT represents missing values for the DateTime datatype.
NA
NA means not available. In the Pandas library, NA has been used historically to denote
missing values for all data types. Instead of using None, NaN, or NaT for various data
types, NA may be employed to represent missing values in integer, Boolean, and string
data types.
isna( ) Method
The Pandas isna() method works to return a Boolean value indicating whether there
is missing value in the data. The method returns true if missing values exist and false
if there are no missing values for each element in the data frame. It works with NA and
NaN types.
When we use this method to the preceding example
df2.isna()
63
Chapter 2 Data Wrangling using Pandas
would yield us
a b c
1 False False False
2 False False False
3 True True True
4 True True True
5 True True True
Note You can also use “isnull()” in place of “isna()” for better readability.
notna( ) Method
The notna() can be seen as an inverse of the isna() method. Instead of identifying
missing values, this method detects non-missing values; and so, the method would
return true if the values are not missing and return false if the values are indeed missing.
Here is how that looks for our same example:
df2.notna()
would return
a b c
1 True True True
2 True True True
3 False False False
4 False False False
5 False False False
df2.isna().sum()
df2.isna().count()
df2["b"].isna().sum()
If you are looking to disregard the missing value rows completely and just need
only the data points that do not have missing values, then my suggestion is the
notna() method:
fillna(0)
interpolate()
Note Use Multivariate Imputer with Iterative Imputer (and also Simple Imputer)
imputation when one is absolutely sure there appears to be a clerical error that
can be easily guessed: https://scikit-learn.org/stable/modules/
impute.html.
For instance, a study that includes test subjects in a certain age range, along with
checking a few other boxes that can be backtracked with reasonable estimation.
Data Transformation
Data transformation is one of the most crucial operations in building a quality dataset for
machine learning pipelines. In broad terms, it can be seen as the process of converting
data from one shape to another. The goal of the data transformation is to prepare data or
create a dataset for downstream consumption.
65
Chapter 2 Data Wrangling using Pandas
Data Exploration
When in a new project or a new task, it is important to start somewhere. The first step
is often the crucial step. And so, the goal of the first step should try to be as close as
learning more about the dataset. Fortunately, there are some statistical tools available for
data scientists and practitioners in Pandas. Let’s look at some of these in detail.
Here is a dataset for illustration:
df = pd.DataFrame(data)
Now
df['price'].describe()
would yield
count 5.000000
mean 27.600000
std 4.878524
min 22.000000
25% 25.000000
50% 27.000000
75% 29.000000
max 35.000000
Name: price, dtype: float64
df['product'].describe()
66
Chapter 2 Data Wrangling using Pandas
would yield
count 5
unique 5
top Apple
freq 1
Name: product, dtype: object
One can also obtain individual values for transformation purposes. Here are some
examples.
Calculating the average price of a fruit
df['price'].mean()
would yield
27.6
df['product'].unique()
would yield
df['total_cost'] = df['price']*df['quantity']
would yield
product price quantity total_cost
0 Apple 22 5 110
1 Orange 27 7 189
2 Melon 25 4 100
3 Peach 29 9 261
4 Pineapple 35 3 105
67
Chapter 2 Data Wrangling using Pandas
Let’s add another column including the sale date. These dates are randomly
generated. And our dataset looks like this:
product price quantity total_cost sale_date
0 Apple 22 5 110 2023-12-09 22:59:06.997287
1 Orange 27 7 189 2024-01-03 11:03:12.484349
2 Melon 25 4 100 2023-12-10 01:59:22.598255
3 Peach 29 9 261 2023-12-21 05:28:01.462895
4 Pineapple 35 3 105 2023-12-17 23:38:42.783745
df['sale_date'].describe()
count 5
mean 2023-12-18 17:49:41.265306112
min 2023-12-09 22:59:06.997287
25% 2023-12-10 01:59:22.598255104
50% 2023-12-17 23:38:42.783745024
75% 2023-12-21 05:28:01.462895104
max 2024-01-03 11:03:12.484349
Name: sale_date, dtype: object
Furthermore, there is the option of sorting the dataset using the sort_values
method():
df.sort_values('price')
would yield
product price quantity total_cost sale_date
0 Apple 22 5 110 2023-12-09 22:59:06.997287
2 Melon 25 4 100 2023-12-10 01:59:22.598255
1 Orange 27 7 189 2024-01-03 11:03:12.484349
3 Peach 29 9 261 2023-12-21 05:28:01.462895
4 Pineapple 35 3 105 2023-12-17 23:38:42.783745
68
Chapter 2 Data Wrangling using Pandas
Left Join
Given two data frames, the left join of these data frames would mean to keep everything
on the data frame that is on the left and add any matching information from the data
frame that is on the right. Let us imagine we have two data frames, a and b, respectively.
The process starts off with keeping everything on the data frame a and looking for
potential matches from the data frame b. The matching is usually searched based on
the common column or a key in the data frame. When there is a match, then bring that
matched column from b and merge it with a. If there is a matching row but only partially
filled-in information, then the remaining would be populated with NA.
Here is how that looks. Let us initialize two data frames:
left_df1 = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'John'],
'DeptID': [101, 102, 103, 104, 103]
})
would yield us
ID Name DeptID
0 1 Alice 101
1 2 Bob 102
69
Chapter 2 Data Wrangling using Pandas
2 3 Charlie 103
3 4 David 104
4 5 John 103
right_df1 = pd.DataFrame({
'DeptID': [101, 102, 103],
'DeptName': ['HR', 'IT', 'Marketing']
})
would yield us
DeptID DeptName
0 101 HR
1 102 IT
2 103 Marketing
leftjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="left"
)
print(leftjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 4 David 104 NaN
4 5 John 103 Marketing
70
Chapter 2 Data Wrangling using Pandas
In the preceding example, you can see that everything from the left_df has been kept,
while only the matched columns from right_df have been merged. There is only one
row on the left_df for which there appears to be no matching rows from the right_df and
which is filled with NaN, as a missing value.
Right Join
Right join can be seen as the inverse of left join.
Given two data frames, the right join of these data frames would mean to keep
everything on the data frame that is on the right and add any matching information from
the data frame that is on the left.
In our earlier example we had two data frames, left_df and right_df, respectively.
The right join would start off with keeping everything on the data frame right_df and
look for potential matches from the data frame left_df. The matching is usually searched
based on the common column or a key in the data frame. When there is a match, then
bring that matched column from left_df and merge it with right_df. If there is a matching
row but only partially filled-in information, then the remaining would be populated
with NA.
The data frame right_df has the following data:
DeptID DeptName
0 101 HR
1 102 IT
2 103 Marketing
ID Name DeptID
0 1 Alice 101
1 2 Bob 102
2 3 Charlie 103
3 4 David 104
4 5 John 103
71
Chapter 2 Data Wrangling using Pandas
rightjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="right"
)
print(rightjoin)
would yield
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
In the case of the right join, the table that is on the right remains unchanged. The
table that is on the left would be scanned to identify matching information based on
the department id column. The matched columns would be then brought over and be
annexed with the table on the right.
Outer Join
In the relational database world, these are called full outer joins. Outer join is where,
given two data frames, both of them are kept without losing any data. Outer joins have
the most number of missing values, but they preserve any and all data that is in both
data frames:
outerjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="outer"
)
print(outerjoin)
72
Chapter 2 Data Wrangling using Pandas
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
4 4 David 104 NaN
The outer join attempted to capture all data from both tables and assigned a missing
value (not a number) where a match is not available.
Inner Join
Inner join is the inverse of full outer join. These joins are where, given two data frames,
only the matched columns would be returned as the resultant data frame. The partial
matches from both the data frames would be excluded, and only complete matches will
be returned.
Here is how that looks for the same example:
innerjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="inner"
)
print(innerjoin)
would yield
Out[174]:
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
73
Chapter 2 Data Wrangling using Pandas
As you can see the row where the partial match was available has been excluded
from the resultant dataset.
Cross Join
Cross join is also referred to as cartesian join. It is where every row in the given data frame is
assigned or merged with rows from another data frame. The common key is not required.
Here is how this looks.
Let us initialize two data frames.
df_one = pd.DataFrame({
'ID': [1, 2],
'Name': ['John', 'Doe']
})
would yield us
ID Name
0 1 John
1 2 Doe
df_two = pd.DataFrame({
'ItemID': [101, 102],
'Item': ['Laptop', 'Cable']
})
would yield us
ItemID Item
0 101 Laptop
1 102 Cable
cross_join = pd.merge(
df_one,
df_two,
how="cross"
)
print(cross_join)
74
Chapter 2 Data Wrangling using Pandas
would yield us
ID Name ItemID Item
0 1 Alice 101 Laptop
1 1 Alice 102 Cable
2 2 Bob 101 Laptop
3 2 Bob 102 Cable
Data Reshaping
Data reshaping is the process of transforming the structure of data while preserving the
values and observations contained in the dataset. The idea of performing data reshaping
is to identify new information and insights while changing the dimensions of it, without
having to change the values.
pivot( )
The most common idea of data reshaping is the concept of pivoting a table. Made
famous by Microsoft Excel, pivoting a table enables rearranging of rows and column
data. Common applications include preparing data for dashboards, further analysis, etc.
Here is an example of a pivot table. Let’s say we have the following dataset:
data = {
'Date':
['2023-06-01', '2023-06-01', '2023-06-01', '2023-06-02',
'2023-06-02', '2023-06-02'],
'City':
['Cupertino', 'Portland', 'Fort Worth', 'Cupertino', 'Portland',
'Fort Worth'],
'Temperature':
[56, 65, 69, 59, 61, 66]
}
df = pd.DataFrame(data)
75
Chapter 2 Data Wrangling using Pandas
would yield us
Date City Temperature
0 2023-06-01 Cupertino 56
1 2023-06-01 Portland 65
2 2023-06-01 Fort Worth 69
3 2023-06-02 Cupertino 59
4 2023-06-02 Portland 61
5 2023-06-02 Fort Worth 66
Now let us reshape the data so that the column values in the city become column
categories:
pivoted_df = df.pivot(
index='Date',
columns='City',
values='Temperature'
)
print(pivoted_df)
would yield us
City Cupertino Fort Worth Portland
Date
2023-06-01 56 69 65
2023-06-02 59 66 61
And so, the dates no longer repeat themselves, each of the cities in the column has
become a category in itself, and the temperature data is displayed in better presentable
format. The pivot() function is very beneficial in cases where you need to reshape the
data without having to perform any type of aggregations.
pivot_table( )
The pivot_table() method is similar yet provides more functionality when compared with
the pivot() method. pivot_table() helps with aggregating numeric data while reshaping
the entire dataset. The pivot_table() method provides spreadsheet pivoting-like
functionality.
76
Chapter 2 Data Wrangling using Pandas
data = {
'Month':
['January', 'January', 'January', 'January',
'February', 'February', 'February', 'February'],
'Region':
['North', 'East', 'West', 'South',
'North', 'East', 'West', 'South' ],
'Product':
['Product A', 'Product A', 'Product A', 'Product A',
'Product B', 'Product B', 'Product B', 'Product B'],
'Sales':
[200, 150, 180, 210, 190, 175, 225, 250]
}
df = pd.DataFrame(data)
print(df)
would yield us
Month Region Product Sales
0 January North Product A 200
1 January East Product A 150
2 January West Product A 180
3 January South Product A 210
4 February North Product B 190
5 February East Product B 175
6 February West Product B 225
7 February South Product B 250
pivot_table_df = df.pivot_table(
index='Product',
columns='Region',
values='Sales',
aggfunc='mean'
)
77
Chapter 2 Data Wrangling using Pandas
print(pivot_table_df)
Region East North South West
Product
Product A 150 200 210 180
Product B 175 190 250 225
In the preceding output, the pivot_table() method converted the region categories
into separate columns and converted the product categories into indexes (separate
rows), and for each product and region combination, it aggregated the total sales in
terms of taking the statistical mean. This way, you can obtain the performances of
product sales for a given region.
stack( )
The stack() method is somewhat similar and closely related to the idea of pivoting
the table. The stack() method also reshapes the data by converting the column
categories into indexes. Let us look at an example to observe the functionality of the
stack() method.
Here is a dataset that contains quarterly sales of two products in a given two years:
data1 = {
'Product': ['Product A', 'Product B'],
'2023-Q1': [100, 80],
'2023-Q2': [150, 130],
'2023-Q3': [200, 180],
'2023-Q4': [250, 230],
'2024-Q1': [300, 280],
'2024-Q2': [350, 330],
'2024-Q3': [400, 380],
'2024-Q4': [450, 430]
}
df1 = pd.DataFrame(data1).set_index('Product')
print(df1)
78
Chapter 2 Data Wrangling using Pandas
would yield
2023-Q1 2023-Q2 2023-Q3 ... 2024-Q2 2024-Q3 2024-Q4
Product ...
Product A 100 150 200 ... 350 400 450
Product B 80 130 180 ... 330 380 430
quartSales = df1.stack()
print(quartSales)
would yield
Product
Product A 2023-Q1 100
2023-Q2 150
2023-Q3 200
2023-Q4 250
2024-Q1 300
2024-Q2 350
2024-Q3 400
2024-Q4 450
Product B 2023-Q1 80
2023-Q2 130
2023-Q3 180
2023-Q4 230
2024-Q1 280
2024-Q2 330
2024-Q3 380
2024-Q4 430
The resultant data frame contains an index that was previously column fields. Now,
the resultant data frame has two indexes: one is the product type and the other is the
sales quarter column.
79
Chapter 2 Data Wrangling using Pandas
unstack( )
unstack() is the inverse operation of the stack() method. However, both these methods
are mutually exclusive. You can call the unstack() method natively without having to use
the stack() method. Let us look at an example:
d
data2 = {
'Product':
['Product A', 'Product A', 'Product B', 'Product B'],
'Month':
['January', 'February', 'January', 'February'],
'Sales':
[120, 110, 150, 140]
}
df3 = pd.DataFrame(data2).set_index(['Product','Month'])
print(df3)
would yield
Sales
Product Month
Product A January 120
February 110
Product B January 150
February 140
Here is how we can widen the narrow table using the unstack() method:
wideSales = df3.unstack()
print(wideSales)
Sales
Month February January
Product
Product A 110 120
Product B 140 150
80
Chapter 2 Data Wrangling using Pandas
melt( )
The melt() method would perform the inverse function of the pivot() function. The
melt() method would convert the wide data format into a narrow data format.
Let us look at the functionality of the melt() method with the following example:
melt_data = {
'Product':
['Product A', 'Product B', 'Product C'],
'January':
[300, 200, 400],
'February':
[350, 250, 450],
'March':
[400, 300, 500]
}
melt_df = pd.DataFrame(melt_data)
print(melt_df)
would yield
Product January February March
0 Product A 300 350 400
1 Product B 200 250 300
2 Product C 400 450 500
Now, let us apply the melt() method, where we would like to identify the product and
measure the product against the month and sales variables:
melted_df = melt_df.melt(
id_vars=['Product'],
var_name='Month',
value_name='Sales'
)
81
Chapter 2 Data Wrangling using Pandas
print(melted_df)
Product Month Sales
0 Product A January 300
1 Product B January 200
2 Product C January 400
3 Product A February 350
4 Product B February 250
5 Product C February 450
6 Product A March 400
7 Product B March 300
8 Product C March 500
The preceding resultant data frame has successfully converted the wide data frame
into a narrow data frame.
crosstab( )
crosstab() is a powerful statistical tool that is used to compute cross-tabulation of two or
more factors. The crosstab() function builds a frequency table for given columns.
Here is an example:
ctdata = {
'Person':
[1, 2, 3, 4,
5, 6, 7, 8, 9],
'State':
['NY', 'WA', 'CO', 'NY',
'WA', 'CO', 'NY', 'WA', 'CO'],
'Likes':
['Mountains', 'Mountains', 'Mountains', 'Mountains',
'Mountains', 'Oceans', 'StateParks', 'StateParks', 'StateParks']
}
ctdf = pd.DataFrame(ctdata)
print(ctdf)
82
Chapter 2 Data Wrangling using Pandas
would yield us
Person State Likes
0 1 NY Mountains
1 2 WA Mountains
2 3 CO Mountains
3 4 NY Mountains
4 5 WA Mountains
5 6 CO Oceans
6 7 NY StateParks
7 8 WA StateParks
8 9 CO StateParks
Now, let us build a frequency table with the columns “State” and “Likes,”
respectively.
crossstab_df = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"]
)
print(crossstab_df)
FavoriteActivity Mountains Oceans StateParks
Region
CO 1 1 1
NY 2 0 1
WA 2 0 1
The preceding data frame has provided the count of persons. You can also obtain the
percentage of persons who liked a certain activity by using the “normalize” argument:
normalActMatrix = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
83
Chapter 2 Data Wrangling using Pandas
rownames=["Region"],
colnames=["FavoriteActivity"],
normalize=True
)
print(normalActMatrix)
would yield us
FavoriteActivity Mountains Oceans StateParks
Region
CO 0.111111 0.111111 0.111111
NY 0.222222 0.000000 0.111111
WA 0.222222 0.000000 0.111111
You can also add the “margins” parameter to get the row and column sum of the
frequency table:
freqTable = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"],
margins=True
)
print(freqTable)
would yield
FavoriteActivity Mountains Oceans StateParks All
Region
CO 1 1 1 3
NY 2 0 1 3
WA 2 0 1 3
All 5 1 3 9
84
Chapter 2 Data Wrangling using Pandas
normalFreqTable = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"],
margins=True,
normalize=True
)
print(normalFreqTable)
would yield us
FavoriteActivity Mountains Oceans StateParks All
Region
CO 0.111111 0.111111 0.111111 0.333333
NY 0.222222 0.000000 0.111111 0.333333
WA 0.222222 0.000000 0.111111 0.333333
All 0.555556 0.111111 0.333333 1.000000
factorize( )
The factorize() method is another important method to know in the data reshaping
topics. The factorize() method helps numerically categorize a given categorical variable
(that may be a string). Some downstream analytic models may require a numerical
encoding of a categorical variable. Let us look at an example:
factdata = {
'Person':
[1, 2, 3, 4, 5],
'Likes':
['Mountains', 'Oceans', 'Oceans', 'Parks', 'Parks']
}
fact_df = pd.DataFrame(factdata)
print(fact_df)
85
Chapter 2 Data Wrangling using Pandas
would yield us
Person Likes
0 1 Mountains
1 2 Oceans
2 3 Oceans
3 4 Parks
4 5 Parks
The “factorize()” method returns two objects, the labels and the unique values. Here
is how that looks in our example:
would yield us
fact_df['Likes_Coded'] = labels
print(fact_df)
would yield us
Person Likes Likes_Coded
0 1 Mountains 0
1 2 Oceans 1
2 3 Oceans 1
3 4 Parks 2
4 5 Parks 2
86
Chapter 2 Data Wrangling using Pandas
compare( )
The compare() method enables you to compare two data frames or series and identify
their differences. Let us take a simple example where we observe differences between
two data frames.
Here is a simple example:
ctdata = {
'Person':
[1, 2, 3, 4, 5, 6, 7, 8, 9],
'State':
['NY', 'WA', 'CO', 'NY',
'WA', 'CO', 'NY', 'WA', 'CO'],
'Likes':
['Mountains', 'Mountains', 'Mountains', 'Mountains',
'Mountains', 'Oceans', 'StateParks', 'StateParks', 'StateParks']
}
ct_df = pd.DataFrame(ctdata)
print(ct_df)
would yield
Person State Likes
0 1 NY Mountains
1 2 WA Mountains
2 3 CO Mountains
3 4 NY Mountains
4 5 WA Mountains
5 6 CO Oceans
6 7 NY StateParks
7 8 WA StateParks
8 9 CO StateParks
ct_df1 = ct_df.copy()
87
Chapter 2 Data Wrangling using Pandas
ct_df1.loc[3,'State'] = 'NC'
ct_df1.loc[5,'State'] = 'CA'
Person State Likes
0 1 NY Mountains
1 2 WA Mountains
2 3 CO Mountains
3 4 NC Mountains
4 5 WA Mountains
5 6 CA Oceans
6 7 NY StateParks
7 8 WA StateParks
8 9 CO StateParks
ct_df.compare(ct_df1)
would yield us
State
self other
3 NY NC
5 CO CA
Let us try to keep the entire dataset while performing the comparison. This can be
achieved by using the “keep_shape” parameter:
ct_df.compare(ct_df1, keep_shape=True)
88
Chapter 2 Data Wrangling using Pandas
would result in
Person State Likes
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NY NC NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN CO CA NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
Notice the usage of missing values in the dataset. During comparison, if the values
are similar on the respective rows and columns, then missing values are assigned. The
values that are different between the data frames are only displayed.
groupby( )
The groupby() method is an important method and function to know in data
analysis using the Pandas library. The groupby() method performs aggregations and
summarizations by grouping data. The groupby() method usually involves one or more
of the following three steps:
groupbydata = {
'Person':
['Alice', 'Bob', 'Charlie', 'David'],
89
Chapter 2 Data Wrangling using Pandas
'Dept':
['Marketing', 'Marketing', 'DigitalM', 'DigitalM'],
'Sales':
[20, 12, 35, 42]
}
grp_df = pd.DataFrame(groupbydata)
print(grp_df)
would yield us
Person Dept Sales
0 Alice Marketing 20
1 Bob Marketing 12
2 Charlie DigitalM 35
3 David DigitalM 42
Now let us apply the groupby() to this dataset. Here we are looking to obtain total
sales performed by each department. This would mean grouping the dataset and
computing the sum for each of the categories.
Here is how that looks:
grp_df1 = grp_df.groupby('Dept')['Sales'].sum()
print(grp_df1)
would yield us
Dept
DigitalM 77
Marketing 32
Name: Sales, dtype: int64
90
Chapter 2 Data Wrangling using Pandas
Conclusion
So far, you have gained a solid understanding of the core concepts and techniques that
form the pillars of effective data manipulation using Pandas. Through this chapter you
have discovered the core data structures, indexing methods, data reshaping strategies,
advanced data wrangling methods, combining multiple Pandas objects, and working
with missing values. The concepts and ideas that were discussed here will serve
you good even if you do not actively use Pandas library; these concepts will lay the
foundation for you to learn various other data wrangling libraries. You may witness in
the upcoming chapters that some of these methods and methodologies you saw here
may repeat themselves in some fashion.
91
CHAPTER 3
93
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_3
Chapter 3 Data Wrangling using Rust’s Polars
–– Write SQL-like statements in Polars and query your flat files on your
file system using Polars command line interface (CLI).
Introduction to Polars
Polars is a promising, up-and-coming Python library within the data processing
ecosystem of Python language. The data structures of Polars library are based on the
Apache Arrow project, meaning Polars stores its data internally using Apache Arrow
arrays, comparable to how Pandas library uses NumPy internally to store its data in the
form of NumPy arrays.
Polars library is written in Rust programming language. Rust is a recent
programming language that demonstrates C/C++-like efficiency. Rust provides strong
support for concurrency and parallel operations. Polars library on Python enables
multi-threading and parallelization on almost every command and every line of code.
No special instruction needs to be provided to enable parallelization—no need for
decorators or tweaking different settings. Parallel processing and multi-threading are
enabled by default.
It may be possible to use Polars library along with other libraries within the Python
data analysis ecosystem. However, it may be the case that Python libraries are usually
single threaded. If the data processing task can be parallelized, one may leverage other
libraries such as multiprocessing to speed up the entire data pipeline.
The syntax for Polars library in Python can appear to be very similar when compared
to Pandas library. However, there are few differences that will be discussed in the coming
topics. To start off, Polars does not support index or multi-index for its data structures,
unlike Pandas. And so, when writing Polars code, one may not be able to use methods
like the loc or iloc method that are available in Python.
Polars library supports both lazy evaluation and eager evaluation, whereas Python
supports only eager evaluation. Polars automatically optimizes the query plan and finds
opportunities to optimally leverage memory while accelerating the execution of a query.
94
Chapter 3 Data Wrangling using Rust’s Polars
ap(4,3)
Here, this program is a simple case of eager evaluation. Every line in that function is
processed. When the function is called with parameters, all the three lines are evaluated
and the statements are printed.
In contrast, in the case of lazy evaluation, the line that contains c=a+b will be deferred
from execution, until the time it is actually needed. Can you guess when c=a+b is actually
needed? It would be the last line where we are printing the sum. That is when the addition
operation is actually executed. The function requires the value of c to be computed.
Note To illustrate this even further, let’s take an example of visiting a restaurant
to get a meal. Eager evaluation is like visiting a buffet restaurant with standard
offerings and no customizations. Every dish is cooked and ready to be served,
regardless of how much the customer would consume or whether anyone would
even consume that dish at all.
95
Chapter 3 Data Wrangling using Rust’s Polars
Lazy evaluation is like visiting a restaurant, where the server person would ask
questions about which sides and items you like, how you like them, whether you
are allergic to something, and several other things. The chef would then prepare
the dish and include items based on the specific responses of the customer.
Eager evaluation computes all the results up front, consumes system resources
to compute everything regardless of they are ultimately needed, and may
experience a slower initial performance but subsequent access may be faster.
Lazy evaluation computes only the results that are required, conserves system
resources by executing only what is necessary, and may seem faster initially but
accessing subsequent results may be slower.
or
or
import polars as pl
96
Chapter 3 Data Wrangling using Rust’s Polars
Polars offers data structures similar to those of Pandas library. They are Series,
DataFrames, and LazyFrames (the one with lazy evaluation). It may be safe to say that
data frames are the most commonly used data structure in the Python ecosystem.
Polars Series
Here is an example of Series:
a = pl.Series([1,4,7,9,16,25])
print(a)
b = pl.Series(
"firstSeries",
[1,4,7,9,16,25]
)
print(b)
would yield
shape: (6,)
Series: 'firstSeries' [i64]
[
1
4
7
9
16
25
]
b1 = pl.Series(
"secondSeries",
[36,47]
)
97
Chapter 3 Data Wrangling using Rust’s Polars
b = b.append(b1)
print(b)
would yield
shape: (8,)
Series: 'firstSeries' [i64]
[
1
4
7
9
16
25
36
47
]
Series support several other methods as well; here is an example of obtaining the
total number of unique values:
a1 = pl.Series(
[1,4,7,9,16,25,5,25,9,5]
)
print(a1.n_unique())
would yield
98
Chapter 3 Data Wrangling using Rust’s Polars
df = pl.DataFrame(
{
"fruits":
["Apple","Peach","Melon","Pineapple","Orange"],
"price":
[1.5, 0.75, 3.0, 3.25, 1.33],
"quantity":
[4, 3, 1, 2, 6]
}
)
print(df)
would yield
shape: (5, 3)
┌────────┬─────┬───────┐
│ fruits ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═════╪═══════╡
│ Apple ┆ 1.5 ┆ 4 │
│ Peach ┆ 0.75 ┆ 3 │
│ Melon ┆ 3.0 ┆ 1 │
│ Pineapple ┆ 3.25 ┆ 2 │
│ Orange ┆ 1.33 ┆ 6 │
└────────┴─────┴───────┘
Polars, by default, has prints with aligned rows and columns with borders. This may
resemble a pretty print library in Python. Like Pandas, one can view the head and tail of
the data frame simply by calling head(n) and tail(n), respectively.
The option describe() would return the summary statistics for the data frame. Here is
how the syntax looks like:
print(df.describe())
99
Chapter 3 Data Wrangling using Rust’s Polars
would yield
shape: (9, 4)
┌─────────┬────────┬───────┬───────┐
│ describe ┆ fruits ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════╪════════╪═══════╪═══════╡
│ count ┆ 5 ┆ 5.0 ┆ 5.0 │
│ null_count ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 1.966 ┆ 3.2 │
│ std ┆ null ┆ 1.097511 ┆ 1.923538 │
│ min ┆ Apple ┆ 0.75 ┆ 1.0 │
│ 25% ┆ null ┆ 1.33 ┆ 2.0 │
│ 50% ┆ null ┆ 1.5 ┆ 3.0 │
│ 75% ┆ null ┆ 3.0 ┆ 4.0 │
│ max ┆ Pineapple ┆ 3.25 ┆ 6.0 │
└─────────┴────────┴───────┴───────┘
plf1 = pl.DataFrame(
{
"fruits":
["Apple","Peach","Melon","Pineapple","Orange"],
100
Chapter 3 Data Wrangling using Rust’s Polars
"price":
[1.5, 0.75, 3.0, 3.25, 1.33],
"quantity":
[4, 3, 1, 2, 6]
}
).lazy()
By adding the lazy(), the data frame is now set up to be a lazy frame.
Here is how to enable an existing Pandas data frame to a lazy frame.
Let’s initialize a Pandas data frame to begin with:
import pandas as pd
pandas_df1 = pd.DataFrame(
{
"a":
["alpha","beta","gamma","delta","eta"],
"b":
[1, 2, 3, 4, 5]
}
)
print(pandas_df1)
would yield
a b
0 alpha 1
1 beta 2
2 gamma 3
3 delta 4
4 eta 5
pandas_lazyframe = pl.LazyFrame(pandas_df1)
print(pandas_lazyframe)
101
Chapter 3 Data Wrangling using Rust’s Polars
would yield
One can also overwrite the Pandas data frame to a lazy frame.
polars_df.write_csv("destination to file.csv")
JSON
Polars supports read and write operations for both JSON and ndJSON.
ndJSON stands for newline-delimited JSON and enables faster inserts and reads
from files.
Here’s how to read JSON:
102
Chapter 3 Data Wrangling using Rust’s Polars
To enable lazy evaluation, Polars offers the option that is limited to ndJSON files:
Parquet
As noted earlier, Parquet is a columnar data store and it is considered faster than many
of the data sources. Here is how one can read Parquet files from Polars:
Polars also can provide lazy evaluation and read and write data in various other
formats like Avro, ORC, Feather, etc.
Polars Context
A context is simply a context where the expressions get evaluated and executed. There
are three main contexts in Polars: selection contexts, filtering contexts, and group-by
contexts. Let’s look at them in detail.
Selection Context
A selection context is where the idea of selection applies to a given Polars data frame.
Some of the operations that can be performed in a selection context are
import polars as pl
df = pl.DataFrame(
{
"A": [1, 2, 3],
"B": [4, 5, 6]
}
)
print(df)
shape: (3, 2)
┌────┬────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════╪════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└────┴────┘
104
Chapter 3 Data Wrangling using Rust’s Polars
print(df.select(['A']))
shape: (3, 1)
┌────┐
│ A │
│ --- │
│ i64 │
╞════╡
│ 1 │
│ 2 │
│ 3 │
└────┘
However, this statement has very limited scope. If you want to perform further
transformations on the selected column, you need to include an additional expression.
Here is how that looks:
timesTwo = df.select(
( pl.col(['A']) * 2 )
.alias("NewCol_A")
)
print(timesTwo)
would yield
shape: (3, 1)
┌───────┐
│ NewCol_A │
│ --- │
│ i64 │
╞═══════╡
│ 2 │
│ 4 │
│ 6 │
└───────┘
105
Chapter 3 Data Wrangling using Rust’s Polars
Filter Context
A filter context is where a given expression filters a dataset and provides a data frame
that matches the user-supplied conditions. The operations that can be performed in this
context are
conFilter = df.filter(pl.col('A')>1)
print(conFilter)
would yield us
shape: (2, 2)
┌────┬────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════╪════╡
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└────┴────┘
Group-By Context
A group-by context enables performing aggregation and summaries on the existing
Polars data frame while also performing computations on top of the aggregated data
frame. The operation that can be performed in this context is
dataframe.group_by([...]).agg([...])
106
Chapter 3 Data Wrangling using Rust’s Polars
df1 = pl.DataFrame(
{
"Region":
["CityBranch", "CampusBranch", "CityBranch",
"SuburbBranch", "CampusBranch", "SuburbBranch",
"CityBranch", "CampusBranch"],
"Product":
["A", "B", "B",
"A", "B", "A",
"A", "B"],
"Sales":
[100, 200, 150, 300, 250, 400, 350, 180]
}
)
print(df1)
would yield us
shape: (8, 3)
┌──────────┬───────┬─────┐
│ Region ┆ Product ┆ Sales │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════╪═══════╪═════╡
│ CityBranch ┆ A ┆ 100 │
│ CampusBranch ┆ B ┆ 200 │
│ CityBranch ┆ B ┆ 150 │
│ SuburbBranch ┆ A ┆ 300 │
│ CampusBranch ┆ B ┆ 250 │
│ SuburbBranch ┆ A ┆ 400 │
│ CityBranch ┆ A ┆ 350 │
│ CampusBranch ┆ B ┆ 180 │
└──────────┴───────┴─────┘
107
Chapter 3 Data Wrangling using Rust’s Polars
groupbydf = df1\
.group_by(["Region", "Product"])\
.agg(
[
pl.col("Sales")
.sum()
.alias("Total"),
pl.col("Sales")
.max()
.alias("Max Sales"),
pl.count("Sales")
.alias("TransactionsCount")\
]
)
print(groupbydf)
would yield us
shape: (4, 5)
┌──────────┬───────┬─────┬────────┬──────────────┐
│ Region ┆ Product ┆ Total ┆ Max Sales ┆ TransactionsCount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ u32 │
╞══════════╪═══════╪═════╪════════╪══════════════╡
│ CityBranch ┆ B ┆ 150 ┆ 150 ┆ 1 │
│ CityBranch ┆ A ┆ 450 ┆ 350 ┆ 2 │
│ SuburbBranch ┆ A ┆ 700 ┆ 400 ┆ 2 │
│ CampusBranch ┆ B ┆ 630 ┆ 250 ┆ 3 │
└──────────┴───────┴─────┴────────┴──────────────┘
108
Chapter 3 Data Wrangling using Rust’s Polars
df1_lazy = pl.LazyFrame(
{
"Region":
["CityBranch", "CampusBranch", "CityBranch",
"SuburbBranch", "CampusBranch", "SuburbBranch",
"CityBranch", "CampusBranch"],
"Product":
["A", "B", "B",
"A", "B", "A",
"A", "B"],
"Sales":
[100, 200, 150,
300, 250, 400,
350, 180]
}
)
print(df1_lazy)
would yield
Basic Operations
Let us look at performing basic operations while working with expressions in Polars.
109
Chapter 3 Data Wrangling using Rust’s Polars
df3 = pl.DataFrame(
{
"A": [1, 2, 3],
"B": [4, 5, 6]
}
)
print(df3)
would yield us
shape: (3, 2)
┌────┬────┐
│ A ┆ B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════╪════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└────┴────┘
modified_df3 = df3.select(
(pl.col("A") + 12)
.alias("A+12"),
(pl.col("B") - 3)
.alias("B-3")
)
print(modified_df3)
would yield us
shape: (3, 2)
110
Chapter 3 Data Wrangling using Rust’s Polars
┌─────┬────┐
│ A+12 ┆ B-3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪════╡
│ 13 ┆ 1 │
│ 14 ┆ 2 │
│ 15 ┆ 3 │
└─────┴────┘
modified_df3_wc = df3.with_columns(
(pl.col("A") + 12)
.alias("A+12"),
(pl.col("B") - 3)
.alias("B-3")
)
print(modified_df3_wc)
would return
shape: (3, 4)
┌────┬────┬─────┬────┐
│ A ┆ B ┆ A+12 ┆ B-3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════╪════╪═════╪════╡
│ 1 ┆ 4 ┆ 13 ┆ 1 │
│ 2 ┆ 5 ┆ 14 ┆ 2 │
│ 3 ┆ 6 ┆ 15 ┆ 3 │
└────┴────┴─────┴────┘
111
Chapter 3 Data Wrangling using Rust’s Polars
logical_df3 = df3.select(
(pl.col('A')>1)
.alias("Is A > 1?")
)
print(logical_df3)
would yield us
shape: (3, 1)
┌────────┐
│ Is A > 1? │
│ --- │
│ bool │
╞═════════╡
│ false │
│ true │
│ true │
└────────┘
You can also create a new column by using the with_columns() method:
logical_df3_wc = df3.with_columns(
(pl.col('A')>1)
.alias("Is A > 1?")
)
print(logical_df3_wc)
would return
shape: (3, 3)
112
Chapter 3 Data Wrangling using Rust’s Polars
┌────┬────┬────────┐
│ A ┆ B ┆ Is A > 1? │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ bool │
╞════╪════╪════════╡
│ 1 ┆ 4 ┆ false │
│ 2 ┆ 5 ┆ true │
│ 3 ┆ 6 ┆ true │
└────┴────┴─────────┘
String Operations
Polars uses Apache Arrow as its backend, which enables a contiguous memory model
for all its operations. Contiguous memory is where the elements are stored sequentially
so that the CPU can retrieve these elements faster and have more efficient operations.
String processing can get expensive as the data grows over time.
polars_dataset = pl.read_csv(
"/path-to-the-folder/polars_dataset.csv",
separator=",",
infer_schema_length=0
) \
.with_columns(
pl.col("birthdate")
.str.to_date(strict=False)
)
print(polars_dataset)
would yield
shape: (1_000, 7)
113
Chapter 3 Data Wrangling using Rust’s Polars
┌───────┬───────┬─────┬────────┬────────┬──────────┬──────────┐
│ firstname ┆ lastname ┆ gender ┆ birthdate ┆ type ┆ state ┆ occupation │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ date ┆ str ┆ str ┆ str │
╞════════╪═══════╪═════╪════════╪═══════╪═════════╪══════════╡
│ Rosa ┆ Houseago ┆ F ┆ null ┆ Stateparks┆ California ┆ Accountant │
│ Reese ┆ Medford ┆ M ┆ null ┆ Stateparks┆ District of ┆ Electrician │
│ ┆ ┆ ┆ ┆ ┆ Columbia ┆ │
│ Gianna ┆ Rylands ┆ F ┆ 1973-07-02 ┆ Mountains ┆ Kentucky ┆ Doctor │
│ Abran ┆ McGeown ┆ M ┆ 1962-12-08 ┆ Baseball ┆ Virginia ┆ Accountant │
│ Lanita ┆ Yantsev ┆ F ┆ 1976-06-07 ┆ Stateparks┆ Illinois ┆ Electrician │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Selina ┆ Heyworth ┆ F ┆ 1951-04-10 ┆ Baseball ┆ Ohio ┆ IT Professional│
│ Rubi ┆ Licari ┆ F ┆ null ┆ Oceans ┆ Michigan ┆ Lawyer │
│ Rubin ┆ Stanworth┆ M ┆ 1956-09-08 ┆ Baseball ┆ Indiana ┆ Electrician │
│ Benjie ┆ Amort ┆ M ┆ 1957-07-07 ┆ Museums ┆ Kentucky ┆ Doctor │
│ Gallard ┆ Samuels ┆ M ┆ 1961-07-09 ┆ Oceans ┆ California ┆ Doctor │
└───────┴───────┴──────┴────────┴───────┴─────────┴───────────┘
Left Join
A left join returns all rows from the left table, scans the rows from the right table,
identifies matches, and annexes them to the left table. Let us look at an example.
Here we initialize two data frames:
import polars as pl
df1 = pl.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'John'],
'DeptID': [101, 102, 103, 104, 103]
114
Chapter 3 Data Wrangling using Rust’s Polars
})
df2 = pl.DataFrame({
'DeptID': [101, 102, 103],
'DeptName': ['HR', 'IT', 'Marketing']
})
Now, we perform the left join by joining using the department id:
leftjoin = df1.join(
df2,
on = "DeptID",
how = "left",
coalesce=True
)
print(leftjoin)
If you specify coalesce = True, the column on which the join operation is performed
will not occur on the final result. This would yield us
shape: (5, 4)
┌────┬───────┬──────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═══════╪══════╪═════════╡
│ 1 ┆ Alice ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ Marketing │
│ 4 ┆ David ┆ 104 ┆ null │
│ 5 ┆ John ┆ 103 ┆ Marketing │
└────┴───────┴──────┴────────┘
115
Chapter 3 Data Wrangling using Rust’s Polars
If coalesce is set to false, here is how that may change our dataset:
leftjoin = df1.join(
df2,
on = "DeptID",
how = "left",
coalesce=False
)
print(leftjoin)
would yield us
shape: (5, 5)
┌────┬───────┬──────┬──────────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptID_right ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ str │
╞════╪═══════╪══════╪══════════╪════════╡
│ 1 ┆ Alice ┆ 101 ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ 103 ┆ Marketing │
│ 4 ┆ David ┆ 104 ┆ null ┆ null │
│ 5 ┆ John ┆ 103 ┆ 103 ┆ Marketing │
└────┴───────┴──────┴──────────┴────────┘
Outer Join
Outer join is such that when a match is found between two given tables, it would retrieve
any and all rows that are matched:
outerjoin = df1.join(
df2,
on = "DeptID",
how = "full",
coalesce=True
)
116
Chapter 3 Data Wrangling using Rust’s Polars
print(outerjoin)
would yield us
shape: (5, 4)
┌────┬───────┬──────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═══════╪══════╪════════╡
│ 1 ┆ Alice ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ Marketing │
│ 4 ┆ David ┆ 104 ┆ null │
│ 5 ┆ John ┆ 103 ┆ Marketing │
└────┴───────┴──────┴────────┘
Inner Join
Inner join is the inverse of outer join and only returns rows that have matching values
from both tables:
innerjoin = df1.join(
df2,
on = "DeptID",
how = "inner",
coalesce=True)
print(innerjoin)
would yield us
shape: (4, 4)
117
Chapter 3 Data Wrangling using Rust’s Polars
┌────┬───────┬──────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═══════╪══════╪════════╡
│ 1 ┆ Alice ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ Marketing │
│ 5 ┆ John ┆ 103 ┆ Marketing │
└────┴───────┴──────┴────────┘
Semi Join
Semi join filters rows that have a match on the right table:
semijoin = df1.join(
df2,
on = "DeptID",
how = "semi",
coalesce=True)
print(semijoin)
would yield us
shape: (4, 3)
┌────┬───────┬──────┐
│ ID ┆ Name ┆ DeptID │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════╪═══════╪══════╡
│ 1 ┆ Alice ┆ 101 │
│ 2 ┆ Bob ┆ 102 │
│ 3 ┆ Charlie ┆ 103 │
│ 5 ┆ John ┆ 103 │
└────┴───────┴──────┘
118
Chapter 3 Data Wrangling using Rust’s Polars
Anti Join
Anti join filters rows that do not have a match on the right table:
antijoin = df1.join(
df2,
on = "DeptID",
how = "anti",
coalesce=True)
print(antijoin)
would yield us
shape: (1, 3)
┌────┬─────┬──────┐
│ ID ┆ Name ┆ DeptID │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════╪═════╪══════╡
│ 4 ┆ David ┆ 104 │
└────┴─────┴──────┘
Cross Join
Cross join would perform a cartesian product of given two tables.
Here is how that would look. Let us initialize two data frames:
df_one = pl.DataFrame(
{
'ID': [1, 2],
'Name': ['John', 'Doe']
}
)
df_two = pl.DataFrame(
{
'ItemID': [101, 102],
119
Chapter 3 Data Wrangling using Rust’s Polars
Now let us perform the cartesian product between these two data frames:
crossjoin = df_one.join(
df_two,
how = "cross",
coalesce=True
)
print(crossjoin)
would yield us
shape: (4, 4)
┌────┬─────┬──────┬──────┐
│ ID ┆ Name ┆ ItemID ┆ Item │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═════╪══════╪══════╡
│ 1 ┆ John ┆ 101 ┆ Laptop │
│ 1 ┆ John ┆ 102 ┆ Cable │
│ 2 ┆ Doe ┆ 101 ┆ Laptop │
│ 2 ┆ Doe ┆ 102 ┆ Cable │
└────┴─────┴──────┴──────┘
Advanced Operations
Let us look at some advanced methods in Polars data analysis library. For our analysis
purposes, we will leverage our earlier dataset:
import polars as pl
polars_dataset = pl.read_csv(
"/Users/pk/Documents/polars_dataset.csv",
separator=",",
infer_schema_length=0
120
Chapter 3 Data Wrangling using Rust’s Polars
) \
.with_columns(
pl.col("birthdate")
.str.to_date(strict=False)
)
print(polars_dataset)
missingdata = polars_dataset.filter(
pl.col("birthdate").is_null()
)
print(missingdata)
would yield us
shape: (588, 7)
┌─────────┬───────┬──────┬────────┬─────────┬────────────────┬────────────┐
╞═════════╪═══════╪══════╪════════╪═════════╪════════════════╪═════════════╡
└─────────┴───────┴──────┴────────┴─────────┴────────────────┴────────────┘
121
Chapter 3 Data Wrangling using Rust’s Polars
various_occupations = polars_dataset.group_by('occupation') \
.agg(
pl.count('occupation') \
.alias('various_occupations')
)
print(various_occupations)
would yield us
shape: (5, 2)
┌────────────┬───────────────┐
│ occupation ┆ various_occupations │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════════════╡
│ Doctor ┆ 225 │
│ IT Professional ┆ 196 │
│ Accountant ┆ 178 │
│ Lawyer ┆ 200 │
│ Electrician ┆ 201 │
└────────────┴───────────────┘
pivotdata = polars_dataset.pivot(
index="state",
columns="occupation",
122
Chapter 3 Data Wrangling using Rust’s Polars
values="firstname",
aggregate_function="first"
)
print(pivotdata)
would yield us
shape: (51, 6)
┌────────────────┬─────────┬──────────┬──────┬────────────┬──────┐
│ state ┆ Accountant ┆ Electrician ┆ Doctor ┆ IT Professional ┆ Lawyer │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞════════════════╪═════════╪═════════╪═══════════════════╪══════╡
│ California ┆ Rosa ┆ Margret ┆ Devin ┆ Joana ┆ Yetta │
│ District of Columbia ┆ Ginger ┆ Reese ┆ Giles ┆ Zorina ┆ Nancey │
│ Kentucky ┆ Neila ┆ Aurea ┆ Gianna ┆ Cesare ┆ Orren │
│ Virginia ┆ Abran ┆ Claudelle ┆ Kalil ┆ Gwendolen ┆ Rog │
│ Illinois ┆ Gareth ┆ Lanita ┆ Axel ┆ Darelle ┆ Nancee │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Hawaii ┆ Barr ┆ Lauretta ┆ null ┆ null ┆ Kiel │
│ Kansas ┆ null ┆ Sarena ┆ null ┆ Cherye ┆ Helyn │
│ Vermont ┆ null ┆ null ┆ null ┆ Kerrie ┆ null │
│ South Dakota ┆ Thorvald ┆ null ┆ null ┆ null ┆ null │
│ Maine ┆ null ┆ null ┆ Rustin ┆ null ┆ null │
└────────────────┴─────────┴──────────┴──────┴────────────┴──────┘
Polars/SQL Interaction
While Polars library does not have an underlying SQL engine, Polars does support
interacting with data objects using SQL-like statements. So, when you are issuing a
SQL-like statement to a Polars data object, Polars would convert that to appropriate
expressions and continue to provide you fast results (as it always does). You can execute
“SQL queries” on flat files, JSON files, and even Pandas data objects. Let us look at an
illustration involving a dataset that contains random health data.
123
Chapter 3 Data Wrangling using Rust’s Polars
import polars as pl
df = pl.read_csv(
"path-to-your-folder/random_health_data.csv"
)
print(df.head())
would yield
shape: (5, 7)
┌────┬──────────────┬────┬─────┬──────┬────────────┬─────────┐
│ id ┆ name ┆ age ┆ state ┆ gender ┆ exercise_daily? ┆ copay_paid │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ bool ┆ str │
╞════╪═════════════╪═════╪═════╪══════╪════════════╪═════════╡
│ 1 ┆ Allayne Moffett ┆ null ┆ SC ┆ Male ┆ null ┆ $206.76 │
│ 2 ┆ Kerby Benjafield ┆ null ┆ NM ┆ Male ┆ true ┆ $21.58 │
│ 3 ┆ Raina Vallentin ┆ null ┆ MI ┆ Female ┆ true ┆ $125.18 │
│ 4 ┆ Kaela Trodden ┆ null ┆ OH ┆ Female ┆ false ┆ $86.20 │
│ 5 ┆ Faber Kloisner ┆ null ┆ MN ┆ Male ┆ false ┆ $219.38 │
└────┴──────────────┴────┴─────┴──────┴────────────┴─────────┘
This fictitious dataset contains the name of the patient, demographic data points
like gender and age, the state where they reside, and another column called copay
paid. Notice the last column, called copay_paid, which is a string and contains a special
character at the beginning. To convert this column into a numeric data type, we need to
remove the special character, the dollar symbol, and convert the remaining to a decimal.
Here is how we can remove the special character from the dataset:
df = df.with_columns(
pl.col("copay_paid") \
.str \
.slice(1) \
.alias("copay_paid")
)
124
Chapter 3 Data Wrangling using Rust’s Polars
print(df.head())
would yield
shape: (5, 7)
┌────┬──────────────┬────┬─────┬──────┬────────────┬─────────┐
│ id ┆ name ┆ age ┆ state ┆ gender ┆ exercise_daily? ┆ copay_paid │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ bool ┆ str │
╞════╪═════════════╪═════╪═════╪══════╪════════════╪═════════╡
│ 1 ┆ Allayne Moffett ┆ null ┆ SC ┆ Male ┆ null ┆ 206.76 │
│ 2 ┆ Kerby Benjafield ┆ null ┆ NM ┆ Male ┆ true ┆ 21.58 │
│ 3 ┆ Raina Vallentin ┆ null ┆ MI ┆ Female ┆ true ┆ 125.18 │
│ 4 ┆ Kaela Trodden ┆ null ┆ OH ┆ Female ┆ false ┆ 86.20 │
│ 5 ┆ Faber Kloisner ┆ null ┆ MN ┆ Male ┆ false ┆ 219.38 │
└────┴──────────────┴────┴─────┴──────┴────────────┴─────────┘
Now, we can convert the string data type to a decimal data type. Here is how we can
do that:
df = df.with_columns(
pl.col("copay_paid")\
.str \
.to_decimal() \
.alias("copay"))
print(df.head())
would yield
shape: (5, 8)
125
Chapter 3 Data Wrangling using Rust’s Polars
┌────┬──────────────┬─────┬─────┬──────┬────────────┬─────────┬──────────┐
╞════╪═════════════╪═════╪═════╪══════╪════════════╪═════════╪══════════╡
└────┴──────────────┴─────┴─────┴──────┴────────────┴─────────┴──────────┘
Let us now initialize the SQL context of Polars. The SQL context object is where the
mapping between SQL-like statements and Polars expressions is stored. And so, this step
is essential.
Here is how we can do that:
pl_sql = pl.SQLContext(
random_health_data = df
)
You would notice the Polars data frame has been referenced to as “random_
health_data.” This will be our table name in writing SQL-like statements. Let us define
our query:
query1 = """
select
state,
count(name) as total_Patients,
max(copay) as most_Expensive
from
random_health_data
where
state is not null
and copay is not null
group by
state
126
Chapter 3 Data Wrangling using Rust’s Polars
order by
mostExpensive desc
limit
10
"""
Now, we can execute our SQL query using our SQL context. Here is how we can
do that:
sqlresult = pl_sql.execute(
query1,
eager=True
)
print(sqlresult)
would yield
shape: (10, 3)
┌─────┬──────────┬───────────┐
│ state ┆ totalPersons ┆ mostExpensive │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ decimal[*,2] │
╞═════╪══════════╪═══════════╡
│ CA ┆ 95 ┆ 274.51 │
│ HI ┆ 8 ┆ 274.03 │
│ VA ┆ 26 ┆ 273.82 │
│ AZ ┆ 22 ┆ 273.68 │
│ WI ┆ 11 ┆ 273.41 │
│ NY ┆ 35 ┆ 272.23 │
│ FL ┆ 62 ┆ 271.62 │
│ IA ┆ 16 ┆ 270.97 │
│ TN ┆ 16 ┆ 270.54 │
│ MA ┆ 16 ┆ 269.18 │
└─────┴──────────┴───────────┘
127
Chapter 3 Data Wrangling using Rust’s Polars
Note You may have noticed “eager=True” during the execution of the SQL
query within the SQL context. You can also set the option to False to enable lazy
execution. This will return a Polars lazy frame. You can continue to use that lazy
frame on downstream data manipulation tasks.
Polars CLI
Polars also has an option to access the library through a command line interface. One
can execute Polars commands and SQL-like commands right from the command line.
Here’s how to get started with using Polars in a command line interface:
And type in “polars” to access the command line interface. Here is how that
may look:
Here is how we can execute a SQL query on a flat file in your file system:
would yield
128
Chapter 3 Data Wrangling using Rust’s Polars
Figure 3-2. Writing SQL queries on flat files using Polars CLI
You can even specify the SQL query directly on your shell and use a pipe operator to
call and execute the SQL command in the Polars CLI. Here is how that looks:
would yield
┌──────────────┬────────────┐
│ fullname ┆ occupation │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪════════════╡
│ Cynthia Paley ┆ Doctor │
│ Adam Seaborn ┆ Doctor │
│ Lilah Mercik ┆ IT Professional │
│ Alidia Wreiford ┆ Lawyer │
│ Rosita Guswell ┆ Accountant │
129
Chapter 3 Data Wrangling using Rust’s Polars
Conclusion
Congratulations on reaching this far! So far, you have covered a lot of ground with
Python, SQL, git, Pandas, and Polars from these chapters. There is immense potential in
Polars. So far, we looked at how to extract and load data in various formats, perform data
transformation operations like filtering and aggregation, and combine multiple Polars
objects using join operations. We also explored handling missing values, identifying
130
Chapter 3 Data Wrangling using Rust’s Polars
unique values, and reshaping data with pivot operations. I hope you find it in you to
adopt Polars in your data pipelines and demonstrate greater efficiency and value to the
organization. While Polars may be the perfect solution for CPU-based data wrangling
tasks, in the next chapter, we will look at GPU-based data wrangling libraries and how
one can process massive amounts of data using GPU. I hope you stay excited toward the
upcoming chapter.
131
CHAPTER 4
133
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_4
Chapter 4 GPU Driven Data Wrangling Using CuDF
Note CPU also has cache, a memory unit that stores data required for a given
iteration of a computation. The cache is supposed to be faster than the physical
memory, and at present there appear to be three levels of cache. There is also a
memory management unit that controls data movement between CPU and RAM
during their communication and coordination for processing various tasks.
134
Chapter 4 GPU Driven Data Wrangling Using CuDF
Introduction to CUDA
CUDA stands for Compute Unified Device Architecture. CUDA is a software framework
developed by NVIDIA for writing code specifically for NVIDIA GPUs. It enables the
execution of kernels on NVIDIA GPUs, harnessing their parallel computing power for
complex data processing. Although written in C language, CUDA also supports Python
programming through libraries such as CuPy and PyCUDA. However, it's important
to note that programs developed using the CUDA software framework are exclusive to
NVIDIA GPUs and are not compatible with GPUs from other manufacturers.
Note Currently in the hardware manufacturing market, there are several companies
who manufacture GPUs. One of the famous manufacturers of GPUs is NVIDIA.
There is also OpenCL, which is also a firmware like CUDA. OpenCL is more generic
and can be used in non-NVIDIA GPUs as well.
Kernels
Kernels are packages of code (containing specific functions) written by developers. They
are compiled into machine code using CUDA compiler. Kernels are functions written for
execution on GPUs.
The process begins by the CPU calling the kernel function. In addition to calling the
kernel, the host would also call the execution configuration, determining how many
parallel threads are required for the kernel execution. The GPU executes the kernel using
a grid of blocks, each block containing a set of threads. The scheduler portion of the
135
Chapter 4 GPU Driven Data Wrangling Using CuDF
GPU block would send these threads to a streaming multiprocessor on the GPU. Once
the execution is complete, the host will retrieve the results from the GPU’s memory. This
would have completed one full execution cycle.
Memory Management
During execution of kernel functions on GPUs, the threads who are executing (the
same) kernel function may read and write data at various memory locations. These
threads have access to local memory, shared memory, and global memory. Local
memory is scoped primarily toward a given thread. Shared memory is memory that is
scoped only to a given block but is accessed by all the threads underneath it. Global
memory is something that is accessed by GPU and CPU. All threads in a given block may
synchronize their execution for better memory access.
Introduction to CuDF
CuDF is a Python library that offers various data manipulation and transformation
methods and functions exclusively to be executed on NVIDIA GPUs. CuDF leverages
Apache Arrow for columnar format data structures. All CuDF’s functions and methods
are based on libcudf, a CUDA library meant for C++ programming.
CuDF is similar to Pandas, with the exception that it is exclusively designed to run
on NVIDIA GPUs. This GPU-accelerated data manipulation and transformation library
provides more efficient data processing (when compared with Pandas) on large datasets.
136
Chapter 4 GPU Driven Data Wrangling Using CuDF
Iterating over a data structure is not considered optimal in CuDF. You can still write
code that would loop through a given column in a data frame and perform some kind of
computation; however, it might provide poor performance. A better option is to either
use an inbuilt CuDF method to perform the transformation or convert the CuDF data
frame into either Pandas or Arrow data frame, perform the necessary operation, and
convert the resultant data frame into a CuDF data frame.
When performing joins or group-by operations, Pandas would sort the data frame,
whereas CuDF does not perform any kind of sorting or ordering of elements. If you
require some sort of ordering elements, then you enable “sort=True” in the code.
Setup
Although the installation page provides more options for installing and working with
CuDF, I recommend the following to be able to get up and running with CuDF.
You need a NVIDIA GPU with compute capability of 6.0 and above. First of all, the
CuDF library is built on top of CUDA, a parallel processing framework exclusive for
NVIDIA GPUs. NVIDIA offers various GPU products and you may be able to view them
on their product catalog.
Second, compute capability is just a glorified term for saying newer models of their
GPU. My understanding of compute capability is that the higher the version number, the
better the GPU can perform and can process more complex operations and the more
efficient it can do them.
Note Think of it like an iPhone, where the more recent versions have more
capabilities when compared with the older iPhones. Compute capability is like
saying a newer iPhone model.
You also need a Linux distribution. Though you may be able to work with Windows
operating systems with WSL, it may be beneficial to have a Linux instance with gcc++ 9.0
and above available.
Note If you have a Windows operating system, then you may need to have a GPU
that has compute capability of 7.0 or higher (an even more recent model).
137
Chapter 4 GPU Driven Data Wrangling Using CuDF
You need appropriate NVIDIA drivers and CUDA drivers, where it may be required
to have a CUDA driver version 11.2 or newer.
You also need the latest version of Python 3 installed on your system.
It may be better to uninstall any existing CUDA toolkits and CUDA packages that are
loaded by default.
I recommend using a virtual environment specifically for executing CuDF pipelines.
A better idea is to set up a conda or miniconda environment in your system and perform
a conda install.
Here is how that looks:
For those who do not have access to one of the abovementioned, it may be beneficial
to look at some cloud-based options. All the major cloud providers offer GPU-based
computing options. While this will be covered in later chapters in detail, let us look at
one such cloud provider example, so we can get our hands on with CuDF. I am choosing
Google Cloud Platform-based Google Colab (arbitrarily).
To get started with Google Colab, simply visit the following portal and sign in with
your Google account (or create an account if you do not possess one yet):
https://colab.google/
And click the New Notebook option:
Google Colab offers a free option and a pro option. For our purposes, the free option
may suffice.
138
Chapter 4 GPU Driven Data Wrangling Using CuDF
At the time of learning, it may be wise to turn off the generate with AI option. If you
do wish to turn it off, click “Tools” on the menu and choose the “Settings” option.
139
Chapter 4 GPU Driven Data Wrangling Using CuDF
Under the settings, locate “Colab AI” and check “Hide generative AI features.”
Now, let us change the runtime of our Jupyter notebook from CPU to GPU. Let us
navigate to the main page within Google Colab. From there, select “Runtime” on the
menu and choose “Change runtime type.”
Here is how that may look:
Figure 4-4. Changing the runtime environment of the Google Colab notebook
Choose “Python 3” for the runtime type and choose an option that contains the word
“GPU” for the hardware accelerator. In this case, select the T4 GPU option. Note that the
other GPU options are also NVIDIA offerings.
140
Chapter 4 GPU Driven Data Wrangling Using CuDF
Now, it is time to install CuDF. Copy and paste the following command in your code
box. This might take some time so it may be better to wait for a little while:
In some cases, the installation may fail for reasons including conflict between
package versions. If and when installation and setup appears to be challenging, do
consider installing an older and stable version.
141
Chapter 4 GPU Driven Data Wrangling Using CuDF
import cudf
a = cudf.Series([1, 2, 3, 4, 5])
print(a)
would yield
0 1
1 2
2 3
3 4
4 5
dtype: int64
142
Chapter 4 GPU Driven Data Wrangling Using CuDF
File IO Operations
CuDF supports most of the commonly utilized data formats in big data and machine
learning. Here is how the syntax may look:
csv_file = "/path-to-folder/path-to-file.csv"
cudf_df = cudf.read_csv(
filepath
,sep=","
,header="infer")
CSV
If you plan to export a given CuDF data frame into a CSV, here’s how to do it:
cudf.to_csv(
"/path-to-file.csv"
,sep=","
,header= True
,index=True
)
143
Chapter 4 GPU Driven Data Wrangling Using CuDF
Parquet
Parquet is one of the more efficient data stores, as we have discussed earlier. CuDF
library provides methods to read a Parquet file as well as export a current data frame into
a Parquet file, where GPU acceleration is enabled for these operations.
Here’s how to read an external Parquet file:
import cudf
filename = "/some-file-path/some-file-name.parquet"
df = cudf.read_parquet(filename)
To export a CuDF data frame to Parquet, the syntax may look like
df.to_parquet(filename = "/some-file-path/some-file-name.parquet")
JSON
CuDF supports JSON file formats; however, many of the attributes for reading a JSON file
do not appear to be GPU accelerated.
Here is an example:
import cudf
filename = "/filepath/file.json"
cudf_dataframe = cudf.read_json(filename)
cudf_dataframe = cudf.to_json(“/filepath/filename.json”)
Similarly, CuDF supports other data file formats like ORC, Avro, etc.
Basic Operations
Let us look at some basic operations with CuDF, by importing an example dataset. Here
is how the code looks:
144
Chapter 4 GPU Driven Data Wrangling Using CuDF
import cudf
cudf_df = cudf.read_csv("/content/sample_data/polars_dataset.csv")
print(cudf_df.head())
Figure 4-8. Reading an external CSV file and examining first few rows
Column Filtering
Let us try to select a subset of the data. Here’s how to obtain only certain columns:
name_and_occupation = cudf_df[
["firstname","occupation"]
].head()
print(name_and_occupation)
145
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield
firstname occupation
0 Rosa Accountant
1 Reese Electrician
2 Gianna Doctor
3 Abran Accountant
4 Lanita Electrician
Row Filtering
Here’s how to obtain only certain rows:
data_subset = cudf_df.loc[
775:898,
["firstname","birthdate", "state", "occupation"]
].head()
print(data_subset)
146
Chapter 4 GPU Driven Data Wrangling Using CuDF
data_subset_sort = cudf_df.loc[
775:898,
["firstname","birthdate", "state", "occupation"]
] \
.sort_values(
"state",ascending=False
).head()
print(data_subset_sort)
would yield
147
Chapter 4 GPU Driven Data Wrangling Using CuDF
Figure 4-10. Sorting a subset of data on a specific column and returning first
few rows
Left Join
Left join is where everything is kept as is on the left table while looking for potential
matches on the right table. We use cudf.merge to perform left joins on GPU DataFrames.
148
Chapter 4 GPU Driven Data Wrangling Using CuDF
import cudf
left_df1 = cudf.DataFrame(
{
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'John'],
'DeptID': [101, 102, 103, 104, 103]
}
)
right_df1 = cudf.DataFrame(
{
'DeptID': [101, 102, 103],
'DeptName': ['HR', 'IT', 'Marketing']
}
)
leftjoin = left_df1.merge(
right_df1,
on=["DeptID"],
how="left"
)
print(leftjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
4 4 David 104 <NA>
149
Chapter 4 GPU Driven Data Wrangling Using CuDF
Outer Join
Outer join is where the columns or indexes that are used to perform the join are
unionized from both data frames. Here is how that looks:
outerjoin = left_df1.merge(
right_df1,
on=["DeptID"],
how="outer"
)
print(outerjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
4 4 David 104 <NA>
Inner Join
Inner join is the inverse of outer join. Inner join is where the intersection of columns or
indexes is performed on both data frames.
Let us look at an example:
innerjoin = left_df1.merge(
right_df1,
on=["DeptID"],
how="inner"
)
print(innerjoin)
150
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
leftsemi = left_df1.merge(
right_df1,
on=["DeptID"],
how="leftsemi"
)
print(leftsemi)
would yield us
ID Name DeptID
0 1 Alice 101
1 2 Bob 102
2 3 Charlie 103
3 5 John 103
151
Chapter 4 GPU Driven Data Wrangling Using CuDF
leftanti = left_df1.merge(
right_df1,
on=["DeptID"],
how="leftanti"
)
print(leftanti)
would yield us
ID Name DeptID
0 4 David 104
Advanced Operations
Let us look at some advanced data wrangling operations using CuDF.
Group-By Function
A group-by function attempts to group values in a series or a data frame. The grouping
of columns can occur either in a single column or more than one column as well. CuDF
supports some of the most commonly used Pandas group-by functions. Let us look at
some of them in detail:
import cudf
groupbydata = {
'Person': ['Alice', 'Bob', 'Charlie', 'David'],
'Dept': ['Marketing', 'Marketing', 'DigitalM', 'DigitalM'],
'Sales': [20, 12, 35, 42]
}
gb_df = cudf.DataFrame(groupbydata)
print(gb_df)
152
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
Person Dept Sales
0 Alice Marketing 20
1 Bob Marketing 12
2 Charlie DigitalM 35
3 David DigitalM 42
Now let us try to get total sales by department. To accomplish this, we need to group
the dataset by the department and obtain the sum of the sales numbers within that
respective department.
Here is how one may achieve that:
totsale_dept = gb_df \
.groupby('Dept')\
.agg('sum')
print(totsale_dept)
would yield us
Sales
Dept
Marketing 32
DigitalM 77
Transform Function
The transform() function is an important function in the context of data transformation.
The transform() function performs aggregations and returns the result in the size that
is of similar size to the data frame or series. This way the output may be added to the
data frame.
Here is how that looks for the same dataset:
dept_sum = gb_df \
.groupby('Dept', sort=True)\
.transform('sum')
print(dept_sum)
153
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
Sales
0 32
1 32
2 77
3 77
Notice the size of the resultant dataset. It may be easy to add the same to the data
frame as well.
gb_df['totalDeptSales'] = dept_sum
print(gb_df)
would yield us
Person Dept Sales dept_sales totalDeptSales
0 Alice Marketing 20 32 32
1 Bob Marketing 12 32 32
2 Charlie DigitalM 35 77 77
3 David DigitalM 42 77 77
apply( )
Though CuDF provides many of the commonly used methods and functions on GPU
accelerator mode, it is certainly possible there may be cases where you have to write
a custom function, also referred to as a user-defined function. These user-defined
functions are often applied to a series or a data frame object on a given column or a set
of columns. This is where the apply() method is beneficial. The apply() method enables
to implement user-defined functions over a given column.
Here is an illustration of the apply() method. Let us define a series object:
print(ap1)
154
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
0 1
1 <NA>
2 3
3 5
4 6
5 8
dtype: int64
We now create a custom method that is required to be applied on the CuDF series object:
def transformData(x):
if x is cudf.NA:
return 42
else:
return x + 100
Now, we are ready to apply this user-defined function in our series object:
ap1.apply(transformData)
would yield us
0 101
1 42
2 103
3 105
4 106
5 108
dtype: int64
print(gb_df)
Person Dept Sales dept_sales totalDeptSales
0 Alice Marketing 20 32 32
1 Bob Marketing 12 32 32
2 Charlie DigitalM 35 77 77
3 David DigitalM 42 77 77
155
Chapter 4 GPU Driven Data Wrangling Using CuDF
Now, let us create a user-defined function for this dataset, where we assign ranks
based on crossing a certain threshold of sales figure:
def salesRank(y):
if y > 25:
return 1
else:
return 2
Now, let us apply this function and add a new column to hold the ranks:
gb_df['SalesRank'] = gb_df['Sales'].apply(salesRank)
print(gb_df)
would yield us
Person Dept Sales dept_sales totalDeptSales SalesRank
0 Alice Marketing 20 32 32 2
1 Bob Marketing 12 32 32 2
2 Charlie DigitalM 35 77 77 1
3 David DigitalM 42 77 77 1
Cross Tabulation
Let us look at some advanced functions in CuDF. First, we will try to use the “crosstab”
function. Crosstab stands for cross tabulation that displays the frequency distribution of
variables in a table, where the row and column belong to a data point and the value in
the table denotes how often the value occurs. Cross tabulation view is useful in exploring
data, analyzing relationships, or even just looking at distribution of a variable across
various categories.
Let us create a crosstab for two data points in CuDF:
data = cudf.DataFrame(
{
'A': [1, 2, 1, 2, 3, 2, 3, 3, 1, 2],
'B': ['x', 'y', 'x', 'y', 'z', 'z', 'x', 'y', 'x', 'y']
}
)
156
Chapter 4 GPU Driven Data Wrangling Using CuDF
print(crosstab)
would yield us
B x y z
A
1 3 0 0
2 0 3 1
3 1 1 1
data = cudf.DataFrame(
{
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 86, 96, 97]
}
)
print(data)
157
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
Age AgeGroup
0 25 <30
1 30 <30
2 35 30-50
3 40 30-50
4 45 30-50
5 50 30-50
6 55 50-70
7 60 50-70
8 65 50-70
9 86 >70
10 96 >70
11 97 >70
Factorize Function
The factorize function is another key operation in feature engineering. The factorize
function is used to encode categorical variables into numerical values (but not
continuous), so it is suitable for machine learning algorithms with numerical inputs. The
way it works is that it assigns a unique value to each category so that we have a vector of
numerical values that corresponds to a certain category. Here is how that works:
fact_data = cudf.Series(
[
'apple', 'banana', 'apple',
'orange', 'banana', 'apple',
'banana', 'banana', 'orange'
]
)
fact_labels = factorize(fact_data)
print(fact_labels)
158
Chapter 4 GPU Driven Data Wrangling Using CuDF
would yield us
(
array([0, 1, 0, 2, 1, 0, 1, 1, 2], dtype=int8),
StringIndex(['apple' 'banana' 'orange'], dtype=‘object')
)
Window Functions
A window function is another key analytical function, where operation is performed over
a specific window of data, usually where they are placed. A good example is calculating
moving averages on certain data. Let us look at an example:
ts_data = cudf.Series(
[10, 20, 30, 40, 50, 55, 60, 65, 70, 72, 74, 76, 78, 80]
)
moving_avg = ts_data.rolling(window=3).mean()
would yield
Moving average:
0 <NA>
1 <NA>
2 20.0
3 30.0
4 40.0
5 48.33333333
6 55.0
7 60.0
8 65.0
9 69.0
10 72.0
11 74.0
12 76.0
13 78.0
dtype: float64
159
Chapter 4 GPU Driven Data Wrangling Using CuDF
CuDF Pandas
CuDF provides Pandas accelerator mode that supports all of the Pandas API functions in
methods. This means that CuDF can accelerate Pandas code on GPU without having to
entirely rewrite the code. In cases where the code may not be able to leverage GPU, this
Pandas accelerator mode automatically reverts to executing the Pandas code on CPU.
It may be a good idea to recall that CuDF supports most of the commonly utilized
Pandas functions and methods. If possible, it may be a better idea to rewrite the Pandas
code into CuDF code. Moreover, if you are working with a large dataset with varying
complex operations, it may be better to test the cudf.pandas with a subset and compare
the same with Pandas.
cudf.pandas works well with Pandas, NumPy, scipy, TensorFlow, scikit-learn, and
XGBoost packages so one can utilize GPU accelerator mode not just with data analysis
but also with machine learning pipelines.
To enable cudf.pandas, simply add the following line in your Pandas code for the
GPU accelerator mode:
%load_ext cudf.pandas
Note GPU acceleration means carrying out processing tasks on GPUs instead
of CPUs.
%load_ext cudf.pandas
import pandas as pd
df = pd.DataFrame(
{
'A': [1, 2, 3],
'B': [4, 5, 6]
}
)
160
Chapter 4 GPU Driven Data Wrangling Using CuDF
result = transform(df)
print(result)
would yield
A B C
1 2 5 7
2 3 6 9
The preceding code is purely written in Python. You may wish to replace the
values in the data frame to something large enough that the CPU takes a minute to
compute and try to run this. You will notice the CuDF Pandas accelerator mode in
action. You can also enable GPU accelerator mode by executing the Python script in the
following manner:
Conclusion
So far, we have gained an appreciation toward data processing in traditional CPU
environments and GPU-accelerated environments. We looked at key concepts of
GPU programming like kernels, threads, and memory management, which form the
foundation of efficient GPU utilization. We also gained hands-on CuDF programming,
performing advanced data manipulation operations, feature engineering, and complex
joins. This chapter can serve as a foundation for setting up cost-effective and robust
machine learning pipelines. We will see more of these in upcoming chapters.
161
CHAPTER 5
163
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_5
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
–– Our data is current, that is, believed to represent the reality of the
situation.
–– If the model exceeds the accuracy threshold, then the model is good;
otherwise, models and algorithms need to be retrained.
164
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
–– Incomplete data
–– Inconsistent data
–– Outdated data
Definition
Data validation is the process of ensuring the data collected or entered meets certain
predefined standards and rules.
The objective is to ensure accuracy, reliability, and integrity of the data coming from
source systems.
Data validation processes usually operate by identifying and preventing errors,
inconsistencies, or inaccuracies.
165
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
The data validation process may occur during data entry, data import, or data
processing and involves checking data against predefined rules to ensure it meets the
expected criteria.
Data Accuracy
Data accuracy is the process of verifying data is free from errors. The goal of data
accuracy is to transform and perform sanitization operations to keep it acceptable and
correct, as per the context of the project or business context. Lack of data accuracy may
not necessarily affect the downstream data pipelines as long as they are of the same
data type that the downstream processes are expecting. However, it may certainly have a
major impact on the quality of results and intelligence that downstream processes may
generate. Hence, it is important to keep the data accurate and current.
Data Uniqueness
Data uniqueness is when each row or a data point in either a row or column is not a
copy of another data point and not a duplicate in any form. In many cases, a data point
may repeat itself in several instances due to the nature and context (e.g., categorical bins
in numerical format). However, in some cases, where the data point is the output of a
transformation that involves more than one input data point, it may be beneficial to have
the uniqueness tested and validated. Another application is that, when the data point is a
floating point number with mandatory “n” number of digits after the decimal, it certainly
helps to validate the uniqueness of the data point.
Data Completeness
Data completeness is a process of validating whether the data required as part of the
dataset is actually present. In other words, data completeness is computing the missing
values or partially available values, when a field or a data point is deemed mandatory
166
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
or critical. A good example is the validation of email addresses; email addresses usually
have a combination of strings, numbers, and acceptable special characters, followed
by the “@” symbol and the domain name along with the .com. Any of these elements, if
missing, can be referred to as data incomplete. Phone numbers are also another good
example. Relevant area code, followed by the phone digits, would make a data complete.
Also it should total to ten digits (in the case of the United States).
Data Range
Data range is the process of validating data falling between certain acceptable
boundaries for a numerical data point or a set of predefined categories for a categorical
data point. For instance, one cannot have age more than 110 or so (link to oldest person).
Similarly, one cannot have negative values for credit scores and such (credit scores start
from 300).
Data Consistency
Data consistency is the process of verifying and validating whether the value contained
is consistent and aligns with other data within the same dataset. There may be some rare
cases where the range of the data is well defined with lower and upper bounds. However,
business rules may suggest that data should be between a subrange with upper and
lower bounds that are well within a given range.
Data Format
Data format is the process of verifying whether the data follows different encoding than
what is expected and whether the data types are in the appropriate format. Data format
is an important part of validation. When data is transported between data systems, there
is a greater likelihood that the one system may not support the native encoding of the
other system. And so, the data types are often converted or imposed to data fields. A
common issue is when date is read as a string. Another issue is when an identification
number is casted as a string and gets auto-converted into an integer; problem is it may
lose all the trailing zeroes, which is the very reason integers are casted as strings.
167
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Referential Integrity
Referential integrity is the process of verifying and validating the data relationships that
exist between tables. To illustrate this further, consider an example where a pipeline is
writing relational tables from source to a target database system. It may be important to
ensure that the referential integrity is maintained such that when in case of joining these
tables with specific join conditions, they continue to produce similar results.
Data validation can be performed during the development phase, or you can set up
an automated process of checking various data points. In this chapter, we are going to
look at Pydantic library.
Introduction to Pydantic
Pydantic is the most popular Python library for performing data validations. The way it
works is that Pydantic enforces data types and constraints while loading data into the
Pydantic models. Some of the most effective applications of Pydantic are
Pydantic provides a relatively simple way to declare data schemas on incoming data
and validate the Pydantic models. Please note that validation happens automatically
when you load the data into a Pydantic model. Pydantic automatically catches errors and
enforces constraints on the incoming data. Before we dive deep into the exciting world of
Pydantic, let us refresh our memory on the concept of type annotations in Python.
168
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
data types for functions, variables, methods, etc. Python does not enforce them. Hence,
type annotations, alone, cannot be used for type checking by default. Type annotations
do improve the code readability and documentation.
Let’s look at a simple example of annotating a variable in Python:
myVariable: int = 0
Now, this Python variable is mutable. And so, let’s look at modifying this value:
myVariable = 1
print(myVariable)
myVariable = 9768.3491
print(myVariable)
The type of myVariable is now effectively float and not int; the value of myVariable
has now changed as well. We get the following output:
9768.3491
addition(33,44)
we would get
77
169
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Please note that this is not a good programming practice to specify that this function
addition would return a float type while explicitly defining an integer within the
function. The idea is to show that Python does not enforce the data types.
Here is a better example:
greet("your name")
it would return
Pydantic is the most widely used library for data validation in Python. Big tech
like Apple and others along with large governmental organizations use Pydantic
(and contribute too). Pydantic recently released a major upgrade called Pydantic V2.
According to the release notes, Pydantic’s validation and serialization logic has been
written in Rust programming language, so the V2 will be tens of times faster than the
earlier version.
Given Pydantic’s ability to generate JSON schema, it would make it a default choice
to integrate with other tools that support JSON schema. Pydantic offers a strict mode,
where data coercion is made unavailable.
Note Data coercion is the process of converting data from one type to another,
so the data is usable in a given context.
170
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Pydantic Models
Models are how you define schemas in Pydantic. Pydantic models are basically writing
classes that inherit from BaseModel. Upon defining the class, you can begin to define the
schemas for the dataset. Once the data is passed through the Pydantic model, Pydantic
will make sure that every data field in the source data will conform to the schema
defined in the Pydantic class.
Here is an example. Let’s import a data file that is in comma-separated
values format:
data = pd.read_csv("/home/parallels/Downloads/MOCK_DATA.csv")
Let’s check the output to see if the data is loaded for our testing purposes:
print(data.head(2))
will yield
name age
0 Ermentrude Azema 22
1 Joey Morteo 56
Now let’s define the schema for the Pydantic model class:
class TestPerson(BaseModel):
name: str
age: int
would yield
171
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Nested Models
Pydantic helps define complex hierarchical data structures using Pydantic models as
types. Here is an example.
Let’s import the required libraries and data:
with open("/home/parallels/Documents/data2.json") as f:
data = json.load(f)
print(data[0])
{
'first_name': 'Angil',
'last_name': 'Boshers',
'interests': {
'first': 'sports',
'second': 'mountains'
}
}
class Interests(pydantic.BaseModel):
first: str
second: str
class Stu(pydantic.BaseModel):
first_name: str
last_name: str
interests: List[Interests]
Both the parent and immediate child get their own Pydantic model class. The parent
Pydantic model class references the immediate child model class using a List method.
172
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Let’s create the model object and print the model output:
for i in data:
output = Stu(**i)
print(output)
would yield
Fields
The field function in Pydantic is used to specify a default value for a given field, in case it
remains missing from source documents.
Here is an example to illustrate the field function. To begin with, let’s import the
libraries and define the Pydantic model:
import pydantic
class Person(pydantic.BaseModel):
name: str = pydantic.Field(default='John Doe')
age: int = pydantic.Field(default=999)
As you can see, the default value is mentioned as John Doe. Now let’s define the
Pydantic model object and observe the output:
newPerson = Person()
print(newPerson)
would yield
173
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
JSON Schemas
Pydantic allows us to create JSON schemas from the Pydantic model class we created. As
you may recall, a JSON schema is simply a JSON data structure. A JSON schema asserts
how the JSON document will be like.
Let us revisit the earlier nested models example. We import the necessary libraries,
create the Pydantic model class, create the Pydantic model object, and print the
JSON schema:
with open("/home/parallels/Documents/data2.json") as f:
data = json.load(f)
print(data[0])
class Interests(pydantic.BaseModel):
first: str
second: str
class Stu(pydantic.BaseModel):
first_name: str
last_name: str
interests: List[Interests]
print(Stu.schema_json(indent=3))
#print(Stu.model_json_schema(indent=3))
would yield
{
"title": "Stu",
"type": "object",
"properties": {
174
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
"first_name": {
"title": "First Name",
"type": "string"
},
"last_name": {
"title": "Last Name",
"type": "string"
},
"interests": {
"title": "Interests",
"type": "array",
"items": {
"$ref": "#/definitions/Interests"
}
}
},
"required": [
"first_name",
"last_name",
"interests"
],
"definitions": {
"Interests": {
"title": "Interests",
"type": "object",
"properties": {
"first": {
"title": "First",
"type": "string"
},
"second": {
"title": "Second",
"type": "string"
}
},
175
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
"required": [
"first",
"second"
]
}
}
}
Constrained Types
A constrained type is a method offered by Pydantic that dictates the values each field can
take on in the Pydantic model class. Pydantic offers the following types:
–– For int, there is the conint method; conint can be used to set up
upper and lower bounds for a continuous variable.
–– For str, there is the constr method; constr can enforce uppercase or
lowercase for a string, wherever required.
–– For floats, there is a confloat method; once again, it helps to set lower
and upper bounds for floats, decimals, etc.
class Person(BaseModel):
name: str
age: conint(ge=30)
blood_sugar_level: confloat(gt=4, lt=10)
Validators in Pydantic
Pydantic provides a way to validate the field data using the validator method. You can even
bind validation to a type instead of a model. It may be worthy to note that validators do not
get executed if a default field value is specified. There are several types of validators:
–– After validators run after Pydantic’s internal parsing.
176
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
A field validator enables validation of a specific field of the Pydantic model. Let us
look at an example on the same concept.
Here, let us import necessary libraries and the data file (in this example, a flat file):
data = csv.DictReader(open("/home/parallels/Downloads/MOCK_DATA.csv"))
The dataset is about persons and contains two variables, name and age. Name is a
string and age is an integer. The idea is to validate how many of the persons are less than
21 years old. Now, let us define the Pydantic model class with a validator:
class Person(BaseModel):
name: str
age: int
@field_validator('age')
def agelimit(cls, value: int) -> int:
if value < 21:
raise ValueError("Less than 21 years old")
return value
177
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
File ~/anaconda3/lib/python3.11/site-packages/spyder_kernels/py3compat.
py:356 in compat_exec
exec(code, globals, locals)
File ~/Documents/pyd_model.py:25
pydantic_model = Person(**person)
Introduction to Pandera
Pandera is an MIT-licensed, open source Python library that performs data validation in
the context of cleaning and preprocessing. Pandera works well with Pandas’ data frames’
data structure. Pandera enables declarative schema definition, as in you can define the
expected data structure including data types, constraints, and other properties. Once a
schema is defined, Pandera can validate whether the incoming Pandas data frame object
adheres to the user-defined schema (including data types and constraints), checking
whether the fields are present among other things. Furthermore, Pandera supports
custom functions and user-defined validation checks, which may be very domain
specific for the incoming dataset.
Pandera also supports statistical validation like hypothesis testing. Note that
statistical validation is very different from data validation; statistical validation is where
one hypothesizes that data should have a certain property or meet a condition, which is
then validated by applying various statistical methods and formulas toward samples of
178
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
the population data. Pandera supports lazy validation; and so, when a data frame is set
to be validated against a long list of data validation rules, Pandera will perform all the
defined validation before raising an error.
Here is a simple example of data validation using Pandera. I have deliberately not
used aliases in this code, as it may be confusing for someone who is looking at these for
the first time.
Let’s import the libraries:
import pandas
import pandera
panderaSchema = pandera.DataFrameSchema(
{
"id": pandera.Column(pandera.Int),
"name": pandera.Column(pandera.String)
}
)
data = {
"id": [123,2435,234],
"name": ["john","doe", None]
}
pandas_df = pandas.DataFrame(data)
Let us now validate the “pandas” data frame against the “pandera” defined schema:
try:
dataValidation = panderaSchema.validate(pandas_df)
print(dataValidation)
179
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
except pandera.errors.SchemaError as e:
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
index failure_case
0 2 None
invalid dataframe
id name
0 123 john
1 2435 doe
2 234 None
Pandera provides a class-based API, similar to Pydantic. This means in Pandera, you
can define a schema model like a Pydantic model class.
import pandas as pd
from pandera import DataFrameSchema, Column, Check
StrictlyNotNullSchema = DataFrameSchema(
{
"series": Column(float, Check(lambda x: x > 0))
}
)
180
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
try:
strictNotNull = StrictlyNotNullSchema.validate(data)
print(strictNotNull)
except errors.SchemaError as e:
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
would yield
index failure_case
0 4 NaN
invalid dataframe
series
0 1.0
1 2.0
2 4.0
3 7.0
4 NaN
As you can see, the schema enforces itself strictly onto the data frame. To accept null
or missing values, one can specify an argument like nullable=True.
Here is how that looks:
NullAllowedSchema = DataFrameSchema(
{
"series": Column(float, Check(lambda x: x > 0),
nullable=True)
}
)
try:
nullAllowed = NullAllowedSchema.validate(data)
print(nullAllowed)
181
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
except errors.SchemaError as e:
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
it would yield
series
0 1.0
1 2.0
2 4.0
3 7.0
4 NaN
import pandas as pd
from pandera import DataFrameSchema, Column
df = pd.DataFrame({"values": [1,4,5,7,9]})
Now, let us define the schema, where the incoming data must be a string. And so, we
enable coerce=True:
schema = DataFrameSchema(
{
"values": Column(str, coerce=True)
})
dataValidated = schema.validate(df)
182
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
We perform the validation and it looks good. Now let’s examine the data type of the
validated data here:
print(dataValidated.iloc[0].map(type))
would yield
values <class 'str'>
Name: 0, dtype: object
And so, the integer value has been coerced into a string.
Checks in Pandera
Checks in Pandera are similar to constrained types and specifying conditions in the
field in Pydantic. Checks specify properties about various objects like data frame,
series, etc., which are validated against Pandera methods. For ease of reading the
code, one can define checks early on to a variable and then call the same in the
DataFrameSchema object.
Here is how that would look.
Let us define the check. In this example, let us check if the given number is odd or
even. Let us define that in the check:
odd = Check.isin([1,3,5,7,9])
schema = DataFrameSchema({
"series": Column(int, odd)
})
183
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
try:
schema.validate(data)
except errors.SchemaError as e:
print("\n\ndataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
would yield
invalid dataframe
series
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
And so, both the index and element are given in the output.
Pandera, by default, performs these data validations by way of vectorized operations.
184
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
You can also specify whether you need element-wise check operations. For our
previous example, here is how that looks:
odd = Check(
lambda x: x % 2 != 0,
element_wise=True
)
data = pd.DataFrame(
{
"series": [1,3,5,7,9]
}
)
schema = DataFrameSchema(
{
"series": Column(int, odd)
}
)
try:
schema.validate(data)
except errors.SchemaError as e:
print("\n\ndataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
185
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
would yield
series
0 1
1 3
2 5
3 7
4 9
186
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
os hours
0 ios 4.5
1 Android 5.1
2 ios 5.1
3 Android 5.1
4 ios 5.7
import pandas as pd
from pandera import DataFrameSchema, Column, Check, Hypothesis
df = pd.read_csv("/home/parallels/data.csv")
df[['hours','os']] = df['float\tname'].str.split('\t',expand=True)
data = df[['os','hours']]
Please note that the data is randomly generated and I needed to split the column
into two columns, followed by rearranging the columns.
My claim is that iOS has a better score than Android. Once again, these are randomly
generated floating point numbers and do not represent anything real here. Here is how
they are interpreted in statistical language:
In statistical language
Null hypothesis: There is no significant difference in value scores between iOS and
Android.
Alternate hypothesis: There is significant difference in value scores between iOS
and Android.
It is essential to note that an alternate hypothesis does not necessarily mean the
opposite of a given claim is true. It simply means that we do not have enough data to
prove our claim is true.
187
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
schema = DataFrameSchema(
{
"hours" : Column(
float, [
Hypothesis.two_sample_ttest(
sample1="ios",
sample2="Android",
groupby="os",
relationship="greater_than",
alpha=0.02,
equal_var=True
)
], coerce=True
),
"os" : Column(str)
})
try:
schema.validate(data)
except errors.SchemaError as exc:
print(exc)
We get
As we discussed earlier, there is not enough data that is present to conclude whether
our claim is correct, and we are about 98% accurate in our conclusion. The degree of
confidence is specified by a parameter called “alpha.”
188
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Lazy Validation
So far we have seen that when we call the “validate” method, the SchemaError is raised
immediately after one of the assumptions specified in the schema is falsified. And so,
the program stops executing other instructions supplied. In many cases, it may be more
useful to see all the errors raised on various columns and checks. Lazy validation is an
option that enables just that. We have seen lazy evaluation of code in our earlier chapter.
Lazy evaluation is when the code is executed only when it is actually needed.
Let’s look at an example.
Let us validate a simple table with a record in it. First, import the libraries and
method, followed by defining the Pandera schema:
import pandas as pd
import json
from pandera import (Column,
Check,
DataFrameSchema,
DateTime,
errors)
# schema
schema = DataFrameSchema(
columns={
"StudentName" : Column(str, Check.equal_to("John")),
"CreditsTaken" : Column(int, [check_ge_3, check_le_15]),
"GPA" : Column(float, [check_gpa_0, check_gpa_4]),
"Date" : Column(DateTime),
},
strict=True
)
189
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
df = pd.DataFrame(
{
"StudentName" : ["JohnDoe", "JaneDoe","DoeJohn"],
"CreditsTaken" : [9,12,18],
"GPA" : [3.7, 4.01, 3.5],
"Date" : None,
}
)
try:
schema.validate(df, lazy=True)
except errors.SchemaErrors as e:
print(json.dumps(e.message, indent=2))
print("dataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("invalid dataframe")
print(e.data) # invalid dataframe
would yield
{
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": null,
"column": "StudentName",
"check": "equal_to(John)",
"error": "Column 'StudentName' failed element-wise validator number
0: equal_to(John) failure cases: JohnDoe, JaneDoe, DoeJohn"
},
{
"schema": null,
"column": "CreditsTaken",
"check": "<lambda>",
190
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
191
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Please note that in the preceding example, SchemaError is coming from the class
errors. This exhaustive output of the “try..except” block is also a good way to see detailed
validation output.
Pandera Decorators
Pandera can be integrated into your current production pipelines through the use of
decorators.
192
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
def decoratortest(func):
def anotherfunc():
func()
print("the decorator function is now executing this line")
return anotherfunc
@decoratortest
def hello_world():
print("this is the main hello world function")
hello_world()
import pandas as pd
data = pd.DataFrame({
"a": [1,4,7,9,5],
193
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
"b": [12,13,15,16,19],
})
def addition(dataframe):
dataframe["c"] = dataframe["a"] + dataframe["b"]
return dataframe
final_df = addition(data)
print(final_df)
It is a straightforward example where we add two columns and store their result in a
third column. Now let’s look at Pandera data validators along with decorator integration.
First off, we import the necessary functions, data, and methods:
import pandas as pd
from pandera import DataFrameSchema, Column, Check, check_input
data = pd.DataFrame({
"a": [1,4,7,9,5],
"b": [12,13,15,16,19],
})
Now we define the Pandera data validation through DataFrameSchema. Please note
the check is defined separately and called within the DataFrameSchema for purposes of
readability. You can incorporate the check directly within the DataFrameSchema:
validateSchema = DataFrameSchema({
"a": Column(int,
check_a
),
"b": Column(int,
check_b
),
})
194
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Now, let’s utilize the Pandera decorators. We are going to use the @check_input, to
validate the input columns. The DataFrameSchema is passed as parameter to the
@check_input decorator:
@check_input(validateSchema)
def addition(dataframe):
dataframe["c"] = dataframe["a"] + dataframe["b"]
return dataframe
final_df = addition(data)
print(final_df)
a b c
0 1 12 13
1 4 13 17
2 7 15 22
3 9 16 25
4 5 19 24
Now, let’s say we change the validation condition of check_b. Currently it looks like
195
Chapter 5 Getting Started with Data Validation using Pydantic and Pandera
Conclusion
Data validation remains one of the key aspects of engineering data pipelines. We
discussed various principles of data validation and the need for good data. We then
looked at Pydantic, a data validation library that focuses on Python objects conforming
to a certain data model. We then discussed Pandera, another data validation library
that focuses on applying validation checks on data frame–like objects; these are very
effective toward the data analysis and data preprocessing tasks. In our next chapter, we
will discuss another data validation library that focuses heavily on data quality assurance
and data validation in data pipelines.
196
CHAPTER 6
197
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_6
Chapter 6 Data Validation using Great Expectations
Data Context
Data context is referred to as the starting point for the Great Expectations API; this
component contains settings, metadata, configuration parameters, and output from
validating data, for a given Great Expectations project. The metadata is usually stored in
an object called Great Expectations stores. The Great Expectations API provides a way
to configure and interact with methods using Python. You can access them using the
Great Expectations public API. Great Expectations also generates data documentation
automatically that contains where the data is coming from and the validations that the
dataset is subjected to.
Data Sources
The data source component tells Great Expectations about how to connect to a given
dataset. The dataset may be a flat file, JSON document, SQL database, Pandas data frame
located locally, or remote machine or one in a cloud environment. The data source
198
Chapter 6 Data Validation using Great Expectations
component contains one or more data assets. Data assets are collections of records from
the source data. Data assets are stored in batches, where a given batch may contain an
unique subset of records. Great Expectations submits a batch request internally to access
the data to perform validation and testing. It may be interesting to note that a data asset
in Great Expectations may contain data from more than one file, more than one database
table, etc., whereas a given batch may contain data from exactly one data asset.
Expectations
Expectations would make the heart of this chapter! As mentioned earlier, an expectation
is an assertion that needs to be verified against a data point. Expectations provide
a descriptive way for describing the expected behaviors of a given data. These
expectations apply to the data directly. And so, when Great Expectations applies a given
expectation, it returns a report that contains whether the expectation is met or not. Great
Expectations already supplies an exhaustive list of predefined expectations; however,
you can also write custom expectations. Here is an example of an expectation:
my_custom_dataframe.expect_column_values_to_be_in_set (
"Hobbies",
["mountain hiking", "city surfing", "long drives"]
)
Checkpoints
A checkpoint is how the Great Expectations framework validates data in production
pipelines. Checkpoints provide a framework to group together one or more batches of
data and execute expectations against the said batched data. A checkpoint takes one or
more batches, providing them to the validator module. The validator module generates
results, which are then passed to a list of validation actions. The validation results tell
you how and where the data stands with respect to the defined expectations. The final
step is validation actions that can be completely configured to your needs. Actions
include sending an email, updating data documentation, etc.
199
Chapter 6 Data Validation using Great Expectations
source gework/bin/activate
If you are using a Windows PC, please activate the virtual environment using the
following command:
gework/Scripts/activate
We use pip to install Python packages. Let us update pip to its latest version before
installing Great Expectations:
or
If you need to work with a SQL database as your data source, then you need the
SQLAlchemy package as well. To install Great Expectations along with optional SQL
dependencies, use this command:
There are two ways of getting started with a new Great Expectations project. One
is the legacy way of interacting with the command line interface, and the other way is
through the modern fluent data source method.
200
Chapter 6 Data Validation using Great Expectations
great_expectations init
Great Expectations will create a project structure for your project. To proceed
forward, type “y”; if not type “n” and restart the process.
Once the project structure creation has been completed, then you will see the
following screen:
201
Chapter 6 Data Validation using Great Expectations
Here is how the options look if we choose relational databases instead of files on a
file system:
Figure 6-4. Choosing the appropriate option for a new data source in Great
Expectations
This would then open up a Jupyter notebook, where you can complete the setup of
your new data source.
203
Chapter 6 Data Validation using Great Expectations
import great_expectations as ge
import pandas as pd
import random
We are creating a dataset from scratch. And so the random library along with the
following code:
gender = ["Male","Female","Others"]
random.seed(10)
blood_sugar_levels = [random.randint(70, 130) for i in range(20)]
gender = [random.choice(gender) for i in range(20)]
Now, import the Pandas data frame into a Great Expectations data frame:
df_ge = ge.from_pandas(df)
Great. Now, let us create the expectation. We expect that the dataset may contain
only men and women. Here is the how that looks:
gender_exp = df_ge.expect_column_values_to_be_in_set(
"Gender",
["Male", "Female"]
)
print(gender_exp)
In the preceding code, we are mentioning both the header and the category that we
expect to see. Here is the output:
{
"success": false,
"expectation_config": {
"expectation_type": "expect_column_values_to_be_in_set",
204
Chapter 6 Data Validation using Great Expectations
"kwargs": {
"column": "Gender",
"value_set": [
"Male",
"Female"
],
"result_format": "BASIC"
},
"meta": {}
},
"result": {
"element_count": 20,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 5,
"unexpected_percent": 25.0,
"unexpected_percent_total": 25.0,
"unexpected_percent_nonmissing": 25.0,
"partial_unexpected_list": [
"Others",
"Others",
"Others",
"Others",
"Others"
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
205
Chapter 6 Data Validation using Great Expectations
You can also explore the dataset’s various categories by issuing the following code:
print(df_ge.Gender.value_counts())
Gender
Female 10
Male 5
Others 5
Name: count, dtype: int64
Now, if we reprogram our expectation to include the “Others” category, we would get
the following.
Here is the code:
gender_exp = df_ge.expect_column_values_to_be_in_set(
"Gender",
["Male", "Female","Others"]
)
print(gender_exp)
{
"success": true,
"expectation_config": {
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "Gender",
"value_set": [
"Male",
"Female",
"Others"
],
"result_format": "BASIC"
},
"meta": {}
},
206
Chapter 6 Data Validation using Great Expectations
"result": {
"element_count": 20,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"unexpected_percent_total": 0.0,
"unexpected_percent_nonmissing": 0.0,
"partial_unexpected_list": []
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
207
Chapter 6 Data Validation using Great Expectations
We have already seen installing Great Expectations, prior to using the CLI, in the
“Setup and Installation” section. Upon successful installation of Great Expectations,
it is time to set up the data context. The data context contains configurations for
expectations, expectation stores, checkpoints, and data docs. Initializing a data context is
quite straightforward, as importing the library and running the get_context() method:
import great_expectations as ge
context = ge.get_context()
print(context)
{
"anonymous_usage_statistics": {
"explicit_url": false,
"explicit_id": true,
"data_context_id": "14552788-844b-4270-86d5-9ab238c18e75",
"usage_statistics_url": "https://stats.greatexpectations.io/great_
expectations/v1/usage_statistics",
"enabled": true
},
"checkpoint_store_name": "checkpoint_store",
"config_version": 3,
"data_docs_sites": {
"local_site": {
"class_name": "SiteBuilder",
"show_how_to_buttons": true,
208
Chapter 6 Data Validation using Great Expectations
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": "/tmp/tmp_oj70d5v"
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder"
}
}
},
"datasources": {},
"evaluation_parameter_store_name": "evaluation_parameter_store",
"expectations_store_name": "expectations_store",
"fluent_datasources": {},
"include_rendered_content": {
"expectation_validation_result": false,
"expectation_suite": false,
"globally": false
},
"profiler_store_name": "profiler_store",
"stores": {
"expectations_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"validations_store": {
"class_name": "ValidationsStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"evaluation_parameter_store": {
"class_name": "EvaluationParameterStore"
},
209
Chapter 6 Data Validation using Great Expectations
"checkpoint_store": {
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"profiler_store": {
"class_name": "ProfilerStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
}
},
"validations_store_name": "validations_store"
}
import great_expectations as ge
context = ge.get_context()
validator = context.sources.pandas_default.read_csv(
"/home/parallels/Downloads/mockaroo_health_data.csv"
)
In the preceding code, we are writing a data asset directly into a validator. The next
step is to create a data source.
First, create a data source:
datasource_name = "test2"
ge.datasource = context.sources.add_pandas(datasource_name)
210
Chapter 6 Data Validation using Great Expectations
asset_name = "asset1"
path_to_data = "/home/parallels/Downloads/mockaroo_health_data.csv"
batch_request = asset.build_batch_request()
context.add_or_update_expectation_suite("my_expectation_suite")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="my_expectation_suite",
)
print(validator.head())
We will get
211
Chapter 6 Data Validation using Great Expectations
Now, let us run the validator to create and run an expectation. Let us look at the
missing values for one of the columns, say “age.” The expectation we use for this
operation is
expect_column_values_to_not_be_null
expectation_notnull = validator.expect_column_values_to_not_be_null(
column="age"
)
print(expectation_notnull)
212
Chapter 6 Data Validation using Great Expectations
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
By running the validator, the expectation has been recorded in the expectation suite.
Now let us run the expectation related to missing values for another column called
“copay_paid.”
Here is how the code looks:
expectation_notnull = validator.expect_column_values_to_not_be_null(
column="copay_paid"
)
print(expectation_notnull)
213
Chapter 6 Data Validation using Great Expectations
214
Chapter 6 Data Validation using Great Expectations
null,
null,
null
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
As you can see, there are about 126 records where the copay_paid value is missing.
So far, we have run two expectations. While the expectations are saved in the
expectation suite on the validator object, validators do not persist outside the current
Python session. To reuse the current expectations, it is necessary to save them. Here is
how that looks:
validator.save_expectation_suite("/home/parallels/Documents/my_expectation_
suite.json", discard_failed_expectations=False)
Upon running the program, Great Expectations would create an expectation suite in
the form of a JSON document.
For our current program, here is how that JSON document looks like:
{
"data_asset_type": null,
"expectation_suite_name": "my_expectation_suite",
"expectations": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "age"
},
"meta": {}
},
215
Chapter 6 Data Validation using Great Expectations
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "copay_paid"
},
"meta": {}
}
],
"ge_cloud_id": null,
"meta": {
"great_expectations_version": "0.18.4"
}
}
Creating a Checkpoint
A checkpoint is a means of validating data in production pipelines using Great
Expectations. Checkpoints allow you to link an expectation suite and data; the idea is to
validate expectations against the data.
Here is how to create a checkpoint:
checkpoint = context.add_or_update_checkpoint(
name="my_checkpoint",
validations=[
{
"batch_request": batch_request,
"expectation_suite_name": "my_expectation_suite",
},
],
)
You can return the checkpoint results by using the following code:
checkpoint_result = checkpoint.run()
print(checkpoint_result)
216
Chapter 6 Data Validation using Great Expectations
217
Chapter 6 Data Validation using Great Expectations
"batch_markers": {
"ge_load_time": "20231215T081018.350419Z",
"pandas_data_fingerprint": "9c19ba3fc9d218a2c39f81794f4f37fe"
},
"active_batch_definition": {
"datasource_name": "test2",
"data_connector_name": "fluent",
"data_asset_name": "asset1",
"batch_identifiers": {}
},
"validation_time": "20231215T081018.363378Z",
"checkpoint_name": "my_checkpoint",
"validation_id": null,
"checkpoint_id": null
}
},
"actions_results": {
"store_validation_result": {
"class": "StoreValidationResultAction"
},
"store_evaluation_params": {
"class": "StoreEvaluationParametersAction"
},
"update_data_docs": {
"local_site": "file:///tmp/tmpy1si_l4d/validations/my_
expectation_suite/__none__/20231215T081018.348374Z/test2-
asset1.html",
"class": "UpdateDataDocsAction"
}
}
}
},
"checkpoint_config": {
"config_version": 1.0,
"notify_on": null,
218
Chapter 6 Data Validation using Great Expectations
"default_validation_id": null,
"evaluation_parameters": {},
"expectation_suite_ge_cloud_id": null,
"runtime_configuration": {},
"module_name": "great_expectations.checkpoint",
"template_name": null,
"slack_webhook": null,
"class_name": "Checkpoint",
"validations": [
{
"name": null,
"expectation_suite_name": "my_expectation_suite",
"expectation_suite_ge_cloud_id": null,
"id": null,
"batch_request": {
"datasource_name": "test2",
"data_asset_name": "asset1",
"options": {},
"batch_slice": null
}
}
],
"action_list": [
{
"name": "store_validation_result",
"action": {
"class_name": "StoreValidationResultAction"
}
},
{
"name": "store_evaluation_params",
"action": {
"class_name": "StoreEvaluationParametersAction"
}
},
219
Chapter 6 Data Validation using Great Expectations
{
"name": "update_data_docs",
"action": {
"class_name": "UpdateDataDocsAction"
}
}
],
"site_names": null,
"notify_with": null,
"batch_request": {},
"profilers": [],
"run_name_template": null,
"name": "my_checkpoint",
"expectation_suite_name": null,
"ge_cloud_id": null
},
"success": true
}
checkpoint = context.add_or_update_checkpoint(
name="my_test_checkpoint", validations=more_validations
)
Data Documentation
Data documentation, or simply data doc, provides interactive documentation of the data
validation results that Great Expectations processed.
Here is how we can generate the data documentation in Great Expectations:
context.build_data_docs()
220
Chapter 6 Data Validation using Great Expectations
Great Expectations will generate a full web-based documentation along with other
metadata. You can also edit the documentation. Here is how the documentation may
look for you.
When you click the expectation suite, you will be able to see more metadata. Here is
how that may look for you:
221
Chapter 6 Data Validation using Great Expectations
Expectation Store
So far we have seen connecting to data assets, building batch requests, creating
validators, running expectations on validators, and saving the expectations on a
JSON document. Now, let us see how to manage the expectations we have built in an
expectation store. An expectation store is a connector to a store and that helps retrieve
collections of expectations about data.
222
Chapter 6 Data Validation using Great Expectations
The process is relatively simple, where you create a folder within your Great
Expectations folder, move all your JSON documents into that folder, and update your
configuration file to include a store for expectation results:
mkdir shared_expectations
mv path-to/my_expectation_suite.json destination-path/shared_expectations/
And update the “base directory” of the configuration file. Here is a screenshot of how
that looks:
Conclusion
We have continued to discuss more on data validations. Great Expectations is one of
the innovative data validation libraries, which comes with predefined expectations. The
number of predefined expectations continues to grow, which is also a good resource
for exploring various ways of performing data validation on your data. You can also
create a template with predefined expectations called an expectation store. The Great
Expectations community appears to be evolving rapidly in my opinion as during the
process of authoring the book, I have passed two different versions. By the time you are
reading this book, you will be witnessing a major 1.0 release.
223
CHAPTER 7
Introduction to
Concurrency
Programming and Dask
Introduction
We have discussed earlier that processing large datasets effectively and efficiently is
important. As datasets continue to grow over time, traditional single-thread processing
can be seen as a limitation. Python, by default, has a global interpreter lock (GIL) that
allows only one thread to hold the interpreter at a given point in time. While this design
ensures the integrity of computations submitted, it would be much more effective to
use many threads within a multi-core CPU that is commonly available today. In this
chapter, you will learn concepts of parallel programming, concurrent programming, and
Dask, a Python library that supports distributed processing and works around the global
interpreter lock limitation by using multiple processes. Dask also supports various data
processing libraries that we have seen in earlier chapters.
By the end of this chapter, you will learn
–– Dask
225
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_7
Chapter 7 Introduction to Concurrency Programming and Dask
History
Parallel processing gained momentum during the time when a CPU started to contain
more than one processing unit within its microprocessor architecture. In general, the
electrons carrying the digital signals in the microprocessor travel very fast; there are
upper limits on how quickly a single processor can operate. Such limitations along with
the need for minimizing heat generation, among other things, paved the way for multi-
core processors. Today, we have several processing units (we call them cores) in the
microprocessor embedded in our laptops, phones, computers, etc.
226
Chapter 7 Introduction to Concurrency Programming and Dask
Python byte code instruction. Once that process is completed, the GIL gets released so
that the next available Python byte code can acquire it for its execution. While it ensures
thread-safe computing, it also disables parallel processing through the method of multi-
threading. However, there are other ways to achieve parallel processing, and some file
input/output operations do not fall under the GIL, which is good news.
227
Chapter 7 Introduction to Concurrency Programming and Dask
numberOfCores = cpu_count()
would return a number that denotes the number of cores that are on the
microprocessor of your computer.
Here is a simple example of parallel processing using the “multiprocessing” library:
def cube(a):
return a * a * a
inputs = [5,12,17,31,43,59,67]
print(cubevalue)
The preceding code will display the cube values for the input supplied. Here is how
the output may look for you:
In the code, you can see the Pool method and its parameter called processes.
Concurrent Processing
Concurrent processing is similar to the idea of parallel processing as far as dealing with
simultaneous tasks are concerned. However, the method of execution is different from
what we have earlier. Concurrent processing refers to two or more tasks getting executed
at the same point in time. However, they may not execute at the exact same time.
For instance, within a single core, one task may start and perform operations for a bit
of time. Let's say that the first task may need to wait for some other process to perform
another operation. And so it waits; during this time one or more tasks might start and
may even finish, depending upon their compute, memory, and storage requirements
and what is available to them.
228
Chapter 7 Introduction to Concurrency Programming and Dask
Rather than waiting for full completion of a task, concurrency enables dynamic
sharing and allocation of memory, compute, and storage during runtime. It is a way of
using resources seamlessly.
Note Imagine you are a chef at a restaurant. There are five orders of food that
arrived at the exact same time. You know that not all items in the menu take equal
amounts of time. You have one stove and one microwave for five orders. For the
sake of simplicity, assume you have one deep fryer, grill, or something like that.
Now let us illustrate concurrency and parallelism.
Concurrency
Imagine starting off a slow-cooking dish. While that is cooking on simmer,
prep vegetables for another dish. As the vegetables are chopped and prepped,
periodically check the slow-cooking dish, and stir maybe a bit. Transfer the slow-
cooking dish to the oven while placing the prepped vegetables on the stove. As the
vegetables start cooking, prepare eggs for another dish. Check the vegetables and
the oven periodically. You are switching between tasks, making progress on three
orders even though you are working alone. This is concurrency.
Parallelism
Imagine five chefs for five orders. Each chef takes care of one order with their own
complete set of equipment (stove, oven, etc.) and tools (knife, spoon, etc.). They
work on their own dish, with their own equipment and focusing on their own task.
This is parallelism.
Concurrency is when one CPU core deals with multiple tasks, using shared
resources, and rapidly switches between tasks. Parallelism is when “n” CPU cores
deal with “n” tasks using dedicated resources to each core.
229
Chapter 7 Introduction to Concurrency Programming and Dask
Introduction to Dask
Dask is a specialized Python library for enabling parallel and distributed computing for
data analysis and machine learning tasks in Python. Dask is a Python native library and
does not depend on non-Python programming languages for its implementation.
Dask can work well with major Python libraries like Pandas, Polars, CuDF, scikit-
learn, and XGBoost, to name a few. If you are working with data processing and machine
learning pipelines with data that exceeds the compute and memory requirements of a
single machine, then you may benefit from adopting Dask in your technology stack.
Dask has a relatively quicker learning curve, and it is relatively easy to integrate Dask
into your production data pipelines. Dask is used by big banks and major healthcare and
national laboratories and organizations.
Dask supports many of the common data formats that are prevalent in big data
pipelines such as Avro, HDF5, Parquet, etc. while also supporting traditional data
formats like CSV, JSON, etc.
Dask primarily leverages CPUs for performing its computations and operations.
However, Dask can also work with GPUs. Dask and the CuDF Python library interoperate
well with each other. Let us dive in detail on this topic in the coming sections.
and
or
or
230
Chapter 7 Introduction to Concurrency Programming and Dask
One can also install Dask in a distributed computing cluster. Dask works well with
Hadoop clusters; you can deploy Dask with YARN, a resource manager for Hadoop that
deploys various services within a given Hadoop cluster. You can also create a cloud
computing Dask cluster on AWS or GCP.
Here is a simple hello world–like implementation using Dask:
import dask
import dask.array as da
arr = da.from_array(
[1,2,3,4,5,6,7,8,9,10]
, chunks=2
)
squared = arr ** 2
result = squared.compute()
print(result)
In the preceding example, we started off by importing the necessary class and
method. We then created a Dask array with chunks of 2, meaning create two different
chunks of this array. We then created a new function called squared that would take the
square of each value in the array. This is followed by execution of that function by using
a method called “compute(),” which triggers the actual computation and returns the
result. This is due to the lazy evaluation nature of Dask.
When we print the result, it will return
Features of Dask
Tasks and Graphs
In Dask, computation is represented as a collection of tasks as nodes that are connected
using edges. These nodes and edges are organized in a directed acyclic graph (DAG)
manner. The tasks that are needed to be accomplished are represented as nodes. The
data dependencies between various tasks (or no data dependency) are represented as
231
Chapter 7 Introduction to Concurrency Programming and Dask
edges. The directed acyclic graph representation enables Dask to effectively execute
the tasks and their underlying dependencies in multiple threads and processes or even
across multiple CPUs.
Lazy Evaluation
We have seen the concept of lazy evaluation in previous chapters. To refresh your
memory, lazy evaluation is where the computation does not happen immediately. The
execution of a given set of code happens only if it is actually needed. Until then, this
code may be delayed and sometimes it may not be executed at all. As we already know,
Dask creates a directed acyclic graph to represent its execution flow. But it never actually
executes every node, unless it is needed. This way, Dask optimizes the computation.
232
Chapter 7 Introduction to Concurrency Programming and Dask
Dask-CuDF
In a previous chapter, we have discussed GPU computing and CuDF in detail. As we
know already, CuDF is a GPU computation framework, which enables data frame
processing across one or more GPUs. CuDF provides various methods and functions that
can import, export, and perform complex data transformations, which can run strictly
on GPUs.
Dask, on the other hand, is a general-purpose parallel and distributed computing
library. We can use Dask on top of Pandas and other leading data analysis libraries to
scale out our workflow, across multiple CPUs.
Dask-CuDF is a specialized Python package that enables processing data frame
partitions using CuDF instead of Pandas. For instance, if you call Dask-CuDF to process
a data frame, it can efficiently call CuDF’s method to process the data (by sending the
same to GPU). If you have very large data for your data processing pipeline, larger than
the limits of your machine, then you may benefit from using Dask-CuDF.
Dask Architecture
Let us review the various components in Dask. The architecture of Dask is flexible in that
it allows single-instance computations, single-instance with multi-core computations,
and distributed computations.
233
Chapter 7 Introduction to Concurrency Programming and Dask
Core Library
At the heart of the Dask system, we have the Dask collections that consist of various data
structures. Some of them are Dask arrays, which resemble NumPy arrays; Dask data
frames, which resemble Pandas' data frames; and Dask bags, which resemble Python
lists and iterators, respectively. These collections enable you to perform parallel and
distributed computations natively. Just by calling Dask’s collections, you will be able to
parallelize or asynchronously execute the computation.
Schedulers
The Dask scheduler is a major component of the Dask execution mechanism. The Dask
scheduler sits in the heart of the cluster, managing the computation, execution of tasks
among workers, and communication to workers and clients. The scheduler component
itself is asynchronous, as it can track progress of multiple workers and respond to various
requests from more than one client. The scheduler tracks work as a constantly changing
directed acyclic graph of tasks. In these directed acyclic graphs, each node can be a task
and each arc (edge between nodes) can be a dependency between tasks.
A task is a unit of computation in Dask that is basically a Python function or an
execution on a Python object. There are different types of schedulers available in Dask.
There is a single-thread scheduler that executes tasks on a single thread on a single
machine; there is also a multi-thread scheduler that executes tasks using multiple
threads of a single machine; there is also a multi-core scheduler that executes tasks
using multiple cores of a processor; and there is a distributed scheduler where tasks are
executed using multiple worker nodes in a given cluster.
Client
Dask’s client is the primary mode of communication with Dask. You communicate with
Dask using a client. The client is the default scheduler and worker; the client manages
and coordinates communication between schedulers. From the computation point of
view, the client manages communication with the scheduler process, and the scheduler
coordinates and assigns work to worker processes; these worker processes would
perform the necessary computation and return the result back to the client. Dask’s client
communicates with the scheduler and collects results from the workers as well. Here is
how you may instantiate the Dask client:
234
Chapter 7 Introduction to Concurrency Programming and Dask
client = Client(
threads_per_worker=4
, n_workers=2
)
client
235
Chapter 7 Introduction to Concurrency Programming and Dask
Workers
Workers are end computation machines that perform the computation that is assigned
to them by the scheduler. A worker can be a single thread or a separate process or a
separate machine in a cluster; each worker has access to compute, memory, and storage.
Workers communicate with the schedulers to receive tasks from them, execute the tasks,
and report the results.
Task Graphs
As discussed in the earlier section, the computation in a task is represented as a
constantly changing directed acyclic graph of tasks where the edges may be dependency
between tasks. Task graphs define the dependencies between various tasks. When you
write Dask code, you are implicitly creating task graphs by performing operations on a
Dask array or Dask DataFrame or Dask bag. The Dask system automatically designs the
underlying task graph.
Dask Arrays
A Dask array is an implementation of NumPy’s ndarray. Dask arrays are made up of
many NumPy arrays, and they are split into smaller and uniform sizes of arrays called
chunks. Dask uses a blocked algorithm to perform this action. You can either specify
how many chunks you would like the Dask to split the array or ask Dask to automatically
split the array for computations and calculations. These chunks are then processed
independently, thereby allowing a distributed and parallel execution of data processing
across cores or workers within a cluster. A Dask array supports parallelism and multi-
core processing out of the box so that you do not have to write additional code or add a
decorator to enable concurrency.
236
Chapter 7 Introduction to Concurrency Programming and Dask
var1 = array.random.random(
(10,10),
chunks=(2,2)
)
print(var1)
would return
Here, we have a random array of (10 × 10) dimensions that is broken into smaller
chunks of (2 × 2) dimensions each.
When we add “T.compute()” to our variable, we get a transposed random matrix.
Here is how that looks:
print(var1.T.compute())
237
Chapter 7 Introduction to Concurrency Programming and Dask
The following is something that is not to be tried at home, but instead in a cloud
account. Assuming you have adequate memory for your dataset, you can try to persist
the data in memory.
Here is how we can do that:
var2 = var1.persist()
You can also re-chunk your Dask array. Here is how we can do that:
var1 = var1.rechunk((5,5))
print(var1)
would return
We can export this Dask array in various formats. Let us look at HDF5 format, which
is meant to store open source large complex data. To do so, we need to install the Python
library “h5py” as a starting point:
import dask.array as da
import h5py
var3 = da.random.random(
(10000,10000),
chunks=(100,100)
)
da.to_hdf5(
'daskarray.hdf5',
'/dask',
var3
)
238
Chapter 7 Introduction to Concurrency Programming and Dask
We have imported the necessary libraries and created the Dask array. We are using
the “to_hdf5” method to export the Dask array as a HDF5 file, where we also specify
the path, where we want to store the file, and the variable that we are exporting as the
HDF5 file.
With the Dask array, you can perform many NumPy operations. Here is how you can
calculate the mean() function of the array:
dask_mean = da.array(var3) \
.mean() \
.compute()
print(dask_mean)
would yield
0.5000587201947724
Dask Bags
Dask bags are another type of data structure, designed to handle and process lists
like objects. In a way they are quite similar to Python lists or iterators, with the added
feature that they are capable of parallel and distributed computations. The word “bag”
is actually a mathematical term. Let us quickly recall what a set is. It is an unordered
list with no element repeating itself. Bags are basically sets where elements can repeat
themselves. It is also known as multisets.
new_bag = bag.from_sequence(
[1,2,3,4,5,6,7,8,9,10]
,npartitions=2
)
239
Chapter 7 Introduction to Concurrency Programming and Dask
print(new_bag)
print(new_bag.compute())
would yield
dask.bag<from_sequence, npartitions=2>
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
import json
from dask import bag
file = "/content/dask_data1.json"
bags = bag \
.read_text(file) \
.map(json.dumps) \
.map(json.loads)
print(bags.take(1))
would yield
('[{"id":1,"name":"Godard
Bloomer","language":"Greek","gender":"Male","occupation":"Financial
Analyst"},\n',)
file = "/your-folder-path-here/mockdata-0.csv"
bags = bag \
.read_text(file) \
.map(
lambda x: next(csv.reader([x]))
)
print(bags.take(5))
240
Chapter 7 Introduction to Concurrency Programming and Dask
would yield
print(bags2.count().compute())
would yield
15
Dask DataFrames
Dask data frames are implementation of Pandas’ data frame objects except that Dask’s
data frames can parallelize and perform distributed computations across multiple cores
of the same processor in a system or across multiple systems in a given cluster. Dask
data frames also support several of native Python functions that can be executed in a
parallelized manner. Let us look at some illustrations of Dask data frames:
import dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
dask.config.set(
{
'dataframe.query-planning': True
}
)
df = dd.from_pandas(
pd.DataFrame(
np.random.randn(10,3),
columns=['A','B','C']
),
npartitions=3)
print(df)
241
Chapter 7 Introduction to Concurrency Programming and Dask
would yield
print(df.compute())
would yield
A B C
0 -0.580477 0.299928 1.396136
1 0.076835 -0.343578 -0.381932
2 -0.092484 -2.525399 1.520019
3 -0.631179 0.215883 0.409627
4 -0.655274 0.117466 -0.525950
5 -0.003654 -0.215435 -0.522669
6 0.947839 -1.978668 0.641037
7 -0.755312 0.247508 -1.881955
8 1.267564 1.342117 0.601445
9 -1.395537 0.413508 -0.896942
Internally, a Dask DataFrame manages several smaller Pandas data frames arranged
on an index. You will see the true performance of a Dask DataFrame when you have a
large dataset that requires execution on multiple cores or even multiple machines.
242
Chapter 7 Introduction to Concurrency Programming and Dask
You cannot truly appreciate the value of Dask if you are dealing with smaller data frames,
whose computational times are less than a few seconds on a single core.
By default, a data frame in Dask is split into one partition. If you specify the number
of partitions to a number, then Dask will partition the dataset accordingly. In our
preceding example, we have set the value to 3. This would enable Dask to create three
Pandas data frames, split vertically across the index. The way we define our partitions
heavily influences the performance of our data munging script. For instance, if we
needed aggregations of a specific month, say January, that fell on two partitions, then
Dask would have to process two of these partitions. On the other hand, if our partitions
were made by month, then the queries would execute faster and more efficiently.
As with loading data, you can load data from any number of sources. Let us look at
an example:
import dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
dask.config.set(
{
'dataframe.query-planning': True
}
)
csvfile = "/content/mock_health_data.csv"
df1 = dd.read_csv(csvfile)
print(df1)
243
Chapter 7 Introduction to Concurrency Programming and Dask
print(df1.compute())
would yield
Dask Delayed
There are cases where the problems do not easily fit into the abstraction of series, arrays, or
data frames. This is where the Dask delayed feature comes in. Using Dask delayed, you can
parallelize the execution of your own algorithms. Let us consider the following example:
def add(x):
return x + 1
def square(x):
return x * x
def square_add(a,b):
return a+b
244
Chapter 7 Introduction to Concurrency Programming and Dask
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = add(x)
b = square(x)
c = square_add(a,b)
output.append(c)
In the preceding example, we have functions that do not depend on each other; and
so, they can be executed in parallel. But this is not an easy task, or we simply cannot try
to fit this into a series, array, or data frame. Now let us use the Dask delayed feature on
this problem:
import dask
output = []
for x in data:
a = dask.delayed(add)(x)
b = dask.delayed(square)(x)
c = dask.delayed(square_add)(a, b)
d = c.compute()
output.append(d)
In this code, we have decorated our functions with Dask delayed features. Dask
delayed will enable the parallel computation of these functions. However, notice that
these functions do not execute. They simply get added in the task graph that represents
the computations performed in a specific order. In order to execute the function, we
need to add the function “compute().” So when you run this code, you will see
245
Chapter 7 Introduction to Concurrency Programming and Dask
@dask.delayed
def add(x):
return x + 1
@dask.delayed
def square(x):
return x * x
@dask.delayed
def square_add(a,b):
return a+b
output = []
for x in data:
a = add(x)
b = square(x)
c = square_add(a,b)
output.append(c)
results = dask.compute(*output)
Dask Futures
A future, simply, in Python, represents an operation that is yet to be executed. It is
basically a placeholder for a result that has not yet been computed, but it may happen at
some point. In Python, the concurrent.future module provides asynchronous execution
of functions. This module provides a way to execute the tasks and obtain the results
concurrently. Dask supports this concurrent.future module to scale your current data
pipeline across multiple systems in a given cluster, with minimal code changes.
Dask futures form a part of Dask’s greater asynchronous execution model, which is
within Dask’s distributed scheduler. So when you submit a task to the Dask scheduler,
it returns a future representing that given task’s result. After that, the scheduler may
execute the task in the background by using the available worker (can be within the
system or another system in the cluster). It may be a good time for us to recall that the
future allows for concurrent execution and can enable parallel computations of code.
246
Chapter 7 Introduction to Concurrency Programming and Dask
Let us also revisit the concept of eager and lazy evaluation and where futures stand
in this point of view. Eager evaluation executes code immediately regardless of whether
a given function or set of code is relevant to the result or not. Lazy evaluation does not
execute anything unless you explicitly specify to execute, upon which it will execute only
the necessary code for the result. You have already seen the “compute()” function that
performs lazy evaluation on Dask. Dask futures can be seen as somewhere in between.
Dask futures are not lazy evaluation; when you submit a future for processing, it will
begin the task in the background. Dask futures are not eager evaluation either, because
futures enable parallel and distributed computation of tasks and functions, natively.
One can naively (with caution exercised) view the Dask futures as eager evaluation with
added features of asynchronous execution and parallelism in a distributed environment,
error handling, and task scheduling.
Internally Dask futures are, by default, asynchronous. When you submit a
computation, it returns a future object immediately. Using the future, you can check the
status of the computation at a given point in time or just wait it out till it completes in the
background using the workers. Dask futures are beneficial in designing data pipelines,
where you may need finer-level control over task scheduling.
To work with the Dask futures, you need to instantiate the Dask client. You will also
be able to view results in a dashboard:
When you instantiate the client without any arguments, it starts local workers
as processes, and when you provide the argument “processes=False”, it starts local
workers as threads. Processes offer better performance when it comes to fault tolerance
over threads. Here is how you can submit a task to the Dask scheduler:
def square(x):
return x ** 2
a = client.submit(square, 10)
c = client.submit(square, 20)
247
Chapter 7 Introduction to Concurrency Programming and Dask
In the preceding example, the submit method that submits the task returns a future.
The result of this task may not yet be completed and be completed at some point using
a process. The result of this function will stay in that specific process unless you ask for it
explicitly. Let’s look at the contents of the variable “a”:
print(a)
would return
You have to explicitly ask it to return the result. Here is how that looks:
print(a.result())
would return
100
There is also the cancel() method that would stop the execution of a future if it is
already scheduled to run; if it has already run, then the cancel() method would delete it.
Here is how that looks:
a.cancel()
You can wait on a future or a set of futures using the wait() function. Once you ask
a future to wait, then it will wait until all the other tasks have been completed or finished
execution successfully. Here is how that looks:
wait(a)
In computing, we have this term called “fire and forget” to describe a process or a
task that is executed without waiting for the response or the outcome. Dask also provides
a “fire and forget” feature to tasks, where the tasks are submitted to be executed without
the need for checking their result or the tasks to even complete. Once the task is fired,
the system is told to forget about it, and so the system no longer monitors that task or
even waits for the task to finish:
client.fire_and_forget(a)
or simply
fire_and_forget(a)
248
Chapter 7 Introduction to Concurrency Programming and Dask
def square(x):
return x ** 2
a = client.submit(square, 10)
c = client.submit(square, 20)
print(a.result())
print(c.result())
When we invoke the client.submit function, the Dask client sends this function to the
Dask scheduler. As we know already, the Dask client would create a future and return the
future object to the developer. This happens before the scheduler receives the function.
The scheduler then receives this function and updates its state on how to compute
the a or c (whichever comes in first). In addition, it also updates several other states.
Once the scheduler gathers all the required information in the memory, it would then
proceed to assign this to a worker. All of these happen in a few milliseconds.
The scheduler now goes by its playbook that defines a set of criteria for identifying
which worker to assign this work. It would start off to see if there are already workers who
have this algorithm saved in their local memory. If none exists, it would select the least
number of bytes a worker might require to process this function. If there are multiple
workers who share similar least values, it would select the worker with least workload.
The scheduler has now placed this computation at the top of the stack. A message
is generated with this function, worker assigned and other metadata; the scheduler
serializes this message and sends it over to the worker via TCP socket.
The worker receives this message, deserializes the message, and understands
what is required from the task point of view. The worker starts the execution of this
computation. It looks to see whether it requires values that may be with other workers
by investigating the “who_has” dictionary. The who_has dictionary is basically a list of
249
Chapter 7 Introduction to Concurrency Programming and Dask
all the metadata about workers and their tasks and contains other metadata that would
enable the workers to find each other. After this step, the worker launches the actual
computation and completes the task. The worker sends the output to the scheduler.
The scheduler receives this message and repeats the process of sending another task
to the worker that is available. The scheduler now communicates with the client that the
computation is ready, and so the future object that is in waiting mode should wake up
and receive the result.
At this point, the Python garbage collection process gets triggered. If there are no
further computations that are based out of the current task, it removes the key of this
computation from the local scheduler and assigns to a list that gets deleted periodically.
Without delving deeply into the lower-level details, this captures a typical process
of task scheduling. Optimizing the task scheduling process may guarantee efficient
execution of all tasks within the DAG.
Data Locality
Moving data between workers or unnecessary data transfers can limit the performance
of the data pipeline in a distributed system environment. Data locality is an important
term in the broader context of distributed computing. As we mentioned earlier,
schedulers intelligently assign to workers that either have the data already or are located
close to the data transfer over the network. Let's look at a scenario:
client = Client()
df = dd.read_csv("/home/parallels/dask/mockdata-*.csv")
result = df["age"].mean()
final_result = result.compute()
In this example, we are loading multiple CSV files stored in a given folder. Dask
would intelligently partition this dataset across various workers available in the given
cluster. Upon loading the dataset, the next line performs a group-by computation and
sum operations on top of that. Dask will intelligently identify the workers who already
250
Chapter 7 Introduction to Concurrency Programming and Dask
have copies of the dataset available. In this case, the workers would be the ones that are
already holding partitions of this dataset. Dask will utilize these workers to minimize the
network transfers and increase the magnitude of the computation.
As we discussed in earlier sections, in certain cases (like custom algorithms), you
may require more control over which worker to process what type of data. Certain data
might require GPU-heavy workers, and so you can specify that in a following manner:
For instance:
Prioritizing Work
When there are many items in the queue for the Dask cluster to process, Dask has to
decide which task to process first and the ones to prioritize over others. Dask prefers to
process the tasks that were submitted early. Dask’s policy is first in, first out. You can
custom define priority keywords, where tasks with higher priorities will run earlier when
compared with ones with lower priority:
a = client.submit(square, 5, priority=10)
b = client.submit(square, 7, priority=-1)
The operation with priority set to 10 carries the top priority, whereas the operation
with priority set to –1 carries relatively lower priority.
Work Stealing
In Dask distributed computing, work stealing is another strategy that is employed as part
of performance optimization. A task may prefer to run on a specific worker because of
the nature of processing or task dependencies that exist in that specific worker. Stealing
strategy may be good in cases where the time taken to perform a computation is longer
251
Chapter 7 Introduction to Concurrency Programming and Dask
than time taken to communicate with the scheduler. Other cases where stealing is a
good choice would be to obtain workload from an overworked worker and replicating
data that is used more often. In Dask, work stealing is enabled by default with the idea of
minimizing computational time while allowing for optimal usage of workers.
Conclusion
We have seen a lot of concepts in this chapter; we learned how breaking down a
large operation into several smaller operations and running them either parallelly or
concurrently utilizes lesser resources than running everything on a single thread. We
looked at Dask, a Python distributed computing library that can perform concurrent
tasks and overcome traditional Python limitations. We also looked at various data
structures of Dask and various optimization concepts. Dask also integrates well with the
modern data stack. The tools and concepts you have learned in this chapter lay a good
foundation on parallel and concurrent data processing.
252
CHAPTER 8
Engineering Machine
Learning Pipelines using
DaskML
Introduction
Machine learning is a branch of artificial intelligence and computer science. A machine
learning model is an intelligent system that can learn from data in order to make
decisions. Organizations of all nature have gained significant value by leveraging
machine learning models in their processes. Training a machine learning model is done
using algorithms, which can be computationally expensive. As datasets grow larger, both
in terms of volume and level of complexity of the data points, scalable machine learning
solutions are highly sought after. In this chapter we will look at Dask-ML, a library that
runs ML algorithms in a distributed computing environment and integrates well with
existing modern data science libraries.
By the end of this chapter, you will learn
–– Dask-ML
253
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_8
Chapter 8 Engineering Machine Learning Pipelines using DaskML
254
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Data Sourcing
Data sourcing is the determination and gathering of data from various sources. Even
prior to that is to identify the various sources and how often new data is generated. The
data sources can also involve various data types including but not limited to text, video
data, audio data, pictures, and so on. The amount of data that is coming in, how often
new data gets generated, and the various types of data that is required for processing set
the stage for data sourcing and ingestion tools and strategies.
Data Exploration
Data exploration is all about getting to know a dataset. A dataset has its own nature
and characteristics, depending upon the nature, type, and context of data points. Data
exploration involves understanding distribution of each data point, understanding
correlation between one or more data points, looking at various patterns, identifying
whether missing values exist, and gaining an initial assessment of how accurate or
complete and consistent the dataset is (even if it is approximate). Data exploration helps
identify critical insights about data points, minimize potential biased results (if one
or more data points represent the same thing; for instance, length of a flower petal in
millimeters, in centimeters, etc.), and most importantly reduce aimless data analytics
work that leads to nowhere.
Data Cleaning
Data cleaning is caring for the hygiene of the dataset. For lack of a better term, clean data
helps with effective data analysis. Data cleaning involves cleaning up errors and outliers
in data points that may negatively affect and possibly skew the dataset into bias. Data
cleaning also is about handling those missing values that were previously “discovered,”
255
Chapter 8 Engineering Machine Learning Pipelines using DaskML
deciding whether to impute the value that is missing or remove that row completely.
Data cleaning can also involve handling data duplication, which would most likely be
removing it. Clean data leads to accurate analysis and accurate regression models and
makes the judgment more reliant.
Data Wrangling
Data wrangling is a continuous process of transforming data to be able to use them for
analytics and model building purposes. The most common data wrangling technique
is often referred to as data standardization, where the data is transformed into a range
of values that can be measured in common scale. Other data wrangling tools include
feature scaling and variable encoding where data is normalized and categorical variables
are converted into numerical values in a specific format that can be easily analyzed
by a mathematical model. While every step in this model building process is equally
important, data wrangling directly deals with improving the accuracy of machine
learning models.
Data Integration
Data integration is the idea of combining one or more datasets from one or more data
sources into a single dataset that can provide a comprehensive view. Data integration
would also involve further identifying and removing duplicates that may arise from
combining several data sources. Data integration involves schema alignment, where
data from one or more sources are compatible and aligned. Data integration also deals
with data quality, where one would develop validations and checks to ensure data
quality is good after the data is integrated from several sources. Data integration helps
expand the scope of the entire project. Data integration also helps improve accuracy and
integrity of the dataset.
Feature Engineering
Once we have the dataset, then we can proceed to the next step, that is, feature
engineering. By this time, the dataset may contain several data points. In addition to the
existing data points in the dataset, feature engineering involves creating new data points
based on the existing data points. Some of the simplest data points are calculating the
256
Chapter 8 Engineering Machine Learning Pipelines using DaskML
age of a person either at birth or at a specific date, calculating Fahrenheit from Celsius, or
calculating distance between two points based on a specific distance measure, to name
a few. Sometimes feature engineering could involve complex computations like Fourier
transformations or other complex methods that are computationally expensive. The idea
is to come up with new data points that have higher correlation with the data point that
the model is looking to predict or classify.
Feature Selection
Now that you have the dataset you need for the project and engineered new features
based on the data points of the existing dataset, it is time to select which of these data
points enter into the model. There are many ways of selecting the features for a given
machine learning model. Tests like the chi-square test help determine the relationship
between categorical variables, while metrics like the Fisher score help generate ranks for
various data points. There are approaches like forward selection or backward selection
that iteratively generate various feature combinations and respective accuracy scores;
however, they can be expensive as data and features grow beyond a lab experiment.
Other techniques include dimensionality reduction using tools like principal component
analysis and autoencoders that encode data into lower dimensions for processing and
decode back to the original shape for output.
Data Splitting
Data splitting is the process of partitioning the entire dataset into smaller chunks for
training a machine learning model. Depending upon the training strategy, the approach
of splitting the data may differ. The most common approach of splitting the data is the
idea of holding out training, validation, and test datasets. The training set is usually
allotted for the model to learn from the data, the validation set is primarily designed
for hyperparameter tuning purposes, and the test set is to evaluate the performance of
the model with chosen hyperparameters. There is also k-fold cross validation where the
dataset is partitioned into k chunks and iteratively trained by keeping k – 1 as training
and the remaining as the test set. One can even go further by leveraging advanced
statistical methods like stratified sampling to obtain partitions of similar characteristics
and distributions.
257
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Model Selection
So we have a good understanding of the dataset, performed various operations,
integrated from multiple sources, generated features, and looked at data distributions.
The next step is to start looking at selecting the right machine learning model. It is also
a good stage to decide how important it is to tell a story of target variables based on
variables entering into the model. In case of predicting a customer churn, it may be
relevant to tell a story that is based on few variables like demographics, preferences, etc.,
whereas in case of image classification, one is more focused on whether the picture is a
cat or a dog and less concerned about how the picture came about to be the one or the
other. In cases of supervised learning, one would usually start off with logistic regression
as a base point and move toward ensemble methods like bootstrapping or stacking
methods.
Model Training
Based on the characteristics of data and nature and context of the problem we are trying
to solve using the machine learning model, the training strategy and algorithm vary.
The common training strategies are supervised and unsupervised learning. Supervised
learning is the process of training a model with a labeled dataset, where data points
have specific names and usually the ones with higher correlations with the decision
variables are picked to train. Common examples include regression models and
classification models.
Unsupervised learning is the process of training with an unlabeled dataset. A good
example is the clustering algorithm that groups similar data points based on features like
geographical proximity and so on. Several other strategies exist; for instance, there is the
incremental learning, where the model is trained continuously as new data is available.
There is also ensemble modeling, which involves combining multiple machine learning
models in order to improve the predictive performance of the model.
258
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Model Evaluation
Once we have a machine learning model trained, the next step is model validation.
Sometimes testing alone may suffice, and in other cases testing and validation are
required. The idea is to evaluate the accuracy of a model by making a decision on a
completely new set of data and evaluate its accuracy statistically by looking at precision
metrics like r-squared, F1-score, precision, recall, etc.
Hyperparameter Tuning
Hyperparameters are parameters that control the training process. These
hyperparameters are set prior to the model training stage. Hyperparameters are
technical parameters that guide an algorithm mathematically. In an n-dimensional
vector space, these hyperparameters dictate which direction to search for the best value
and how far to go in that direction at each iteration.
Imagine a scenario where you are tasked to find a football in a large football field.
You may have to walk in various directions and in each direction decide how many
steps to take and look around to see if you can spot the football in your region. This is
basically how the internals of training algorithms work. Hyperparameter tuning is the
process of finding the optimal parameters for a machine learning model. Imagine if I
say, “Turn 45 degrees toward your left, take 10 steps, turn right, and take 5 steps. You will
find the football.” Imagine a process that will guide you to figure out these steps. This is
hyperparameter tuning.
Final Testing
We have performed various checks and validations in the process of building a machine
learning model up to the process of selecting the best hyperparameters for our machine
learning model. Final testing is basically testing the final chosen model on a held-out
test set. This process exists in order to obtain realistic expectations when the model is
deployed in production. The main goal behind this step is to get an idea of how this
model may perform when exposed to a new set of data. Once we test out the final model,
establish an understanding of strengths and weaknesses of the model and document the
results for future reference and maintaining audit trails.
259
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Model Deployment
Once we have the training model, the model is validated against test sets, and
hyperparameters are tuned, we now have a machine learning model that is ready to be
deployed. The concept of deployment changes with respect to the nature and context
of the business and problem statement in hand. The conceptual definition is that
deployment means the usage of a model in everyday scenarios, where you are leveraging
the model to inform better decisions.
Model Monitoring
Once the model is deployed, then the model needs to be monitored in frequent
time intervals to see if the model is performing appropriately or not. The model also
requires frequent maintenance every so often. The maintenance activity could involve
various strategies ranging from simple steps like incremental training or adjusting
the coefficients to tiny bit to adding/modifying features and even to complete model
overhaul. This will be the focus of the next step.
Model Retraining
Model retraining is a crucial step in the machine learning model process. This
step involves sourcing the new data that has been collected since the last time and
performing various data wrangling and feature engineering processes. Based on the
dataset that is generated, it will then be decided as to whether incremental training
may suffice or a complete model overhaul may be required. Based on the appropriate
training strategy, the model is retrained, followed by employing testing and validation.
The model is then redeployed to the production environment. This process is repeated
subsequently to keep the model accurate and up to date.
260
Chapter 8 Engineering Machine Learning Pipelines using DaskML
we discuss the features and abilities of Dask-ML in this chapter, let us briefly look at various
popular machine learning libraries that Dask-ML currently integrates with.
scikit-learn
scikit-learn is a popular machine learning and data processing library in Python. scikit-
learn library is widely adopted by both academia and industry for building machine
learning applications. The library supports many supervised and unsupervised learning
algorithms and also has prebuilt methods for feature engineering, model selection, and
cross validation strategies.
Many of the algorithms in scikit-learn are developed for parallel execution by
specifying a “joblib” parameter. This “joblib” parameter provides thread-based and
process-based parallelism for some of the machine learning algorithms. The Dask-ML
library can scale these “joblib”-based algorithms to execute over multiple machines in a
given cluster.
XGBoost
XGBoost is another popular library in the machine learning space for gradient-
boosted trees. Gradient-boosted trees are a machine learning technique that is used
for both continuous and categorical decision variables (regression and classification,
respectively). It belongs to the ensemble learning paradigm, where multiple decision
tree models minimize the error and combine to make the final prediction model. Dask
integrates well with the XGBoost model, where Dask sets up the master process of
XGBoost onto the Dask scheduler and the worker process of XGBoost to the workers.
Then it moves the Dask data frames to XGBoost and let it train. When XGBoost finishes
the task, Dask would clean up XGBoost’s infrastructure and return to its original state.
PyTorch
PyTorch is an open source machine learning library primarily used for natural language
processing, computer vision, and neural networks among other models. PyTorch
supports tensor computation, where tensors are GPU-efficient multidimensional arrays,
made for high-performance computing tasks. PyTorch integrates well with scikit-learn
through a library called Skorch. This Skorch library brings a scikit-learn wrapper to
261
Chapter 8 Engineering Machine Learning Pipelines using DaskML
PyTorch methods. This means that Skorch-wrapped PyTorch models can be used with
the Dask-ML API.
Other Libraries
Dask-ML also supports Keras and TensorFlow libraries. Keras is an open source Python
library for developing deep learning models. TensorFlow is an end-to-end machine
learning library, developed by Google. The scikit-learn package has an API wrapper to
Keras, which allows Dask-ML to be utilized with Keras models.
You can also install XGBoost and Dask-XGBoost using this command:
Similar to what we discussed earlier, Dask partitions the dataset into multiple smaller
chunks and assigns these chunks to various workers in a given computing resource.
The machine learning model is trained in parallel on these workers, and the results are
aggregated upon training. Dask-ML offers parallel processing of several tasks including
data preprocessing tasks like encoding and estimators like principle component
analysis. Furthermore, Dask-ML also provides effective cross validation methods and
hyperparameter tuning methods that would use all the workers in the given computing
environment to perform computations for obtaining better parameters. Let us discuss
these features.
262
Chapter 8 Engineering Machine Learning Pipelines using DaskML
RobustScaler( )
The robust scaler method enables data to be resistant against outliers. It employs a two-
step approach of centering and scaling, where it subtracts the median and divides the
data point with the quartile range.
Let us look at this with an illustration:
import dask.dataframe as dd
from dask_ml.preprocessing import RobustScaler
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
df = dd.from_array(X)
RScaler = RobustScaler() \
.fit_transform(df) \
.compute()
print(RScaler.head())
would yield
MinMaxScaler( )
MinMaxScaler is another useful scaling function provided by the Dask-ML preprocessing
methods. For each value in a given column, MinMaxScaler subtracts the minimum value
of the feature and divides by the range. The idea behind MinMaxScaler is to bring all the
values within the range of [0,1].
It may be good to note that if the column had negative values, then it would attempt
to fit the data in the range of [–1,1]. MinMaxScaler is a good fit when you are attempting
to fit a neural network with a Tanh or Sigmoid activation function. Please note that
MinMaxScaler may not play well with outliers. Either you remove the outliers or use the
RobustScaler function, instead.
Let us look at an example:
import dask.dataframe as dd
from dask_ml.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
X = iris.data
MMScaler = MinMaxScaler() \
.fit_transform(df) \
.compute()
print(MMScaler.head())
would return
0 1 2 3
264
Chapter 8 Engineering Machine Learning Pipelines using DaskML
0 0.222222 0.625000 0.067797 0.041667
1 0.166667 0.416667 0.067797 0.041667
2 0.111111 0.500000 0.050847 0.041667
3 0.083333 0.458333 0.084746 0.041667
4 0.194444 0.666667 0.067797 0.041667
encoder = OneHotEncoder()
var1 = da.from_array(
np.array(
[['Apples'], ['Melons'], ['Melons'], ['Oranges']]
),
chunks=2
)
# encoder = encoder.fit(var1)
result = encoder.fit_transform(var1)
print(result)
would return
print(result.compute())
265
Chapter 8 Engineering Machine Learning Pipelines using DaskML
would return
Cross Validation
As we have seen in a previous section, cross validation is a machine learning technique
that helps assess a recently trained machine learning model; you may be able to
understand how well the model adapts to non-training data using cross validation
methods. The method usually partitions the dataset into “k” different chunks. The model
is trained “k” times where at each time one of the k chunks acts as a test set and the rest
all helps with the training/validation. This process is repeated k times where in each
iteration a unique chunk serves as the test set.
For instance, let us say we have five partitions of a dataset, calling them d1, d2, d3,
d4, and d5:
–– During the first iteration, d1 will be the test set and d2, d3, d4, and d5
will be training sets.
–– During the second iteration, d2 will be the test set and d1, d3, d4, and
d5 will be training sets.
–– During the third iteration, d3 will be the test set and d1, d2, d4, and
d5 will be training sets.
And so on … At each iteration, the performance metrics are gathered, and at the end
of all “k” iterations, the performance metrics are aggregated, and it is used to assess the
model’s performance. What we have seen here is k-fold cross validation.
Let us look at an example of performing fivefold cross validation using Dask-ML:
import dask.array as da
from dask_ml.model_selection import KFold
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression
from statistics import mean
266
Chapter 8 Engineering Machine Learning Pipelines using DaskML
model = LinearRegression()
# The Dask kFold method splits the data into k consecutive subsets of data
# Here we specify k to be 5, hence, 5-fold cross validation
kf = KFold(n_splits=5)
train_scores: list[int] = []
test_scores: list[int] = []
for i, j in kf.split(X):
X_train, X_test = X[i], X[j]
y_train, y_test = y[i], y[j]
model.fit(
X_train,
y_train
)
train_score = model.score(
X_train,
y_train
)
test_score = model.score(
X_test,
y_test
)
train_scores.append(train_score)
test_scores.append(test_score)
You may wish to include your own example in the preceding code and work with
tenfold cross validation and examine the output as an exercise.
267
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Dask-ML also provides a method to split one or more Dask arrays. Let us look at the
following code:
import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
X, y = make_regression(
n_samples=200,
n_features=5,
random_state=0,
chunks=20
)
# notice the use of method from Dask ML library which resembles similar
to Sklearn
print(X)
would return
After splitting
print(X_train)
would return
print(X_train.compute()[:5])
array(
[[ 0.77523133, -0.92051055, -1.62842789, -0.11035289, 0.68137204],
[ 0.02295746, 1.05580299, -0.24068297, 0.33288785, 2.14109094],
[-0.40522438, -0.59160913, 0.0704159 , -1.35938371, -0.70304657],
268
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Grid Search
You define a grid of values (a matrix of values). The grid search would exhaustively
search through all the combinations of hyperparameters. It would evaluate the
performance of every possible combination of hyperparameters within the grid using
a cross validation method. While it will deterministically search and provide the best
possible hyperparameters for your model, it gets expensive as the model grows and a lot
depends on the grid you have defined.
269
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Random Search
You define the distribution for each hyperparameter. Random search randomly
samples hyperparameter values from predefined distributions. Once again, this method
evaluates the performance of the model using cross validation. While this might sound
like a more efficient approach, there is likelihood that statistical distribution may not
assist with finding the global optimal values for these hyperparameters.
When you are facing memory constraints but your computer is enough to execute
the training model, you can use an incremental search method.
Incremental Search
Incremental search starts off with predefined values for hyperparameters and
adjusts them based on the performance over every iteration. It is an efficient method
when incremental adjustments are feasible; it may even converge to optimal values
for the hyperparameters. But the search strategy requires work and expects the
hyperparameters to be continuous.
When compute and memory are both constrained and posing a challenge to obtain
good hyperparameters for your machine learning model, you may choose to explore
Hyperband algorithms for hyperparameter tuning.
The way I understand it is that the Hyperband algorithm is basically a grid search
where if a combination takes too much time to converge, then the algorithm dynamically
cuts off the compute and memory resources allocated to it and focuses on the ones
that yielded the hyperparameters while covering faster than the others. It has a more
aggressive early stopping criteria at each iteration. Once all predefined iterations are
completed, the hyperparameters that converged with least computing resources will be
returned.
Let us look at an example, where we are creating a stochastic gradient descent
classifier and initializing the hyperparameters to be in log uniform and uniform
distribution, respectively. The Hyperband algorithm computes the hyperparameters and
you can see that at the end. Here is how it looks:
270
Chapter 8 Engineering Machine Learning Pipelines using DaskML
client = Client()
X, y = make_classification(
chunks=10,
random_state=0
)
clf = SGDClassifier(
tol=1e-3,
penalty='elasticnet', # regularization term
random_state=0
)
params = {
'alpha': loguniform(1e-2, 1e0),
'l1_ratio': uniform(0, 1)
}
search = HyperbandSearchCV(
clf,
params,
max_iter=80,
random_state=0
)
search.fit(
X_train,
y_train,
classes=[0, 1]
)
print(search.best_params_)
print(search.best_score_)
271
Chapter 8 Engineering Machine Learning Pipelines using DaskML
print(search.score(X_test, y_test))
would yield
HyperbandSearchCV(
estimator=SGDClassifier(penalty='elasticnet', random_state=0),
parameters={
'alpha': <scipy.stats._distn_infrastructure.rv_continuous_frozen
object at 0x7fb9e4c47d90>,
'l1_ratio': <scipy.stats._distn_infrastructure.rv_continuous_
frozen object at 0x7fb9e43cfb80>
},
random_state=0
)
print(search.best_params_)
{
'alpha': 0.057240478677855075,
'l1_ratio': 0.40506608504946273
}
print(search.best_score_)
0.6
print(search.score(X_test, y_test))
0.8
import dask.dataframe as dd
272
Chapter 8 Engineering Machine Learning Pipelines using DaskML
import pandas as pd
from dask_ml.preprocessing import OneHotEncoder
from dask_ml.impute import SimpleImputer
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
Now let us create a sample dataset for us to work with the Dask-ML methods. We
are keeping it a simple n-dimensional array that can be converted into a data frame for
downstream processing purposes:
data = {
'color':
['red', 'orange', 'green', None, 'yellow', 'green'],
'weight':
[150, 180, 200, 160, None, 220],
'taste':
['sweet', 'sweet', 'sour', 'sweet', 'sweet', 'sour'],
'fruit':
['apple', 'orange', 'apple', 'apple', 'orange', 'melon']
}
Now, let us create the Dask data frame for the preceding data with two partitions:
df = dd.from_pandas(
pd.DataFrame(data),
npartitions=2
)
Let us initialize the SimpleImputer. You can specify the strategy that can be used to
fill the missing value. The most common approach for a categorical variable is to fill in
with a category that is most commonly occurring. For a continuous variable, you can
choose the option of taking the statistical average and fill in the missing value:
imputer1 = SimpleImputer(strategy='most_frequent')
df_fit = imputer1.fit(df)
df[['color']] = imputer1.transform(df[['color']])
273
Chapter 8 Engineering Machine Learning Pipelines using DaskML
imputer2 = SimpleImputer(strategy='mean')
ddf[["weight"]] = imputer2.fit_transform(df[['weight']])
print(ddf.compute())
would yield
color weight taste fruit
0 red 150.0 sweet apple
1 orange 180.0 sweet orange
2 green 200.0 sour apple
3 green 160.0 sweet apple
4 yellow 182.0 sweet orange
5 green 220.0 sour melon
Notice the first imputer assigned the most commonly occurring category to the
categorical variable. The second imputer assigned the statistical mean of the column to
the missing value. As it so happens the statistical average of the missing orange fruit is
close to the value found for orange.
Let us look at another machine learning model in Dask-ML where we fit a linear
regression model to a set of data. Linear processes are primarily concerned with
predicting a value of a continuous variable. For instance, forecasting the demand for a
specific item given inventory levels at a given time can be a good fit for linear regression.
Let us look at the code in detail.
First, let us import all the necessary classes and methods:
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.linear_model import LinearRegression
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
Let us generate some sample data for our machine learning model:
number_of_samples = 1000
number_of_features = 4
274
Chapter 8 Engineering Machine Learning Pipelines using DaskML
X, y = make_regression(
n_samples= number_of_samples,
n_features= number_of_features,
noise=0.1,
chunks=4
)
y = np.exp(y / 200).astype(int)
linear_model.fit(
X_train,
y_train
)
would yield
0.9999
print(linear_model.get_params())
275
Chapter 8 Engineering Machine Learning Pipelines using DaskML
Conclusion
The idea of utilizing distributed computing to run machine learning algorithms in the
Python-based ecosystem can be seen as achieving a significant milestone in the applied
research. Dask-ML integrates well with scikit-learn and other leading libraries. So far we
looked at various preprocessing and tuning applications using Dask-ML. By knowing
Dask-ML and being able to use it in a production environment, organizations can
leverage gaining high-value insights and significant savings, enabling them to push the
boundaries to the next level in data-driven decision making.
276
CHAPTER 9
Engineering Real-time
Data Pipelines using
Apache Kafka
Introduction
So far we looked at batch processing pipelines. In this chapter, we will be looking
at real-time streaming processing pipelines and a library that is incredibly rich in
functionalities. We are going to discuss Apache Kafka. Apache Kafka is a distributed and
fault-tolerant streaming and messaging platform. Kafka helps build event streaming
pipelines that can capture creation of new data and modification of existing data in real
time and route it appropriately for various downstream consumption purposes. In this
chapter, we will look at Kafka, its architecture, and how Kafka can help build real-time
data pipelines.
By the end of this chapter, you will learn
–– ksqlDB
277
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_9
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
278
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Introduction to Kafka
Apache Kafka is a distributed, scalable, fault-tolerant, elastic, and secure streaming and
messaging platform. Written in Java and Scala, it is used to build real-time data pipelines
and streaming pipelines as well. Event streaming pipelines help capture data in real time
(or near real time) from sources like databases, sensors from devices (like IoT devices
and others), and other systems. These captured data are stored in a database for further
analysis, data processing, and routing to other downstream software applications. As per
the official documentation, Apache Kafka provides three core functionalities:
279
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
“Fault tolerant” means the ability of the system to continuously function in case of
failure. Fault tolerance is usually achieved by the idea of replication. In the case of data
processing in a distributed environment, data that are partitioned into chards are usually
replicated; this means that every subset of data would have more than one copy in the
system. These copies would be spread all across the cluster. And so, when one node
containing one copy of the chard fails, the system would automatically locate the other
copy of the same chard, in a different node.
“Secure” means that the data, systems, and resources are protected from
unauthorized access, damages, and attacks from external entities. A secure system may
contain access control, two-factor authorization, encryption in motion and at rest (disk
encryption and transporting data in secure tunnels), and regular security updates to
protect the system from potential threats and other vulnerabilities that may exist.
Note
Encryption at Rest
Encryption at rest means the idea of disk encryption. Disk encryption makes sure
that when an unauthorized entity gains access to disk and attempts to read, it won’t
be able to access the data. The data is encrypted and it requires a “key” to decrypt.
Encryption in Motion
Encryption in motion means the idea of encrypting the data in transit. The transport
layer encryption refers to securing data while it is in transit over a given network.
There are network protocols (industry-approved instructions) that implement
encryption in motion. Some of them are HTTPS, TLS, and SSL protocols. And so,
when the data is intercepted in between transit, the information does not make
sense (requires some sort of a key to decrypt).
280
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Let us talk about the messaging queue systems architecture. These are where the
messages are kept in a queue to be processed. The processing starts by consuming
these messages in an asynchronous manner. Each message is consumed by only one
consumer. Once a message is consumed, then that message is removed from the queue.
Now, let us talk about the pub–sub messaging system. Pub–sub stands for publisher
and subscriber, respectively. In this architecture, the messages are “published” into what
is called a topic. The topic can then be read by another program, called “subscriber.”
These subscribers can receive messages from the topic. You can have multiple
subscribers receiving messages from topics; and so, these messages are broadcasted to
multiple subscribers.
Kafka Architecture
Kafka can be deployed in virtual machines, cloud environments, containers, and even
bare metal appliances. Kafka consists of servers and clients within its distributed system.
Kafka servers actively provide brokerage and Kafka Connect services. The brokers
form the storage layer, whereas Kafka Connect provides continuous import or export
of data as event streams. The Kafka client provides you with libraries and APIs to write
distributed applications and microservices that can process streams of events, function
in parallel, scale up where needed, and be fault tolerant. Let us look at some of the
components of Kafka.
Events
An event is simply something that is taking place or happened at a given point in time. Of
course, every day there are lots of events happening in our world; in the Kafka context, it
would specifically refer to the events for the use case. For instance, in a bank, you have
lots of events happening at the same time—customer withdrawing, customer depositing
money, etc. The anatomy of an event is such that it may contain key, value, timestamp,
and other metadata relating to that event, for instance:
Key: John
Value: John withdrew $100 cash
Timestamp: July 14, 2023, 10:10:33 am
281
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Topics
Topics are where events are organized and stored securely. The concept is similar to
that of files and folders. A folder is where one or more files are stored; similarly, one or
more events are stored in topics. In the preceding example, John withdrawing a sum of
$100 is considered an event. If similar events were grouped, then it would be referred to
as the “Cash withdrawal” topic. Events can be read as many times as needed and do not
get deleted after consumption. You can define how long to retain these events. These
retention periods can be modified topic to topic.
Partitions
A partition is the process of dividing a large data block into multiple smaller ones so that
it helps with reading and access. Topics are partitioned. Any given topic is partitioned,
as spread over a number of buckets. Events with the same event key (in the preceding
example, John is an event key) will be written to the same partition. These partitions
can also be replicated such that in case of a given partition failure, there is always
another partition that contains the same topic and event. The replication is performed
at the topic partition level. For instance, if the replication factor is 3 and you have four
partitions for one given topic, then there will be 12 different partitions (as in three copies
of each partition per topic).
Broker
Brokers in Kafka are storage services provided by the Kafka server. The abovementioned
topic partitions are stored across a number of brokers. This way, when one broker fails,
Kafka will locate the other broker that contains the similar topic partition and continue
with the processing. Brokers communicate with each other for data replication, and they
try to ensure consistency of data across the entire Kafka cluster.
Replication
Kafka automatically replicates the topic partition, and each partition gets stored in more
than one broker, depending upon the replication factor set by the Kafka administrator.
This is to enable fault tolerance so that even when a cluster fails, the messages continue
282
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
to remain available. The main partition is called the leader replica, and the copies of
topic partitions are stored at a place called follower replicas. Every partition that gets
replicated has one leader and n – 1 follower replicas. When you are reading and writing
data, you are always interacting with the leader replica. The leader and followers work
together to figure out the replication. The whole thing is automated and happens in the
background.
Producers
Producers are part of the Kafka client, enabling the process of publishing events to
Kafka topics. Typically Kafka producers send a message to the topic. These messages are
partitioned based on a mechanism (e.g., key hashing). For a message to be successfully
written in a topic, the producer must specify a level of acknowledgment (ack). Some
examples of a Kafka producer include web servers, IoT devices, monitoring and logging
applications, etc. Kafka may serialize the data that is coming in using any of Java’s built-
in serializers or using Apache Avro, Protobuf, etc.
You connect to the cluster using the Kafka producer, which is supplied with
configuration parameters including brokers’ addresses in the cluster, among
other things.
Consumers
Once producers have started sending events to the Kafka cluster, we need to get the
events back out into other parts of the system. Consumers are those that subscribe to
(read and write) these events. They read these events from producers and generate a
response with regard to the event. Note that consumers only read the topics and not
“consume” in any other ways; these topics are still available in the Kafka cluster. The
data that is coming from producers needs to be accurately deserialized in order to be
consumed. Consumers read topics from oldest to newest.
283
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Schema Registry
Now that we have producers and consumers ready, the producers will start publishing
messages into topics, while the consumers will begin to subscribe to these same
messages that the producers publish. The consumers will start consuming only if they
identify and understand the structure of the data the producers publish. The schema
registry maintains a database of all schemas that producers have published into topics.
The schema registry is a standalone process that is external to Kafka brokers. By utilizing
the schema registry, you are ensuring data integrity among other great benefits.
With Avro
Avro is a serializer for structured data.
With Protobuf
Protobuf stands for Protocol Buffers. It is a way of serializing structured data and
comes with its own schema definition language. Protobuf is developed by Google; it is
text based and readable when compared with JSON format. Protobuf is binary based,
language neutral, and efficient in terms of speed.
Kafka Connect
Kafka Connect is a data integration mechanism, offered by Kafka to connect and
integrate with various enterprise systems. Kafka Connect provides necessary methods
and functions through an API, to connect both from and to various relational databases,
NoSQL databases, file systems, etc. You can use Kafka to connect to both source data
from these systems to publish them as topics and consume topics from a Kafka producer
and load them to these various databases and file systems. Kafka Connect can function
in standalone mode for development and testing and also in a distributed model for
large-scale production deployments.
284
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
receive data from a perennial flow of data coming from a given system. The easiest
way to visualize is the idea of weather forecasting systems. These systems receive
atmospheric data from a satellite continuously. And so, there is no “extraction”
happening here. Instead, they is configured to receive the same.
Kafka Streams is scalable and fault tolerant and can be deployed in leading operating
systems, bare metal appliances, containers, etc. Kafka Streams primarily caters to
Java programming language to be able to write streaming pipelines and also to enrich
current Java applications with stream processing functionalities. For other programming
languages, we have another service called ksqlDB that enables writing Kafka stream
functionalities.
ksqlDB is a database that integrates with Kafka, enabling you to build streaming data
pipelines, similar to building traditional data applications. More specifically, ksqlDB is
a computational layer that uses SQL-like syntax to interact with the storage layer, which
is Kafka. You can interact with ksql using web, command line, and REST APIs. To use
ksqlDB with Python REST API, you can install
You can create a new table or stream within the ksqlDB and stream your existing
data in relational tables or other file systems into topics, for instance:
StartDate DATE,
EndDate DATE,
Budget FLOAT
)
WITH (
Kafka_topic = "Campaign",
Partitions = 3,
Value_format = "Protobuf"
);
285
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
If you have existing Kafka topics, you can use ksqlDB to create a stream or table and
then begin streaming data into ksqlDB:
Just like relational SQL, you can perform lookups, joins, filtering, transformations,
and aggregations to derive real-time analytic insights from data.
Choose the appropriate cluster type. For learning purposes, the basic option should
suffice. Here is how that may look:
288
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
You will have the option to specify where your cluster should be hosted. Please feel
free to select your favorite cloud service provider and a region closest to where you are
located. You will see the estimated costs at the bottom. Here is how this may look for you:
This will be followed by a payment page. If you are registering with Confluent for the
first time, you may get free credits; so you can skip the payments and proceed to launch
the cluster.
Here is how the cluster may look for you:
289
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Once you have a Kafka cluster ready, proceed to create a topic within that cluster. A
topic is an append-only log register that registers messages.
At this time, navigate to your cluster within the Kafka dashboard, locate the “Connect
to your systems” section, choose Python programming language, and create a cluster
API key and a schema registry API key.
Here is how that may look:
290
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Figure 9-6. API key and secrets for the Kafka cluster
For development and sandbox purposes, create a config.py file within your
development environment and store the following:
config = {
'bootstrap.servers': '<bootstrap-server-endpoint>',
'security.protocol': 'SASL_SSL',
'sasl.mechanisms': 'PLAIN',
'sasl.username': '<CLUSTER_API_KEY>',
'sasl.password': '<CLUSTER_API_SECRET>'}
sr_config = {
'url': '<schema.registry.url>',
'basic.auth.user.info':'<SR_API_KEY>:<SR_API_SECRET>'
}
291
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Make sure to replace the endpoints, key, and secret with appropriate data that is
obtained during the process. This is essential for your local Kafka code to interact with
your Kafka cluster in the cloud.
Now, let us set up the Kafka development environment in your local workstation. We
will start off by revising the creation of the virtual environment:
source kafkadev/bin/activate
If you are on MS Windows operating system, use the following command to activate
your Python virtual environment:
kafkadev\Scripts\activate
deactivate
Let us now set up the Kafka development environment, from the command line:
Now let us create a simple Kafka publisher–subscriber program. To get started, create
a work directory:
mkdir Kafka-dev
cd Kafka-dev
Create a new Python script named “config.py”. Copy the earlier configuration into the
script and comment out the schema registry section that is not needed for our program.
It would look like this:
'PLAIN',
'sasl.username':
'MG2334H787GH097FRT',
'sasl.password':
'vYGUYTGuyGVJHGVUYtt76&*^YUG+ArDmnoHxENzv**o#$%^ijkjkn+buNjCb0W5E'
}
"""
sr_config = {
'url': '<schema.registry.url>',
'basic.auth.user.info':'<SR_API_KEY>:<SR_API_SECRET>'
}
"""
p = Producer(config)
greetings = ['alice','bob','jenny','rhonda','steve','miller']
for g in greetings:
p.produce("hello_topic",
f"hello {g}",
g,
on_delivery=callback
)
p.flush()
293
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Now let us examine the code. We start off by importing the necessary libraries,
namely, Confluent distribution of Kafka and config. We then instantiate the producer
class with the config. The most important methods of the producer class are produce
and flush.
The produce method would contain
python3 producer.py
the program will connect with the Kafka cluster in the cloud, locate the appropriate
topic, and write out the messages to the topic. From the program point of view, it will
print the following:
Let us now create a Kafka consumer that would read the messages from the topic
within the Kafka cluster. Please note that the Kafka consumer only tries to read the
messages and not in any way alter the form or perform operations other than read.
These messages will continue to persist in the topic within the Kafka cluster.
294
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Create a new Python script named “consumer.py” and enter the following code:
def consumer_config():
config['group.id'] = 'hello_group'
config['auto.offset.reset'] = 'earliest'
config['enable.auto.commit'] = False # type: ignore
consumer_config()
subscriber = Consumer(config)
subscriber.subscribe(["hello_topic"])
while True:
event = subscriber.poll(1.0)
if event.error():
print(event.error())
val = event.value().decode('utf8')
partition = event.partition()
print(f'Received: {val} from partition {partition}')
subscriber.close()
We start off, once again, by importing the necessary classes. In addition to the
configuration supplied by config.py, we also need to mention the consumer group id,
whether to start consuming messages that arrived earliest or latest, etc. A consumer class
is instantiated in the name of the subscriber, and the same is subscribed to consume
messages from the given topic. Finally, the subscriber will operate on an infinite while
loop to output the messages, unless the user interrupts with a key combination. As a
consumer program, the idea here is to read the messages from the topic and, if possible,
print the same to the console screen.
295
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
class Student(object):
def __init__(self, student_id, student_name, course):
self.student_id = student_id
self.student_name = student_name
self.course = course
schema_str = """{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Student",
"description": "Student course data",
"type": "object",
"properties": {
"student_id": {
"description": "id of the student",
"type": "number"
296
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
},
"student_name": {
"description": "name of the student",
"type": "string"
},
"course": {
"description": "course of the student",
"type": "string"
}
}
}
"""
student_data = [
Student(55052,"Johndoe","Chem101" ),
Student(55053,"TestStudent","Physics"),
Student(55054,"JaneDoe","Biology")
]
if __name__ == '__main__':
topic = 'student_data'
schema_registry_client = SchemaRegistryClient(sr_config)
297
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
json_serializer = JSONSerializer(schema_str,
schema_registry_client,
student_to_dict)
producer = Producer(config)
for i in student_data:
producer.produce(topic=topic,
key=str(i.student_id),
value=json_serializer(i,
SerializationContext(
topic,
MessageField.VALUE
)),
on_delivery=delivery_report)
producer.flush()
We start off with loading the various classes and methods required for this publisher
script. We also have loaded a data class and a JSON schema for demonstrating the
student data example. We have defined a function that converts student data to a
dictionary. The remaining portion of the program is quite familiar to you, as we are
instantiating the producer class, schema registry, and JSON serializer, supplying various
parameters required for the producer to serialize the data and iterate over the list to write
to the topic within the Kafka cluster.
The JSON serializer accepts the schema both in string format and as a schema
instance. The student_to_dict function converts the Python object (in this case a list) to a
dictionary. The “ctx” parameter provides metadata relevant to serialization.
When this program is run
python3 publisher.py
we would get
298
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Now let us focus on the consumer portion of this project. Create a file, called
consumer.py, and load the following code into the file:
class Student(object):
def __init__(self, student_id, student_name, course):
self.student_id = student_id
self.student_name = student_name
self.course = course
schema_str = """{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Student",
"description": "Student course data",
"type": "object",
"properties": {
"student_id": {
"description": "id of the student",
"type": "number"
},
"student_name": {
"description": "name of the student",
"type": "string"
},
"course": {
"description": "course of the student",
"type": "string"
}
}
}
"""
299
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
def set_consumer_configs():
config['group.id'] = 'student_group'
config['auto.offset.reset'] = 'earliest'
if __name__ == '__main__':
topic = 'student_data'
print(type(schema_str))
json_deserializer = JSONDeserializer(schema_str, from_dict=dict_to_
student)
set_consumer_configs()
consumer = Consumer(config)
consumer.subscribe([topic])
while True:
try:
event = consumer.poll(1.0)
if event is None:
continue
student = json_deserializer(event.value(),
SerializationContext(topic, MessageField.VALUE))
if student is not None:
print(f'Student with id {student.student_id} is {student.
student_name}.')
except KeyboardInterrupt:
break
consumer.close()
Once again, the program loads all the required classes and methods to the file,
followed by the data class for the student data schema and JSON configuration defining
the schema string. We then add the consumer configuration required in addition to
the existing configuration and a function that would convert the dictionary to a list.
300
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
From here on, we proceed to instantiate our consumer class, mention the specific topic,
instantiate the JSON deserializer, and create a loop that would print the student data
unless interrupted by keyboard keys.
When you run this program
python3 consumer.py
it would yield
Protobuf Serializer
As we saw earlier, Protobuf stands for Protocol Buffers. Protobuf is developed by Google
and considered to be more efficient in terms of speed and size of serialized data. To
be able to use the Protobuf serializer, you need to follow the mentioned guidelines as
follows:
–– First off, define your Protobuf schema in your schema definition file.
You need to define the schema and save the file with “.proto” file
format. These proto schema files are very much human readable and
contain information in much simplified manner when compared
with JSON documents.
–– Finally, you have to import the generated file into your main code. In
our case, we would obtain a Python file. This Python file needs to be
imported in our main publisher or subscriber program.
301
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
/*
Protobuf person model
---------------------
*/
syntax = "proto3";
package com.mykafka.personmodel;
message Person {
string name = 1;
int32 age = 2;
float temperature = 3;
bool needsParkingPass = 4;
optional string email = 5;
}
As you can see, the schema reads much easier when compared with JSON. You can
define the data types and name of the fields and also specify whether a specific item
requires to be filled, by using the keyword “optional.” The numbers that are assigned at
the end of each line are field numbers.
To investigate further, you can visit https://protobuf.dev. If you experience issues
with installing protoc in your workstation, I suggest obtaining the binaries from the git
repository and copying the OS-appropriate executable in a separate folder. Then, open a
terminal and use the following command within that folder.
It is essential to keep the proto file within the same folder. You will get a Python
script, carrying the same model name suffixed with “pb2” like the following:
myschemafile_pb2.py
This file will be imported in our Kafka code. Let us look at an illustrated example
using the Protobuf serializer. Here we shall utilize the same student model that we saw
previously. Here is how the Protobuf model may look like:
302
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
/*
Protobuf student model
----------------------
student model to interact with confluence kafka cluster
*/
syntax = "proto3";
package com.kafka.studentmodel;
message student {
int32 student_id = 1;
string student_name = 2;
string course = 3;
}
message student_data {
repeated student student_records = 1;
}
“Repeated” is a Protobuf syntax meaning that the student_data can contain multiple
“student” entries.
Now, let us compile this proto model using the Protobuf compiler:
model_pb2.py
303
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0bmodel.
proto\x12\x16\x63om.kafka.studentmodel\"C\n\x07student\x12\x12\n\nstudent_
id\x18\x01 \x01(\x05\x12\x14\n\x0cstudent_name\x18\x02 \x01(\t\x12\x0e\n\
x06\x63ourse\x18\x03 \x01(\t\"H\n\x0cstudent_data\x12\x38\n\x0fstudent_
records\x18\x01 \x03(\x0b\x32\x1f.com.kafka.studentmodel.studentb\
x06proto3')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'model_pb2', _globals)
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_globals['_STUDENT']._serialized_start=39
_globals['_STUDENT']._serialized_end=106
_globals['_STUDENT_DATA']._serialized_start=108
_globals['_STUDENT_DATA']._serialized_end=180
# @@protoc_insertion_point(module_scope)
class student:
def __init__(self, student_id, student_name, course):
self.student_id = student_id
self.student_name = student_name
self.course = course
304
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
producer = Producer(config)
student_data = model_pb2.student_data()
data = [
student(55052,"Johndoe","Chem101" ),
student(55053,"TestStudent","Physics"),
student(55054,"JaneDoe","Biology")
]
for i in data:
student_record = student_data.student_records.add()
student_record.student_id = i.student_id
student_record.student_name = i.student_name
student_record.course = i.course
serialized_data = student_data.SerializeToString()
We have imported the necessary classes and methods, including the Python file
generated by the Protobuf compiler. Then we create a student data class, followed by a
callback function that would write out messages to console in case of success or failure of
publishing messages to the topic within the Kafka cluster. We then instantiate a producer
class and student data. This is succeeded by iteratively adding the test bed of data. Upon
serialization, the data is published in the “student_data” topic within the Kafka cluster.
The “SerializeToString()” method is provided by the Python-based Protobuf library.
It is a direct way to serialize a Protobuf into a binary string. It does not integrate well with
the schema registry.
Here’s how to install the Protobuf library:
305
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Now let us look at the consumer or subscriber portion of this Kafka project:
def set_consumer_configs():
config['group.id'] = 'student_group'
config['auto.offset.reset'] = 'earliest'
set_consumer_configs()
consumer = Consumer(config)
consumer.subscribe(['student_data'])
while True:
try:
event = consumer.poll(1.0)
if event.error():
print(f"Error occurred at {event.error()}")
continue
student_data = model_pb2.student_data()
student_data.ParseFromString(event.value())
except KeyboardInterrupt:
break
finally:
consumer.close()
306
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
In this example, we have imported necessary classes and methods, set additional
required configurations for the consumer class, and started listening on the given topic.
We then write an infinite loop that would print the messages that are read from the topic
within the Kafka cluster. The “ParseFromString()” method is from Protobuf Python
library, which enables you to deserialize the binary object to structured data. I hope
these illustrations help strengthen the foundational thinking and most commonly used
classes and methods in Kafka programming.
Stream Processing
So far we have seen how Kafka can publish and subscribe to events. We also saw that
Kafka stores events in a Kafka cluster. In this section, we will be looking at how Kafka can
process and analyze these events that are stored in the Kafka cluster. This is what we call
stream processing in Kafka. Let us define an event. It is simply something that occurred
at a given point in time. It could be anything from a student signing up for a course in
an online system to a television viewer who is voting for their favorite performer in a
reality TV show. Events are basically commit logs that are immutable, and only append
operation is permitted. You cannot undo an event; in the case of the student example,
you can withdraw from the course that you have signed up for, but remember that
becomes another event.
Stream processing can be performed in two approaches: one is Kafka Streams
and the other is kSQL. Kafka Streams is a java library that provides various methods to
stream processing applications, whereas kSQL is a module that enables building stream
processing pipelines in a SQL dialect. ksqlDB is a database that is deliberately built for
storing, processing, and analyzing events from stream processing applications.
ksqlDB is not a relational database and it does not replace relational database at
all. There are no concept of indexes in ksqlDB. The tables in ksqlDB are different from
tables in relational database management systems. The ksqlDB table can be seen more
as a materialized view and not a table. Streams can also be defined as sets of events that
you can derive a table from, for instance, the price of a company’s stock listed/traded
on a stock exchange in a given time interval. The sequence of stock prices over the time
period is a stream. The same stream can be seen as a table to describe the most recent
stock price of that company.
307
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
–– Creating branches where the input streams are branched into mul-
tiple output streams based on criteria applied to the message
308
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
two services: the ksqlDB that performs computation and Kafka that constitutes storage.
Within ksqlDB, you can utilize Kafka to both stream data out of other systems into Kafka
and from Kafka to other downstream systems.
It may be good to reiterate that ksqlDB is a separate package from Apache Kafka. And
so, we will continue to use our existing clusters and may create new streams.
Let us try to set up a ksqlDB cluster for us to write streaming data pipelines. To get
started, log in to Confluent Cloud, select the appropriate environment, choose the right
cluster, and click ksqlDB.
You may choose to create a cluster on your own or start with the tutorial. The goal
is to set up a development instance to get started with the stream data processing
pipelines. You may choose to enable global access for your ksqlDB cluster and ensure
appropriate privileges to the schema registry.
309
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
You may choose the lowest Confluent streaming units and choose to launch clusters.
Here is how that looks:
310
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Do wait a few minutes for your cluster to be provisioned. You’ll know the cluster is
ready when the status changes to “Up” from Provisioning. Once we have the ksqlDB up and
running, we can try to set up the command line interface for both Confluent and ksqlDB.
The easiest and straightforward method is to spin up a Docker container containing
the Confluent platform. You can visit the appropriate page, copy the Docker compose
file, and run those commands to get up and running. However, it would be nice to install
these packages in the terminal, since we will also get hands on installing a package
without an executable. Let us start looking at installing Confluent CLI in a Debian
operating system.
The first step is to install the prerequisites:
Let's check if you have Java installed; observe the output for the following command:
java —version
Let us import the public key of the ksqlDB package. This is to verify the authenticity
of the package, part of a security mechanism known as cryptographic signing and
verification:
Let us add the ksqlDB into our apt repository and run an update:
311
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
curl -O https://packages.confluent.io/archive/7.5/confluent-7.5.3.tar.gz
Copy the file path to set the environment variable and path:
export PATH=$PATH:$CONFLUENT_HOME/bin
Enter the following help command on the terminal to check whether installation
succeeded:
confluent –help
Figure 9-10. Help screen for the Confluent command line interface
312
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
If you get something similar to this screen, then you succeeded in installing a
Confluent platform.
Now it is time to set up the environment so you can interact with the ksqlDB cluster
in the cloud. Within the dashboard, click ksqlDB and choose CLI instructions. You will
see instructions on logging in to Confluent Cloud CLI and setting the context to your
environment and cluster:
confluent login
confluent environment use env-0jvd89
confluent kafka cluster use lkc-555vmz
Now, enter the following so you can access the ksqlDB command line interface;
don’t forget to enter the appropriate key and secret:
313
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Now let us start with some basic stream applications. A stream associates a schema
with an underlying Kafka topic. Let us begin by creating a topic in the Kafka cluster. Visit
the cluster’s dashboard and click “Topics.” Here is how that looks:
314
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Select “Add topic” and name the topic as “students.” Leave other fields as is and
select “Create with defaults.” Here is how that looks:
315
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
Figure 9-14. Naming and specifying partitions for the new topic
Select the option of creating a schema for message values. The option for creating
a schema for message keys may help in the production environment when it comes
to achieving consistency on structure of keys and routing to appropriate partitions
(messages with the same key are sent to the same partition).
316
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
syntax = "proto3";
message students {
int32 id = 1;
string name = 2;
string course = 3;
}
317
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
There is also the option of creating a stream from the ksqlDB command line
interface. Let us look at that approach in detail. Using CLI, we will create a new stream
called students. Locate the Editor tab within the ksqlDB and enter the following code to
create a new students stream. Ensure you set the “auto.offset.reset” property to earliest.
Here is how that looks:
318
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
INSERT INTO students (id, name, course) VALUES (5, 'John', 'Physics');
INSERT INTO students (id, name, course) VALUES (6, 'Joe', 'Math');
INSERT INTO students (id, name, course) VALUES (7, 'Jane', 'Physics');
INSERT INTO students (id, name, course) VALUES (8, 'Jan', 'Physics');
INSERT INTO students (id, name, course) VALUES (10, 'Alice', 'Math');
INSERT INTO students (id, name, course) VALUES (12, 'Al', 'Chemistry');
INSERT INTO students (id, name, course) VALUES (4, 'Moe', 'Chemistry');
Now we can try to run a query on this stream. Here is a sample query for you to get
started:
Notice how you can view the results on the dashboard within your browser. Let us
try to access the Kafka cluster through the command line interface. Log in to Confluent
Cloud, which we installed earlier, and access ksqlDB using the same commands.
In the ksqlDB prompt, enter the following query:
Click “Run query.” Notice the terminal where you have the SQL results persistent.
You will likely see the following:
We have just witnessed how these supposedly queries are not exactly relational
database–like queries but rather streams. As soon as we insert data into the stream, that
same data pops up in our persistent stream output.
Kafka Connect
We saw in our earlier discussion how Kafka Connect helps us scale and stream data
reliably between Apache Kafka and other database systems. Kafka Connect offers a tool
called connectors that would connect with other data systems to enable and coordinate
320
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
data streaming. There are two types of connectors, namely, source connector and sink
connector. Source connectors ingest the data systems to stream data to Kafka topics,
whereas sink connectors deliver data to data systems from Kafka topics.
The Kafka Connect service can be installed on a standalone mode or be deployed
in distributed mode for production purposes. Kafka Connect is an important aspect of
streaming data/ETL (extract, transform, and load) pipelines, especially when you have
Kafka in the architecture. You can connect to an external data system using the Kafka
Connect service through ksqlDB. Please see the Oracle database example connector
here; the fields with values enclosed in brackets need to be populated with respective
database credentials:
Best Practices
As you can see, we have illustrated Kafka streaming data pipelines using a commercial
distribution, Confluent. At the time of writing this chapter, the vendor provided
hundreds of dollars on cloud credits that can be used to apply toward availing their
services. This way, anyone who is looking to get their feet wet with Kafka can avail their
services to run sample clusters. At the same time, it is also a good practice to be mindful
of unused Kafka clusters running in the background. Here are some commands that can
be used toward cleaning up unused Kafka resources:
321
Chapter 9 Engineering Real-time Data Pipelines using Apache Kafka
You can also asynchronously delete the respective topic backing the table or stream:
Conclusion
It is safe to say that Apache Kafka represents a paradigm shift on how we approach
building a real-time data pipeline and stream analytics solution. The architecture is
scalable, distributed, and fault tolerant, which provides a reliable foundation for building
and managing complex data pipelines that can handle the modern data streams.
We discussed the Kafka architecture and various components and how they all work
together. We looked at publisher–subscriber models and how topics are stored in a
database. We then looked at streaming data pipelines using ksqlDB and how the data
can be analyzed to gain business insights. I hope the skills and knowledge you gained
from this chapter may serve you well in building real-time data pipelines.
322
CHAPTER 10
Engineering Machine
Learning and Data REST
APIs using FastAPI
Introduction
In the complex world of data engineering and machine learning modeling, where
there are several moving parts (as you have seen so far), it is essential to deliver data
services and deploy machine learning models for consumption. In this chapter, we
will be discussing FastAPI. FastAPI is a Python library that primarily enables web
application development and microservice development. We will look at FastAPI with
the sole intention of using the library for delivering data services and machine learning
models as services. Though Django has won many hearts in the world of Python-based
application development, FastAPI is simple, strong, and very powerful. It is important
to understand the concept of application programming interfaces for a ML engineer or
data engineer.
By the end of this chapter, you will learn
323
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_10
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
OpenWeather API
OpenWeather API is a data service provided by OpenWeather, a company that
provides weather data to individuals and organizations by delivering through REST
API calls. The OpenWeather API enables engineers to access a wide range of weather
data programmatically. Some of the data include current weather data, forecasts,
and historical data for various locations across the world. They provide various API
endpoints for current weather data, forecast weather data, daily aggregations, historical
data, and so on. You can register for the OpenWeather API in the following website to get
an access key. You can obtain various data through the use of the access key:
https://openweathermap.org/api
Types of APIs
Not all APIs are made the same way; neither do they serve the same constituents.
The most common class of APIs is the Open API. These APIs are provided by various
organizations and they are free to access and utilize. Engineers can access and utilize
324
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
the services without any restrictions. The other class of APIs are internal APIs. These
APIs once again deliver services; however, they are internal to an organization. They
are not exposed to the public. Instead, they are accessed and utilized to fetch data
and functionality by various teams, within a given organization only. There is another
variation to internal APIs, which is called partner APIs. These APIs again pose some level
of confidentiality; however, they may cater to the organization and some of their partner
organizations as well.
In addition to various classes, and the purposes they serve, there are various
methods of accessing APIs; some of them are as follows.
SOAP APIs
SOAP stands for simple object access protocol. SOAP APIs have long been in existence
and used in enterprise-grade applications. SOAP is primarily meant for exchanging
structured data between two applications. SOAP uses either HTTP or SMTP to transport
the message. They use an XML template for message formatting.
REST APIs
REST stands for representational state transfer. REST is an architectural style for building
lightweight and scalable APIs. They leverage HTTP methods for communication.
Some of the HTTP methods are GET, PUT, POST, etc. REST APIs return data in JSON
or XML format. These APIs are commonly used in web-based applications and mobile
applications.
GraphQL APIs
GraphQL stands for graph query language. As the name stands, GraphQL uses query-
based requisition for fetching, making sure they fetch just enough data they need. The
emphasis of GraphQL is that it avoids overfetching. While that information may be a bit
too far-fetched for you (depending upon your background), it is best to note that such an
API exists.
Webhooks
Webhooks are another type of APIs that enable real-time communication between
systems. Like REST APIs, webhooks also communicate using HTTP requests.
325
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Endpoints
As mentioned, APIs are intermediaries; they serve as a bridge between two parties.
APIs provide a mechanism for clients initiating a conversation with a server. One such
mechanism is called URL or URI, namely, uniform resource locator or uniform resource
identifier. It contains the address of the server so the requests are made to the right
address.
Let’s take the following example:
Jane Doe,
Accounting Department,
ABC Company,
Say, if one sends a physical mail to this address, then it would go to the respective
town, to the concerned organization (hypothetical one, in this case), to the respective
department, and to the person addressed in that said department.
A REST API endpoint is similar to the address of a person just mentioned.
326
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
https://developer.MyFavoriteMovieTheatre.com/api/movie-theatre/
showtimes?movie_id=5&date=2023-06-06
In here
https://developer.MyFavoriteMovieTheatre.com is the base URL of the API.
/api/movie-theatre/showtimes represents the path parameters for retrieving
showtimes.
movie_id=5 and date=2023-06-06 represent query parameters. They allow you
to specify which movie you want to see and what date. The movie_id parameter
corresponds to a movie shown on a certain day.
This whole URL is called an endpoint. Endpoints are addresses just like the physical
address mentioned previously. By stating the correct base URL, correct path, and
appropriate query parameters, the movie theatre server will be able to process your
request and retrieve the response.
APIs usually require authentications. This is done so as to ensure only authorized
users or applications access their functionality. There are several ways to authenticate
with an API. Some of the common ways are the usage of API keys, tokens, and other
authentication mechanisms.
APIs move data from source to sink and provide an easier way to ingest data.
327
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
REST API
As discussed, REST expands to representational state transfer. It is one of the types
of web service that complies with REST architectural standards. REST architectural
standards leverage client–server communication over a given network. REST
architecture is stateless, meaning that the server will not maintain session information
for a request from a client. Instead, the request message itself would contain just enough
information for the server to process the request.
RESTful services listen for HTTP methods and retrieve the response from the server.
HTTP stands for HyperText Transfer Protocol. The HTTP methods are such that they tell
the server which operation to perform in order to retrieve the response for a request.
There are two categories of HTTP methods, safe HTTP methods and idempotent HTTP
methods. Safe HTTP methods are such that they always return the same or similar
response regardless of the number of times they have been called. Safe HTTP methods
do not change the data on the server. Idempotent methods may be able to change the
data on the server. Some of the more common HTTP methods (or verbs) include
–– GET: Reads data from the server and does not alter or change
resources in any way
GET /multiplexapi/movies/
–– Here is a HEAD request to see the last update time for movies:
HEAD /multiplexapi/theatreslist
POST /multiplexapi/bookings
328
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
–– PUT: Replaces current data with new data; can be seen as performing
an update operation
OPTIONS /multiplexapi/bookings
CONNECT app.multiplexapi.com:3690
329
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Table 10-1. Some common HTTP status codes and their descriptions
HTTP Status Code Description
These web services expose their data or a service to the outside world
through an API.
To build APIs in Python, there are quite a few packages available. There is Flask, a
Python library known for and meant for building simple, easy-to-setup, and smaller
APIs. The popular ones are Django and FastAPI. Django has been around for a long time,
has more functionalities, and is a better option than Flask. In this chapter, we will take a
deep dive into FastAPI.
FastAPI
FastAPI provides certain advantages over other Python libraries; first off, FastAPI
supports Python’s asynchronous programming through asyncio. FastAPI automatically
generates documentation that is based on Open API standards, allowing the users of the
330
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
API to explore the documentation and test various endpoints. Most important of them
all, as the name suggests, FastAPI is high performance and faster than the many other
Python libraries.
FastAPI does not come with a web server; let us also install a web server:
Assuming you are using a virtual development environment, I also encourage you to
consider this command:
This will install all the optional dependencies used by FastAPI, Pydantic, and
Starlette Python libraries, which include the “uvicorn” web server and asynchronous
client “http.”
Now let us consider a simple hello world application using FastAPI. Here is how
that looks:
app = FastAPI()
api_router = APIRouter()
@api_router.get("/")
def root() -> dict:
return {"msg": "Hello, World!"}
app.include_router(api_router)
331
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Core Concepts
FastAPI is based on AGSI standards and Starlette. FastAPI utilizes AGSI servers to deliver
its asynchronous capabilities. AGSI stands for Asynchronous Server Gateway Interface.
AGSI is a relatively new web standard for servers to asynchronously communicate with
clients. FastAPI supports Hypercorn and Uvicorn web servers as they are compatible
with AGSI standards.
Starlette, on the other hand, is an AGSI standard framework that helps with core
functionality of FastAPI. Starlette provides the core routing infrastructure, handling
requests and responses, database integration, and many other features that are
leveraged by FastAPI. On top of Starlette and AGSI standards, FastAPI natively provides
data validation, documentation generation, etc. so engineers can focus on the business
logic of things.
https://www.amazon.com/Sweetcrispy-Ergonomic-Computer-Comfortable-
Armrests/dp/B0C61DLQNH/ref=mp_s_a_1_5?crid=1CE83JNY3DEDI&keywords=chair&qi
d=1703648878&refinements=p_36%3A1253524011&rnid=386465011&s=home-
garden&sprefix=cha%2Caps%2C478&sr=1-5
/Sweetcrispy-Ergonomic-Computer-Comfortable-Armrests
/dp
/B0C61DLQNH
/ref=mp_s_a_1_5
?
crid=1CE83JNY3DEDI &
keywords=chair &
332
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
qid=1703648878 &
refinements=p_36%3A1253524011 &
rnid=386465011 &
s=home-garden &
sprefix=cha%2Caps%2C478 &
sr=1-5
When you are writing a function, you can pass variables to the function that can get
called. Here is how that looks:
Students = [
{
"Student_id": 1,
"Student_name": "StudentPerson1",
"Course" : "BIO 101",
"Department" : "Biology"
},
{
"Student_id": 2,
"Student_name": "StudentPersonTwo",
"Course" : "CHEM 201",
"Department" : "Chemistry"
},
{
"Student_id": 3,
"Student_name": "StudentPerson3",
"Course" : "MATH 102",
"Department" : "Mathematics"
},
333
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
api_router = APIRouter()
The main function defined earlier has not changed much in its structure. Let us add
a new function called “get_student” to retrieve student records. Notice the use of the
decorator that includes variables. The value of that variable or parameter will be passed
during runtime:
@api_router.get("/")
def hello() -> dict:
return {"msg": "Welcome to the root page"}
@api_router.get("/student/{Student_id}")
def get_student(Student_id: int) -> dict:
for s in Students:
if s["Student_id"] == Student_id:
result = s
return result
app.include_router(api_router)
Let’s run this code real-time. Here is how we are running this code:
334
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Note You can even mention the data type of the path parameter. In the earlier
example, where the get_student() function has a path parameter mentioned, the
data type of the path parameter is also defined.
If you enter a data type that is not defined in the code, FastAPI would throw an HTTP
error, like the following one:
335
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Here is an example:
app = FastAPI()
Students = [
{
"Student_id": 1,
"Student_name": "StudentPerson1",
"Course" : "BIO 101",
"Department" : "Biology"
},
{
"Student_id": 2,
"Student_name": "StudentPersonTwo",
"Course" : "CHEM 201",
"Department" : "Chemistry"
},
{
"Student_id": 3,
"Student_name": "StudentPerson3",
"Course" : "MATH 102",
"Department" : "Mathematics"
},
@app.get("/students/")
def read_item(skip: int = 0, limit: int = 10):
return Students[skip : skip + limit]
This example would perform two filters, namely, skipping a number of records and
limiting the number of results.
Here is how the URL looks like:
http://127.0.0.1:8000/items/?skip=1&limit=2
336
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
The query parameters are skip and limit; we intend to skip the first record and limit
the total results to two records only. So the output would look something like this:
Pydantic Integration
As we saw in a previous chapter, Pydantic is a simple and powerful data validation tool.
FastAPI integrates well with Pydantic.
Let us look at an example of Pydantic integration:
app = FastAPI()
api_router = APIRouter()
Students = [
{
"Student_id": 1,
"Student_name": "StudentPerson1",
"Course" : "BIO 101",
"Department" : "Biology"
},
337
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
{
"Student_id": 2,
"Student_name": "StudentPersonTwo",
"Course" : "CHEM 201",
"Department" : "Chemistry"
},
{
"Student_id": 3,
"Student_name": "StudentPerson3",
"Course" : "MATH 102",
"Department" : "Mathematics"
},
class Student(BaseModel):
Student_id: int
Student_name: str
Course: str
Department: str
class retrieveStudent(BaseModel):
Students: Sequence[Student]
class newStudent(BaseModel):
Student_id: int
Student_name: str
Course: str
Department: str
@app.get("/")
def hello():
return {"welcome to the root"}
@api_router.get("/student/{Student_id}", response_model=Student)
def get_student(Student_id: int) -> dict:
for s in Students:
338
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
@api_router.post("/student", response_model=Student)
def create_student(incomingStudent: newStudent) -> dict:
StudentEntry = incomingStudent(
Student_id = incomingStudent.Student_id,
Student_name = incomingStudent.Student_name,
Course = incomingStudent.Course,
Department = incomingStudent.Department,
)
Students.append(StudentEntry.dict())
return StudentEntry
app.include_router(api_router)
The code utilizes similar examples as previous instances. The get method retrieves
students by their student_id. The post method would create a new student and the
required parameters. Please note that this example is for illustrative purposes only. The
data (that is posted) does not persist in the disk.
Response Model
In the get and post functions we have defined, one can see the use of the “response_
model” attribute. The response_model would convert the given data into Pydantic and
still return the type that the function is designated to do. The response_model does
not actually change the return type of the function. The response_model parameter
tells FastAPI to convert the returned data into a Pydantic object and perform necessary
validations (wherever required). You can use the “response_model” parameter of the
FastAPI decorator methods. Some of these methods are
app.get()
app.post()
app.delete()
… and so on.
339
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
There is also an option to set the response_model to None. This will enable FastAPI
to skip performing the validation or filtering step that response_model intends to do. By
specifying the response model to None, you can still keep the return type annotation and
continue to receive type checking support from your IDE or type checkers.
Within that window, do expand any of the HTTP methods to try out their
functionality. Here is how that looks for “Get Student”:
340
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
When you click “Try it out,” the text box would be live, and so you can enter a given
request for the server to process and retrieve responses. The server would also generate a
Curl command for the request, the endpoint for the appropriate request, status response
message, and the response body. Here is how that looks:
341
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
342
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
FastAPI has a simple and intuitive dependency injection system. The method
“Depends” provides the dependency injection functionality for FastAPI. Let’s look at the
following example:
app = FastAPI()
def dbConnect():
return database
@app.get("/users")
def get_users(db=Depends(dbConnect)):
conn = db
users = conn.fetch_all("SELECT * from users")
return {"users": users}
In the preceding example, we have a simple API design where we have designed a
GET request, which would retrieve “users” when requested.
In order for the GET method to retrieve and respond with that piece of information,
it requires a connection to a database. As you may notice the database connection is
defined as a separate function.
And so, that database connection gets passed as a parameter within the get_users
function with the FastAPI method “Depends”. This way, the “get_users” function calls the
function “dbConnect()” function, which establishes a connection to a database.
mysqld —version
mysql -u root -p
Enter the appropriate password and you will get the mysql prompt:
mysql>
show databases;
use your-database-name;
Let us create a database, create a table within that database, and insert data into
that table:
344
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
show tables;
would yield
+---------------+
| Tables_in_stu |
+---------------+
| students |
+---------------+
insert into students (id, name, course) values (1, "test", "tescourse");
insert into students (id, name, course) values (2, "JohnDoe", "Physics101");
insert into students (id, name, course) values (3, "Joe", "Chemistry101");
+----+---------+--------------+
| id | name | course |
+----+---------+--------------+
| 1 | test | tescourse |
| 2 | JohnDoe | Physics101 |
| 3 | Joe | Chemistry101 |
+----+---------+--------------+
Let us also grant privileges to the new database user we have created:
Note These commands are meant for simple database setup for development
purposes. You would have to use mysql_secure_installation and follow DBA best
practices for setting up production instances of databases.
Now create a new Python script called “db1.py” and enter the following code:
345
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
dbapp = FastAPI()
database_creds = connector.connect(
host="localhost",
user="newscott",
password="Tiger@#@!!!9999",
database="stu",
auth_plugin='mysql_native_password')
346
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
@dbapp.post("/newstudent")
def newstudent(name: str, course: str) -> dict:
cursor = database_creds.cursor()
sql = "INSERT INTO student (name, course) VALUES (%s, %s)"
val = (name, course)
cursor.execute(sql, val)
database_creds.commit()
return {"message": "student added successfully"}
SQLAlchemy
SQLAlchemy consists of two major modules, core and ORM. The core module provides
a SQL abstraction that enables smooth interaction with a variety of database application
programming interfaces (also known as DBAPI). The core module does not require ORM
to be adopted to connect and interact with a database. The ORM module builds on top
347
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Engine
To connect with any database using SQLAlchemy, you need to get acquainted with
both the concepts of engine and session. In order to connect with a database, you
need to create an engine. The engine is the starting point of a SQLAlchemy application
that attempts to connect and communicate with a database. The create_engine object
provides a method to create new database connections.
Here is an illustration to connect with a PostgreSQL database:
databaseURL = URL.create(
"mysql+mysqlconnect",
username="your_username_here",
password="your_password_here",
host="localhost",
database="testdatabase",
)
engine = create_engine(databaseURL)
Session
The other concept is session. As the name suggests, the session establishes conversations
with the database, after a connection is established. The session creates a workplace for
all your queries and transactions (commits, rollbacks, etc.). If you are building a bigger
application, it is recommended to utilize the sessionmaker class to create a session. Here
is an illustration of how to utilize “sessionmaker” to create a new session:
databaseURL = URL.create(
"mysql+mysqlconnect",
username="your_username_here",
password="your_password_here",
348
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
host="localhost",
database="testdatabase",
)
engine = create_engine(databaseURL)
sampleSession.close()
The argument “bind” is a way to associate an engine with the session. The argument
“autoflush” flushes or commits the database transactions automatically when a query
is invoked to send SQL to the database or when you are using certain methods of
session class.
Query API
SQLAlchemy also provides a query API that enables performing various operations to
aid with data retrieval and manipulation. Some of these methods include but are not
limited to sorting, filtering, joining, fetching data from tables, etc. Let’s look at some SQL
to ORM examples.
Let us say we have a table called “students” in a relational database. Here is how the
SQL query translates:
would translate to
Students.select()
Let us look at a slightly detailed query where we choose to retrieve students who
have taken Chemistry 101 during Fall 2024:
Select
*
from
students
where
349
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
would translate to
students \
.select() \
.filter( \
students.has( \
term="Fall2024", \
course="Chemistry101", \
plan="something" \
)
).fetchmany(25)
Alembic
Furthermore, it is also important to know about database migrations and the Alembic
library. Database migration does not necessarily mean moving large amounts of data
between databases. Instead, database migration is about creating and maintaining both
the current and evolving schemas with the application. Alembic is a Python library
that helps manage and maintain these evolving schemas within the SQLAlchemy
application.
350
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
SQLAlchemy_DB_URL = "mysql+mysqlconnector://newscott:Tiger$$$!!!9999@
localhost/stu"
engine = create_engine(SQLAlchemy_DB_URL)
SessionLocal = sessionmaker(autocommit=False,
autoflush=False,
bind=engine)
Base = declarative_base()
Next, we will define the class that corresponds to the respective table in the database:
class Student(Base):
__tablename__ = "students"
This will be followed by creating a data validation class using Pydantic library. Here
is how that looks:
class NewStudent(pyd.BaseModel):
id: int
name: str
course: str
Let us now develop various functions that would become the functionality of our
API. In this case, we can obtain a list of all students, obtain details of a student by name
or id, or create a new student and commit the transaction as well:
351
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Once we have all the functionalities and moving parts, let us now create the FastAPI
application:
import model
from crud import get_AllStudents, get_student_by_name, get_student_by_id,
new_student
from pydanticSchema import NewStudent
model.Base.metadata.create_all(bind=engine)
app = FastAPI()
# Dependency
def get_db():
db = SessionLocal()
352
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
try:
yield db
finally:
db.close()
@app.get("/student/{name}")
def get_stud_by_name(student: NewStudent,
db: Session = Depends(get_db)):
try:
byname = get_student_by_name(db, name=student.name)
return byname
except Exception:
raise HTTPException(status_code=400)
@app.get("/students/")
def get_allstudents(db: Session = Depends(get_db)):
try:
allstudents = get_AllStudents(db)
return allstudents
except Exception:
raise HTTPException(status_code=400)
@app.get("/student/{id}")
def get_stud_by_id(student: NewStudent,
db: Session = Depends(get_db)):
try:
by_id = get_student_by_id(db, id=student.id)
return by_id
except Exception:
raise HTTPException(status_code=400)
@app.post("/students/")
def new_student(student: NewStudent,
db: Session = Depends(get_db)):
student_id = get_student_by_id(db, id=student.id)
if student_id:
raise HTTPException(status_code=400)
353
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Once we have the functionalities and code ready, let us test the RESTful data API
application. Simply run
You can open a browser tab and enter http://127.0.0.1:8000/students/ to GET all the
students.
Here is how that may look like:
Middleware in FastAPI
A middleware is a small function or set of programs that is placed between provider
and consumer. It works with both request and response processes. With every request
that is received prior to being sent to the server for processing, the middleware can be
configured to process a specific operation before sending it to the server. Likewise, it can
take a response, perform certain operations, and send that response to the consumer.
Common examples of middleware applications are performing authentications and
354
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
writing a metadata log entry on what has been requested and responded. Middleware
also helps limit the number of requests from a specific consumer.
Let us look at a simple example of a middleware implementation. Imagine we have
machine learning deployed as a service that is open to the public. We want to implement
a rate-limiting middleware that limits the number of requests originating from a
specific consumer to 25 per hour. On a side note, this is a good example of async/await
implementation.
Note Please note that the focus is more on the FastAPI implementation for our
machine learning app illustration.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pickle
import numpy as np
df = pd.read_csv("/<your-path-to>/iris.csv")
355
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
x = df.iloc[:,:4]
y = df.iloc[:,4]
model=LogisticRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
#print(confusion_matrix(y_test,y_pred))
#print(accuracy_score(y_test, y_pred)*100)
input_data_reshaped = np.asarray(input_data).reshape(1,-1)
prediction = model1.predict(input_data_reshaped)
Once we have the model pickle file, we can proceed to build our application in
FastAPI with a custom middleware that limits access to the API after a few number
of requests. We start off with importing the necessary classes and methods, followed
by instantiating the FastAPI, loading the model, and creating a dictionary to store the
IP addresses and number of hits. Then we define the middleware function, where we
obtain the IP address from the request class and increment the hits for each IP address.
The middleware is coded to ensure that if the number of hits goes beyond 5 in a given
hour, then it will throw an exception:
356
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
app = FastAPI()
model = pickle.load(open('irismodel1.pkl','rb'))
# creating a new register for the ipaddresses
registry: dict[str, (int, datetime)] = {}
@app.middleware("http")
async def rateThreshold(request: Request, call_next):
ip_address = request.client.host
rightnow = datetime.now()
if ip_address in registry:
hits, timestamp = registry[ip_address]
if rightnow - timestamp > timedelta(hours=1):
hits = 0
else:
hits = 0
hits += 1
if hits > 5:
raise Exception("too many requests")
registry[ip_address] = (hits, rightnow)
response = await call_next(request)
return response
class ModelInput(BaseModel):
SLength: float
SWidth: float
PLength: float
PWidth: float
@app.post("/predict/")
def predict(modelparam: ModelInput):
modelparameters = json.loads(modelparam.json())
sl = modelparameters['SLength']
sw = modelparameters['SWidth']
pl = modelparameters['PLength']
pw = modelparameters['PWidth']
357
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
predictionclass = model.predict([[sl,sw,pl,pw]])
return predictionclass[0]
Figure 10-9. A POST request for the machine learning model API
You can start inputting the values and let the system classify (predict) the iris plant.
Here is how that looks:
358
Chapter 10 Engineering Machine Learning and Data REST APIs using FastAPI
Figure 10-10. FastAPI computing the prediction for the input parameters and
returning the response body
Conclusion
APIs, application programming interfaces, are one of the growing ways of consuming
data analytics and machine learning services. API-based development and deployment
is an ocean of its own; it is good for a ML engineer or a data engineer to set up and build
an API. In this chapter, we looked at the concepts and architecture of APIs. We went
into greater detail on FastAPI architecture and looked at a variety of illustrations. We
moved toward database integration, building a data service REST API. We also looked at
deploying a machine learning model as a REST API. Not only have you mastered the core
concepts of FastAPI, but also you have acquired a skillset that bridges the gap between
complex engineering and service delivery. Overall, I hope this served you greatly in
understanding how to engineer and deploy data pipelines using FastAPI.
359
CHAPTER 11
361
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_11
Chapter 11 Getting Started with Workflow Management and Orchestration
362
Chapter 11 Getting Started with Workflow Management and Orchestration
1. Ingredient mixing, where the dough, yeast, water, and salt are
mixed together in specific, measured proportions in giant mixers.
363
Chapter 11 Getting Started with Workflow Management and Orchestration
This is the most basic example of a workflow. In the steps listed earlier, one cannot
interchange given steps. These steps have to be followed in exactly the same order.
Furthermore, the output of a given step constitutes the input of the next step. And so, the
output of the next step becomes the input of the next subsequent step, and so on.
364
Chapter 11 Getting Started with Workflow Management and Orchestration
365
Chapter 11 Getting Started with Workflow Management and Orchestration
A recent variant of ETL is referred to as ELT. ELT stands for extract, load, and
transform. In this case, data from various sources are extracted and sourced into the
target system. From there on, various transformation and business rules are applied
directly onto the data. There may or may not be a staging layer in the ELT workflow.
ELT workflows are more common in the case of cloud data stores. ELT is more suited
366
Chapter 11 Getting Started with Workflow Management and Orchestration
when you are dealing with diverse data types. Most cloud providers offer serverless data
lake/warehouse solutions that serve ELT pipelines, making it cost effective and easy to
maintain.
Let us take an example of what an ELT data pipeline may look like. Let’s say relational
database data is being sourced from an inventory system on various products, along with
JSON data of log sensor readings of the warehouse, along with few other data. Obtaining
this data at frequent intervals would constitute the extract stage. For the load stage,
imagine if we have a Google BigQuery where you dump all the data. The transform stage
is where the data gets cleaned, wrangled, and either integrated into a unified dataset or
loaded into dimensional tables for further analysis.
Here is how this may look:
Figure 11-3. A sample ELT architecture for inventory data processing pipelines
367
Chapter 11 Getting Started with Workflow Management and Orchestration
Workflow Configuration
Workflow configuration is another key concept to understand. Workflow configuration
is basically a set of configuration parameters that would drive the workflow processes.
Workflow configuration is essential and crucial to the success of delivering a product or
solution, in this case, bread.
In this example, workflow configuration might involve the following aspects:
368
Chapter 11 Getting Started with Workflow Management and Orchestration
Workflow Orchestration
Workflow orchestration refers to the centralized administration and management of the
entire process. Workflow orchestration involves planning, scheduling, and controlling
of each step of the production process and their respective workflow configuration
parameters, in order to ensure that all tasks are executed within the controls defined and
in the appropriate order. Workflow orchestration optimizes the utilization of resources,
helps plan procurement of raw materials, and establishes high quality and consistency
among each of the products developed, eventually leading to seamless functioning of the
organization. In this context, workflow orchestration would refer to
1. Defining the entire workflow: Where each step of the processes is
defined with configuration parameters baked in.
2. Scheduling: Where the defined workflow will be scheduled,
wherein a process will start upon completion of a process in the
assembly chain.
369
Chapter 11 Getting Started with Workflow Management and Orchestration
3. A central control room that oversees the entire shop floor from
ingredient mixing to packing bread in boxes where a forklift
operator would place these boxes in trucks.
Concepts
At the heart of the cron scheduler, we have something called the cron daemon process.
This process runs on system startup and in the background. The cron daemon executes
commands that are mentioned in the file called crontab. There are two files that specify
the users and their access to the cron scheduler environment.
These may be located at
/etc/cron.allow
/etc/cron.deny
/etc/cron.d/cron.allow
/etc/cron.d/cron.deny
371
Chapter 11 Getting Started with Workflow Management and Orchestration
Note In the case of Ubuntu and Debian operating systems, cron.allow and cron.
deny files do not exist, meaning all users of Ubuntu and Debian operating systems
are provided access to the cron scheduler by default.
The most basic unit of a cron scheduler environment is a cron job. A cron job
represents a task or job that is scheduled to be executed. A cron job is one line in a
crontab, representing one execution of a shell command.
Crontab File
Crontab is short for cron table. Crontab is a utility that maintains crontab files for users. A
crontab file is automatically generated for a given Linux user. The cron daemon process
runs this crontab file regardless of whether a task or a job is scheduled to run or not. A
crontab file can be seen as a registry that contains a list of jobs scheduled. A crontab file
belongs to root or a superuser. You can become a superuser to create a crontab file or
edit an existing crontab file. The default location for crontab files is
/var/spool/cron/crontabs/
To create a new crontab file, use “crontab -e” from the command line. Here is how
that looks:
You will be presented with the following template to write your own cron jobs:
To list all the current crontab files, you can try the following:
ls -l /var/spool/cron/crontabs/
The minute column represents the specific minute when the job should run. It can
contain numeric values only. The values can range any number between 0 and 59.
The hour column represents the hour at which the job should run and takes only
numerical values. The values can range anything between 0 and 23.
373
Chapter 11 Getting Started with Workflow Management and Orchestration
The day-of-month column represents the day in a given calendar month. As you
know, months contain 28, 29, 30, and 31 days depending upon the month and whether
it’s leap year or not. You can mention a numerical value between 1 and 31.
The month column represents a calendar month and can take numerical values
between 1 and 12. These numerical values represent calendar months from January to
December, respectively.
The weekday column represents a given day of the week starting from Sunday
through Saturday. It can take any value between 0 and 6, where 0 represents Sunday and
6 represents Saturday.
The shell command is where the command that requires it to be run is issued.
Note In the crontab file, we have an asterisk operator. The “*” operator can be
applied in the first five columns of the crontab entry. When a “*” is mentioned for
the appropriate column, it means that it can run on all the intervals of the given
allowable range. Here is an example:
15 6 * * * /user/parallels/bin/helloworld
The abovementioned crontab entry would run the hello world script every day at 15
minutes past 6 AM in the morning. Specifically, the script would run every day of
the month, every month, and every weekday as well.
Cron Logging
The cron daemon process writes a log entry every time the system starts up, it executes a
new job, and it has trouble interpreting the command or faces an error during execution
of a job. Unfortunately, the log entry does not contain the source of the error. If you are
working on a Debian-based operating system (Ubuntu and others), then you can find
the logs here:
/var/log/syslog
/var/log/cron
374
Chapter 11 Getting Started with Workflow Management and Orchestration
You can obtain the most recent cron log entries by the following command:
Logging in cron stores a lot of key information. The log entry starts off with a date
timestamp when the job was started, the host of the workstation, the user account of that
workstation, and the command that is executed. Here is an example of a log entry:
In the preceding example, the log entry begins with the date timestamp, followed
by the hostname (in this case, linux), followed by the user on the host (in this case,
parallels) and finally the command that is executed (in this case, executing a Python
script named helloworld.py).
nano cronexample.sh
#!/bin/bash
375
Chapter 11 Getting Started with Workflow Management and Orchestration
Save the file and exit the editor. Let us provide execute permissions to this file:
chmod +x cronexample.sh
Now enter the crontab to schedule the execution of this shell script:
crontab -u parallels -e
376
Chapter 11 Getting Started with Workflow Management and Orchestration
Figure 11-7. Scheduling the execution of that previously written shell script
What we have is
* * * * * /home/parallels/Documents/cronexample.sh
This script line would execute the command at every minute of every day of every
week of every month.
Now let us introduce another operator through another example:
*/6 * * * * /home/parallels/Documents/cronexample.sh
We have a “/” operator, which is called the step value. In this case we have
incorporated the step value for the minute field. This example would mean to run the
following script at every sixth minute regardless of hour, day, month, date, etc.
377
Chapter 11 Getting Started with Workflow Management and Orchestration
The “/” step operator can be extended to other fields as well, rendering much more
complex applications.
Here is an example:
This would enable the cron scheduler to run this script at the 24th minute of every
4th hour. Here is how a sample schedule looks like:
In the preceding example, we are running the “cronexample.sh” at every half hour
past every 18th hour on Monday, Thursday, and Sunday. Here is how the schedule may
look like:
The first execution starts at 18:00 hours followed by a 30-minute gap in the same
hour. The subsequent execution happens at noon of the next day providing the 18-hour
time window.
378
Chapter 11 Getting Started with Workflow Management and Orchestration
In this example, we have scheduled this script to run at 10 PM, every day from
Monday through Friday, every six months.
Here is how the schedule may look like:
Now, let us try to combine the range operator and the step value operator:
36 9-17/2 * * * /home/parallels/Documents/cronexample.sh
In this example, we have scheduled the script to run at the 36th minute of the given
hour, every 2 hours between 9 AM and 5 PM:
Here is how that would look like:
Let us look at an example that uses word strings instead of numbers. Here is an example:
This crontab script would execute the shell script every 12th day of the month if
that falls on Monday in January, March, May, July, October, November, and December,
and on the day of executing the script, it would execute every third hour between 9 AM
and 5 PM.
Here is an example schedule:
379
Chapter 11 Getting Started with Workflow Management and Orchestration
Notice the use of the character “L” and the hash symbol “#”. The “L” character
indicates the last of the values for a given field. The “#” hash operator indicates the nth
day of the month. In this case, the script is scheduled to be executed every 15 minutes,
between 3 AM and 3:59 AM, on the last day of the month if that last day happens to be
the fourth Friday of the month, in the months of June, August, and September.
Database Backup
Database backup is a process of creating a copy of an organization’s master files and
transactional tables. This is a critical operation, as data is the most valuable asset and
data is required to run everyday business and operations. There are various types of
database backup methods that exist depending upon the type and frequency of backups.
There are differential backups, incremental backups, and complete backups.
Let us look at a simple example where a shell script executing database backup
operation is scheduled:
0 2 * * * /home/parallels/mysql_backup.sh
This script would run every day at 2 AM carrying out instructions mentioned in the
shell script file.
Data Processing
Many of the organizations are working with a variety of systems that generate data. The
data team within the organization would usually perform report generation every day
or perform a file transfer of certain data. These data processing tasks that arise within
certain source systems may be automated using cron scheduler utility.
380
Chapter 11 Getting Started with Workflow Management and Orchestration
Let us look at scheduling a batch job that performs data processing or a report
generation every day:
45 3 * * * /home/parallels/Documents/data_processing.sh
The abovementioned cron scheduler would process the batch operation every day at
3:45 AM, thereby preparing the report or performing file transfer to another location.
Email Notification
Email notification is a process of notifying one or more persons with a certain
message. In the world of data systems and engineering, one can set up to receive email
notifications about the status of the task that is processed (backing up a database,
preparing a dataset, data mining, etc.). Several libraries exist to set up email server
configuration and output messages to the appropriate email.
Let us look at a simple example:
45 7 * * MON-FRI /home/parallels/email_notification.sh
The abovementioned would send email notifications on various tasks every day at
7:45 AM from Monday to Friday.
Cron Alternatives
There are quite a few alternatives to cron job scheduler; some of these alternatives are
still based on earlier distribution of cron scheduler utility. But there is one alternative
that is utilized and well respected in the admin community, and that is called “systemd
timer.” Systemd is a software suite that provides a variety of components to manage
system processes in Linux operating systems. Systemd timer is one of the components of
systemd that provides a job scheduling mechanism. Jobs can be scheduled to run on a
specific time interval or based off of a certain event.
Conclusion
Workflow management and orchestration are essential components in data engineering.
In this chapter, we looked at crontab and cron scheduling and how we can orchestrate
complex jobs and automate routine tasks using the same. What appears to be a simple
381
Chapter 11 Getting Started with Workflow Management and Orchestration
concept is a powerful scheduling tool for automating recurring tasks. The foundations
and key concepts you have learned in this chapter may serve you well when looking at
modern data orchestration platforms, which we will see in the next couple of chapters.
Efficient workflow management, scheduling, and orchestration are key to unlocking the
complete potential of your data infrastructure.
382
CHAPTER 12
Orchestrating Data
Engineering Pipelines
using Apache Airflow
Introduction
Apache Airflow is an open source workflow orchestration platform. It enables engineers
to define, schedule, and orchestrate workflows as directed acyclic graphs (DAGs).
Airflow supports a wide range of integrations with popular data storage and processing
systems and is suitable for a variety of use cases, from ETL (extract, transform, load)
processes to sophisticated machine learning pipelines. With its scalable and extensible
architecture, great graphical user interface, and collection of plugins and extensions,
Airflow is widely used to improve productivity and ensure data reliability and quality. We
will have a deep dive into the architecture of Apache Airflow, various components, and
key concepts that make Apache Airflow a powerful workflow orchestrator.
By the end of this chapter, you will learn
source airflow-dev/bin/activate
airflow-dev/Scripts/activate
mkdir airflow_sandbox
Now, copy the folder path and set the AIRFLOW_HOME environment variable.
Ensure that you are performing this prior to installing Apache Airflow in your system:
export AIRFLOW_HOME=/home/parallels/development/airflow_sandbox
The following instructions are extracted from the Apache Airflow website. We shall
obtain the Python version using the following command:
As mentioned earlier, we are looking to install Apache Airflow version 2.8.1. Here is
the command to perform this step:
384
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
This command will kick-start installing Apache Airflow and its dependencies. Here is
how that may look:
Now, let us initialize the backend database, create an admin user, and start all
components by entering the following:
airflow standalone
385
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Once this step is completed, let us initialize the Apache Airflow web server by the
following command:
airflow webserver —D
This command will start the Airflow web server as a background process.
386
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Once you have the web server running, let us try to run the scheduler. The following
command will kick-start the Airflow scheduler:
airflow scheduler -D
You may open your browser and enter the following URL to access the Apache
Airflow graphical user interface.
Here is the URL:
http://localhost:8080
387
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
When you ran the “airflow standalone” command, it must have created an admin
account with the password stored in a text file within the location of AIRFLOW_
HOME. Obtain the password from the text file “standalone_admin_password.txt”. Here is
how that may look:
Figure 12-5. Obtaining the user credentials from the text file
Once your login is successful, you may see the following screen with prebuilt jobs:
388
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Airflow Architecture
Airflow provides you a platform to define, schedule, and orchestrate jobs and workflows.
As discussed earlier, a workflow can be a series of tasks where the output of a given
task serves as the input of the subsequent task. In Apache Airflow, a workflow is
represented by a diagonal acyclic graph. A diagonal acyclic graph is a mathematical
term representing a series of nodes connected by edges that does not form a closed
loop nowhere in its structure. In the context of Apache Airflow, a diagonal acyclic graph
represents a series of tasks connected and dependent on each other but does not in any
way form a loop. They are called simply as DAGs.
A DAG workflow has tasks represented as nodes; data sources and dependencies
are represented as edges. The DAG represents the order in which the tasks should
be executed. The tasks themselves contain programmatic instructions that describe
what to do. The concept of DAG is at the heart of Apache Airflow’s functionalities. In
addition to that, there are some core components that are essential to Apache Airflow
functionalities. Let us look at these core features in detail, starting with the components
we are already familiar with.
389
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Web Server
The web server component provides the user interface functionality of the Apache
Airflow software. As we have seen in the previous section, the web server runs as
a daemon process. The installation is shown as a local machine/single machine
standalone process. You can deploy Apache Airflow, in a distributed process where
the web server can have its own instance with heightened security parameters. The
web server provides the user interface for a developer to look at various jobs that are
currently scheduled, the schedules they were configured to run, the last time they were
run, and the various tasks that are contained in the workflow of a given job.
Database
The database component of Apache Airflow contains metadata of various jobs. Some of
this metadata information includes the last run of a task or workflow, status of previous
runs, etc. The web server, which has a modern user interface, retrieves information
from this database. Apache Airflow uses SQLAlchemy as its default data store. For a
production instance, it is recommended to set up a different database backend like
Postgres or MySQL. The backend database setup involves creating a separate database
for an Airflow instance, a separate database user, providing all privileges on the Airflow
database to the Airflow user, and adding the newly created Airflow user to the database’s
access control list. Lastly, initialize the new backend database and utilize appropriate
database drivers in your connection string within the Airflow configuration file (we will
see more on this later).
Executor
The executor is how the tasks within a given workflow are run. The executor is
actually part of the scheduler component. It comes with several pluggable executors.
These executors provide various different mechanisms for running your workflows.
There are two major types of pluggable executors, namely, the local executors and
the remote executors. As their namesakes, local executors run tasks within your
workstation (locally), whereas remote executors run these tasks in multiple worker node
environments (distributed environment). For instance, the LocalExecutor runs on your
390
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Scheduler
This is the most important component of the Apache Airflow architecture. The scheduler
component monitors all the workflows and all the tasks within each workflow. The
scheduler triggers execution of a task once all its underlying dependencies are complete.
The scheduler is a separate service that needs to be started manually, as we have seen in
the “Setup and Installation” section. Apache Airflow supports more than one scheduler
component running concurrently to improve performance. The scheduler component
makes use of the executor to execute tasks that are ready to go.
Configuration Files
The Airflow configuration file is not part of the Apache Airflow architecture; however,
it controls the workings and functionalities. Apache Airflow comes with an option to
fine-tune various options through the “airflow.cfg” file. This file gets created when you
install Apache Airflow in your workstation. Each property that is configurable has been
provided with descriptions and default configuration options as well. You can open the
Airflow configuration file simply using a notepad or text editor. You can also get the list of
options from the command line as well by entering the following command:
391
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
If you want to change the location where you keep your DAGs, then here is where
you would specify the location of the folder:
[core]
dags_folder = /user/parallels/<location-to-your-DAGs-folder>/
If you want to specify or change the executor that Airflow intends to use, then here is
where you would do so:
[core]
executor = LocalExecutor
If you wish to change the default sqlite backend to either MySQL or Postgres
database instance, then you may change the database string from sqlite driver
and credentials to either mysqldriver or psycopg2 driver along with the respective
credentials. Here is where you would replace the database string:
[database]
sql_alchemy_conn = sqlite:////
392
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
If you would like to gain visibility over your development or sandbox instance of Apache
Airflow and enable ability to view the configuration, then set the following parameter to True:
[webserver]
expose_config=False
Once you have made the necessary changes, it is essential to restart the Airflow
server. A good production practice is to create a service within the systemd service
in order to start, stop, and restart the Airflow server. You may find a systemd service
configuration file within the GitHub repository of Apache Airflow. However, for
development instances, it is easier to identify the process id (also known as PID), attempt
to kill the process, and start the Apache Airflow server by using the following command:
Note The “-p” refers to specifying the port, whereas “-D” is to run the server as
a daemon process (commonly known as background process).
In addition, you can use the following command from the terminal to obtain a list of
all current DAGs loaded in the environment:
A Simple Example
Let us look at a simple example of Apache Airflow and examine various components in
the following sections. Here is a simple example to get started:
def say_hello():
print("Hello, World!")
def print_time():
now = datetime.today()
print(now)
return now
393
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
first_airflow_dag = DAG(
'hello_world',
description='A simple DAG that prints Hello, World!',
schedule=None,
)
hello_task = PythonOperator(
task_id='say_hello_task',
python_callable=say_hello,
dag=first_airflow_dag,
)
display_time = PythonOperator(
task_id='print_time',
python_callable=print_time,
dag=first_airflow_dag
)
We import the required classes and methods, to begin with. We have two functions
in this code: the first function would print a string, and the other function would display
the current date and time. We then define the DAG where we specify various details like
name, description, and schedule. These DAGs can also take in much more complex
configurations. Then, we define the tasks for each of those functions and associate them
with our DAG. Finally, we define the order of the tasks to be executed. While this may
not print anything in the command line, this is helpful to understand the various parts
of workflow definition using Apache Airflow. To get the DAG displayed on your user
interface, make sure you place the Python script in the appropriate folder.
Note You have to place your Python script in the folder that is referenced in the
airflow.cfg configuration file in order to view and monitor your DAG. Make sure
your “dags_folder” property is pointing to the appropriate folder where you have
your scripts that need to be run. Moreover, locate the configuration parameter
named “expose_config” and set that parameter as “True”. This will expose
all the configuration parameters in the web server. In development or sandbox
instances, these may be helpful to troubleshoot.
394
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Airflow DAGs
The diagonal acyclic graph or simply the DAG is the heart of a workflow management
and orchestration system and, in this case, Apache Airflow. A typical DAG would consist
of various tasks that are connected with each other through the database connections
and dependencies such that there won’t be a loop formation anywhere in the graph.
Here is a simple example of a DAG:
Connect to the database ➤ extract the table ➤ perform the transformations ➤ load
the dataset ➤ close the database ➤ email notify
These DAGs will define and control how these tasks will be executed; however, they
have no information about what these tasks actually do. In a way, these DAGs are task
agnostic.
395
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
There are three ways of declaring a DAG within a data pipeline. The most basic and
familiar way is using a constructor and passing the DAG into an operator that you plan to
use. Here is how that looks:
my_test_airflow_dag = DAG(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
)
EmptyOperator(task_id="my_task", dag="my_test_airflow_dag")
You can also declare a DAG by using a Python context manager and adding the
operators within the context:
with my_test_airflow_dag(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
):
PythonOperator(task_id="hello_world_task")
Lastly, you can define a function, where you specify all the operators within that
function; then you can use a decorator to turn that function into a DAG generator. Here
is how that looks:
@dag(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
)
def some_random_function():
EmptyOperator(task_id="hello_world_task")
PythonOperator(task_id="print_current_date")
some_random_function()
396
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Whenever you run a DAG, Apache Airflow will create a new instance of a DAG and
run that instance. These are called DAG runs. Similarly, new instances of the tasks
defined within DAGs will also be created during runtime. These are called task instances.
Tasks
DAGs require tasks to run. Tasks are at the heart of any given DAG. They can be seen as
a unit of work. During a run, you will be creating a new instance of a task, namely, task
instance. These task instances have a begin date, an end date, and a status. Tasks have
states, representing the stage of the lifecycle. Here is a sample lifecycle:
There are three kinds of tasks, namely, operators, sensors, and task flows. As
discussed in previous sections, tasks are interconnected using relationships. In this
context, the dependencies between tasks are represented as relationships. Here is an
example of how relationships are represented in Apache Airflow. Let’s assume we have
three tasks, namely, extract, transform, and load. The output of an extract task serves as
the input of a transform task, and the output of a transform task serves as the input of a
load task. Here is how these tasks can be represented in relationships:
Operators
An operator in Apache Airflow represents a template for performing certain tasks.
Apache Airflow has an extensive list of operators available that are already developed.
Some of the operators cater to specific cloud environments, databases, etc. Let us look
at a simple example of leveraging a bash operator to download a flat file and save it in a
specific directory.
Here is the code for the same:
397
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
@dag(
dag_id="extract_task",
schedule=None,
start_date=datetime(2024,2,2)
)
def extract_task():
BashOperator(
task_id="data_extract",
bash_command="wget -c https://www.dol.gov/sites/dolgov/files/ETA/
naws/pdfs/NAWS_F2Y197.csv -O /Users/<your-directory-here>/airflow/
wash-electric-vehicles.csv",
)
extract_task()
Every time you utilize Apache Airflow to schedule and orchestrate your data
pipeline, you have to reinitialize the database. You can perform that by using the
following command:
airflow db init
This command will pick up the DAG from the folder, and you can see the DAG listed
in the user interface. Here is how that looks:
398
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Sensors
A sensor is also a type of an operator, except it performs a very distinct function. The
sensor can be used to create an event where it would wait for something to happen.
Once the event happens, it would trigger a downstream job. If the event does not
happen, it will continue to wait. Sensors are commonly used to wait for a file to arrive
within a file system, monitor external APIs and databases, or simply wait for the
completion of a different task. Here are some of the commonly used sensors and how
they are utilized in data pipeline workflows:
Let us look at a simple example of a sensor. We have a file sensor that continues to
wait till a file arrives in a folder/file system for a specified amount of time. Whenever the
file arrives into the respective folder location, the DAG would run the downstream task,
which echoes that the file has arrived. In this context, it is assumed that the file will be
placed in the respective folder by another process:
@dag(
dag_id = "file_sensor_example",
start_date=datetime(2024,2,3)
)
def wait_for_file():
FileSensor(
task_id="wait_for_file",
filepath="/Users/<your-location-here>/airflow/testfile.txt",
399
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
mode="poke",
timeout=300,
poke_interval=25,
)
@dag(
dag_id = "file_sensor_example",
start_date=datetime(2024,2,3)
)
def file_notification():
BashOperator(
task_id="file_notification_task",
bash_command='echo "file is available here"'
)
Task Flow
As you have seen in previous sections, you will have to start planning tasks and DAGs as
you develop your data pipeline. If you have data pipelines that are mostly written in pure
Python code without using Airflow operators, then task flow decorators will help create
DAGs without having to write extra operator code:
@dag(schedule="@daily", start_date=(2024,2,2))
def taskflow_example():
@task(task_id="extract", retries=5)
def extract():
"""
Extract data from the data source.
If there is a connection issue, attempt to
establish the connection again
"""
return dataset
@task(task_id="transform")
400
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
def transform(dataset):
"""
Perform the necessary transformation for
the extracted dataset
"""
return processed_dataset
@task(task_id="loaddata", retries=5)
def load(processed_dataset):
"""
Obtain the dataset and transport it to file system.
Retry the task if there is an issue with connection.
"""
pass
extract_data = extract()
transform_data = transform(dataset)
load_data = load(processed_dataset)
taskflow()
Xcom
Tasks, by default, perform or execute the instructions they have been provided in
isolation and do not communicate with other tasks or processes. Xcom is an Apache
Airflow exclusive concept that enables cross-communications between tasks. With
Xcom, you can obtain information or value from one task instance and utilize that
in another task instance. While this can also be programmatically accomplished by
exporting the value to a file or a database and then reading the file or database to obtain
the value, Xcom functionality is native and can be considered as more secure.
From a development point of view, cross-communication can be achieved in two
steps. First, you create an Xcom for a task and you pull the value from where you have
created it. As far as creating an Xcom is concerned, you can simply return a value of a
function, or you can utilize the Xcom push method. Here is how that may look:
401
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
@task
def some_function(value):
result = some_other_function(*args)
value.xcom_push(key="Result_Func1", value=result)
The Xcom pull method will attempt to obtain values using task_id and keys. Here is
an example:
def another_function(value):
value_from_some_function = value.xcom_pull(key="Result_Func1", task_
id="some_function")
pass
Hooks
Apache Airflow lets you connect with external data platforms and systems with ease,
through the concept of Hooks. Hooks in Airflow, let you connect with external systems
without having to write code to manually connect to these systems. Hooks also have the
optional “retry” logic embedded in them. Airflow offers many built-in hooks and there is
also a community of developers developing custom hooks to various applications. Let us
look at a simple illustration.
Lets look at the following code that connects to a database:
import psycopg2
def querydb():
conn_params = {
"host": localhost,
"database": testDB1,
"port": 5432,
"user": "username",
"password": "password"
}
try:
conn = psycopg2.connect(**conn_params)
cur = conn.cursor()
402
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
rows = cur.fetchall()
for row in rows:
print(row)
except:
print("error connecting to database")
Now let us use an Airflow Hook to connect to the same database. Here is how that
code may look:
def querydb():
dbhook = PostgresHook(postgres_conn_id='Postgres_TestDB1')
Notice the use of the Postgres database hook while connecting and retrieving rows;
the database connection parameters are not required to be specified here.
■■ Note Prior to using Hooks in Airflow, you must setup the credentials of the
external data system in Airflow separately.
You can visit Airflow UI ➤ Admin ➤ Connections ➤ New Connection.
Once you have the external data system configured, you may then get a connection
id that you can use it in the hook.
Variables
Apache Airflow variables are key–value pairs that can be defined and used within an
Airflow DAG. These variables are scoped globally and can be accessed from any of the
tasks. The tasks within a DAG can query these variables. You can also use these variables
403
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
within your template. Here is how you can set a variable in Apache Airflow. A good
application is storing some database or access credentials that remain static during the
course of the project. When using Airflow variables to store API secrets, keys, passwords,
and other sensitive information, Airflow will automatically mask the variables. In addition,
Airflow also encrypts these variables. And so, it is suggested to use variables to store
sensitive information over other options, which may lack masking and encryption features.
You can create Apache Airflow variables programmatically within the pipeline,
through the command line interface, or even through the user interface (web server).
Here is an example of setting up Airflow variables in the command line interface:
You can also set up or bulk upload JSON from the Apache Airflow web server. To get
started, navigate to the Admin menu in the web server and click Variables:
You can click the “+” button to create a new variable. It will render a new screen
where you can enter the key, value, and description.
404
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Figure 12-11. Populating the key, value, and description of the variable
var.value.<key-portion-of-variable>
def get_aws_parameters():
aws_access_key = Variable.get("aws_key",default_var=None)["aws_
access_key"]
aws_secret_key = Variable.get("aws_key",default_var=None)["aws_
secret_key"]
PythonOperator(
task_id="get_aws_parameters",
python_callable=get_aws_parameters,
)
405
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Params
Params are short for parameters. Similar to variables, params are arguments you can
pass to Airflow DAG runs or task instances during runtime. The difference is that values
passed to params are not encrypted and they should not be used for storing sensitive
information. You can pass params at the DAG level or at the individual task level. Task-
level params take precedence over DAG-level params.
Let us look at passing params at the task level:
BashOperator(
task_id="an_example_task",
bash_command="echo {{ params.my_param }}",
params={"my_param": "this is an example parameter"},
)
@dag(
start_date=datetime(2024, 2, 2),
schedule=None,
catchup=False,
params={
"Person": "JohnDoe",
"Value": 10
},
)
def demo_dag():
@task
def demo_task(**context):
print(context["params"]["Person"])
print(context["params"]["Value"] + 10)
print_all_params()
demo_dag()
406
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
Params are more localized when compared with variables in Apache Airflow. If you
want to pass different values for a task at each run, params would be a good choice.
Templates
In Apache Airflow, the concept of templating utilizes a Python templating framework
called “Jinja,” which allows you to pass data into task instances during runtimes. These
Jinja templates enable you to fill values during runtime. The syntax of this templating is
as follows:
{{ your-value-here }}
The most commonly used application for templating is obtaining the execution
information in Apache Airflow. These include the date of the task instance running,
execution date, and start and end date interval of the current DAG. These templates are
supported in most commonly used operators in Apache Airflow. You can also extend the
functionality of these templates by adding filters or functions. Here is how you can use
templates in Airflow:
BashOperator(
task_id="begin and end date interval"
bash_command='echo "the DAG started at {{ data_interval_start }} and
finished at {{ data_interval_end }}"'
)
Macros
Macros in Apache Airflow enable you to inject predefined functions to be used within
the templates. They live under the “macros” namespace in the templates. Currently,
Airflow macros support methods from Python libraries, namely, time, datetime, uuid,
random, and dateutil. Here is an example of utilizing macros:
BashOperator(
task_id="display_current_datetime",
bash_command='echo "It is currently {{ macros.datetime.now() }}"'
)
407
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
In the preceding example, we have called a method from the Python library with the
macros namespace. If you try to call it without the macros namespace, the command
won’t return the datetime.
def decision_loop(result):
if value > 0.5:
return ['third_task', 'fifth_task']
return ['fourth_task']
branching = BranchPythonOperator(
task_id='branch',
python_callable=decision_loop,
)
408
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
This example utilizes the older syntax where you create an instance of the respective
operator class, whereas the latest syntax makes use of a decorator that handles the
instantiation and operation of the respective operator.
Let us look at an example with the new syntax:
@dag(
start_date=datetime(2024, 2, 2),
schedule=None,
catchup=False
)
def branch():
@task
def cur_task():
return randint(1,100)
@task.branch
def decision_loop(value_from_cur_task):
if ((value_from_cur_task > 0) and (value_from_cur_task <= 40)):
return ['i_am_next','next_to_that']
elif ((value_from_cur_task > 40) and (value_from_cur_task <= 80)):
return 'other_task'
else:
return 'hello_world'
@task
def i_am_next():
pass
@task
def next_to_that():
pass
409
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
@task
def other_task():
pass
@task
def hello_world():
pass
value_from_cur_task = cur_task()
decision_loop(value_from_cur_task)
This example uses task decorators. Depending upon the value a given task generates,
a condition is checked, and based on the evaluation, a respective task id is returned,
which would determine which downstream task to run. In the code, “@task.branch” is
the decorator version of the “BranchPythonOperator”.
The other operator that can be used in controlling the workflow is called the
ShortCircuitOperator. The ShortCircuitOperator, if it satisfies a given condition, will skip
all the downstream tasks. Let us look at an example:
@dag(
start_date=datetime(2024, 2, 2),
schedule='@daily',
catchup=False,
)
def sc1():
@task
def begin():
return bool(getrandbits(1))
@task.short_circuit
def decision_loop(value):
return value
410
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
@task
def success():
print('Not skipping the execution')
chain(decision_loop(begin()), success())
sc1()
In this example, we have a simple function that returns a random true or false, which
is provided to the decision loop function. If the random function returns true, then the
downstream task success() gets executed; if the random function does not return true,
then the downstream task is skipped. And so, every time this DAG is run, you would
obtain different results.
Triggers
As we discussed in earlier sections, a given task will execute only after all its preceding
tasks have been successfully completed. We have seen ways in which we can control
execution based on conditions. Let us imagine a scenario where we have “n” number
of tasks linearly scheduled with one beginning to execute after its preceding task
successfully completed.
Let us say we have a requirement where if a task fails, all the other succeeding tasks
are ignored and only the conclusion task must be run. You cannot short-circuit the
remaining workflow as that would enable skipping all the tasks including the conclusion
task. This is where the concept of triggers is beneficial.
You can configure triggers within the task itself, by adding the “trigger_rule”
argument to the task. Let us look at an example:
@dag(
start_date=datetime(2024,2,2),
dag_id="trigger_illustration",
schedule="@daily"
)
411
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
def dag_trigger():
@task(task_id="begin")
def begin():
print("Task execution begins")
@task(task_id="random_task")
def random_task():
if random.choice([True,False]):
raise ValueError("Task 2nd in queue unsuccessful")
else:
print("Task 2nd in queue executed")
@task(
task_id="task3",
trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS
)
def task3():
print("Task 3rd in queue executed")
@task(
task_id="task4",
trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS
)
def task4():
print("Task 4th in queue executed")
@task(
task_id="conclusion_task",
trigger_rule=TriggerRule.ONE_FAILED
)
def conclusion_task():
print("One of the tasks unsuccessful")
dag_trigger()
412
Chapter 12 Orchestrating Data Engineering Pipelines using Apache Airflow
In the preceding example, we have five tasks, where one of the tasks has been
configured to either execute or not execute, during which the conclusion task must be
executed. By specifying various trigger rules, we are able to skip the downstream tasks
and execute the conclusion task, in the event of the task not being executed.
Conclusion
As we explored throughout this chapter, Apache Airflow is a powerful workflow
orchestration tool and a good solution for orchestrating data engineering and machine
learning jobs. Apache Airflow has been present for close to ten years, enabling workflow
orchestration and providing advanced features and customizations. All the cloud
vendors do seem to offer Apache Airflow as a service, showing its wide adoption in
the industry. We looked at setting up Airflow, the architecture of the Airflow system,
components, and how they all work together. We then looked at DAGs, hooks, operators,
variables, macros, params, triggers, and other concepts and how you can leverage them
in orchestrating a data pipeline. In the next chapter, we will look at another workflow
orchestration tool that is more robust and enables you to work without a DAG. The
choice of the workflow orchestration tool depends on the project requirements, current
infra, and team.
413
CHAPTER 13
Orchestrating Data
Engineering Pipelines
using Prefect
Introduction
Earlier we looked at Apache Airflow, the most widely adopted workflow orchestration
solution. Apache Airflow is well adopted and supported by the community. In this
chapter, we will look at another workflow orchestration solution called Prefect. We
will look at its architecture and how you can create a workflow without DAGs. You can
still define DAGs in Prefect though. The features, components, and typical workflow
of Prefect appear different from what we have seen with Apache Airflow. As for which
tool is a better choice, it depends on many factors like team, requirements, and
infrastructure, among other things.
By the end of this chapter, you will learn
–– Task runners
415
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_13
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Introduction to Prefect
Prefect is a second-generation, open source workflow orchestration platform, designed to
help automate, orchestrate, execute, and manage various Python-based tasks and processes.
Prefect can generate dynamic workflows at runtime. The learning curve for Prefect is
shallow, meaning it takes relatively less time to use Prefect in the workflows. Using Prefect,
one can transform Python functions into a unit of work that can be orchestrated, monitored,
and observed. Some of the highlighting features that Prefect offers are
Scheduling: Scheduling a job or a task is usually done through cron scheduling
or using a dedicated orchestration service. With Prefect, one can create, modify, and
view the schedules in a minimalistic web UI. Prefect offers cron scheduling, interval
scheduling, and recurrence rule scheduling for more complex event scheduling.
Retries: Data engineers often write code to retry a certain operation in case of a
failure. Often, this is used in cases where one needs to connect to a database, as part of
a data integration/engineering pipeline that runs periodically. With Prefect, one can
simply use a Python decorator to specify the number of retries, eliminating the need for
several lines of code and their associated maintainability.
Logging: Logging is a process of registering or recording information when an
event happens during an execution of a data pipeline. Logging can include high-level
information, like registering the status of connecting to a database, and lower-level
information, like capturing the value of a variable at a given point in time during
execution. Prefect automatically logs events for various operations without having to
perform additional configurations with Prefect.
Caching: Caching is a method of storing copies of data in specific locations, often
referred to as cache. The idea is to efficiently retrieve such items to reuse the same for
other jobs. With Prefect, one can cache the results of a task and persist the same to a
specified location.
Async: Async is short for asynchronous and is a method of executing tasks, outside
the main flow. With Prefect, one can run tasks asynchronously, meaning that Prefect can
start another task while waiting for another task to finish, without stopping the execution
of the program.
Notifications: Notifications are messages that are sent by the data pipelines about an
event or an update before, during, or after the execution of a data pipeline. Prefect Cloud
enables notifications about the data pipeline by providing various alerts by various
communication mechanisms like emails, text messages, and platforms like Slack, MS
Teams, etc.
416
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Observability: Prefect Cloud offers a dashboard-like user interface, where users can
observe and monitor various tasks and flows that are scheduled, currently running, or
finished.
In a traditional data stack, one has to write several lines of code, and in some cases, it
would involve the utilization of another tool or software to enable such features. The fact
that Prefect can offer these features out of the box is considered to be appealing when
compared with similar orchestration platforms.
Prefect comes in two different editions, namely, Prefect Core and Prefect Cloud.
Prefect Cloud is a commercial offering of Prefect Technologies, Inc., that comes with a
dashboard-like user interface where one can observe and monitor the workflows along
with few other features like role-based access control and integrations with other tools
and technologies, to name a few.
Prefect Core is an open source framework that provides the core functionality. One
can run Prefect Core within their own infrastructure. Prefect Core is a Python package
that provides the core components of Prefect. One can obtain Prefect from the Python
package installer.
source prefect/bin/activate
If you are using MS Windows operating system, then please use the following
command:
prefect/Scripts/activate
prefect version
417
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Prefect Server
Prefect Server is an open source backend that helps monitor and execute the Prefect
flows. Prefect Server is part of the Prefect Core package. Prefect Server offers the
following options in its user interface:
• Flow runs
• Flows
• Deployments
• Work pools
• Blocks
• Variables
• Notifications
• Artifacts
418
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
You would get the following screen if the web server successfully started:
Navigate to a browser and enter this URL to access the user interface:
http://127.0.0.1:4200
419
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
From a software design point of view, there appear to be three major components:
–– The Prefect core, which provides the core Prefect functionality and
workflow orchestration engine
You can reset the backend Prefect database by issuing the following command:
420
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Prefect Development
Flows
Flows are the most basic unit of Prefect development. A flow is a container for workflow
logic and allows users to manage and understand the status of their workflows. It is
represented in Python as a single function. To convert a Python function into a Prefect
flow, one needs to add the following decorator to the function:
@flow
Once this decorator is added, the behavior of the function changes, rendering the
following features:
–– Changes in the state of flow are reported to the API, enabling real-
time observation of flow execution.
Flows can include calls to other flows or tasks, which Prefect refers to as “subflows.”
Flow Runs
Flow runs represent a single instance of the flow execution. One can create a flow run,
simply by running the Python script that contains the flow. A flow run can also be
initiated by setting up deployment on Prefect Cloud or self-hosted Prefect Server. If a
flow run consists of tasks or subflows, then Prefect will track the relationship of each
child flow run to the parent flow run.
421
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@flow()
def hello_world():
print("Hello World of Prefect!")
hello_world()
Let us look at an illustration of a Prefect flow that utilizes various parameters that can
be passed to the decorator:
@flow(name = "MyFirstFlow",
description = "This flow prints Hello World",
retries = 3,
retry_delay_seconds = 10,
flow_run_name = "Hello World Flow",
timeout_seconds = 300
)
def hello_world():
print("Hello World of Prefect!")
hello_world()
In addition to the name and description of the flow, we can also configure retries.
The retries parameter is the number of times to retry execution upon a flow run failure.
You can also specify the number of seconds to wait before retrying the flow after failure,
by setting the retry_delay_seconds parameter. Furthermore, there is the timeout_
operator, which would mark the flow as failed if the flow exceeds the specified runtime.
To execute this Python program, simply execute the program from your terminal.
Upon executing you will see a log trace similar to what is shown in the following:
422
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
As you can see, Prefect has executed an instance of the flow, and this flow run has
been given a name. You can also see that the flow finished in the completed state. Let us
look at the user interface:
423
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Interface
Recall that we had assigned a name for both the flow and the instance of the flow as
“MyFirstFlow” and “Hello World Flow,” respectively. You can obtain the same log trace
and few other details when you click the flow run name. Here is how that looks:
While a flow is a fundamental idea of workflows in Prefect, a flow can include one or
more tasks.
Let us look at the concept of tasks.
Tasks
A task is a function that represents a single discrete unit of work in Prefect. It is entirely
possible to create a workflow, solely with flows, where the process is encapsulated as
a Python function with a flow decorator. However, tasks enable breaking down the
workflow logic in smaller packages that may be reused in various other flows and
subflows. Tasks are similar to functions, where they take inputs, perform functions, and
return an output. Tasks can be defined within the same Python file, where the flow is also
defined. Tasks can also be defined within a module and can be imported to be used in
flow definitions. It is essential that all tasks must be called from within a flow.
424
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Tasks receive metadata from upstream dependencies, including the state of the
dependencies prior to their execution, regardless of receiving data inputs from them.
This enables the developers to make a task wait prior to the completion of another task.
Usage of tasks is optional in Prefect. One can write all their code in a giant function and
use a @flow decorator for that function. Prefect can execute that function. However,
organizing the code into as many smaller tasks as possible may help in gaining visibility
into the runtime state and also in debugging the code. In the absence of tasks, if a job
runs in one flow and any line of code fails, then the entire flow will fail. If tasks are used
to break the flow into several tasks, then it can be debugged relatively easily.
Each Prefect workflow must contain one primary, entry point @flow function.
From that flow function, a number of tasks, subflows, and other Python functions can
be called. Tasks, by default, cannot be called in other tasks; special syntax must be
employed while calling tasks from other tasks:
@task
def task1():
print("I am task1")
@task
def task2():
task1.fn()
print("I am task2")
@flow
def flow1():
task1()
task2()
flow1()
In the preceding example, we have seen two tasks that are incorporated within a
flow. While Prefect does not allow triggering tasks from other tasks, there is an option to
call the function of the task directly by using the “fn()” method. You can see the function
“task1” has been called within the “task2()” function. It will only be a one-time execution
of the function, and you cannot utilize retries among other Prefect functionalities when
calling this function from a task.
425
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Figure 13-7. Observing the output of a Prefect flow in the command line
This Prefect flow has two tasks, namely, task1 and task2, where task2 calls the
function of task1 directly. And so, you will see the task1() execution and the task2()
execution, which includes the print function of task1.
Tasks provide parameters for further customization. Some of them are listed in the
following. Here is an example of tasks with optional parameters included:
@task
def task1(
name="Task1",
description="This is the task1 returns a string",
tags=[test],
retries=2,
retry_delay_seconds=10
):
print("I am task1")
426
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task
def task2(
name="Task2",
description="This is the task2 returns a string and \
calls the task1 function directly",
tags=[test],
retries=2,
retry_delay_seconds=10
):
task1.fn()
print("I am task2")
@flow
def flow1(
name="MainFlow",
description="Main flow with tasks included",
retries=1
):
task1()
task2()
flow1()
In addition to providing the name and description of the task, you can also specify
the number of times to retry the specific task upon the unsuccessful execution of the
task and the duration to wait before retrying the task. There is also a Boolean parameter
specifying whether to persist the task run results to storage.
Results
Results are quite straightforward. They represent data returned by a flow or a task. As
mentioned previously, tasks or flows are basically Python functions that may return
a value.
427
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task
def my_task():
return 1
@flow
def my_flow():
task_result = my_task()
return task_result + 1
result = my_flow()
assert result == 2
In the preceding example, you can see the output of a task (function with the task
decorator) is saved to a variable. And the output of a flow is also saved to a variable.
While this program is able to save the results of tasks and flows to variables, the program
cannot persist them.
Persisting Results
Prefect does not store the results unless they help Prefect reduce the overhead of reading
and writing to result storage. Results are persisted to a storage location, and Prefect
stores a reference to the result.
The following features are dependent on results being persisted:
To persist the results in Prefect, you need to enable the result persistence parameter,
specify the location of storage, and also have the option to serialize the output either to
JSON or using pickle serializer. Depending upon the parameters used in the program,
results can either be persisted to a given task or all tasks or to a flow. Consider the
following example:
428
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task
def task1(a: int, b: int) -> int:
c = a + b
return c
@flow(retries=2)
def flow():
ans = task1(2,3)
print(ans)
flow()
In this example, the flow has the retries parameter enabled, which would require
that results of all tasks to be persisted. However, flow retries do not require the flow to be
persisted. Let us try to run this program:
Figure 13-8. Observing the output persistence of the Prefect flow in the
command line
We can observe the program executed successfully. When you examine the results in
the Prefect server, you may be able to see the results persisted, whereas the flow didn’t.
Here is how that may look:
429
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
One can also manually toggle the result persistence. Toggling persistence will
override any action that Prefect may arrive on its own. Here is an example of that:
@task(persist_result=False)
def my_task(a: int, b: int) -> int:
c = a * b
return c
@flow(persist_result=True)
def my_flow() -> int:
result = my_task(5,6)
return result
my_flow()
430
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
In the preceding example, the persist_result has been set to False for the task. This
means the task will never persist. Any feature that depends on persistence may not work
and result in error. However, for the flow, the persist_result has been set to True. The flow
will persist the result even if it is not necessary for a feature.
Let us run this code:
The code ran without errors. Recall that we have asked the task not to persist
the result and the flow to persist the result. Let us look at the flow run output in the
Prefect server:
431
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Here, we see that while the task did not persist the result, the flow stored the result in
a local storage.
Upon further examination, here is the serialized output of that file:
{
"serializer": {
"type": "pickle",
"picklelib": "cloudpickle",
"picklelib_version": "2.2.1"
},
"data": "gAVLHi4=\n",
"prefect_version": "2.14.21"
}
You can choose to specify a specific location where you would like the results to
be stored.
Artifacts in Prefect
Artifacts, in Prefect, refer to persisted outputs like links, reports, tables, or files. They
are stored on Prefect Cloud or Prefect Server instances and are rendered in the user
interface. Artifacts enable easy tracking and monitoring of objects that flows produce
and update over time. Artifacts that are published may be associated with the task run,
task name, or flow run. Artifacts provide the ability to display tables, markdown reports,
and links to various external data.
Artifacts can help manage and share information with the team or with the
organization, providing helpful insights and context. Some of the common use cases of
artifacts are:
Documentation: Documentation is the most common use of Prefect artifacts. One
can publish various reports and documentation about specific processes, which may
include sample data to share information with your team and help keep track of one’s
work. One can also track artifacts over time, to help see the progress and updates on the
data. To enable that, the artifacts must share the same key.
432
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Data quality reports: Artifacts can help publish data quality checks during a task. In
the case of training a large machine learning model, one may leverage using artifacts to
publish performance graphs. By specifying the same key for multiple artifacts, one can
also track something very specific, like irregularities on the data pipeline.
Debugging: Using artifacts, one can publish various information about when and
where the results were written. One can also add external or internal links to storage
locations.
Link Artifacts
Artifacts can be created by calling the create_link_artifact() method. Each create_link_
artifact() method produces a distinct artifact and has the following parameters.
Here is a simple example of an artifact containing a link:
@flow
def my_flow():
create_link_artifact(
key="artifcatkey1",
link="https://www.google.com/",
link_text="Google",
description="Here is a search engine"
)
if __name__ == "__main__":
my_flow()
When we run this code, it will generate an artifact for us. Let us try to locate the
same. In your Prefect server, click Flow Runs and navigate to the Artifacts column:
433
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
When you click the artifact, here is how that may render the link artifact:
Markdown Artifacts
Markdown is a lightweight markup language that can be used to add formatting
elements to plain text documents. The markdown syntax needs to be added to various
texts to indicate which words or phrases look different and how. To create a markdown
artifact, the create_markdown_artifact() function needs to be used:
434
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task(name="Manufacturing Report")
def markdown_task():
markdown_report = f"""
# Manufacturing Report
## Summary
| Items | Quantity |
|:--------------|-------:|
| Portable PC | 60000 |
| Tablet | 70000 |
| Phone | 80000 |
## Conclusion
These numbers are very encouraging and exemplify our hardworking and
resourceful manufacturing team.
"""
create_markdown_artifact(
key="manufacturing-report-1a",
markdown=markdown_report,
description="Quarterly Manufacturing Report",
)
@flow()
def my_flow():
markdown_task()
if __name__ == "__main__":
my_flow()
435
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Upon running the Python code, we may be able to access the markdown artifact in
the same location. Here is how the artifact looks like:
Table Artifacts
One can create a table artifact by using the function create_table_artifact(). Here is an
example:
@flow
def my_fn():
highest_churn_possibility = [
{'customer_id':'12345', 'name': 'John Doe',
'churn_probability': 0.85 },
{'customer_id':'56789', 'name': 'Jane Doe',
'churn_probability': 0.65 }
]
436
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
create_table_artifact(
key="table-artifact-001",
table=highest_churn_possibility,
description= "## please reach out to these customers today!"
)
if __name__ == "__main__":
my_fn()
Upon running this program, we may be able to locate the artifact at the similar
location, within the Prefect server.
Here is how that may look:
Furthermore, one can list all the artifacts in the command line.
437
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
By entering “prefect artifact ls”, we can list all artifacts like the following:
In addition, you can inspect the contents of these artifacts using the artifact key in
the terminal.
The command is "prefect artifact inspect <artifact-key>".
States in Prefect
A state basically means the status or condition of an object at a given point in time in the
system. In Prefect, states contain information about the status of a particular task run
or a flow run. A reference to the state of a flow or a task means the state of a flow run or
task run. For instance, if a flow is meant to be successful or completed or running, then
it directly refers to the state of the flow run. Prefect states describe where the program
currently is, during its lifecycle of execution. There are terminal and non-terminal states
in Prefect.
438
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
SCHEDULED Non-terminal The run will begin at a particular time in the future.
PENDING Non-terminal The run has been submitted to run, but is waiting on necessary
preconditions to be satisfied.
RUNNING Non-terminal The run code is currently executing.
PAUSED Non-terminal The run code has stopped executing until it receives manual approval
to proceed.
CANCELING Non-terminal The infrastructure on which the code was running is being cleaned up.
COMPLETED Terminal The run completed successfully.
FAILED Terminal The run did not complete because of a code issue and had no
remaining retry attempts.
CRASHED Terminal The run did not complete because of an infrastructure issue.
In Prefect, tasks of flow objects can return data objects, Prefect state objects, or
Prefect futures, a combination of both data and state objects. Let us look at an illustration
for data objects:
@task
def add_one(x):
return x + 1
@flow
def my_flow():
result = add_one(1) # return int
439
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
A state object is a Prefect object that indicates the status of a given flow run or task
run. You can return a Prefect state by setting the “return_state” parameter to true. Here
is how that may look:
@task
def square(x):
return x * x
@flow
def my_flow():
state = square(5, return_state=True)
result = state.result()
return result
my_flow()
A Prefect future is a Prefect object that contains both data and state. You can return a
Prefect future by adding “.submit()” to the function:
@task
def square(x):
return x * x
@flow
def my_flow():
state = square.submit(5)
data = state.result()
pstate = state.wait()
return data, pstate
print(my_flow())
440
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task(
on_completion=[success_hook, conclusive_hook],
on_failure=[failure_hook, conclusive_hook]
)
def my_task():
print("this task is successful")
@flow(
on_completion=[flow_completion_hook],
on_cancellation=[flow_cancellation_hook, conclusive_hook],
441
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
on_crashed=[flow_crashed_hook]
)
def my_flow():
my_task()
my_flow()
You can see from the preceding code that, for a given state transition, we have more
than a state change hook specified as a list. Also, you can specify the same state hook to
multiple tasks and flows for the same state transition. This way, when you have several
tasks and flows, you can easily control the workflow, with respect to state changes in
tasks and flows.
Blocks
Blocks are one of the important aspects of Prefect. Blocks help store configuration and
authentication information and provide an interface to interact with other systems.
With Prefect blocks, one can securely store authentication, pull values from
environment variables, and configuration information for various source systems,
downstream systems, communication channels, and other systems. Using blocks,
one can query data from a database, upload or download data from a cloud data
store like AWS S3, send a message to Microsoft Teams channels, and access source
code management tools like GitHub. Blocks are available both in Prefect Server and
Prefect Cloud.
It is important to note that blocks are used for configurations that do not change or
modify during runtime. What is defined in blocks may not be altered during the runtime
of flows. Blocks can be created and existing blocks can be modified using Prefect UI.
Here is how it appears on the Prefect UI:
442
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
Prefect provides blocks for so many cloud data stores and other external systems. It
is also possible to create a custom block in Prefect. Prefect’s custom blocks are built off
of BaseModel class from Pydantic library. While all blocks are encrypted before being
stored, one can also obfuscate fields using Pydantic’s SecretStr field type.
Let us create a new block. In here, click Add block; you will see a list of various
blocks. Here is an example of creating a custom AWS S3 block. S3 is an object storage
cloud service, offered by Amazon Web Services. To create a new S3 block, simply type
in “s3” in the search bar and you will get the S3 block. Enter the required details in the
block and save it. Here is how that may look:
443
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
One can also register the custom blocks, so they can be used in the Prefect UI, using
the command
Prefect Variables
In Prefect, variables are named and mutable string values. Variables can be created or
modified at any time and they may be cached for faster retrieval. Variables can be used
to store information of nonsensitive nature, such as configuration information. It is
important to note that Prefect variables are not encrypted. To store sensitive information
like credentials, one would use blocks, as opposed to variables.
One can create, read, write, and delete variables via the Prefect user interface, Prefect
API, and command line interface. To access Prefect variables using the command line
interface, here’s what you need to do:
To list all the Prefect variables
prefect variable ls
444
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
To create Prefect variables via the user interface, go to the Prefect server, choose
Variables, and click New variable.
Here is how that may look for you:
VariableFromPrefect = variables.get('prefect_var_1')
print(VariableFromPrefect)
pull:
- prefect.deployments.steps.git_clone:
repository: https://github.com/PrefectHQ/hello-projects.git
branch: "{{ prefect.variables.deployment_branch }}"
The abovementioned variable will be evaluated at runtime. One can also change or
modify the variables without having to update a deployment directly.
Task Runners
In Prefect, tasks can be run in different ways. They are concurrent, parallel, and
distributed ways of executing a task. Task runners are not required to be specified for
executing tasks. Tasks simply execute as a Python function and produce appropriate
returns for that function. By using default settings, when a task is called within a flow,
Prefect executes the function sequentially.
A sequential execution is a method where functions or tasks are executed one after
the other, only after the previous task or function is completed. This is the simplest
model of task execution and it performs tasks one at a time, in a specific order. Prefect
provides a built-in task runner, called the SequentialTaskRunner.
This SequentialTaskRunner works both with synchronous and asynchronous
functions in Python. Here is an example:
@task
def task1():
print("I am task1")
@task
def task2():
task1.fn()
print("I am task2")
# using submit enables to create Prefect Future that helps return both
state and result of the task
@flow (task_runner=SequentialTaskRunner())
446
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
def flow1():
task1.submit()
task2.submit()
if __name__ == "__main__":
flow1()
@task
def stop_at_floor(floor):
print(f"elevator moving to floor {floor}")
time.sleep(floor)
print(f"elevator stops on floor {floor}")
@flow(task_runner=ConcurrentTaskRunner())
def elevator():
for floor in range(10, 0, -1):
stop_at_floor.submit(floor)
if __name__ == "__main__":
elevator()
import dask
from prefect import flow, task
from prefect_dask.task_runners import DaskTaskRunner
447
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
@task
def show(x):
print(x)
def my_flow():
with dask.annotate(resources={'GPU': 1}):
future = show(0) # this task requires 1 GPU resource on a worker
if __name__ == "__main__":
my_flow()
Conclusion
In this chapter, we looked at Prefect, another open source workflow management
and orchestration tool. Prefect offers a flexible and convenient approach to workflow
orchestration that addresses many of the challenges of an engineering team. Prefect
is Python native, and the concepts of flows and tasks and the idea of simply using a
decorator to create a workflow are convenient. So far, we looked at the components of
the workflow orchestrator, followed by setup and installation. We gained an appreciation
for flows and flow instances, tasks, and task instances. We further looked into results,
448
Chapter 13 Orchestrating Data Engineering Pipelines using Prefect
persisting results, and various artifacts. We also discussed states, state change hooks, and
variables. We looked at blocks for secrets management; there is also integration available
with HashiCorp Vault, for secrets management. It may be worth noting that the choice of
a workflow orchestration solution may depend on several factors (current infrastructure,
team’s expertise, requirements of the project, etc.). Overall, you have gained higher-level
skills by learning Airflow and Prefect, enabling you to build scalable and more efficient
data pipelines.
449
CHAPTER 14
451
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_14
Chapter 14 Getting Started with Big Data and Cloud Computing
452
Chapter 14 Getting Started with Big Data and Cloud Computing
IP address
IP address stands for Internet protocol address. An IP address is a way servers are
identified on the Internet. IP version 4 defined the IP address as a 32-bit number. However,
there is increasing adoption of IP version 6, which defines the address as a 128-bit number.
DNS
DNS stands for Domain Name System. It can be seen as an address book but only for
the Internet. Instead of remembering the complex IP address digits and characters, you
can see this as a string of words, an easily remembered domain name that can be used
to access the website. DNS is a database where the IP addresses are mapped with the
domain names.
Ports
Ports are logical entities that serve as gates to the servers. In order to enter a server, not
only you need to have the appropriate IP address but also the correct port. Port numbers
range from 0 to 65535. Imagine this as an airport with 65,535 gates. In order to board
your flight, you have to reach the correct airport and access the correct gate to get to your
flight. For instance, port 1521 is used by Oracle database, port 22 is used by SSH, etc.
Firewalls
A firewall is like a gated system where only certain IP addresses are given access. Think
of it more like an invite-only launch event that allows members who are in the white list.
453
Chapter 14 Getting Started with Big Data and Cloud Computing
This is different from a Virtual Private Network, which provides secure and encrypted
connection over private networks so you can access private services from remote
locations.
Virtualization
Virtualization is the process of running a virtual instance of an operating system whose
hardware is isolated from the operating system. So the computer hardware may have
certain specifications. By adopting virtualization, we are able to provision only a subset
of the entire capabilities that bare metal hardware possesses, and so we are able to
provision multiple operating systems within a same computer. A hypervisor software
that sits between computer hardware and the operating system manages how much
hardware resources to allocate to each system. The operating system that a hypervisor
spins off is referred to as a virtual machine.
Hadoop
The most widely accepted system in the big data space is Hadoop. Hadoop is an
ecosystem of tools and technologies that supports processing and delivering insights
using big data. After its first release in the early 2000s, Hadoop was open-sourced and
contributed by people all over the world. Hadoop has its roots in distributed computing.
Hadoop enables the creation of a distributed computing platform that runs on a large
number of consumer-grade machines. The Hadoop ecosystem consists of a cluster that
has a name node and one or more worker nodes, wherein the name node would be
responsible for assigning and coordinating tasks and the worker nodes would carry out
the tasks that are assigned to them.
454
Chapter 14 Getting Started with Big Data and Cloud Computing
The Hadoop ecosystem has its own file system called Hadoop Distributed File
System (HDFS). HDFS supports the MapReduce paradigm, which enables parallel
computation on big data. Hadoop enables writing quick and easy MapReduce jobs
using a scripting language called Pig Latin. Pig Latin is helpful in writing complex data
transformations and useful in exploratory data analysis with big datasets. Furthermore,
Hadoop also supports Hive, a data warehousing solution on top of Hadoop’s file system.
The Hive tool provides a SQL-like interface for querying data stored in Hadoop’s
file system. Hive translates this SQL-like syntax into MapReduce jobs and enables
distributed computation on the data.
Spark
The biggest leap in the evolution of big data systems happened with Apache Spark.
Apache Spark is an analytics engine that supports big data processing and possesses
the ability to run in-memory computations. The idea of performing in-memory
computations enabled Apache Spark to be 100 times faster than MapReduce processing
on disk. Furthermore, Spark supports multiple programming languages like Java, Scala,
Python, and R offering exhaustive analytics capabilities that include SQL processing,
real-time stream processing, machine learning pipeline development, and graph
processing as well.
Spark played a significant role in the shaping and evolution of modern data
platforms like Delta Lake architecture, data lake house, etc. Delta Lake is an open
source storage layer that runs on top of data lakes and is basically built on Spark; Delta
Lake provides various capabilities like ACID transactions on data lakes and integrating
streaming and batch data processing.
The growth of big data, leading to greater need for big data systems and their closely
associated tools and technologies, certainly contributed to increased adoption of
cloud computing technologies. As organizations realized gathering analytics from vast
amounts of data to gain competitive intelligence, among other insights, it was a relatively
easier choice to migrate to the cloud. Cloud computing systems, regardless of vendors,
offer compute, memory, and storage at scale. It is much quicker to provision and set up
a data processing cluster; even better, cloud computing vendors offer to charge only for
the time you consume their resources. This is called serverless computing and you will
see more of this in later sections and chapters.
455
Chapter 14 Getting Started with Big Data and Cloud Computing
456
Chapter 14 Getting Started with Big Data and Cloud Computing
Public Cloud
Public cloud is the most common method of deploying cloud computing for a customer.
It is also considered as the most economical option. In the public cloud, the services
are delivered over the Internet connection. And so, the customer just needs to have
a secured workstation and a strong enough Internet bandwidth to be able to access
services rendered by cloud computing vendors, seamlessly. It is also the easiest to set up
and may take not more than a few minutes to create an account and begin to utilize the
cloud services.
Private Cloud
Private cloud is where an organizational customer has the exclusive rights and control
of the IT infrastructure. This is similar to running a traditional IT infrastructure. You can
have a third-party cloud vendor provide and manage the exclusive IT infrastructure,
or it can be the case that the organization has deployed their own exclusive cloud that
is managed by them. You can still avail of the more recent technology, automations,
virtualization, etc. and reap the benefits of cloud computing. However, the cost may be
higher when compared with public cloud offerings.
Hybrid Cloud
As the name suggests, it is a mix of both private and public cloud concepts. In hybrid
cloud, some resources are hosted on-premises, while some are hosted in the public
cloud. This can be seen more often where organizations stay on-premise with certain
applications and move to the cloud for hosting certain other applications. And so,
organizations can keep their data within their own premise while procuring and utilizing
certain services in the cloud.
Community Cloud
A community cloud is a variant of the public cloud, except that the IT infrastructure is
shared between several organizations from a specific community or industry. They can
either be managed internally or externally by a third-party vendor. Community cloud
is a collaborative effort. It provides better security when compared with public cloud.
Examples include banking clients and healthcare clients, where specific regulations and
compliances are to be met, which are set by their respective governing bodies.
457
Chapter 14 Getting Started with Big Data and Cloud Computing
Government Cloud
Government cloud is made for government bodies and entities; they are a sub-variant of
community cloud except that they are made to cater specific compliance and regulatory
requirements. This would involve cloud computing vendors providing extra layers of
security and policies that would comply towards specific requirements that government
entities may have.
Multi-cloud
Multi-cloud is where an organization is leveraging more than one cloud vendor to
host their applications and infrastructure. The advantage can be that you are not tied
to a single cloud vendor and so you can easily move between vendors. It can also be
advantageous from the cost point of view. An offering at one cloud vendor may be less
costly compared with an offering at another cloud vendor. The downside is that it has a
high operational complexity.
Scalability
Every data pipeline is meant to process and/or transport data. The amount of
computational resources consumed is directly proportional to the volume of data
that is processed. As the data pipeline processes more data, the entire pipeline gets
overworked. Scalability is the process of handling the stress caused by increasing in
usage while ensuring your data pipeline effectively performs its functions smoothly and
consistently. You have to have a rough idea about the maximum workload your pipeline
can process and whether you may encounter any memory leaks or stack overflow.
Scalability can be achieved by two approaches, namely, horizontal scaling and vertical
458
Chapter 14 Getting Started with Big Data and Cloud Computing
scaling. Horizontal scaling is adding more compute nodes to a cluster that can handle
an increase in the processing traffic. Vertical scaling is increasing the capacity of a given
compute node to meet the traffic.
Elasticity
Elasticity refers to the ability to adapt to the change in workloads of a given data pipeline.
Elasticity is the idea of dynamically resizing the underlying computational resources
to meet the data processing needs and also to reduce the utilization of computational
resources. This is different from scalability as scalability focuses on scaling up during
peak consumption period, whereas elasticity refers to the elastic nature of the data
processing system to dynamically scale up or down given the need to minimize the
resource utilization. A good example is the idea of serverless computing offered by
the cloud vendors. Serverless is where you do not pay for the underlying computer
but rather you pay only for the units of work completed within the execution time. In
addition, elasticity is also that feature where you can add or dynamically provision
compute and storage during runtime.
High Availability
High availability is defined as the ability of the data processing system to be available
continuously without intervention for a specified amount of time. The idea is to
eliminate the single point of failure and be made available anytime when the need for
data processing arises. High availability can be measured by uptime percentage during
a given point in time. Even in case of system maintenance or failure of the underlying
computing environment, the ability to make the data processing system to be available is
the idea of high availability.
Fault Tolerance
Fault tolerance is the idea of the data processing system to be available in the event of
actual failure of the underlying computing environment. It is slightly different from the
idea of high availability although they are kind of two sides of the same coin. Both are
aiming to make the system available without any interruptions. The way you can achieve
fault tolerance is by adding compute nodes with the same data processing system that
is present in a different region. For instance, if your data processing pipeline is served
from one geographical region, you can have one more instance of a system that performs
459
Chapter 14 Getting Started with Big Data and Cloud Computing
the same task in a different geographical region. This way if one region fails, the other
region can continue to make itself available till the primary region resumes serving
its customers. Many distributed computing clusters have the concept of a main node
managing communications and worker nodes performing tasks. Fault tolerance is the
idea of having backup main nodes readily available.
Disaster Recovery
Disasters mean macro-level disasters, such as earthquakes and floods that cause
disruption in everyday services and other human-related events like diseases, pandemic
or epidemic, or some other event that causes prolonged interruption to services. They
can be intentional or unintentional. Disaster recovery is the idea of recovering the
data management systems, data processing, and machine learning systems, restoring
the functionality after a disaster event happens. The idea here is to have a backup site
for the data management, processing, and machine learning systems. In this context,
disaster recovery is the ability of cloud computing vendors to offer this as a service so the
businesses continue to function as is.
Caching
Caching is not exactly an architectural principle; however, it is an important aspect of
architectural design. Caching is a data storage concept, where commonly used data
during computations are stored in a place. The idea is to reuse previously computed or
processed data so we do not have to run the same computation again and again. The
cached data is usually stored in RAM, or random access memory. Caching improves
speed as it fetches the result from RAM compared with performing the same operation
again using compute and storage. From the CPU architecture point of view, there may
be several other caching layers, which are beyond the scope of cloud data architecture.
There are several types of caching data in memory; some of them are API response
caching, Content Delivery Network caching, etc. Our focus is on database caching.
When you have a database, there may be few queries that are expensive; even after
optimizations, it requires heavy resource utilization. By obtaining the results of the
queries from cache instead of the source, we minimize the latency and increase the
overall performance of the data processing system.
460
Chapter 14 Getting Started with Big Data and Cloud Computing
461
Chapter 14 Getting Started with Big Data and Cloud Computing
Infrastructure as a Service
This is where the cloud computing vendor provides the computing infrastructure as a
service. The idea is the cloud vendor provides you the service and you can use it to run
applications, perform computations, etc. To run applications, you need infrastructure.
The infrastructure consists of computer hardware like processors, memory, operating
systems, networking, storage, etc. Typically the cloud vendor will provide you a server,
storage, networking, and a virtual machine with an operating system. It is up to you
to decide which software to install to perform the tasks you require. This model is for
administrators who are looking to set up and manage the infrastructure.
Examples of Infrastructure as a Service include Microsoft Azure, Amazon Web
Services, and Google Cloud Platform, to name a few.
Platform as a Service
Platform as a Service is where the cloud computing vendor provides you all the
infrastructure and the environment to develop your application or work with the
application. Imagine you were able to rent out a multi-core processor with memory
and storage configured, an operating system installed with all the updates and patches
done, and the software development installed and managed so you can jump right into
developing your code or deploying your code. Platform as a Service is very helpful to
developers who are looking to develop something from the ground up without having to
worry about administrative efforts.
Examples of Platform as a Service include Salesforce CRM, Tableau dashboards, or a
subscription-based database as a service (Postgres instance on AWS RDS).
Software as a Service
Software as a Service is something you see all over you every day. You check your free
email account provided by Google by entering their domain address and your user
id and password or your social media provider like Instagram or Facebook; you are
using it for free of charge. Basically, these companies provide you with everything
from hardware, operating system, memory, storage, and software application (Gmail,
Dropbox, etc.). As an end user you would access their service, consume it to your heart’s
content, and, perhaps, return to it another time.
Examples of Software as a Service may include Zoom video conferencing, Microsoft
Office 365 (Word, Excel, PowerPoint), Dropbox online storage, etc.
462
Chapter 14 Getting Started with Big Data and Cloud Computing
Compute
Compute means computing processing power (like a motherboard), memory, storage,
and network to connect to it. This is the most basic cloud service that one can offer. In
2006, Amazon started their Amazon Web Services and provided the EC2 as one of their
earliest offerings. You have the option to pick and choose the kind of microprocessor that
may offer various numbers of cores, the amount of physical memory you need, and the
kind of networking connection you wish.
Storage
Cloud storage is basically storing your data (files, backups, pictures, videos, audios, etc.)
in a third-party storage provider that can be accessed via the network. Cloud storage
offers security and cost-effective storage management. Cloud storage has multiple
types of storage offerings deepening upon the need and use case of the customer. These
storage systems can autoscale wherever they need to be and can be highly available to be
accessed over a network.
Object Storage
Object storage is a specialized type of file system storage that can be used to store any
kind of files. The object storage manages data as objects, as opposed to file systems
where data is managed as files. Each data is an object and comes with metadata and
463
Chapter 14 Getting Started with Big Data and Cloud Computing
a unique identifier that is associated with the single data. Object storage is used in
a variety of applications including cold storage of data (where data is infrequently
accessed) and archives of data.
Examples of object storage are Red Hat Ceph, MinIO, Cloudflare R2, Azure Blob
Storage (if you are in the Microsoft ecosystem), etc.
Databases
One of the most common offerings is the relational database management systems.
Cloud vendors offer databases as a service in various options. You have the option of
renting out a computing space from a cloud vendor with a high-speed network and
installing a database by yourself; or you can rent out a preinstalled database of the latest
version in a matter of a few clicks. The vendor may provide support with some of the
database administration and maintenance tasks if you choose the fully configured route.
With databases, you can autoscale, programmatically interact, and build data pipelines,
in addition to loading, querying, and generating SQL reports.
NoSQL
In addition to relational databases, cloud computing vendors also provide NoSQL
database solutions. NoSQL stands for Not Only SQL, meaning that these are database
management systems that are designed to handle a variety of data models and data
structures. The idea here is NoSQL database systems complement and enable efficient
storage of data that may not fit into the cookie-cutter relational data models. However,
NoSQL can never replace the traditional relational database systems, the data integrity
they impose using various normalization forms, and the effectiveness relational
databases deliver in transactional systems. NoSQL systems allow data to be inserted
without having to predefine a schema.
464
Chapter 14 Getting Started with Big Data and Cloud Computing
NoSQL has been in existence since the early 2010s. Today, there are various NoSQL
offerings that major cloud vendors do seem to offer. Let us look at each one briefly.
Document Databases
Document databases are data systems that are designed for managing semistructured
data. A good example is JSON data, where data is captured in JSON documents. These
document databases have flexible data models and they are not defined prior to
loading. Even though the JSON documents may have complex structure and nested
structures within each document, they will always be consistent—as long as they come
from a single source. And so, the document databases will accommodate the incoming
structure of these documents. A famous example of a document database is MongoDB. It
is a fast, lightweight, simple-to-use database that comes with a querying language that
can be used to treat the JSON documents as relational data and write queries on top
of them.
Column-Oriented Databases
The other oldest form of NoSQL database system is the column-oriented databases.
Please note that column-oriented databases are different from column-store databases.
Column stores are still relational database systems that are used for OLAP processing
and data warehousing purposes. They are extremely fast and efficient for generating
complex analytics reports. Column-oriented databases are NoSQL systems where
465
Chapter 14 Getting Started with Big Data and Cloud Computing
the schema evolves with the incoming data. Column-store databases mostly have
dimensional data models that are defined prior to loading data. For instance, Postgres is
a column-store database; so is Google BigQuery.
Column-oriented databases are also referred to as wide-column databases. It is
where the columns for a given set of data can be added at any time. Google Bigtable and
Apache Cassandra are two great service offerings of wide-column databases. Microsoft’s
Cosmos DB and Amazon’s DynamoDB are equally competitive service offerings for
wide-column database systems.
Key–Value Stores
Key–value stores are types of NoSQL database systems where the data is stored as a
collection of key–value pairs (very similar to dictionary data types in Python). The data
model is very simple and straightforward, where the keys are identifiers and can be used
to retrieve the corresponding value. Hence, these systems are incredibly fast for retrieval,
as it does not involve scanning a few tables and matching certain filters and conditions.
Good examples are Amazon DynamoDB, Redis, Azure Cosmos DB, etc.
Graph Databases
Graph databases are another type of NoSQL databases, where the underlying data model
is composed of nodes and edges. Nodes are entities that contain various information like
persons, names, or other discrete objects. Edges are connections between nodes that
describe the relationship between two connecting nodes.
Let us look at a graphical illustration of how two entities may share a relationship
and how there can be various types of relationships between entities.
Some good examples of graph databases are Neo4j, Amazon Neptune, Azure Cosmos
DB, etc.
466
Chapter 14 Getting Started with Big Data and Cloud Computing
Vector Databases
Vector databases are newer in the NoSQL database systems, where they are designed
to store, index, and fast search vector data. Given a query vector, the vector database
would perform a fast search to retrieve the vector that has the most similarity with the
given vector. This is also known as similarity search, which is very effective in image
recognition and other artificial intelligence environments. Here, vectors are simply an
array of numbers; depending upon the context, these arrays may carry certain meaning.
Once a vector is given, one can specify the type of measure that is relied on to obtain
the nearest neighbor vector that resembles the same. The similarity measure can be
Euclidean, Tanimoto, or Manhattan distance. All these vectors are indexed for faster
similarity search and retrieval.
Data Warehouses
Data warehouses are columnar-style relational database management systems that are
used for analytical processing. These data stores leverage dimensional data models
and column stores in order to retrieve data faster and be able to generate analytic
insights. Major cloud vendors provide data warehouse software, access control, and
some key functionalities out of the box, so you can easily navigate to standing up your
data warehouse instance. The time taken to provision a data warehouse and go live
is “relatively” faster but not exponentially faster. Amazon’s AWS Redshift, Google’s
BigQuery, and Azure’s Synapse Analytics are good examples of relational data warehouse
environments.
467
Chapter 14 Getting Started with Big Data and Cloud Computing
Data Lakes
Data lakes are a centralized repository for storing structured, semistructured, and
unstructured data. Data lake environments enable data storage similar to data
warehouses but without the need for defining predefined schemas. You can ingest data
from various sources, of multiple types, and can interact with and query data using
various analytical tools like Spark or distributed computing environments like Hadoop.
A data lake environment is ideal if you have big data or are attempting to integrate
relational and unstructured data to create analytical datasets. Data lakes are a good
option to look at for developing machine learning pipelines. Good examples of data
lake environments are Amazon’s Lake Formation, Azure Data Lake Storage Gen2, and
Google’s Bigtable.
468
Chapter 14 Getting Started with Big Data and Cloud Computing
pipelines. Amazon offers AWS MSK, which is managed streaming for Apache Kafka,
and AWS Kinesis, which offers Apache Flink. Google also offers Kafka and Flink through
Google Cloud, along with Google Pub/Sub service. Microsoft also offers these services in
the name Azure Stream Analytics.
Serverless Functions
Serverless means that you have the option of consuming service and pay only for the
resources that have been utilized. Serverless in a major cloud means that one can run
functions that interact with one or more components and set up tasks to run without
having to provision the underlying computer instance. Amazon offers a serverless
compute service called AWS Lambda, Google Cloud Platform has Cloud Functions, and
Microsoft has Azure Functions—all three offer serverless compute services.
469
Chapter 14 Getting Started with Big Data and Cloud Computing
Containerization
Containers mimic virtual machines; they host an application or service. Unlike virtual
machines they consume much lower resources. When compared to the idea of serverless,
containerization enables an app or a service to be always available. Docker is the
well-known container software that helps containerize the application in the form of
container images. A container image is an executable software that contains everything
from code, dependencies, system tools and libraries, etc. needed to run the application
independently. A container registry is where one or more container images are stored and
distributed. Amazon provides AWS Elastic Container Registry, Microsoft provides Azure
Container Registry, and Google offers Container Registry as container registry services.
Data Governance
Data governance is the combination of tools, processes, and practices where data,
throughout the lifecycle of it, is managed safely and in a compliant manner. It is
important to have data governance practice in every organization as data has become
the most valuable asset an organization possesses. Major cloud vendors like Amazon,
Google, and Microsoft all provide various services that can be used to establish and run
data governance practice. Let us look at some of them in greater detail here.
Data Catalog
A data catalog is a centralized repository for storing and managing metadata about
data stored in the cloud. It also serves as a system where users can discover data about
various data sources and understand how these sources can be used. Amazon offers
AWS Glue Data Catalog that is a centralized repository for storing and managing data.
Microsoft has Azure Data Catalog, a fully managed cloud data cataloging service that
serves as a knowledge hub about various data assets in Microsoft Azure. Google Cloud
Platform provides Google Cloud Data Catalog, a scalable data catalog and metadata
management service that is used to discover all the data sources in the cloud.
470
Chapter 14 Getting Started with Big Data and Cloud Computing
instance, in the healthcare industry, there is HIPAA compliance that ensures that
the personally identifiable information is protected from fraud and theft. Amazon
offers AWS Artifact that generates audit and compliance reports that are helpful
with various regulatory agencies and policies. Microsoft offers Azure Policy, which
offers a drill-down report to assess compliance and enforce standards. Google
offers Security Command Center that helps monitor compliance and other security
standards.
Machine Learning
All the major cloud vendors offer a machine learning platform as a service where you can
build machine learning models, train these models, and be able to connect with other
cloud services for deployment or reporting the results. Amazon offers AWS SageMaker,
Microsoft offers Azure ML, and Google offers Vertex AI. These services offer a complete
lifecycle of machine learning models and have the option to either program your own
custom machine learning model or use low code to build a machine learning model.
Conclusion
Cloud computing has revolutionized the technology landscape. The amount of flexibility
and cost-effective procurement of services has enabled engineering teams to deliver
more value. We have seen how the rise in big data paved the way for various evolutions
of tools and technologies. You can now provision a Hadoop cluster in a relatively shorter
471
Chapter 14 Getting Started with Big Data and Cloud Computing
amount of time, compared with a decade ago. However, it is important to know how we
evolved so we appreciate the magnitude of sophistication in the technology and data
systems we have. Though there may be reasons an organization may or may not choose
to go cloud, it remains evident that cloud computing vendors offer ease of use and latest
technologies, tools, and processes to gain insights from data.
472
CHAPTER 15
Engineering Data
Pipelines Using Amazon
Web Services
Introduction
Cloud computing is a concept of obtaining the necessary IT infrastructure for
development and deployment of data and software services. By providing compute
power, memory, storage, and various software installed on top of physical machines, an
organization can get up to speed faster than before. This way an organization eliminates
the capital expenses for buying physical equipment as cloud computing providers offer
pay-as-you-go pricing models for their services.
In this chapter we will look at Amazon Web Services in detail. Amazon is one of
the early adopters of cloud computing and one of the first few companies to provide IT
infrastructure services to businesses, as early as 2006. Cloud computing was referred
to as web services. Today, Amazon offers more than 200 products in cloud computing
including but not limited to compute, storage, analytics, developer tools, etc.
Let us look at the AWS service offerings in the context of developing data and
machine learning pipelines in detail. AWS classifies the countries where it is hosting
its data centers into three regions, also known as partitions. They are standard regions,
China regions, and US GovCloud regions. AWS suggests choosing a region that is
geographically close to the physical presence to minimize latency.
473
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_15
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
As you embark on your journey exploring Amazon Web Services, please may
I request you to be mindful of managing the cloud resources, especially when you
are done using them. In the case of traditional infrastructure, you may switch off a
computer when you are done. However, in the case of cloud infrastructure, where you
often procure a plan that is based on a pay-as-you-go model, you will be billed for your
resources even if they are not being used. Signing off AWS does not mean powering
off the services you have procured. This is particularly important for virtual machines,
databases, and any other compute-intensive resources. Please, always, make it a habit to
review your active resources and shut down these specific resources. If you are no longer
planning on using a service, you may wish to delete the same. And so, you are only
paying for the services you actively use and keep your costs under control. I also would
like to thank AWS for providing free credits for new users and having actively supported
new learners in bringing their cost down.
By the end of the chapter, you will learn
–– How to build, train, and deploy machine learning models using AWS
SageMaker
474
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
It may be possible to provision and access services directly from your root account,
but I recommend you create an admin account, to provision and access AWS services.
To accomplish that, let us utilize AWS Organizations service. AWS Organizations
is an account management service that enables one to manage multiple accounts.
The process is that, within your organization (the root account), you would set up
organizational units. Organizational units would basically be various departments
within an organization. This is followed by creating and assigning service control policies
to various accounts in various organizational units. For our illustration purposes, we
will create two organizational units, namely, “dev” and “prod.” We will invite the team
members for our organizational units.
475
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once you have completed the registration and have signed in, you would get the
AWS management console page. Here is how that looks:
476
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Navigate your cursor to the search section, and enter “AWS Organizations” and click
“Create a new organization.” You will get the following screen:
The next step is to create an organizational unit. You will notice the organizational
structure mentioned in the center of the screen. Make sure to check the “Root” and click
Actions, followed by creating an organizational unit.
477
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
478
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Now that we have service control policies enabled, let us create a new policy for our
organization by clicking “Create policy”:
You will have the option of selecting one of many AWS services and choosing the
appropriate privileges to the policy, along with resources as well. Once you have chosen
the appropriate policy, choose “Save changes.” AWS will automatically create a JSON file
based on what we have chosen here. Here is how policy creation would look:
479
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once these policies are created, it is time to attach a policy with an account. To
accomplish that, go to the main policies page, choose the appropriate policy, and
click “Actions” followed by “Attach Policy,” attaching the policy to the appropriate
organizational unit.
480
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once the respective policies have been attached, you can sign in to the user’s
account to see if the policies have been enforced. Now visit the management console
and look for “IAM Identity Center.” Please ensure that you have logged in with the
account used to create the AWS organization. AWS IAM Identity Center provides a single
sign-on (SSO) service within AWS. You need to enable AWS IAM Identity Center. Here is
how the page looks:
481
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
SSOs are considered secure, can be auditable as they provide log trails, and do help
organizations meet regulatory compliance requirements. Famous names include SAML,
OAuth, Okta, etc.
482
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once you have enabled the identity center, you need to create permission sets.
Permission sets are templates for creating policies of identity and access management.
Permission sets enable granting access to various services for groups and users within
the organization. Here is how that looks:
Now we will create a new permission set. Let us try to work with a permission
set that is predefined by Amazon. For our illustration, we have chosen the
“DatabaseAdministrator” permission set.
483
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
484
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
This will enable you to enter further details about the permission set. Once this step
is successful, you will obtain the following screen:
485
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Now we will create a user. The idea is that we will create users and assign various
users with various permission sets. You can also assign users to groups and assign
permission sets to the groups as well. It is an optional step though. Here is how
that looks:
Now, we need to connect the users, permission sets, and organizations; revisit
AWS Organizations, select the appropriate organization, and select the user and the
permission sets to the respective user. Here is how AWS Organizations looks like:
486
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Click Assign users or groups, and choose the user we created. Upon clicking Next,
choose the permission sets for the respective user for the respective organization. Click
Submit and wait a few seconds for AWS to create the permission sets for the user for the
organization. Here is how that final screen looks:
487
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once you have completed the steps, click “Dashboard” in AWS IAM Identity Center
and obtain the AWS access portal for that respective user. Here is how that looks:
Figure 15-16. Obtaining the access portal for the respective user
You can use the AWS access portal URL to log in. Please use the user credentials we
have set up using AWS IAM Identity Center and not the root credentials. It may also be
beneficial to bookmark the link.
aws configure
aws s3 ls
aws s3 ls s3://<name-of-your-bucket>
AWS S3
AWS S3 stands for simple secure storage, which is an object storage service. S3 allows for
retention of unstructured data. S3 is considered to be optimized for use cases where you
read many times and write once. S3 stores data in the form of objects, as compared with
file systems where data is stored as files. It is a great option to store photos, videos, and
similar content. AWS provides a programmatic interface to allow apps to perform create,
update, read, and delete operations on the data stored on the S3 object storage.
Amazon offers a variety of storage types within their S3 service. Amazon S3 Standard
is for data that has a high volume of reads and is an ideal option for applications that
seek high throughput and demand low latency. Amazon S3 IA (Infrequent Access) is
for data that is not accessed and not read as much; ideal applications would include
business recovery and disaster planning, long-term backups, etc. Amazon S3 Glacier is
for data that may never be accessed but requires to be held; applications are end-of-
lifecycle data, backup that is kept for regulatory and compliance purposes, etc.
489
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
AWS supports a global bucket system. To create a global bucket, it must be named
uniquely within a given partition. There is also an option to create a public bucket that is
accessible to everyone. Here is how to create an AWS S3 bucket:
490
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Click Create bucket and you will be presented with a bucket creation page. While it
is best to leave the options preselected as is, here are some options you wish to consider
choosing that work best for you:
491
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Uploading Files
Here is an example of uploading objects (files) in a given bucket using the AWS
command line interface:
import boto3
import configparser
config = configparser.ConfigParser()
path = '/Users/pk/Documents/de/aws/s3/.aws/credentials.cfg'
config.read(path)
aws_access_key_id = config.get('aws','access_key')
aws_secret_access_key = config.get('aws','access_secret')
aws_region = 'us-east-1'
session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
# create a S3 client
s3_client = session.client('s3')
file_to_upload = '/Users/pk/Documents/abcd.csv'
bucket_name = 'bpro-dev-bucket-1'
try:
s3_client.upload_file(
file_to_upload,
bucket_name,
'uploaded_abcd.csv'
)
492
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
In the preceding example we are storing the AWS credentials in a configuration file
and calling the required access key and access secret from the said file. A good security
practice is to create a hidden folder and store the credentials. Hidden folders will provide
some level of security, but keep in mind anyone with root access to your development
environment can see the credentials.
When you work with version control systems like git, hidden folders by default
will not be included when you add files to the repository. It is also essential to have a
.gitignore file within your git repository that dictates which files to ignore.
Here is a very basic .gitignore file:
Don’t forget to place the .gitignore file in the root of the directory and commit the
file so others who are forking it may also benefit. Please ensure that ignoring certain files
according to .gitignore won’t affect the code in any way.
Amazon RDS
RDS stands for relational database service. As its name goes, Amazon RDS provides
relational databases with several options for database engines. Some of them include but
are not limited to MySQL, PostgreSQL, Oracle, MS SQL Server, etc. AWS RDS is easier to
493
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
set up within a few minutes and easier to operate and can scale up when necessary. AWS
will manage the common database administration tasks so you may not have to worry
about maintenance.
AWS RDS is a primary option for those who are looking to migrate their databases
to the cloud from the on-premise environment. By choosing the appropriate database
engine for their migration, one may be able to achieve faster speeds and reliability
during transfer of relational database systems.
Let us look at creating a new database instance. You may initiate the process
by searching for “RDS” in the AWS console, opening the appropriate page, and
selecting “Create a database.” You have the option of creating a new database either by
customizing individual options for availability, security, maintenance, etc. or choosing
the AWS-recommended configurations for your use case.
You can also specify the nature of this instance, whether you are in need of a
highly available production database instance, or a development sandbox with limited
performance might do it for you. Here is how this may look for you:
494
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
In addition, make sure to add an inbound network rule in the security configuration
by enabling all incoming IPs to talk through TCP/IP. You will find this option under the
VPC tab. Here is how that may look for you:
495
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Figure 15-20. Enabling all incoming IPs to talk through TCP/IP (for learning and
development purposes only)
Here is a Python application that connects to AWS RDS and uploads data to a table:
import csv
import psycopg2
import configparser
# CONFIG
config = configparser.ConfigParser()
cur = conn.cursor()
try:
with open(csv_file_path, 'r') as f:
reader = csv.reader(f)
next(reader) # Skip the header row
for row in reader:
cur.execute(
"INSERT INTO users (id, name, age)
VALUES (%s, %s, %s)",
row
)
conn.commit()
print("Data uploaded successfully")
except:
print("error loading data into AWS RDS")
finally:
cur.close()
conn.close()
Amazon Redshift
Redshift is a massively scalable cloud data warehouse. It allows you to load petabytes
of data, execute complex analytical queries to retrieve data using SQL, and connect
business intelligence applications to render reports and dashboards. You do not have to
pay when the data warehouse environment is in idle state as charges occur only when
you are actively interacting with Redshift by passing a query or loading data. Amazon
Redshift is based on open standard PostgreSQL; however, it has been customized to
support online analytical processing applications and other read-heavy applications.
Amazon Redshift is a column-store-like data warehouse that is scalable and
fully managed by AWS. Storing data in a column store reduces the number of input–
output operations on Redshift and optimizes the performance of analytical queries.
Furthermore, you can also specify the data in Redshift to be compressed so that it further
497
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
reduces storage requirements and even optimizes the query. When you execute a query
against compressed data, the data is uncompressed and gets run.
Amazon Redshift instances are spun off in clusters. A cluster has a leader node
and compute nodes. The leader node manages communications, both with client
applications and the compute nodes. For an analytical query supplied by the user, the
leader node creates an execution plan and lists the execution steps necessary to return
the result. The leader node compiles the query and assigns a portion of the execution to
the compute node. The leader node would assign a query to a compute node that has the
underlying table mentioned in the query.
The compute nodes are themselves partitioned into slices. A portion of physical
memory and storage is allocated to each slice within a compute node. So, when the
leader node distributes the work to compute nodes, the work really goes to a given slice
within a given compute node. The compute node would then process the task in parallel
within the cluster. A cluster can consist of one or more databases where the data is stored
in compute nodes.
Let us try to set up a new Amazon Redshift cluster. To begin, we have to create a
workgroup. A workgroup in Redshift is a collection of compute resources that also
include VPC configuration, subnet groups, security groups, and an endpoint. Let us visit
the AWS management console and search for Redshift. Once we are on the main page,
let us choose to create a workgroup. You will be directed to a page where you can specify
the RPUs for the workgroup. RPUs are Redshift Processing Units, a measure of available
computing resources. You may also choose the VPC configuration and subnet groups.
Here is how that may look for you:
498
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once you have specified the various parameters, you may wish to navigate to the
provisioned cluster dashboard to create a new cluster. When creating a new cluster, you
may specify the name of the cluster, the number of nodes you wish to have on the cluster,
and the type of compute, RAM, and storage you wish to obtain for your data warehouse
cluster.
499
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
You may also specify the database credentials when creating the Redshift cluster. If
you wish to get started on querying the Redshift cluster using sample data, you may click
Load sample data. In addition, you can also choose to create a new role or associate an
existing IAM role with their cluster.
500
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Figure 15-23. Configuring the database credentials for the Redshift cluster
Once you have chosen other options, then you are ready to create a new Redshift
cluster. After you click Create the new cluster, you may have to wait a few minutes for the
cluster to be created. Once you have the cluster ready and loaded with sample data, you
may wish to click the query editor on the cluster dashboard page. We have the option
of choosing query editor version 2, which is a much cleaner interface. Here is how the
query editor may look:
501
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
You can write queries, access various databases, and also explore AWS Glue Data
Catalog from your query editor. Furthermore, you can create views, stored procedures,
and functions within the query editor against the data in the Redshift cluster. In the top-
left corner, you have options to create a new database or table using the graphical user
interface or load data directly from one of your S3 buckets. Let us look at how we may
load the data from a S3 bucket to Redshift using Python. Here is a sample code that picks
up a flat file from a S3 bucket and loads it into a Redshift database:
import boto3
import configparser
config = configparser.ConfigParser()
path = '/Users/<your-path-here>/credentials.cfg'
config.read(path)
aws_access_key_id = config.get('aws','access_key')
aws_secret_access_key = config.get('aws','access_secret')
aws_region = 'us-east-1'
s3_bucket_name = 'your-bucket'
s3_key = 'key'
502
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
table_to_be_copied = 'nba_shots.csv'
redshift_cluster_identifier = config.get('redshift','cluster_id')
redshift_database = config.get('redshift','r_dbname')
redshift_user = config.get('redshift','r_username')
redshift_password = config.get('redshift','r_password')
redshift_copy_role_arn = config.get('redshift','redshift_arn')
s3 = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
redshift = boto3.client(
'redshift',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
copy_query = f"""
COPY {table_to_be_copied}
FROM 's3://{s3_bucket_name}/{s3_key}'
IAM_ROLE '{redshift_copy_role_arn}'
FORMAT AS CSV;
"""
def execute_redshift_query(query):
try:
redshift_data = boto3.client(
'redshift-data',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
503
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
response = redshift_data.execute_statement(
ClusterIdentifier=redshift_cluster_identifier,
Database=redshift_database,
DbUser=redshift_user,
Sql=query,
WithEvent=True
)
return response
except Exception as e:
print(f"Error executing Redshift query: {e}")
return None
response = execute_redshift_query(copy_query)
print(response)
Amazon Athena
Amazon Athena is an interactive querying service to analyze the data in the object
storage AWS S3 directly using standard SQL syntax. With Amazon Athena, you also have
the option to run data analytics using Apache Spark, directly on the S3 buckets. There is
no need to set up the underlying Spark infrastructure or set up a database connection
driver. Amazon will take care of all these for you. Amazon Athena is a serverless service,
and so you will be charged only for what you use, by the number of requests.
As you have the ability to query the buckets in your object storage directly, you can
now analyze the unstructured, semistructured, and structured data within the buckets.
The data can be of any formats, including but not limited to CSV, JSON, Parquet, and
Apache ORC. All of these can be analyzed using Athena. By using Amazon Athena,
you can query various kinds of data within S3; therefore, you can use S3 as your data
lake, with Athena as a query layer for the data lake. The queries will automatically be
parallelized and executed, offering natively faster performance.
To get started with Amazon Athena, search for Athena in the AWS management
console and choose an option to query your S3 bucket data (either SQL or Spark). Here is
how that may look for you:
504
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
You can also use Spark SQL on top of the objects in your S3 bucket, by interacting
with Jupyter Notebook. Here is how that may look for you:
505
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Amazon Glue
Amazon Glue is a serverless data integration tool. You can use Glue to discover data
sources, integrate data, and prepare datasets for data analytics, machine learning,
and application development. Amazon Glue has Glue Studio, which lets you prepare
your data integration jobs visually, where you can drag and drop components into a
workspace. Once you have designed your jobs, you can run and monitor these jobs
directly from the Glue environment. You can trigger a job based on a scheduled time or
an event.
506
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Let us create two folders, one for the source data and the other for running
Glue-based reports within the data lake bucket. Here is how that may look for you:
507
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Let us now configure the data lake. We will navigate to the AWS management console
and search for Lake Formation. We will create a database within AWS Lake Formation
and configure our data sources folder from the S3 bucket. Create a database, assign a
name, and choose the data source location for the database. Here is how that may look
for you:
508
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Now let us configure AWS Glue, so we can automatically catalog the tables in the
data sources bucket. Before that, we need to create an IAM role within AWS that AWS
Glue will use to create a crawler. A crawler in AWS Glue is a program that can scrape or
obtain data from multiple sources. First, let us look for IAM within the AWS management
console, navigate to the service, and click Roles, to create a role. Here is how that
may look:
509
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
For the trusted entity type, choose “AWS service,” and for the use case, click the
drop-down and select “Glue.” Here is how that may look:
510
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Here are the following permissions you would need for this IAM role; choose an
appropriate name and create the IAM role:
Now, we will grant permission for this newly created role to create a catalog in our
Lake Formation. Navigate to AWS Lake Formation, choose our data lake, and grant
permission from Actions:
511
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Figure 15-33. Grant permission for the newly created role in Lake Formation
Select the newly created IAM role and select the data lake under the named data
catalog resources. Within the database permissions, provide the create table access
option as well. Let us now locate the AWS Glue service.
You can search for Glue within the AWS management console or click the crawler
option in AWS Lake Formation to directly navigate to AWS Glue service.
512
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Figure 15-34. Choosing the IAM user within the permissions page
We are going to create a new crawler, where we will specify the “data sources” folder
from the data lake bucket as the input, assign the newly created “dl-glue” IAM role, and
choose the “data lake” bucket for the target database.
513
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Figure 15-35. Creating a new crawler with specified sources and IAM roles
Once you have created the crawler (may take a couple of minutes), you can see your
newly created crawler in AWS Glue.
Choose the crawler and click Run. You may have to wait a few more minutes for the
crawler to run.
514
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
During this process, the AWS Glue crawler would infer the structure of the input
dataset from the data sources folder and store the metadata in the data catalog.
Here is how that may look:
So far, we have successfully set up AWS Lake Formation, pointed the data sources at
the AWS S3 bucket, created the necessary service account using AWS IAM, configured
and ran the AWS Glue crawler, and populated the data catalog. We will now use AWS
Athena to query the data to extract insights. Visit the AWS management console to
navigate to AWS Athena service. Select the option of querying the data with Trino SQL
and click Launch the query editor. Here is how that may look:
515
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Let us now set up an output folder where we can store results of the query. At the
top, click the Settings option and then the Manage settings option. Click the Browse S3
option, where you navigate to the data lake S3 bucket and select the Output datasets
option. Save the settings.
Now, we are ready to write queries into AWS Athena and extract the insights from the
dataset. Let us write a simple query:
select
distinct year, season, city
from
events_csv
where
year < 2020
order by year desc
516
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
517
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
As we mentioned earlier, the results of this query are already stored in the S3 bucket
under the folder we specified earlier. Let us navigate to that folder to look at the query
results. You will see a complex hierarchy of folder structures created with date, month,
and year. Locate the CSV file type and click “Query with S3 Select.” You will have a simple
query already populated, and so just choose to run the query to obtain the results.
Here is how that may look:
518
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
AWS SageMaker
Amazon SageMaker is a fully managed machine learning service by AWS. It enables you
to build, train, test, and deploy machine learning models at scale. You have the option
of developing a no-code or low-code machine learning model as well. AWS SageMaker
supports industry-leading machine learning toolkits like sk-learn, TensorFlow, PyTorch,
Hugging Face, etc.
AWS SageMaker comes in a few variants. AWS SageMaker Studio comes with a fully
functioning ML integrated development environment. It comes with a Jupyter notebook,
which gives you control over every step of the model being built. AWS SageMaker
Studio Lab is a user-friendly, simplified version of AWS SageMaker Studio, but it comes
with limited hardware support. You would get 16 gigabytes of physical memory and 15
gigabytes of storage, and it cannot be extended, compared with SageMaker Studio where
you can specify larger compute and greater memory. There is also SageMaker Canvas,
which is a low-code environment for developing predictive models.
To setup AWS SageMaker Studio, get to the service page, and choose the option “Set
up SageMaker Domain.” Here is how that may look:
Once the SageMaker Domain is created for you, you can proceed to launch
SageMaker Studio from the Domain. Here is how that may look:
519
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
520
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Let us try to build and deploy a machine learning model using Amazon SageMaker.
We will build a simple machine learning model, host it on an instance, create an API that
can be used to interact with the model, and create a lambda function that would access
the endpoint.
To get started, let us open JupyterLab and choose the new JupyterLab space. You
will provide a name for the JupyterLab Instance and proceed to choose the underlying
hardware. My suggestion is to use the instance “m1.t3.medium” as you get 250 free hours
of compute when you start using SageMaker initially.
The Image option is basically a Docker image appropriately configured for machine
learning tasks.
Here is how the settings may look, for instance:
Figure 15-44. Creating and configuring a new Jupyter space on AWS SageMaker
For our simple illustration, we will choose the titanic dataset, where the outcome
is the likelihood of a passenger to survive, in case of a ship colliding with the iceberg.
For our illustration purposes, we are choosing a simpler dataset and have removed a
few columns as well. We are importing necessary libraries, preprocessing the dataset,
521
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
running the SageMaker’s linear learner model, and creating an endpoint that can be
hosted on another instance. We are using Amazon SageMaker’s linear methods to obtain
the model.
Here is the code:
import sagemaker
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
df = pd.read_csv("titanic.csv")
df = df.dropna(
subset=["Age"]
)
df = df.drop(
['Name','Sex','Ticket','Cabin','Embarked'],
axis=1
)
df.to_csv("titanic_subset.csv")
rawdata = np.genfromtxt(
"titanic_subset.csv",
delimiter=',',
skip_header=1
)
522
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
linear = sagemaker.LinearLearner(
role = get_execution_role(),
instance_count = 1,
instance_type = 'ml.m5.large',
predictor_type='regressor',
sagemaker_session=sagemaker.Session()
)
train_data_records = linear.record_set(
Xtr.astype(np.float32),
labels=Ytr.astype(np.float32),
channel='train'
)
linear.fit(train_data_records)
predictor = linear.deploy(
initial_instance_count=1,
instance_type='ml.t3.medium'
)
Let us run this code on a JupyterLab notebook and make a note of the endpoint
that SageMaker generates. We will host this endpoint on another SageMaker instance.
We will then have a lambda function that would call the hosted API. Lambda functions
usually have a certain IAM role associated to work with other services.
Let us create a new IAM role that can be used to interact with machine learning
models. Let us find IAM in the AWS management console, locate the Identity and Access
Management option on the left, and click Roles; create a new role and assign full access
to Lambda, SageMaker, and CloudWatch services. Here is how it may look for you:
523
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Let us now create a lambda function. Navigate to the AWS management console
and search for “Lambda” service. Choose to create a function, select the Author from
scratch option, choose Python as the runtime configuration, and leave the architecture
selection as is.
Here is how the Create function page in AWS Lambda looks like:
524
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Ensure that you change the default execution role by using an existing role, and
select the IAM role you created from the drop-down option.
The AWS lambda function comes with starter code, where the lambda function
has two parameters, one that takes an event that triggers the function and another
containing runtime parameters.
525
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
We parse the JSON string to obtain the actual data that will be sent to SageMaker for
prediction. We then request the SageMaker endpoint and return the response. This is
again followed by JSON parsing of the response and returning the prediction score.
Here is the lambda function:
endpoint = os.environ['endpoint']
runtime = boto3.client('runtime.sagemaker')
526
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
result = json.loads(
response['Body'] \
.read() \
.decode()
)
return int(result['predictions'][0]['score'])
In addition, you may wish to set the environment variable for this lambda function
by navigating to the Configuration option and choosing Environment variables. Here is
how that may look:
Figure 15-48. Setting the environment variables for the lambda function
Let us now deploy this function. From the AWS management console, navigate to the
API Gateway service and choose to create an API. Choose to build a REST API that will
provide you more control over the request and response. Enter a name for the new API,
leave the endpoint type as is, and choose to create an API.
527
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Since we will be sending data to the API to obtain a prediction score, we will create
a POST method that would post the data points to the machine learning API and return
a prediction score. And so, choose to create a method, select POST as method type, and
select the lambda function as the integration method.
From the Method type drop-down, please choose the POST method and not the
PUT method.
528
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Once you have created this POST method, you can test the API functionality to see if
it is working as expected. On API Gateway, navigate to the Test option and enter the data
to be tested within the Request Body section and click Test. You will get the response
code and the response that the API returns.
Once everything looks good, we can proceed to deploy the API that is publicly
accessible. In the top-right corner, choose the “Deploy API” option, choose “New stage”
for the Stage drop-down, and enter a name for the stage. You could either mention
Development, Testing, Production, or Alpha, Beta, etc.
529
Chapter 15 Engineering Data Pipelines Using Amazon Web Services
Conclusion
So far, I hope you gained a comprehensive view of Amazon Web Services, focusing
on its data engineering and machine learning capabilities. We have covered a wide
range of services starting from simple object storage to hosting your own machine
learning models on the cloud. We created IAM roles for secured access to services;
stood up databases with RDS, data warehouses with Amazon Redshift, data lakes with
Lake Formation, and machine learning models with AWS SageMaker; and also gained
experience in working with AWS Lambda, AWS Athena, and AWS Glue.
The examples provided here may seem to use comparatively generous IAM role
privileges. This is only for illustration purposes. In production, you should practice the
principle of least privilege, granting only the specific permissions required for a given
task. Overall the topics we covered may serve you well in your projects.
530
CHAPTER 16
Engineering Data
Pipelines Using Google
Cloud Platform
Introduction
In this chapter, we will look at Google Cloud Platform, one of the early adopters of
cloud computing. We will look at exploring various services of GCP, focusing on data
engineering and machine learning capabilities. We will start by understanding core
concepts of Google Cloud Platform and looking at key services. This is followed by a
detailed look at the various data system services offered by Google Cloud Platform. We
will finally look at Google Vertex AI, a fully managed machine learning platform for
building, training, and deploying machine learning models within one service.
By the end of the chapter, you will learn
• Fundamentals of Google Cloud Platform
• Google’s object storage and compute engine for virtual machines and
data systems
531
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_16
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
As you embark on your journey exploring Google Cloud Platform, please may
I request you to be mindful of managing the cloud resources, especially when you
are done using them. In the case of traditional infrastructure, you may switch off a
computer when you are done. However, in the case of cloud infrastructure, where you
often procure a plan that is based on a pay-as-you-go model, you will be billed for your
resources even if they are not being used. Signing off GCP does not mean powering
off the services you have procured. This is particularly important for virtual machines,
databases, and any other compute-intensive resources. Please, always, make it a habit
to review your active resources and shut down these specific resources. If you are no
longer planning on using a service, you may delete the same. And so, you are only paying
for the services you actively use and keep your costs under control. I also would like to
thank GCP for providing free credits for someone who is a new user and having actively
supported new learners in bringing their cost down.
Each folder can have one or more projects. The project represents the lowest level of
hierarchy in an organization. Any given project is where a virtual machine or a storage
instance can be provisioned.
For illustration purposes, let us set up folders as part of GCP setup. You are going to
need a Google account to begin with. I would suggest you start off by creating a sandbox
folder and create your projects there. You may set up a development folder upon gaining
momentum and test your use cases there. Google Cloud can be accessed and interacted
through a graphical user interface, command line interface, and client libraries that offer
API to manage resources and services.
533
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
534
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
1. First off, run the update command and install any dependencies
your system may need:
2. Import the Google Cloud CLI public key and add the GCloud CLI
distribution as a package source:
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |
sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg
gcloud init
You will have to authenticate in a browser by entering your credentials and allowing
the Google Cloud SDK to access your account.
535
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
If you are working on MS Windows operating system, then open a PowerShell tab
and enter the following:
(New-Object Net.WebClient).DownloadFile("https://dl.google.com/dl/
cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe", "$env:Temp\
GoogleCloudSDKInstaller.exe")
& $env:Temp\GoogleCloudSDKInstaller.exe
Follow the prompts that the Google installer provides. Open Google Cloud CLI and
enter the following to get started with GCP CLI:
gcloud init
You are all set to utilize Google Cloud from your command line.
536
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
–– Memoryoptimized,whichoffersmorememorypercore,suitableformemory-
intensive workloads
Let us get started by enabling the Compute Engine API. Navigate to the Google Cloud
console and look for a virtual machine option or Google compute option; we need to
enable the API service for consumption.
Here is how that may look:
537
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Click Create a VM, which will lead us to the following page, where you would be
provided the option to enable the Google Compute Engine API.
Click ENABLE, to enable the compute API.
Once enabled, you would be able to manage the API. It would appear like this:
With the Compute Engine API enabled, you may proceed to create a new virtual
machine.
Now, we are going to create a Linux virtual machine. We will begin by creating a new
compute instance, selecting the appropriate Linux distribution and the version, and
enabling access by allowing incoming HTTP traffic.
538
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
First, let us visit the menu on the left, locate the Compute Engine option, and click
VM Instances and create a new instance. You would be provided with a page, where you
can specify the name, location, and type of instance to begin with. You also have the
choice of either specifying a pre-packed machine type with a specific number of cores
and memory, or you can customize your configuration as well.
Based on the choices you make, Google would present you costs that may be
incurred every month. Here is how this may look for you:
539
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
You also have the option of specifying the access control, firewall rules, and
other security configurations within the same page. Once you have chosen your
desired configuration, you may click Create. You may also choose the most minimum
configuration and go with the bare essentials. Here is how the dashboard may look:
Please wait for a few minutes for the instance to be created. Once it is created,
Google will provide an SSH-in-browser session. Here is how that may look for you:
540
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Cloud SQL
Cloud SQL is a fully managed relational database service offered by Google. Cloud SQL
is completely managed by the provider, Google, meaning that it does not require system
maintenance, administration, and operational support. Here is how this may look
for you:
541
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
542
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
As we discussed earlier, Cloud SQL will provide you an option to choose the database
engine. Once you have chosen the database engine, proceed to provide the compute
instance name, a password for the root user, a sandbox for the preset, a region close to
your location, and a single zone, and choose to create the instance.
543
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Please wait a few minutes for the instance to be created. When it is created, here is
how the screen would look:
Once you have provisioned the Cloud SQL service, you can proceed to create a
database and enable shell access to your Cloud SQL instance among other things. Let
us create a database here. Go to your instance, choose Databases in the left menu, and
click “Create a database.” Provide a database name, leave the character set to UTF-8, and
click Create.
Here is a simple application to load data programmatically from your local
workstation to your newly created Cloud SQL database:
config = configparser.ConfigParser()
path = '/Users/<your-folder-path-here>/credentials.cfg'
544
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
config.read(path)
)
return conn
pool = sqlalchemy.create_engine(
"mysql+pymysql://",
creator=getconn,
)
filepath = "/Users/<your-folder-path-here>/gcp/polars_dataset.csv"
db_conn.execute(' \
CREATE TABLE onedata( \
firstname varchar(200), \
lastname varchar(200), \
gender varchar(10), \
birthdate varchar(100), \
type varchar(20), \
state varchar(100), \
occupation varchar(100) \
) \
')
545
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
db_conn.commit()
546
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google Bigtable
Bigtable is a NoSQL offering of Google Cloud. This NoSQL data store is designed for
high-velocity incoming data and low-latency database application. Bigtable is used in
applications where data is required rapidly, applications like stock trading, real-time
gaming, and streaming media platforms. Bigtable is a denormalized database; there is no
support for joins between tables.
547
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google Bigtable is a key–value store, with no support for secondary indexes (other
than the key), and columns need not be in the same order. You can still store the data in
tables, and Bigtable can have several tables within a given instance. In the absence of a
schema, you can attempt to store data of similar structure (or schema) in the same table.
Moreover, all operations are atomic at the row level. I believe that when an operation
writes multiple rows to Bigtable and it fails, then it may be possible that some rows may
have been written and some may not have been written at all.
When designing Bigtable row keys, it is important to elucidate how the data will
be queried and how one would retrieve data. You can incorporate multiple identifiers
within the row key that are separated by delimiters. Rows are sorted lexicographically by
row keys in Bigtable, and so it may be beneficial to design a row key that starts off with a
common value and completes with the most granular value.
Google Bigtable defines a table as a logical organization of values indexed by row
key. Let us create a Bigtable instance. Navigate to the GCP console and type in “Bigtable.”
Once you are able to locate the service, choose to create an instance.
You can also install the Google Bigtable client for Python in your workstation. Here
is how you can do that, as per the documentation (https://cloud.google.com/python/
docs/reference/bigtable/latest):
You always create a Bigtable instance. An instance typically has one or more clusters,
located in different time zones. A given cluster has one or more compute nodes. These
compute nodes are responsible for managing your data and performing other tasks. To
create an instance, you have to specify the storage, where you have the choice of hard
disk or SSD.
Here is how this may look for you:
548
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
{
"id": 2,
"first_name": "Rube",
"last_name": "Mackett",
"email": "[email protected]",
"gender": "Male",
"ip_address": "153.211.80.48"
}
549
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
config = configparser.ConfigParser()
path = '/Users/pk/Documents/de/gcp/credentials.cfg'
# Configuration
project_id = config.get('bigtable','project_id')
instance_id = config.get('bigtable','instance_id')
table_id = config.get('bigtable','table_id')
json_file_path = "/Users/pk/Downloads/randomdata.json"
max_versions_rule = column_family.MaxVersionsGCRule(2)
column_family_id = "cf1"
column_families = {column_family_id: max_versions_rule}
if not table.exists():
table.create(column_families=column_families)
column_family_id = "cf1"
550
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
# commmit
row.commit()
You may observe the output at Bigtable Studio by running a simple query. Here is
how that may look for you:
[bigtable]
project_id = "value-here"
instance_id = "value-here"
table_id = "value-here"
551
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google BigQuery
Google BigQuery is a fully managed analytical data warehousing solution by Google.
As it is a fully managed service, there is no need to set up, provision, or maintain the
underlying infrastructure. The storage for BigQuery is automatically provisioned and
also automatically scaled when there is more data coming in. BigQuery supports
high-throughput streaming ingestion in addition to high-throughput reads. In Google
BigQuery, the storage and compute are separated. Both the storage and compute can be
scaled independently, on demand.
Google BigQuery is a columnar store, where data is automatically replicated across
multiple availability zones to prevent data loss from machine failure or zone failure. In
BigQuery, data is stored mostly in tabular format. Tabular data includes standard data,
table clones, materialized views, and snapshots. The standard tabular data is a table
data that has a schema, and each column has an assigned data type. The table clones are
lightweight copies of standard tables, where the delta of the table clone and base table is
stored (and so it is lightweight).
Table snapshots are point-in-time copies of tables, where you have the option to
restore a given table from a given point in time. The last one is materialized views, where
the results of a query are precomputed and cached and updated periodically. In Google
BigQuery, the cached results of a query are stored in temporary tables. Google does not
charge for cached query results.
Google BigQuery lets you manage the security and quality of data throughout its
lifecycle through access control, data stewardship, and data quality modules. You can
manage and control the access of BigQuery users and the level of access they have with
respect to tables, views, snapshots, and so on. There are also provisions to control and
manage access to certain rows and columns, enabling fine-grained access to sensitive
data. You have the choice of obtaining the detailed user profiles and system activity by
obtaining audit logs.
In addition, you can safeguard personally identifiable information in the BigQuery
tables and views, by applying data masking to the PII fields. This will safeguard the
datasets against accidental data disclosure. BigQuery automatically encrypts all data,
both during the transit and at rest. You also have the option to check the data quality by
running statistical metrics (mean, median, unique values, etc.) on your dataset. You can
also validate your data against predefined rules and check data quality and troubleshoot
data issues.
552
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
To get started using Google BigQuery, we need to locate the service within Google
Cloud and enable it. Here is how that process may look:
Once you click ENABLE, you may wait a few more minutes till you get automatically
redirected to Google BigQuery. There is no need to provision any infrastructure,
compute, or even storage. When you start using BigQuery, depending upon your usage,
you will be billed for storage and compute.
Here is how that may look for you:
553
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google Dataproc
Google Dataproc is a fully managed service that provides you a Hadoop cluster or a
Spark cluster for you to run big data processing in a distributed environment to take
advantage of big data tools. The Dataproc service also integrates well with various other
Google Cloud services like BigQuery and others, enabling you to set up a complete data
platform as well.
To get started with Google Dataproc, enable the API from the cloud console:
554
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google Dataproc’s Spark cluster is a serverless model, which means you pay only for
your consumption. The Hadoop cluster is also relatively less expensive. Google Dataproc
takes very little time to set up a new cluster. It also has an autoscaling feature, which
would scale up or down depending upon the memory usage.
There are three types of clusters you can set up using Google Dataproc. They are
single-node cluster, standard cluster, and high-availability cluster. The single-node
cluster consists of one master and no worker nodes. The standard cluster consists of one
master and “n” workers, whereas the high-availability cluster consists of three master
nodes and “n” worker nodes. For a single-node cluster, the autoscaling feature is not
available.
Let us create a simple cluster in Google Dataproc with Compute Engine. To get
started, let us click Create a cluster and choose the Compute Engine option:
555
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
In our illustration, we will create a single-node cluster with 1 master and 0 workers.
Let us provide a cluster name, region, and type of cluster and choose a disk image for the
compute engine. Here is how that looks:
As you scroll down, make sure check the Enable component gateway checkbox and
also the Jupyter Notebook option. You may wish to choose other options depending
upon your needs. Here is how this may look:
556
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Figure 16-20. Specifying the components for the new Dataproc cluster
Now, click Configure nodes, where you have the option to specify the type of
compute you wish to use. If you are creating a cluster for a development environment,
then you may be able to work comfortably with a less powerful compute.
In that case, you may wish to choose a general-purpose compute and select the
“N1 series” option if you see it. You may also select similar options for the worker node
as well.
557
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
You can also create a cluster from your Google Cloud command line interface.
Here is the command for that:
558
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
When the cluster is created, you may be able to see it in the dashboard. Here is how
this may look for you:
Let us create a Spark data pipeline and use Cloud Dataproc to execute the data
pipeline. We will use Vertex AI Workbench service to create a Jupyter notebook.
559
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Vertex AI Workbench integrates well with Cloud Storage and BigQuery services. It
comes with various machine learning libraries preinstalled and supports both PyTorch
and TensorFlow frameworks. You have the option of using CPU-only instances or GPU-
only instances.
Here is a sample code for you that you can run on Google Cloud Storage data:
bucket = "your-bucket"
filepath = "gs://your-bucket/polars_dataset.csv"
spark = SparkSession.builder\
.appName("PySpark-my-cluster")\
.config("spark.jars", "gs://spark-lib/bigquery/spark-bigquery-
latest.jar")\
.getOrCreate()
occ_report = df.crosstab(col1="state",col2="occupation")
occ_report.show(5)
560
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Note The “spark.jars” file is used with Spark to read and write data with
Google BigQuery. You may obtain the appropriate Spark jars file at Google Dataproc
service. Visit the dashboard and navigate to the “Submit a job” page within
Dataproc. Provide the appropriate information and click Submit to obtain the
jars file.
561
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Google Vertex AI
Google’s Vertex AI is a fully managed machine learning platform that lets you train and
deploy machine learning models. Once you train a model, you can deploy it on Vertex,
enabling the model consumers to receive real-time intelligence and insights aimed at
generating value to the business. This machine learning platform does have so many
features and is highly comparable to other leading machine learning platforms offered
by AWS and Microsoft.
Let us build a custom machine learning model from scratch, and this time we will
use the AutoML feature. AutoML stands for automated machine learning and is the
process of automating machine learning models, including preprocessing, feature
engineering, model selection, and tuning the hyperparameters.
Google Vertex AI offers two development environments for building machine
learning models; they are Vertex AI Workbench and Colab notebooks. We have already
seen Vertex AI Workbench in the previous section and utilized the notebook to build
machine learning models. These workbench environments are provided through virtual
machines, and they are customizable with support for GPU as well.
Google Colab notebooks are a managed notebook environment that lets you
collaborate with others, while Google manages the underlying infrastructure for you.
You have the option of customizing the underlying compute and configuring runtimes.
The upload size for a tabular file cannot exceed 20 megabytes in Google Colab, whereas
in the case of Vertex AI Workbench, you can work with a tabular dataset with size up to
100 megabytes.
Let us look at building a machine learning model and host the model as an
endpoint using Google Vertex AI. We are using the iris flowers dataset for a multi-class
classification problem. Here is the machine learning code for your reference. I encourage
you to use this code as a starter and experiment with other classifiers:
import pandas as pd
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pickle
562
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
training_dataset = pd.read_csv('iris.csv')
dataset = training_dataset.values
# split data into X and y
X = dataset[:,0:4]
Y = dataset[:,4]
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(
X,
label_encoded_y,
test_size=test_size,
random_state=seed
)
xgmodel = XGBClassifier()
xgmodel.fit(
X_train,
y_train
)
print(xgmodel)
# evaluate predictions
accuracy = accuracy_score(
y_test,
predictions
)
filename = "model.pkl"
pickle.dump(xgmodel, open(filename, 'wb'))
563
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Note Pickle files are binary files, used to serialize Python objects like dictionaries,
lists, etc. The process of converting a Python object to a byte stream is called
pickling or serialization. The inverse also holds. You can deserialize or unpickle a
pickled file. They are considered secure as pickle files are Python specific. Pickling
is commonly used in machine learning modeling in the Python ecosystem.
Once we have the model available in pickle file format, we can now add this model in
Google Vertex AI model registry. To get started, visit Vertex AI from the Google console,
navigate to deployment, and click Model Registry.
Here is how that may look for you:
564
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
You can either create a new machine learning model interactively or import the
pickle file you have generated. In this case, we will import our model as a pickle file. You
have the option of choosing the region for your model.
Figure 16-26. Importing the model, naming, and specifying the region
You may specify a name, description, and the region for this model. You can choose
to import as a new model. If by any chance you have retrained this same model (with
the iris dataset) and generated another model pickle file, then you may import the same
as a newer version of the existing model. We have trained our model using the XGBoost
model framework. If you have used scikit-learn or TensorFlow as a model framework,
then choose the appropriate model framework and model framework version from the
drop-down box. And for the model artifact location, provide the cloud storage folder
location, either by browsing visually or just by entering the URI.
565
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
For the model artifact location, please specify only the folder and not the exact
model path. The column would auto-populate the “gs://” prefix, and so be mindful of
the cloud storage path.
Once you have deployed the machine learning model pickle file in the model
registry, then it is time to create an endpoint and expose the same for consumption. Let
us navigate to the “DEPLOY AND USE” tab within Vertex AI and click Online prediction.
Here is how that may look:
566
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
The concept of online prediction is only when you are looking to consume a
machine learning model by submitting a request to an HTTP endpoint. Let us deploy our
iris machine learning model as an endpoint. You may choose a region where you wish to
host this endpoint and click Create new endpoint.
Once you click to create an endpoint, provide the name for the endpoint and make
a note of that name separately. To access the endpoint, we will choose the standard
option here.
The traffic split is the model is basically specifying how much of the incoming traffic
should be routed to this endpoint. For our illustrative purposes, we can leave the value as
is. This parameter can be changed at a later point, if we are hosting multiple endpoints
for a machine learning model.
Furthermore, you can specify the number of compute nodes for this one endpoint.
Also, you can specify the maximum number of compute nodes you would like to have
should there be higher traffic. You can also specify custom machines with specific
compute and physical memory; there are also options to choose from specialized
machines like CPU intensive, memory intensive, etc.
567
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
For the remaining options, you may leave as is and choose to deploy the model. The
process may take several minutes. Once the model is deployed, you may be able to see
the entry in your deployment dashboard.
Here is how that may look for you:
568
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
Figure 16-31. A deployed model that can be seen in the Google Vertex AI
dashboard
You can test the endpoint in various forms. Postman client would be simplest. Here
is an illustration of using the Google Cloud command line interface:
Conclusion
So far, we were able to look at Google Cloud Platform with a focus on data engineering
and machine learning projects. We explored a wide range of services from Cloud Storage
to BigQuery. GCP offers flexible compute options through the Compute Engine API and
managed database services like Bigtable and Cloud SQL. We looked at Dataproc and
BigQuery for running distributed computing workloads and also Vertex AI Workbench
for developing machine learning models. This was followed by Google Vertex AI, where
569
Chapter 16 Engineering Data Pipelines Using Google Cloud Platform
we looked at building a custom machine learning model, deploying the same to model
registry, and exposing it as an endpoint for real-time predictions. As GCP evolves and
continues to add more services, they continue to offer managed services that reduce the
operational overhead for data and ML engineers, enabling them to build data solutions
that are stable and reliable.
570
CHAPTER 17
Engineering Data
Pipelines Using
Microsoft Azure
Introduction
Azure cloud computing is a fully managed computing service by Microsoft. The Azure
concept was introduced early in 2008 at a conference, and Microsoft launched the
Windows Azure service in 2010, which was rebranded as Microsoft Azure. Currently it
provides a comprehensive suite of computing, database, analytics, and other services
through various deployment modes. Microsoft has data centers all across the globe and
serves various industries. In this chapter, we will look at some of Azure’s key components
and services, with a focus on data engineering and machine learning.
By the end of this chapter, you will learn
–– How to set up a Microsoft Azure account and Azure’s resource
hierarchy
–– Microsoft’s object storage called Azure Blob Storage and its concepts
and setup
–– The relational database service called Azure SQL and its concepts
and installation
571
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5_17
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
As you embark on your journey exploring the Microsoft Azure platform, please
may I request you to be mindful of managing the cloud resources, especially when
you are done using them. In the case of traditional infrastructure, you may switch off a
computer when you are done. However, in the case of cloud infrastructure, where you
often procure a plan that is based on a pay-as-you-go model, you will be billed for your
resources even if they are not being used. Signing off Azure does not mean powering
off the services you have procured. This is particularly important for virtual machines,
databases, and any other compute-intensive resources. Please, always, make it a habit
to review your active resources and shut down these specific resources. If you are no
longer planning on using a service, you may delete the same. And so, you are only paying
for the services you actively use and keep your costs under control. I also would like to
thank Azure for providing free credits for someone who is a new user and having actively
supported new learners in bringing their cost down.
Introduction to Azure
Azure is Microsoft’s cloud computing platform offering several of the services that are
comparable to Google’s and Amazon’s cloud service offerings. Azure’s offerings are
particularly attractive to organizations who are heavily invested in Microsoft-based
technologies.
Microsoft has had a long-standing reputation in the data management field.
Microsoft is one of the early adopters of building ETL tools, which were instrumental
in facilitating data movements between various sources. To get started with Microsoft
Azure, you need to visit Microsoft’s website to sign up for a Microsoft account, before
attempting to obtain access to Azure. Once you have a Microsoft account, you may wish
to sign up for free Azure services. Like AWS and Google Cloud, you will enter various
information to Microsoft to obtain access to Azure services.
Let us get ourselves familiar with the resource hierarchy of Microsoft Azure. It helps
to configure and govern the provisioning of services and workload by a given account,
especially an organizational account. The highest level is called the management group.
The management group governs utilization of various services by a given account.
572
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Management groups enable an organization to apply various policies across all users
within a management group and apply access control to one or more Azure service
offerings. From the root management group, you can create one or more management
groups that may correspond to a business group or a certain type of billing model.
Simply search for the term “management group” from your Azure dashboard to get into
that section, and you may wish to create one for your organization. You can name your
management group as your organization or your department within the organization or
a specific project name.
If you visit a service offering page more often, then it may be convenient to pin that
page to your dashboard. To perform that, simply click the pin-like icon next to the title of
your management group. Here is how that looks:
Within a given management group, you can have one or more subscriptions. You
can see the option of subscriptions in the Management groups section. A subscription
is a scheme that enables one or more users to access one or more Azure service
offerings. For instance, you can have two subscriptions within a management group for
a specific team, namely, development and production. You may allow more resources
on the production subscription and enable certain constraints on the development
subscription. Another example could be creating a subscription for every product or
service that the organization offers so that one may be able to calculate the costs.
Within any given subscription, we have a concept called resource group. This is a
project-based grouping of resources or one or more Azure services. For instance, if you
are looking to set up a data warehousing project and need databases, data warehouses,
573
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
data integration services, secrets management, data modeling tool services, and logging
and monitoring services, then you can create one resource group that can contain
these services. This can be beneficial in the context of governance and setting up access
policies.
Finally, the lowest abstraction is called the resources in Azure. These are various
services that Azure offers. It can be a database or a tool that you would pay to use.
Resources are anything that consist of networking, storage, or compute entities. For
instance, a serverless data integration service that uses compute to move data from
source to sink is an example of a resource.
From a data products and services point of view, let us explore the storage systems,
database services, and data ingestion and integration services among others.
Note “Blob” is a database term that expands to binary, large objects. Blob can
store any type of data into binary objects. It is perfect for images, audio, video, or
PDF content. These blob data are not indexable though. There is also clob, which
expands to character large objects, that stores large amounts of text data like
documents, large code snippets, etc.
To create a new blob storage, choose Storage from the main dashboard and create
a new storage. Provide the name of the instance and choose a closely located region. In
addition, choose standard performance and low-cost replication. You may be able to see
other options like choosing customer-managed encryption keys among other critically
important options. Microsoft will take a moment to run validation to review your choices
before enabling the “Create” button. Once you have created it, wait a few moments for
deployment of the resource in your account. Once deployed, here is how that looks:
574
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You now have a storage account configured for your use. Let us now create a
container by clicking “Upload” and choosing “Create new.”
Here is how that looks:
575
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Let us now programmatically upload a flat file into Azure Blob Storage using Python.
To begin with, we need to install the software development kit for Azure Blob Storage for
Python programming language. Open a terminal and start a virtual environment session
or conda environment session:
source azure/bin/activate
azure/Scripts/activate
In order for us to programmatically interact with the cloud storage service, we need
the connection string, container name, and local file path inputs, respectively. We will
set the connection path to the file that needs to be uploaded and define the container
name and connection string, respectively. Then we will initialize the blob client and
the container client, respectively. A container is a logical grouping of objects in the blob
storage service. You can enforce container-level access control as well. Once we initially
use the blob client, we can upload our file from there on.
Here is how this all may look:
connection_string = 'DefaultEndpointsProtocol=https;\
AccountName=testblobforsampleusage;\
AccountKey=YjsfMmriMCRCn5Po/DdzCQAiidxPsfgfsdgs45S
FGS$%SGDFFADSFGSR$#$AGDFGHDFGHcpjKd6AINtVWMiO9+AS
tc9nKVA==;\
EndpointSuffix=core.windows.net'
container_name = "firstcontainer"
local_file_path = "/Users/username/hello.csv"
576
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
blob_service_client = BlobServiceClient.from_connection_
string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
# I am just using the file name from the local file path here
blob_name = os.path.basename(local_file_path)
# upload blob
with open(local_file_path, "rb") as data:
blob_client.upload_blob(data)
Similar to AWS and Google Cloud, Microsoft Azure offers various storage classes
where you can organize data in a specific tier based on the frequency of accessing that
data. Currently Azure provides hot tier, cool tier, cold tier, and archive tier for storage.
The hot tier is the storage class that is designed for most frequent access (reading
or writing). The hot tier is the most expensive of all. The cool tier is for data that is
infrequently accessed; however, the data should be stored for a minimum of 30 days.
The cold tier is similar to the idea of the cool tier except the data should be stored for a
minimum of 90 days. The minimum storage period is 180 days for the archive period,
which is also known as long-term backup.
Azure SQL
Azure SQL is a managed database service that can be availed in three different methods.
Recall our discussion on Infrastructure as a Service, Platform as a Service, and Software
as a Service.
You can deploy Azure SQL on an Azure virtual machine. Azure provides you
images of preinstalled SQL Server on operating systems. You can pick the ones you
prefer, though keep in mind you will be responsible for operating system upgrades and
patches. You have the option to choose memory optimized or storage optimized for
a virtual machine. The service is named “SQL database,” and this type of setup is the
Infrastructure as a Service.
577
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You can also avail an Azure managed database instance, where the fully managed
Azure SQL database service automates most of the software updates, security updates,
and patches. Here we just choose the appropriate database version and not worry about
the underlying operating system. You can still configure the underlying compute (and
have the option to scale up or down the number of virtual cores currently allocated to
the managed instance). This is referred to as a Platform as a Service model.
The other option is availing a service called “SQL Server” that operates as a SaaS
(Software as a Service). In this option, you do not need to worry about the underlying
infrastructure. Microsoft will manage both the application and the underlying operating
system. You have the option to go to the serverless model where you pay only for what
you consume. Azure bills you based on the number of seconds you have utilized SQL
Server. And you can choose various tiers for the underlying infrastructure, and you have
the flexibility of changing to scale up or down the compute as well.
Let us create an Azure SQL database. In the process of creating a new SQL database,
we will also create an Azure SQL logical server. The Azure SQL logical server is like a
parent resource for the SQL database. With the Azure SQL logical server, you can enable
role-based access control, logins, threat protection, and firewalls to name a few. It also
provides an endpoint for accessing databases.
Let us search for “SQL database” and choose to create a new database instance. You
will have to mention the subscription and resource group details. You will also have to
enter a name for the database and choose a server. The server option is for the Azure
SQL logical server. If no server currently exists, you may create a new one. Here is how
this may look:
578
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You can also choose whether to use an elastic SQL pool. An elastic SQL pool is
basically the underlying compute, storage, and IO for the database that can be shared
with other database instances as well. If you create “n” single database instances, then
you would have “n” underlying compute and other resources. If you use an elastic SQL
pool, then you can have less than “n” compute and storage resources and “share” them
among “n” database instances.
579
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
In addition to the option of a SQL elastic pool, we also have the choice of specifying
whether this instance we are creating is for development or production purposes. The
SQL elastic pools are billed by eDTUs. An eDTU is a unit of currency that represents a
certain amount of compute, storage, and IO resources. You may wish to buy a certain
number of eDTUs in advance and set the limit for the databases on how many such units
they can consume in a given amount of time.
Before we create a new SQL database, we may have to create a new Azure SQL
logical server. Let us click Create SQL database server, which would provide the page for
creating a new logical server. Here is how that may look:
You may specify a name and a location for the logical server; also, you have the
option of specifying the authentication method. Once the information is supplied, you
may click OK to create a new logical server for your SQL database. For our instances, we
are choosing to create a single instance of a SQL database. Hence, we are choosing not to
use an elastic SQL pool.
Once the logical server is ready, you may have the option of choosing the compute
and storage for the SQL database. Here is where you can choose the underlying
hardware for your SQL database; you may wish to explore various hardware that Azure
offers along with their pricing, which will be computed and provided to you instantly. In
addition, you have the option of choosing the compute tier, whether you want pre-
allocated compute resources or you wish to choose serverless, where you will be billed
only on the resources you have consumed.
Here is how this may look:
581
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
When it comes to choosing backup storage redundancy, you may wish to choose
either locally redundant backup storage or the zone-redundant option for your
development purposes. You may leave other options as is and choose to create the
database.
Here is a sample Python application that would connect to a database, create a table,
and insert rows to that table. If you do not have the “pyodbc” package already, please
issue the following command in your terminal:
server = os.getenv("server")
database = os.getenv("database")
username = os.getenv("username")
password = os.getenv("password")
driver = os.getenv("driver")
try:
cursor = cnxn.cursor()
print('Connection established')
try:
cursor.execute(query1)
cnxn.commit()
582
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
print("Table created")
try:
cursor.execute(query2)
cnxn.commit()
print("Rows inserted")
except:
cursor.close()
cnxn.close()
except:
except:
Here is how the contents of the environment file may look like:
server = "your-azure-database-server-instance"
database = "your-database"
username = "your-username"
password = "password"
driver = "driver"
583
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Azure Cosmos DB
Azure Cosmos DB is a managed NoSQL database service provided by Microsoft. It is a
versatile service offering that can provide both SQL and NoSQL database development
including relational data models, document stores, graph databases, and vector
databases as well. Cosmos DB offers multiple APIs, each one simulating a different
database engine. These include relational SQL, MongoDB for document databases,
Cassandra DB for column-oriented databases, Gremlin for graph databases, and Table
for key–value databases. In a way, Cosmos DB has a multimodel database service,
where you can use appropriate APIs for your database application needs. In the case of
database migration, you may choose the appropriate API corresponding to your current
relational or NoSQL database that you intend to migrate.
You can avail Cosmos DB in two methods, namely, provisioned throughput and
serverless models. In provisioned throughput, as the name suggests, you provision how
much throughput you expect the database to deliver while also defining the performance
levels you wish to obtain for your database application. The unit of measurement here is
called a request unit. The other method of availing Cosmos DB is through serverless.
The serverless model is such that it only charges you based on how many requests
that it has received and processed. As far as billing is concerned, provisioned throughput
will compute the bill based on how many request units have been provisioned and not
consumed, whereas, in the case of serverless, billing is computed based on how many
number of request units consumed over an hour by performing database operations.
Note A request unit is a way of measuring how you have consumed the service.
Every operation that you perform with Cosmos DB will consume a certain amount
of compute, storage, and physical memory. The request unit is calculated based
on such consumption. For instance, a query that is sent to Cosmos DB will cost a
certain number of request units to retrieve the results. The exact number of request
units required to obtain the results may depend on the size of the table, query
compilation time, and number of rows, to name a few parameters.
584
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Let us look at creating a Cosmos DB instance with the MongoDB API. Once you
make an API selection, it remains the same for that instance. Should you choose to work
with another API, you may wish to create another instance with that respective API. In
our case, we choose the MongoDB API. If you are familiar with MongoDB, then the
usage will be very similar to Cosmos DB. Here is where you may choose the API within
Cosmos DB:
Figure 17-7. Choosing the right API for your Azure Cosmos DB instance
Next, you will be prompted to choose the resource for your instance. There are two
categories: one is based on request units, while the other is based on virtual cores. Both
the options are fully managed by Azure and scalable as the workload increases. Let
us choose the virtual core option, where we have the option to choose the underlying
hardware.
585
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Once you choose the virtual core option, you will be directed to configure your
Cosmos DB. Here, you will have to specify the subscription or billing account and
appropriate resource group. A resource group is something that shares similar policies
and permissions. In addition, choose an administrator username and password and
name and location of the cluster, among other details.
586
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You have the option of choosing the free tier for the underlying cluster. Let us look
at configuring our underlying cluster for this Cosmos DB instance. Under configuration,
you will have various development- and production-level hardware instances, where you
can customize your compute, storage, and physical memory. In addition, you also have
the option to choose high availability and database shards as well.
587
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Database sharding in MongoDB is a method where you distribute the data between
multiple worker nodes within a given cluster. You can also create replications of your
data, and so you will gain speed in read/write operations. For instance, let us say a node
that contains a specific shard may process “n” operations per second; by introducing
“m” shards containing the same data, you may be able to process “m × n” operations
per second.
Here is how cluster configuration may look:
You also have the option of specifying the firewall rules and connectivity methods
within the networking option. Here is how that may look for you:
588
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You may leave the networking settings and other settings as is, and click Create. It
might take several minutes for Azure to provision a new cluster. Let us look at connecting
and uploading a JSON document to Cosmos DB:
url = config.settings['host']
key = config.settings['master_key']
database_name = config.settings['database_id']
container_name = config.settings['container_id']
path_to_json = "/Users/pk/Documents/mock_data.json"
589
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
def connect():
try:
client = CosmosClient(url,credential=key)
print("Established connection with the client")
try:
database = client.get_database_client(database_name)
print("Database obtained")
try:
container = database.get_container_client(container_name)
print("Container obtained")
except:
print(f"unable to select the container {container_name}")
except:
print(f"unable to select the database, {database_name}")
except:
print("Unable to connect to the database")
return container
def upsert(container):
with open(path_to_json, "rb") as file:
json_doc = json.load(file)
for js in json_doc:
upload_json_doc = container.upsert_item(js)
print(f"id uploaded is : {js['id']}")
def primaryfunc():
c_name = connect()
upsert(c_name)
primaryfunc()
590
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
591
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
We also have to define the name and location of the workspace; in addition, we
have to either specify the current data lake or create a new data lake, part of configuring
the workspace. Once we have entered the basics, let us specify a local user in security
settings.
592
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Once Synapse Analytics is up and running, you can view it in your dashboard. Here
is how that may look:
593
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Using Azure Data Factory, it is possible to compress the data during loading from one
system to another, thereby consuming less bandwidth. To get started with Azure Data
Factory, visit the link https://adf.azure.com/en/ or find the same link by searching for
Azure Data Factory in the console. A new tab will open where you can specify a name for
your new data factory and associate your subscription account as well.
Here is how that may look:
594
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
When you submit the information, you may wait a few minutes for Azure to prepare
the Azure Data Factory service for you. Here is how it may look when it is completed:
On the left side, you will see five options, where Home is the main page, Author
is your data integration IDE, Monitor is where you can orchestrate and observe the
job runs, Manage is where you can control the git integration and other settings, and
Learning Center is where you can obtain documentation and learning support for Azure
Data Factory.
595
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
In designing a data integration job using Azure Data Factory, you can author a
data integration job, internally called a pipeline, which is a group of activities logically
arranged to be executed. Activities inside a pipeline are what perform a single task. You
have the option of performing various activities that would connect source and sink data
systems, transport data, and also perform various transformations.
You have to define a data system connection in order to connect to a data system. To
define the data system connection information, navigate to the Manage page and click
“Linked Services” under the Connection tab.
Let us try to create or add Azure Blob Storage as a new connection in Linked
Services. You can click a new connection, search for Azure Blob on the data store, and
click Continue. You may enter a name and description for the connection, followed
by choosing an Azure subscription under Account selection method; select your
appropriate subscription and container name.
You can test the connection and if it returns successful then you have successfully
added Azure Blob Storage in your Azure Data Factory. Here is how that may look for you:
596
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
To get acquainted with Azure Data Factory, let us look at copying data from Azure
Blob Storage to an Azure SQL database. We will obtain a comma-delimited data file from
a source and upload it to Azure Blob Storage. We would then use Azure Data Factory
to initiate a copying pipeline from Blob Storage to an Azure SQL database. To begin,
navigate to the main page and click the Ingest option and choose Built-in copy task.
597
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
We will specify the source data system in our next page, which is Azure Blob Storage.
From the Source type drop-down menu, choose Azure Blob Storage, and from the
Connection drop-down, choose your blob storage. This should be available if you have
utilized the Linked Services option earlier. If not, the procedures are the same.
Once your blob storage is linked, you can browse the object storage to choose the
appropriate flat file or a folder that contains the flat file. Choose the “Recursive” option if
you plan to upload multiple flat files from one folder. The system would identify the type
of flat file that is being selected and populate file format settings by itself.
598
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Figure 17-20. Specifying the source data store in Azure Data Factory
Now, let us specify the destination data system. You may choose an Azure SQL
database and specify the connection, for which you may wish to create a new linked
service, or choose the one that may have been already created. Since this is a relational
database, we do have the option of creating a table schema on the target database and
populating the flat file directly to the database. If not, we can choose to automatically
create a data table and populate this data table as well. Here is how that may look:
599
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Figure 17-21. Specifying the destination data store in Azure Data Factory
Once this step is completed, let us name this data pipeline under Settings. Here is
how this may look for you:
Figure 17-22. Adding the name and description of this data pipeline
600
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Once all these steps are completed successfully, you will see a review screen that
lists all the parameters of this data pipeline. If there are any errors or you may find
something that needs to be changed, you have the option of editing those respective
fields. If everything looks good, then you may click Next. Azure Data Factory will run the
validation for this entire data pipeline. Once the validation process finishes successfully,
you may get the following screen:
You may choose to click Finish, which will take you to the main page. On the
dashboard of Azure Data Factory, click the Monitor page, which will list the current data
pipelines that are in your repository. You can click “Rerun” for running the data pipeline.
When the job completes successfully, you may get something like the following:
601
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Figure 17-24. Monitoring and rerunning of data pipelines in Azure Data Factory
Azure Functions
Azure Functions is a serverless compute option that enables executing functions that
perform certain tasks without having to worry about the underlying hardware. Azure
automatically provides updated servers for these functions to execute and also scale up
if required. These compute resources are allocated dynamically and autoscaled during
runtime. You would get charged only for the code you execute (duration of the execution
and compute resources used). You can use Azure Functions to build APIs, respond to an
event happening on another service, etc.
The highlight of Azure Functions is the use of triggers and bindings. Using triggers
and bindings, you can create functions that interact with other services. A trigger is
basically an event or a condition that initiates the execution of a function. You can only
have one trigger per function in Azure Functions. For instance, you can have an Azure
function that has an HTTP trigger that receives a request and triggers the function to
provide a response; you can have a service trigger that activates an Azure function when
602
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
an event happens—new file uploaded in Blob Storage or even a simple timer that sets
the function execution on a specific schedule.
Bindings in Azure Functions provide a method to connect to various data sources;
thereby, you can interact (write and read data) with services. You can have more than
one binding in a given Azure function. There are two types of bindings that Azure
Functions supports, namely, input and output binding. Input bindings may include
reading from data storage systems like Blob Storage, databases, file systems, etc., and
output bindings could be sending data to another service. Azure Functions actually
supports multiple input and output bindings in a given function.
To create a new Azure function, navigate to the Azure dashboard, and search for
“Function app.” Here is how that may look for you:
Once you are in Azure Functions, create a new function app. You have to enter the
subscription details; you would also define the instance details, including the name of
the function app, the runtime stack, the environment, and the operating system.
You can also define how you choose to host the app. The cost-effective option is
serverless; however, you have the option to reuse the compute from other Azure services
as well.
603
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Once you have created the function app instance, you can create your Azure
functions within the same. Let us explore how to develop Azure functions locally by
using the core tools package. Depending upon the operating system, you may wish to
refer to these instructions to install the core tools.
Here is the link: https://learn.microsoft.com/en-us/azure/azure-functions/
functions-run-local?tabs=macos%2Cisolated-process%2Cnode-v4%2Cpython-
v2%2Chttp-trigger%2Ccontainer-apps&pivots=programming-language-python
604
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
To check if the installation is complete, you may open a terminal and type in “func,”
and it will render output relating to Azure functions. To create a new Azure function
locally, enter
It would set up the entire project structure with appropriate JSON elements. Here is
how that may look for you:
The folder structure on the left is created by Azure Functions core tools utility. You
may navigate to the function_app.py and proceed to write your Azure function. Here is a
sample Azure function for you:
app = func.FunctionApp()
@app.function_name('helloworld')
@app.route(route="route66")
def helloworld(requests: func.HttpRequest) -> func.HttpResponse:
logging.info('received request for helloworld function')
user = req.params.get('username')
if user:
message = f"Hello, {user}"
605
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
else:
message = "welcome to Azure functions"
return func.HttpResponse(
message,
status_code=200
)
Once you have the Azure function ready, you may run the function using “func start”
from the command line. Here is how that looks:
When you navigate to the appropriate URLs, you may view the output of the
function. If you wish to publish your code as an Azure function (currently, the function is
developed and deployed locally), then you can utilize the following command from your
terminal:
606
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
and Python SDK for machine learning. With Azure ML Studio, you can write your own
code using Jupyter Notebook and execute the same using Azure ML services. There is
also a low-code option that Azure provides, which enables you to drag and drop ML data
assets and components to design, train, and deploy machine learning models.
To get started with Azure Machine Learning, navigate to the management console
and search for “Azure machine learning.” You will find Azure Machine Learning in the
marketplace option. Select the service and create a new machine learning workspace.
Provide appropriate details for your subscription and workspace.
Here is how this may look for you:
607
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
The workspace option is to enable a centralized view of all resources that Azure
Machine Learning provides. When you provision a new Azure Machine Learning
workspace, Azure would automatically create a set of services dedicated to Azure
machine learning. These are storage account, key vault, container registry, and insights.
The storage account is a newly created storage space for Azure machine learning
data assets, logs, and reports (if any). There is also an option to create blobs. The
key vault is a secrets management resource that is used to securely store sensitive
information that is utilized in the machine learning workspace. Insights would provide
operational insights and diagnostic information during monitoring of machine learning
projects. The container registry handles Docker containers for machine learning models,
development environments, etc.
Once the service is created and available, here is how this may look for you:
608
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
609
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Once you are on this page, click Create. You can specify the name and description of
the dataset and choose either File or Tabular from the “Type” drop-down.
610
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Azure ML Job
Azure machine learning is basically a set of machine learning tasks that runs on the
underlying cluster. You can build a ML training job, which may consist of a set of
tasks that trains a machine learning model; you can also build a sweep job, which
may consists of a set of tasks that executes a machine learning model with multiple
hyperparameters, and select the best model that best minimizes the cost function. These
jobs also generate log entries and Azure ML Studio can track the metrics of each ML job.
To get started with developing machine learning models, you need a machine
learning workspace and a compute instance. In previous sections, we have already
created a workspace. To create a new compute instance, you may either choose the
“Compute” option under “Manage” or simply click “Add compute instance” on the
dashboard.
611
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
You can specify a name for the compute instance and choose whether you need a
CPU-based instance or GPU-based instance. The choice on the type of instance would
be based upon the nature of the machine learning task you are intending to perform.
GPUs are always associated with parallelism and high-performance computing,
and so if your machine learning model leverages parallelism (through Dask or Ray),
you can go for this choice. Certain applications like training a deep learning model with
several layers or a compute vision task (image processing) are highly suited for GPUs.
Certain machine learning applications like text modeling, large models, or building
recommender engines where the underlying tasks benefit from parallelism can be
trained by both CPU and GPU architectures. In the case of tabular data, CPU-based
computation may be sufficient.
612
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Upon choosing the virtual machine, you may leave other options as is and proceed to
create the compute instance. Here is how this may look for you:
Once you have created the compute instance, you can start to utilize the compute for
developing your machine learning model. You may choose the Jupyter Notebook option
or the web-based VS Code editor as well. Here is how that may look for you:
613
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
Let us build a classifier model that classifies whether a person is diabetic or not
based on the given set of observations. Depending upon your convenience, you may
either choose to work with a Jupyter notebook or access Visual Studio Code. The goal
for us is to train the model using Azure ML Studio and deploy the model using Azure
Functions.
Here is the model code for you:
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv("diabetes.tsv",sep="\t")
#print("Top records of the dataset")
#print(df.head())
#print("Analytics on dataset")
#print(df.describe())
df = df.drop_duplicates()
df = df.fillna(0)
614
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
x = final_df.drop(["Y"], axis=1)
y = final_df["Y"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = RandomForestRegressor(**params)
file = 'model.pkl'
pickle.dump(model, open(file, 'wb'))
You can use this code as a starter, build your own features, and try out other
classifiers as well. Once you have the pickle file exported, we can proceed to create a
new Azure function to deploy the model. If you have followed along with the “Azure
Functions” section earlier, then navigate to the terminal and create a new Azure function
named diabetes.
Here is how that may look for you:
We are going to deploy the model as a serverless API using Azure Functions. We will
leverage the HTTP triggers. Now, change directory into the diabetes folder and navigate
to the function_app.py file. This function has only been delayed locally. Let us deploy
this function in the Azure function app.
615
Chapter 17 Engineering Data Pipelines Using Microsoft Azure
app = func.FunctionApp()
@app.function_name('pima-indian-diabetes-model')
@app.route(route="pidm", auth_level=func.AuthLevel.ANONYMOUS)
def main(requests: func.HttpRequest) -> func.HttpResponse:
filepath = 'model.pkl'
with open(filepath, 'rb') as file:
model = pickle.load(file)
data = requests.get_json()["data"]
y_pred = model.predict(data)[0]
pred_class = classifier[y_pred]
Conclusion
So far, we have explored a wide range of services from object storage to advanced
analytics, to NoSQL solutions, to data integration using Azure Data Factory, to Azure
Functions, and to Azure ML Studio. Azure Functions demonstrated how serverless
computing can be utilized to execute code. From creating object storage solutions to
deploying machine learning models, we looked at these services in detail along with
code. While it’s difficult to cover entire cloud computing offerings in a single chapter, I
hope the ideas, services, and practical implementations we’ve explored will serve as a
solid foundation for your cloud computing journey.
616
Index
A REST API endpoint, 528
SageMaker Domain, 519
Advanced data wrangling
set environment variables, 527
operations, CuDF
Amazon’s DynamoDB, 466
apply() method, 154–156
Amazon S3 Glacier, 489
crosstab, 156, 157
Amazon S3 Standard, 489
cut function, 157, 158
Amazon Web Services (AWS)
factorize function, 158, 159
Amazon SageMaker (see Amazon
group-by function, 152, 153
SageMaker)
transform function, 153, 154
data systems
window function, 159
Amazon Athena, 504, 505
AGSI, see Asynchronous Server Gateway
Amazon Glue, 506
Interface (AGSI)
Amazon RDS, 493–496
Airflow user, 390
Amazon Redshift, 497–502
Alpha, 188
Lake Formation (see AWS Lake
Alternate hypothesis, 186, 187
Formation)
Amazon, 335, 461, 469, 473, 489, 504
Apache Airflow
Amazon Athena, 504, 505
architecture
Amazon Glue, 506
configuration file, 391–393
Amazon RDS, 493–497
database, 390
Amazon Redshift, 497–504
example, 393, 394
Amazon SageMaker, 519
executor, 390
AWS SageMaker Studio, 519, 520
scheduler, 391
create and configure, new Jupyter
web server, 390
space, 521
background process, 386
dashboard page, AWS Lambda, 526
DAGs (see Diagonal acyclic
deploy API, 530
graph (DAGs))
home page, 520
data storage and processing
IAM role, 523, 524
systems, 383
lambda function, 525
definition, 389
linear methods, 522
GUI, 387, 388
machine learning toolkits, 519
home screen, 389
POST method, 529
617
© Pavan Kumar Narayanan 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines,
https://doi.org/10.1007/979-8-8688-0602-5
INDEX
618
INDEX
619
INDEX
620
INDEX
621
INDEX
622
INDEX
623
INDEX
624
INDEX
625
INDEX
626
INDEX
627
INDEX
628
INDEX
629
INDEX
630
INDEX
631
INDEX
632
INDEX
633
INDEX
634
INDEX
635
INDEX
W X, Y, Z
wait() function, 248 Xcom, 383, 401–402
Webhooks, 325 XGBoost, 160, 230, 261, 262
636