Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz - Instantly access the full ebook content in just a few seconds
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz - Instantly access the full ebook content in just a few seconds
https://ebookmass.com/product/python-data-cleaning-cookbook-second-
edition-michael-walker/
https://ebookmass.com/product/data-driven-seo-with-python-solve-seo-
challenges-with-data-science-using-python-1st-edition-andreas-
voniatis/
Statistical Process Monitoring Using Advanced Data-Driven
and Deep Learning Approaches: Theory and Practical
Applications 1st Edition Fouzi Harrou
https://ebookmass.com/product/statistical-process-monitoring-using-
advanced-data-driven-and-deep-learning-approaches-theory-and-
practical-applications-1st-edition-fouzi-harrou/
https://ebookmass.com/product/data-structure-and-algorithms-with-
python-the-ultimate-guide-towards-coding-john-thomas/
https://ebookmass.com/product/nonclinical-study-contracting-and-
monitoring-a-practical-guide-1st-edition/
Data Ingestion with Python
Cookbook
Gláucia Esppenchutz
BIRMINGHAM—MUMBAI
Data Ingestion with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
ISBN 978-1-83763-260-2
www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his
support and understanding during this challenging endeavor. I want to thank all my friends that
didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed,
helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed
best friend, who is at peace, Minduim, for “helping” me to write this book.
– Gláucia Esppenchutz
Contributors
I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the
Python open source community and the DataBootCamp founders, who guided me at the beginning
of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
About the reviewers
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune
4 organization. He has a demonstrated history of working in the cloud, data and analytics industry
for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem
(Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC
of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail
domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in
the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based
out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are
skilled not only in various domains including AI, data warehouse architecture, and business analytics,
but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT.
Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof
of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer
and have actively shared their extensive knowledge and expertise through presentations, blog articles,
and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software
development and delivery excellence across multiple domains and technology, ranging from C++ to
Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the
community by being part of GDG – Pune Volunteer Group, helping the team in organizing events.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction,
non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: k r i s h n a n @ g m a i l . c o m or via
LinkedIn: www.linkedin.com/in/krishnan-raghavan
I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to
review this book.
Table of Contents
Prefacexv
2
Principals of Data Access – Accessing Your Data 31
Technical requirements 31 How to do it… 47
Implementing governance in a data How it works… 48
access workflow 32 There’s more… 49
Getting ready 32 See also 52
How to do it… 33 Managing encrypted files 52
How it works… 34 Getting ready 52
See also 34 How to do it… 53
Accessing databases and data How it works… 54
warehouses34 There’s more… 55
Getting ready 35 See also 56
How to do it… 35 Accessing data from AWS using S3 56
How it works… 37 Getting ready 56
There’s more… 38 How to do it… 59
See also 39 How it works… 62
Accessing SSH File Transfer Protocol There’s more… 63
(SFTP) files 39 See also 63
Getting ready 39 Accessing data from GCP using
How to do it… 41 Cloud Storage 64
How it works… 43 Getting ready 64
There’s more… 43 How to do it… 66
See also 44 How it works… 68
Retrieving data using API There’s more… 70
authentication44 Further reading 70
Getting ready 45
3
Data Discovery – Understanding Our Data before Ingesting It 71
Technical requirements 71 How to do it… 73
Documenting the data discovery How it works… 77
process71 Configuring OpenMetadata 77
Getting ready 72 Getting ready 77
Table of Contents ix
4
Reading CSV and JSON Files and Solving Problems 95
Technical requirements 95 How it works… 105
Reading a CSV file 96 There’s more… 106
See also 107
Getting ready 96
How to do it… 96 Using PySpark to read CSV files 108
How it works… 98 Getting ready 108
There’s more… 98 How to do it… 108
See also 99 How it works… 109
Reading a JSON file 99 There’s more… 110
See also 114
Getting ready 100
How to do it… 100 Using PySpark to read JSON files 114
How it works… 100 Getting ready 114
There’s more… 101 How to do it… 115
See also 103 How it works… 116
Creating a SparkSession for PySpark 103 There’s more… 117
See also 117
Getting ready 103
How to do it… 104 Further reading 117
5
Ingesting Data from Structured and Unstructured Databases 119
Technical requirements 119 There’s more… 127
Configuring a JDBC connection 120 See also 129
6
Using PySpark with Defined and Non-Defined Schemas 159
Technical requirements 159 How to do it… 169
Applying schemas to data ingestion 160 How it works… 170
7
Ingesting Analytical Data 181
Technical requirements 181 How it works… 197
Ingesting Parquet files 182 There’s more… 198
See also 200
Getting ready 182
How to do it… 183 Ingesting partitioned data 200
How it works… 184 Getting ready 200
There’s more… 185 How to do it… 201
See also 185 How it works… 201
Ingesting Avro files 185 There’s more… 203
See also 204
Getting ready 186
How to do it… 186 Applying reverse ETL 204
How it works… 188 Getting ready 204
There’s more… 190 How to do it… 205
See also 190 How it works… 206
Applying schemas to analytical data 191 There’s more… 207
See also 207
Getting ready 191
How to do it… 191 Selecting analytical data for reverse
How it works… 194 ETL207
There’s more… 194 Getting ready 207
See also 195 How to do it… 208
Filtering data and handling common How it works… 209
issues195 See also 210
9
Putting Everything Together with Airflow 243
Technical requirements 244 How to do it… 257
Installing Airflow 244 How it works… 260
There's more… 262
Configuring Airflow 244 See also 262
Getting ready 244
How to do it… 245 Configuring sensors 262
How it works… 247 Getting ready 262
See also 248 How to do it… 263
How it works… 264
Creating DAGs 248 See also 265
Getting ready 248
How to do it… 249 Creating connectors in Airflow 265
How it works… 253 Getting ready 266
There's more… 254 How to do it… 266
See also 255 How it works… 269
There's more… 270
Creating custom operators 255 See also 270
Getting ready 255
Table of Contents xiii
10
Logging and Monitoring Your Data Ingest in Airflow 281
Technical requirements 281 Designing advanced monitoring 304
Installing and running Airflow 282 Getting ready 304
How to do it… 306
Creating basic logs in Airflow 283
How it works… 308
Getting ready 284
There’s more… 309
How to do it… 284
See also 309
How it works… 287
See also 289 Using notification operators 309
Getting ready 310
Storing log files in a remote location 289
How to do it… 312
Getting ready 289
How it works… 315
How to do it… 290
There’s more… 318
How it works… 298
See also 299 Using SQL operators for data quality 318
Getting ready 318
Configuring logs in airflow.cfg 299
How to do it… 320
Getting ready 299
How it works… 321
How to do it… 299
There’s more… 323
How it works… 301
See also 323
There’s more… 303
See also 304 Further reading 324
11
Automating Your Data Ingestion Pipelines 325
Technical requirements 325 Scheduling daily ingestions 326
Installing and running Airflow 326 Getting ready 327
xiv Table of Contents
12
Using Data Observability for Debugging, Error Handling,
and Preventing Downtime 349
Technical requirements 349 Getting ready 358
Docker images 350 How to do it… 358
How it works… 361
Setting up StatsD for monitoring 351 There’s more… 363
Getting ready 351
How to do it… 351 Creating an observability dashboard 363
How it works… 353 Getting ready 363
See also 354 How to do it… 363
How it works… 369
Setting up Prometheus for storing There’s more… 370
metrics354
Getting ready 354 Setting custom alerts or notifications 370
How to do it… 354 Getting ready 371
How it works… 356 How to do it… 371
There’s more… 357 How it works… 377
Index379
Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data
governance, covering workflows and management of familiar sources such as SFTP servers, APIs,
and cloud providers. It also provides examples of creating data access policies in databases, data
warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of
carrying out the data discovery process before data ingestion. It covers manual discovery, documentation,
and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON
files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures
while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts
of relational and non-relational databases, including everyday use cases. You will learn how to read
and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark
use cases, focusing on handling defined and non-defined schemas. It also explores reading and
understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading
and writing. It explores reading partitioned data for improved performance and discusses Reverse
ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion,
facilitating error identification, and debugging. Techniques such as monitoring file size, row count,
and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information
and guides you in building a real-life data ingestion application using Airflow. It covers essential
components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and
monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications,
and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered
to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using
previously learned best practices, enabling reader autonomy. It addresses common challenges with
schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12, Using Data Observability for Debugging, Error Handling, and Preventing Downtime,
explores data observability concepts, popular monitoring tools such as Grafana, and best practices
for log storage and data lineage. It also covers creating visualization graphs to monitor data source
issues using Airflow configuration and data ingestion scripts.
Preface xvii
For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it
is not mandatory to install it, this tool can help you to test the code and try new things on the code due
to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code via the GitHub repository (link available in the next section). Doing so will help you
avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then
we proceeded with the with open statement.”
A block of code is set as follows:
$ python3 –-version
Python 3.8.10
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words
in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected
showString at NativeMethodAccessorImpl.java:0, which redirected us to the
Stages page.”
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How
it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
Getting ready
This section tells you what to expect in the recipe and describes how to set up any software or any
preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
Preface xix
How it works…
This section usually consists of a detailed explanation of what happened in the previous section.
There’s more…
This section consists of additional information about the recipe in order to make you more knowledgeable
about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the
subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Visit https://ebookmass.com today to explore
a vast collection of ebooks across various
genres, available in popular formats like
PDF, EPUB, and MOBI, fully compatible with
all devices. Enjoy a seamless reading
experience and effortlessly download high-
quality materials in just a few simple steps.
Plus, don’t miss out on exciting offers that
let you access a wealth of knowledge at the
best prices!
Other documents randomly have
different content
mental endowments just enumerated, not only to obviate disgust,
but to excite extraordinary admiration.
One of the most prominent and detestable vices indeed, in
Richard's character, his hypocrisy, connected, as it always is, in his
person, with the most profound skill and dissimulation, has, owing to
the various parts which it induces him to assume, most materially
contributed to the popularity of this play, both on the stage, and in
the closet. He is one who can
and accordingly appears, during the course of his career, under the
contrasted forms of a subject and a monarch, a politician and a wit,
a soldier and a suitor, a sinner and a saint; and in all with such
apparent ease and fidelity to nature, that while to the explorer of the
human mind he affords, by his penetration and address, a subject of
peculiar interest and delight, he offers to the practised performer a
study well calculated to call forth his fullest and finest exertions. He,
therefore, whose histrionic powers are adequate to the just
exhibition of this character, may be said to have attained the highest
honours of his profession; and, consequently, the popularity of
Richard the Third, notwithstanding the moral enormity of its hero,
may be readily accounted for, when we recollect, that the versatile
and consummate hypocrisy of the tyrant has been embodied by the
talents of such masterly performers as Garrick, Kemble, Cook, and
Kean.
So overwhelming and exclusive is the character of Richard, that
the comparative insignificancy of all the other persons of the drama
may be necessarily inferred; they are reflected to us, as it were,
from his mirror, and become more or less important, and more or
less developed, as he finds it necessary to act upon them; so that
our estimate of their character is entirely founded on his relative
conduct, through which we may very correctly appreciate their
strength or weakness.
The only exception to this remark is in the person of Queen
Margaret, who, apart from the agency of Richard, and dimly seen in
the darkest recesses of the picture, pours forth, in union with the
deep tone of this tragedy, the most dreadful curses and
imprecations; with such a wild and prophetic fury, indeed, as to
involve the whole scene in tenfold gloom and horror.
We have to add that the moral of this play is great and
impressive. Richard, having excited a general sense of indignation,
and a general desire of revenge, and, unaware of his danger from
having lost, through familiarity with guilt, all idea of moral obligation,
becomes at length the victim of his own enormous crimes; he falls
not unvisited by the terrors of conscience, for, on the eve of danger
and of death, the retribution of another world is placed before him;
the spirits of those whom he had murdered, reveal the awful
sentence of his fate, and his bosom heaves with the infliction of
eternal torture.
11. King Richard the Second: 1596. Our great poet having been
induced to improve and re-compose the Dramatic History of Henry
the Sixth, and to continue the character of Gloucester to the close of
his usurpation, in the drama of Richard the Third, very naturally,
from the success which had crowned these efforts, reverted to the
prior part of our national story for fresh subjects, and, led by a
common principle of association, selected for the commencement of
a new series of historical plays, which should form an unbroken
chain with those that he had previously written, the reign of Richard
the Second. On this account, therefore, and from the intimation of
time, noticed by Mr. Chalmers, towards the conclusion of the first
[376:A]act, we are led to coincide with this gentleman in assigning
Now this pale cast of thought and its consequences, which, had
not Hamlet been interrupted by the entrance of Ophelia, he would
have himself applied to his own singular situation, form the very
essence, and give rise to the prominent defects of his character. It is
evident, therefore, that Shakspeare intended to represent him as
variable and indecisive in action, and that he has founded this want
of volition on one of those peculiar constitutions of the mental and
moral faculties which have been designated by the appellation of
genius, a combination of passions and associations which has led to
all the useful energies, and all the exalted eccentricities of human
life; and of which, in one of its most exquisite but speculative forms,
Hamlet presents us with perhaps the only instance on theatric
record.
To a frame of mind naturally strong and contemplative, but
rendered by extraordinary events sceptical and intensely thoughtful,
he unites an undeviating love of rectitude, a disposition of the
gentlest kind, feelings the most delicate and pure, and a sensibility
painfully alive to the smallest deviation from virtue or propriety of
conduct. Thus, while gifted to discern and to suffer from every moral
aberration in those who surround him, his powers of action are
paralysed in the first instance, by the unconquerable tendency of his
mind to explore, to their utmost ramification, all the bearings and
contingencies of the meditated deed; and in the second, by that
tenderness of his nature which leads him to shrink from the means
which are necessary to carry it into execution. Over this irresolution
and weakness, the result, in a great measure, of emotions highly
amiable, and which in a more congenial situation had contributed to
the delight of all who approached him, Shakspeare has thrown a veil
of melancholy so sublime and intellectual, as by this means to
constitute him as much the idol of the philosopher, and the man of
cultivated taste, as he confessedly is of those who feel their interest
excited principally through the medium of the sympathy and
compassion which his ineffective struggles to act up to his own
approved purpose naturally call forth.
It may be useful, however, in order to give more strength and
precision to this general outline, to enter into a few of the leading
particulars of Hamlet's conduct. He is represented at the opening of
the play as highly distressed by the sudden death of his father, and
the hurried and indecent nuptials of his mother, when the awful
appearance of the spectre overwhelms him with astonishment,
unhinges a mind already partially thrown off its bias, and fills it with
indelible apprehension, suspicion, and dismay. For though, on the
first communication of the murder, his bosom burns with the thirst of
vengeance, yet reflection and the gentleness of his disposition soon
induce him to regret that he has been chosen as the instrument of
effecting it,
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookmass.com