100% found this document useful (2 votes)
344 views

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

The document is a promotional overview of the 'Data Ingestion with Python Cookbook' by Gláucia Esppenchutz, which serves as a practical guide for ingesting, monitoring, and identifying errors in data ingestion processes. It includes links to various related resources and books on data science and Python programming. The book is published by Packt Publishing and aims to provide comprehensive insights into data ingestion methodologies.

Uploaded by

dickodemard7s
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
344 views

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

The document is a promotional overview of the 'Data Ingestion with Python Cookbook' by Gláucia Esppenchutz, which serves as a practical guide for ingesting, monitoring, and identifying errors in data ingestion processes. It includes links to various related resources and books on data science and Python programming. The book is published by Packt Publishing and aims to provide comprehensive insights into data ingestion methodologies.

Uploaded by

dickodemard7s
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Ingestion with Python Cookbook: A practical

guide to ingesting, monitoring, and identifying


errors in the data ingestion process Esppenchutz
pdf download
https://textbookfull.com/product/data-ingestion-with-python-
cookbook-a-practical-guide-to-ingesting-monitoring-and-
identifying-errors-in-the-data-ingestion-process-esppenchutz/

Download more ebook from https://textbookfull.com


We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

Biota Grow 2C gather 2C cook Loucas

https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-
loucas/

Practical Python Data Visualization: A Fast Track


Approach To Learning Data Visualization With Python
Ashwin Pajankar

https://textbookfull.com/product/practical-python-data-
visualization-a-fast-track-approach-to-learning-data-
visualization-with-python-ashwin-pajankar/

Hands-On Entity Resolution: A Practical Guide to Data


Matching With Python 1st Edition Michael Shearer

https://textbookfull.com/product/hands-on-entity-resolution-a-
practical-guide-to-data-matching-with-python-1st-edition-michael-
shearer/

Practical Data Science Cookbook Data pre processing


analysis and visualization using R and Python
Prabhanjan Tattar

https://textbookfull.com/product/practical-data-science-cookbook-
data-pre-processing-analysis-and-visualization-using-r-and-
python-prabhanjan-tattar/
Data monitoring committees in clinical trials a
practical perspective Demets

https://textbookfull.com/product/data-monitoring-committees-in-
clinical-trials-a-practical-perspective-demets/

Mastering Machine Learning with Python in Six Steps: A


Practical Implementation Guide to Predictive Data
Analytics Using Python 1st Edition Manohar Swamynathan
(Auth.)
https://textbookfull.com/product/mastering-machine-learning-with-
python-in-six-steps-a-practical-implementation-guide-to-
predictive-data-analytics-using-python-1st-edition-manohar-
swamynathan-auth/

A Python Data Analyst’s Toolkit: Learn Python and


Python-based Libraries with Applications in Data
Analysis and Statistics Gayathri Rajagopalan

https://textbookfull.com/product/a-python-data-analysts-toolkit-
learn-python-and-python-based-libraries-with-applications-in-
data-analysis-and-statistics-gayathri-rajagopalan/

Introduction to Machine Learning with Python A Guide


for Data Scientists Andreas C. Müller

https://textbookfull.com/product/introduction-to-machine-
learning-with-python-a-guide-for-data-scientists-andreas-c-
muller/

Azure Data Factory Cookbook: Data engineers guide to


build and manage ETL and ELT pipelines with data
integration , 2nd Edition Dmitry Foshin

https://textbookfull.com/product/azure-data-factory-cookbook-
data-engineers-guide-to-build-and-manage-etl-and-elt-pipelines-
with-data-integration-2nd-edition-dmitry-foshin/
Data Ingestion with Python
Cookbook

A practical guide to ingesting, monitoring, and identifying


errors in the data ingestion process

Gláucia Esppenchutz

BIRMINGHAM—MUMBAI
Data Ingestion with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.

Group Product Manager: Reshma Raman


Publishing Product Manager: Arindam Majumdar
Senior Editor: Tiksha Lad
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Jyoti Chauhan
Marketing Coordinator: Nivedita Singh

First published: May 2023

Production reference: 1300523

Published by Packt Publishing Ltd.


Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-83763-260-2

www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his
support and understanding during this challenging endeavor. I want to thank all my friends that
didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed,
helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed
best friend, who is at peace, Minduim, for “helping” me to write this book.

– Gláucia Esppenchutz
Contributors

About the author


Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts
of data using cloud and on-premises technologies. She worked in companies such as Globo.com,
BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations
for autonomous systems.
She comes from the biomedical field and shifted her career ten years ago to chase the dream of
working closely with technology and data. She is in constant contact with the open source community,
mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies
group, FreeCodeCamp, Udacity, and MentorColor communities.

I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the
Python open source community and the DataBootCamp founders, who guided me at the beginning
of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
About the reviewers
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune
4 organization. He has a demonstrated history of working in the cloud, data and analytics industry
for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem
(Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC
of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail
domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in
the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based
out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are
skilled not only in various domains including AI, data warehouse architecture, and business analytics,
but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT.
Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof
of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer
and have actively shared their extensive knowledge and expertise through presentations, blog articles,
and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software
development and delivery excellence across multiple domains and technology, ranging from C++ to
Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the
community by being part of GDG – Pune Volunteer Group, helping the team in organizing events.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction,
non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: k r i s h n a n @ g m a i l . c o m or via
LinkedIn: www.linkedin.com/in/krishnan-raghavan

I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to
review this book.
Table of Contents

Prefacexv

Part 1: Fundamentals of Data Ingestion


1
Introduction to Data Ingestion 3
Technical requirements 4 How to do it… 17
Setting up Python and its environment 4 How it works… 21
See also 22
Getting ready 4
How to do it… 5 Creating schemas 22
How it works… 7 Getting ready 22
There’s more… 7 How to do it… 22
See also 7 How it works… 24
Installing PySpark 8 See also 25

Getting ready 8 Applying data governance in ingestion 25


How to do it… 10 Getting ready 25
How it works… 12 How to do it… 26
There’s more… 12 How it works… 28
See also 12 See also 28
Configuring Docker for MongoDB 13 Implementing data replication 29
Getting ready 13 Getting ready 29
How to do it… 13 How to do it… 29
How it works… 14 How it works… 29
There’s more… 16 There’s more… 30
See also 16
Further reading 30
Configuring Docker for Airflow 16
Getting ready 17
viii Table of Contents

2
Principals of Data Access – Accessing Your Data 31
Technical requirements 31 How to do it… 47
Implementing governance in a data How it works… 48
access workflow 32 There’s more… 49
Getting ready 32 See also 52
How to do it… 33 Managing encrypted files 52
How it works… 34 Getting ready 52
See also 34 How to do it… 53
Accessing databases and data How it works… 54
warehouses34 There’s more… 55
Getting ready 35 See also 56
How to do it… 35 Accessing data from AWS using S3 56
How it works… 37 Getting ready 56
There’s more… 38 How to do it… 59
See also 39 How it works… 62
Accessing SSH File Transfer Protocol There’s more… 63
(SFTP) files 39 See also 63
Getting ready 39 Accessing data from GCP using
How to do it… 41 Cloud Storage 64
How it works… 43 Getting ready 64
There’s more… 43 How to do it… 66
See also 44 How it works… 68
Retrieving data using API There’s more… 70
authentication44 Further reading 70
Getting ready 45

3
Data Discovery – Understanding Our Data before Ingesting It 71
Technical requirements 71 How to do it… 73
Documenting the data discovery How it works… 77
process71 Configuring OpenMetadata 77
Getting ready 72 Getting ready 77
Table of Contents ix

How to do it… 79 Getting ready 86


How it works… 84 How to do it… 88
There’s more… 86 How it works… 92
See also 86
Further reading 93
Connecting OpenMetadata Other tools 94
to our database 86

4
Reading CSV and JSON Files and Solving Problems 95
Technical requirements 95 How it works… 105
Reading a CSV file 96 There’s more… 106
See also 107
Getting ready 96
How to do it… 96 Using PySpark to read CSV files 108
How it works… 98 Getting ready 108
There’s more… 98 How to do it… 108
See also 99 How it works… 109
Reading a JSON file 99 There’s more… 110
See also 114
Getting ready 100
How to do it… 100 Using PySpark to read JSON files 114
How it works… 100 Getting ready 114
There’s more… 101 How to do it… 115
See also 103 How it works… 116
Creating a SparkSession for PySpark 103 There’s more… 117
See also 117
Getting ready 103
How to do it… 104 Further reading 117

5
Ingesting Data from Structured and Unstructured Databases 119
Technical requirements 119 There’s more… 127
Configuring a JDBC connection 120 See also 129

Getting ready 120 Ingesting data from a JDBC database


How to do it… 124 using SQL 129
How it works… 126 Getting ready 129
x Table of Contents

How to do it… 130 Getting ready 146


How it works… 132 How to do it… 146
There’s more… 133 How it works… 148
See also 133 There’s more… 149
See also 150
Connecting to a NoSQL
database (MongoDB) 134 Ingesting data from MongoDB
Getting ready 134 using PySpark 150
How to do it… 136 Getting ready 150
How it works… 142 How to do it… 151
There’s more… 144 How it works… 153
See also 145 There’s more… 155
See also 156
Creating our NoSQL table in
MongoDB145 Further reading 157

6
Using PySpark with Defined and Non-Defined Schemas 159
Technical requirements 159 How to do it… 169
Applying schemas to data ingestion 160 How it works… 170

Getting ready 160 Ingesting unstructured data with a


How to do it… 162 well-defined schema and format 172
How it works… 163 Getting ready 172
There’s more… 165 How to do it… 172
See also 165 How it works… 174
Importing structured data using a There’s more… 176
well-defined schema 165 See also 176

Getting ready 165 Inserting formatted SparkSession


How to do it… 165 logs to facilitate your work 176
How it works… 167 Getting ready 176
There’s more… 168 How to do it… 176
See also 169 How it works… 178
Importing unstructured data without There’s more… 179
a schema 169 See also 179

Getting ready… 169 Further reading 179


Table of Contents xi

7
Ingesting Analytical Data 181
Technical requirements 181 How it works… 197
Ingesting Parquet files 182 There’s more… 198
See also 200
Getting ready 182
How to do it… 183 Ingesting partitioned data 200
How it works… 184 Getting ready 200
There’s more… 185 How to do it… 201
See also 185 How it works… 201
Ingesting Avro files 185 There’s more… 203
See also 204
Getting ready 186
How to do it… 186 Applying reverse ETL 204
How it works… 188 Getting ready 204
There’s more… 190 How to do it… 205
See also 190 How it works… 206
Applying schemas to analytical data 191 There’s more… 207
See also 207
Getting ready 191
How to do it… 191 Selecting analytical data for reverse
How it works… 194 ETL207
There’s more… 194 Getting ready 207
See also 195 How to do it… 208
Filtering data and handling common How it works… 209
issues195 See also 210

Getting ready 195 Further reading 210


How to do it… 196

Part 2: Structuring the Ingestion Pipeline


8
Designing Monitored Data Workflows 213
Technical requirements 213 Getting ready 214
Inserting logs 214 How to do it… 214
How it works… 216
xii Table of Contents

See also 217 How it works… 229


There’s more… 229
Using log-level types 217
See also 230
Getting ready 217
How to do it… 217 Logging based on data 231
How it works… 219 Getting ready 231
There’s more… 220 How to do it… 231
See also 221 How it works… 232
There’s more… 233
Creating standardized logs 221
See also 233
Getting ready 222
How to do it… 222 Retrieving SparkSession metrics 234
How it works… 224 Getting ready 234
There’s more… 227 How to do it… 234
See also 227 How it works… 237
There’s more… 241
Monitoring our data ingest file size 227
See also 242
Getting ready 228
How to do it… 228 Further reading 242

9
Putting Everything Together with Airflow 243
Technical requirements 244 How to do it… 257
Installing Airflow 244 How it works… 260
There's more… 262
Configuring Airflow 244 See also 262
Getting ready 244
How to do it… 245 Configuring sensors 262
How it works… 247 Getting ready 262
See also 248 How to do it… 263
How it works… 264
Creating DAGs 248 See also 265
Getting ready 248
How to do it… 249 Creating connectors in Airflow 265
How it works… 253 Getting ready 266
There's more… 254 How to do it… 266
See also 255 How it works… 269
There's more… 270
Creating custom operators 255 See also 270
Getting ready 255
Table of Contents xiii

Creating parallel ingest tasks 270 Getting ready 274


Getting ready 270 How to do it… 275
How to do it… 271 How it works… 277
How it works… 272 There's more… 278
There's more… 273 See also 279
See also 274
Further reading 279
Defining ingest-dependent DAGs 274

10
Logging and Monitoring Your Data Ingest in Airflow 281
Technical requirements 281 Designing advanced monitoring 304
Installing and running Airflow 282 Getting ready 304
How to do it… 306
Creating basic logs in Airflow 283
How it works… 308
Getting ready 284
There’s more… 309
How to do it… 284
See also 309
How it works… 287
See also 289 Using notification operators 309
Getting ready 310
Storing log files in a remote location 289
How to do it… 312
Getting ready 289
How it works… 315
How to do it… 290
There’s more… 318
How it works… 298
See also 299 Using SQL operators for data quality 318
Getting ready 318
Configuring logs in airflow.cfg 299
How to do it… 320
Getting ready 299
How it works… 321
How to do it… 299
There’s more… 323
How it works… 301
See also 323
There’s more… 303
See also 304 Further reading 324

11
Automating Your Data Ingestion Pipelines 325
Technical requirements 325 Scheduling daily ingestions 326
Installing and running Airflow 326 Getting ready 327
xiv Table of Contents

How to do it… 327 There's more… 340


How it works… 330
Setting up the schedule_interval
There's more… 330
parameter340
See also 331
Getting ready 340
Scheduling historical data ingestion 331 How to do it… 341
Getting ready 331 How it works… 342
How to do it… 332 See also 342
How it works… 335
Solving scheduling errors 343
There's more… 336
Getting ready 343
Scheduling data replication 337 How to do it… 343
Getting ready 337 How it works… 346
How to do it… 338 There’s more… 347
How it works… 339
Further reading 347

12
Using Data Observability for Debugging, Error Handling,
and Preventing Downtime 349
Technical requirements 349 Getting ready 358
Docker images 350 How to do it… 358
How it works… 361
Setting up StatsD for monitoring 351 There’s more… 363
Getting ready 351
How to do it… 351 Creating an observability dashboard 363
How it works… 353 Getting ready 363
See also 354 How to do it… 363
How it works… 369
Setting up Prometheus for storing There’s more… 370
metrics354
Getting ready 354 Setting custom alerts or notifications 370
How to do it… 354 Getting ready 371
How it works… 356 How to do it… 371
There’s more… 357 How it works… 377

Setting up Grafana for monitoring 358 Further reading 378

Index379

Other Books You May Enjoy 388


Preface
Welcome to Data Ingestion with Python Cookbook. I hope you are excited as me to enter the world
of data engineering.
Data Ingestion with Python Cookbook is a practical guide that will empower you to design and implement
efficient data ingestion pipelines. With real-world examples and renowned open-source tools, this
book addresses your queries and hurdles head-on.
Beginning with designing pipelines, you’ll explore working with and without data schemas, constructing
monitored workflows using Airflow, and embracing data observability principles while adhering
to best practices. Tackling the challenges of reading diverse data sources and formats, you’ll gain a
comprehensive understanding of all these.
Our journey continues with essential insights into error logging, identification, resolution, data
orchestration, and effective monitoring. You’ll discover optimal approaches for storing logs, ensuring
easy access and references for them in the future.
By the end of this book, you’ll possess a fully automated setup to initiate data ingestion and pipeline
monitoring. This streamlined process will seamlessly integrate into the subsequent stages of the Extract,
Transform, and Load (ETL) process, propelling your data integration capabilities to new heights. Get
ready to embark on an enlightening and transformative data ingestion journey.

Who this book is for


This comprehensive book is specifically designed for Data Engineers, Data Integration Specialists, and
passionate data enthusiasts seeking a deeper understanding of data ingestion processes, data flows,
and the typical challenges encountered along the way. It provides valuable insights, best practices, and
practical knowledge to enhance your skills and proficiency in handling data ingestion tasks effectively.
Whether you are a beginner in the data world or an experienced developer, this book will suit you.
It is recommended to know the Python programming fundamentals and have basic knowledge of
Docker to read and run this book’s code.

What this book covers


Chapter 1, Introduction to Data Ingestion, introduces you to data ingestion best practices and the
challenges of working with diverse data sources. It explains the importance of the tools covered in
the book, presents them, and provides installation instructions.
xvi Preface

Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data
governance, covering workflows and management of familiar sources such as SFTP servers, APIs,
and cloud providers. It also provides examples of creating data access policies in databases, data
warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of
carrying out the data discovery process before data ingestion. It covers manual discovery, documentation,
and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON
files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures
while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts
of relational and non-relational databases, including everyday use cases. You will learn how to read
and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark
use cases, focusing on handling defined and non-defined schemas. It also explores reading and
understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading
and writing. It explores reading partitioned data for improved performance and discusses Reverse
ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion,
facilitating error identification, and debugging. Techniques such as monitoring file size, row count,
and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information
and guides you in building a real-life data ingestion application using Airflow. It covers essential
components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and
monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications,
and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered
to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using
previously learned best practices, enabling reader autonomy. It addresses common challenges with
schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12, Using Data Observability for Debugging, Error Handling, and Preventing Downtime,
explores data observability concepts, popular monitoring tools such as Grafana, and best practices
for log storage and data lineage. It also covers creating visualization graphs to monitor data source
issues using Airflow configuration and data ingestion scripts.
Preface xvii

To get the most out of this book


To execute the code in this book, you must have at least a basic knowledge of Python. We will use
Python as the core language to execute the code. The code examples have been tested using Python
3.8. However, it is expected to still work with future language versions.
Along with Python, this book uses Docker to emulate data systems and applications in our local
machine, such as PostgreSQL, MongoDB, and Airflow. Therefore, a basic knowledge of Docker is
recommended to edit container image files and run and stop containers.
Please, remember that some command-line commands may need adjustments depending on your local
settings or operating system. The commands in the code examples are based on the Linux command-
line syntax and might need some adaptations to run on Windows PowerShell.

Software/Hardware covered in the book OS Requirements

Python 3.8 or higher Windows, Mac OS X, and Linux (any)


Docker Engine 24.0 / Docker Desktop 4.19 Windows, Mac OS X, and Linux (any)

For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it
is not mandatory to install it, this tool can help you to test the code and try new things on the code due
to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code via the GitHub repository (link available in the next section). Doing so will help you
avoid any potential errors related to the copying and pasting of code.

Download the example code files


You can download the example code files for this book from GitHub at https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook. In case there’s an update
to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://
github.com/PacktPublishing/. Check them out!

Download the color images


We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You
can download it here: https://packt.link/xwl0U
xviii Preface

Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then
we proceeded with the with open statement.”
A block of code is set as follows:

def gets_csv_first_line (csv_file):


    logging.info(f"Starting function to read first line")
    try:
        with open(csv_file, 'r') as file:
            logging.info(f"Reading file")

Any command-line input or output is written as follows:

$ python3 –-version
Python 3.8.10

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words
in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected
showString at NativeMethodAccessorImpl.java:0, which redirected us to the
Stages page.”

Tips or important notes


Appear like this.

Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How
it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready
This section tells you what to expect in the recipe and describes how to set up any software or any
preliminary settings required for the recipe.

How to do it…
This section contains the steps required to follow the recipe.
Preface xix

How it works…
This section usually consists of a detailed explanation of what happened in the previous section.

There’s more…
This section consists of additional information about the recipe in order to make you more knowledgeable
about the recipe.

See also
This section provides helpful links to other useful information for the recipe.

Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the
subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
xx Preface

Share Your Thoughts


Once you’ve read Data Ingestion with Python Cookbook, we’d love to hear your thoughts! Please
click here to go straight to the Amazon review page for this book and share
your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering
excellent quality content.
Preface xxi

Download a free PDF copy of this book


Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical
books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content
in your inbox daily
Follow these simple steps to get the benefits:

1. Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837632602

2. Submit your proof of purchase


3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1:
Fundamentals
of Data Ingestion

In this part, you will be introduced to the fundamentals of data ingestion and data engineering,
passing through the basic definition of an ingestion pipeline, the common types of data sources, and
the technologies involved.
This part has the following chapters:

• Chapter 1, Introduction to Data Ingestion


• Chapter 2, Principals of Data Access – Accessing Your Data
• Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It
• Chapter 4, Reading CSV and JSON Files and Solving Problems
• Chapter 5, Ingesting Data from Structured and Unstructured Databases
• Chapter 6, Using PySpark with Defined and Non-Defined Schemas
• Chapter 7, Ingesting Analytical Data
1
Introduction to Data Ingestion
Welcome to the fantastic world of data! Are you ready to embark on a thrilling journey into data
ingestion? If so, this is the perfect book to start! Ingesting data is the first step into the big data world.
Data ingestion is a process that involves gathering and importing data and also storing it properly
so that the subsequent extract, transform, and load (ETL) pipeline can utilize the data. To make it
happen, we must be cautious about the tools we will use and how to configure them properly.
In our book journey, we will use Python and PySpark to retrieve data from different data sources
and learn how to store them properly. To orchestrate all this, the basic concepts of Airflow will be
implemented, along with efficient monitoring to guarantee that our pipelines are covered.
This chapter will introduce some basic concepts about data ingestion and how to set up your
environment to start the tasks.
In this chapter, you will build and learn the following recipes:

• Setting up Python and the environment


• Installing PySpark
• Configuring Docker for MongoDB
• Configuring Docker for Airflow
• Logging libraries
• Creating schemas
• Applying data governance in ingestion
• Implementing data replication
4 Introduction to Data Ingestion

Technical requirements
The commands inside the recipes of this chapter use Linux syntax. If you don’t use a Linux-based
system, you may need to adapt the commands:

• Docker or Docker Desktop


• The SQL client of your choice (recommended); we recommend DBeaver, since it has a
community-free version

You can find the code from this chapter in this GitHub repository: https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Note
Windows users might get an error message such as Docker Desktop requires a newer WSL
kernel version. This can be fixed by following the steps here: https://docs.docker.
com/desktop/windows/wsl/.

Setting up Python and its environment


In the data world, languages such as Java, Scala, or Python are commonly used. The first two languages
are used due to their compatibility with the big data tools environment, such as Hadoop and Spark,
the central core of which runs on a Java Virtual Machine (JVM). However, in the past few years, the
use of Python for data engineering and data science has increased significantly due to the language’s
versatility, ease of understanding, and many open source libraries built by the community.

Getting ready
Let’s create a folder for our project:

1. First, open your system command line. Since I use the Windows Subsystem for Linux (WSL),
I will open the WSL application.
2. Go to your home directory and create a folder as follows:
$ mkdir my-project

3. Go inside this folder:


$ cd my-project

4. Check your Python version on your operating system as follows:


$ python -–version
Setting up Python and its environment 5

Depending on your operational system, you might or might not have output here – for example,
WSL 20.04 users might have the following output:
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3

If your Python path is configured to use the python command, you will see output similar
to this:
Python 3.9.0

Sometimes, your Python path might be configured to be invoked using python3. You can
try it using the following command:
$ python3 --version

The output will be similar to the python command, as follows:


Python 3.9.0

5. Now, let’s check our pip version. This check is essential, since some operating systems have
more than one Python version installed:
$ pip --version

You should see similar output:


pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.9)

If your operating system (OS) uses a Python version below 3.8x or doesn’t have the language
installed, proceed to the How to do it steps; otherwise, you are ready to start the following Installing
PySpark recipe.

How to do it…
We are going to use the official installer from Python.org. You can find the link for it here: https://
www.python.org/downloads/:

Note
For Windows users, it is important to check your OS version, since Python 3.10 may not be
yet compatible with Windows 7, or your processor type (32-bits or 64-bits).

1. Download one of the stable versions.


At the time of writing, the stable recommended versions compatible with the tools and resources
presented here are 3.8, 3.9, and 3.10. I will use the 3.9 version and download it using the
following link: https://www.python.org/downloads/release/python-390/.
Scrolling down the page, you will find a list of links to Python installers according to OS, as shown
in the following screenshot.
6 Introduction to Data Ingestion

Figure 1.1 – Python.org download files for version 3.9

2. After downloading the installation file, double-click it and follow the instructions in the wizard
window. To avoid complexity, choose the recommended settings displayed.
The following screenshot shows how it looks on Windows:

Figure 1.2 – The Python Installer for Windows


Random documents with unrelated
content Scribd suggests to you:
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like