100% found this document useful (2 votes)

344 views

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

The document is a promotional overview of the 'Data Ingestion with Python Cookbook' by Gláucia Esppenchutz, which serves as a practical guide for ingesting, monitoring, and identifying errors in data ingestion processes. It includes links to various related resources and books on data science and Python programming. The book is published by Packt Publishing and aims to provide comprehensive insights into data ingestion methodologies.

Uploaded by

dickodemard7s

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

344 views

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

Uploaded by

dickodemard7s

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Data Ingestion with Python Cookbook: A practical

guide to ingesting, monitoring, and identifying

errors in the data ingestion process Esppenchutz
pdf download
https://textbookfull.com/product/data-ingestion-with-python-
cookbook-a-practical-guide-to-ingesting-monitoring-and-
identifying-errors-in-the-data-ingestion-process-esppenchutz/

Download more ebook from https://textbookfull.com

We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

Biota Grow 2C gather 2C cook Loucas

https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-
loucas/

Practical Python Data Visualization: A Fast Track

Approach To Learning Data Visualization With Python
Ashwin Pajankar

https://textbookfull.com/product/practical-python-data-
visualization-a-fast-track-approach-to-learning-data-
visualization-with-python-ashwin-pajankar/

Hands-On Entity Resolution: A Practical Guide to Data

Matching With Python 1st Edition Michael Shearer

https://textbookfull.com/product/hands-on-entity-resolution-a-
practical-guide-to-data-matching-with-python-1st-edition-michael-
shearer/

Practical Data Science Cookbook Data pre processing

analysis and visualization using R and Python
Prabhanjan Tattar

https://textbookfull.com/product/practical-data-science-cookbook-
data-pre-processing-analysis-and-visualization-using-r-and-
python-prabhanjan-tattar/
Data monitoring committees in clinical trials a
practical perspective Demets

https://textbookfull.com/product/data-monitoring-committees-in-
clinical-trials-a-practical-perspective-demets/

Mastering Machine Learning with Python in Six Steps: A

Practical Implementation Guide to Predictive Data
Analytics Using Python 1st Edition Manohar Swamynathan
(Auth.)
https://textbookfull.com/product/mastering-machine-learning-with-
python-in-six-steps-a-practical-implementation-guide-to-
predictive-data-analytics-using-python-1st-edition-manohar-
swamynathan-auth/

A Python Data Analyst’s Toolkit: Learn Python and

Python-based Libraries with Applications in Data
Analysis and Statistics Gayathri Rajagopalan

https://textbookfull.com/product/a-python-data-analysts-toolkit-
learn-python-and-python-based-libraries-with-applications-in-
data-analysis-and-statistics-gayathri-rajagopalan/

Introduction to Machine Learning with Python A Guide

for Data Scientists Andreas C. Müller

https://textbookfull.com/product/introduction-to-machine-
learning-with-python-a-guide-for-data-scientists-andreas-c-
muller/

Azure Data Factory Cookbook: Data engineers guide to

build and manage ETL and ELT pipelines with data
integration , 2nd Edition Dmitry Foshin

https://textbookfull.com/product/azure-data-factory-cookbook-
data-engineers-guide-to-build-and-manage-etl-and-elt-pipelines-
with-data-integration-2nd-edition-dmitry-foshin/
Data Ingestion with Python
Cookbook

A practical guide to ingesting, monitoring, and identifying

errors in the data ingestion process

Gláucia Esppenchutz

BIRMINGHAM—MUMBAI
Data Ingestion with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Arindam Majumdar
Senior Editor: Tiksha Lad
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Jyoti Chauhan
Marketing Coordinator: Nivedita Singh

First published: May 2023

Production reference: 1300523

Published by Packt Publishing Ltd.

Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-83763-260-2

www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his
support and understanding during this challenging endeavor. I want to thank all my friends that
didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed,
helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed
best friend, who is at peace, Minduim, for “helping” me to write this book.

– Gláucia Esppenchutz
Contributors

About the author

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts
of data using cloud and on-premises technologies. She worked in companies such as Globo.com,
BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations
for autonomous systems.
She comes from the biomedical field and shifted her career ten years ago to chase the dream of
working closely with technology and data. She is in constant contact with the open source community,
mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies
group, FreeCodeCamp, Udacity, and MentorColor communities.

I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the
Python open source community and the DataBootCamp founders, who guided me at the beginning
of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
About the reviewers
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune
4 organization. He has a demonstrated history of working in the cloud, data and analytics industry
for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem
(Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC
of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail
domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in
the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based
out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are
skilled not only in various domains including AI, data warehouse architecture, and business analytics,
but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT.
Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof
of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer
and have actively shared their extensive knowledge and expertise through presentations, blog articles,
and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software
development and delivery excellence across multiple domains and technology, ranging from C++ to
Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the
community by being part of GDG – Pune Volunteer Group, helping the team in organizing events.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction,
non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: k r i s h n a n @ g m a i l . c o m or via
LinkedIn: www.linkedin.com/in/krishnan-raghavan

I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to
review this book.
Table of Contents

Prefacexv

Part 1: Fundamentals of Data Ingestion

1
Introduction to Data Ingestion 3
Technical requirements 4 How to do it… 17
Setting up Python and its environment 4 How it works… 21
See also 22
Getting ready 4
How to do it… 5 Creating schemas 22
How it works… 7 Getting ready 22
There’s more… 7 How to do it… 22
See also 7 How it works… 24
Installing PySpark 8 See also 25

Getting ready 8 Applying data governance in ingestion 25

How to do it… 10 Getting ready 25
How it works… 12 How to do it… 26
There’s more… 12 How it works… 28
See also 12 See also 28
Configuring Docker for MongoDB 13 Implementing data replication 29
Getting ready 13 Getting ready 29
How to do it… 13 How to do it… 29
How it works… 14 How it works… 29
There’s more… 16 There’s more… 30
See also 16
Further reading 30
Configuring Docker for Airflow 16
Getting ready 17
viii Table of Contents

2
Principals of Data Access – Accessing Your Data 31
Technical requirements 31 How to do it… 47
Implementing governance in a data How it works… 48
access workflow 32 There’s more… 49
Getting ready 32 See also 52
How to do it… 33 Managing encrypted ﬁles 52
How it works… 34 Getting ready 52
See also 34 How to do it… 53
Accessing databases and data How it works… 54
warehouses34 There’s more… 55
Getting ready 35 See also 56
How to do it… 35 Accessing data from AWS using S3 56
How it works… 37 Getting ready 56
There’s more… 38 How to do it… 59
See also 39 How it works… 62
Accessing SSH File Transfer Protocol There’s more… 63
(SFTP) ﬁles 39 See also 63
Getting ready 39 Accessing data from GCP using
How to do it… 41 Cloud Storage 64
How it works… 43 Getting ready 64
There’s more… 43 How to do it… 66
See also 44 How it works… 68
Retrieving data using API There’s more… 70
authentication44 Further reading 70
Getting ready 45

3
Data Discovery – Understanding Our Data before Ingesting It 71
Technical requirements 71 How to do it… 73
Documenting the data discovery How it works… 77
process71 Configuring OpenMetadata 77
Getting ready 72 Getting ready 77
Table of Contents ix

How to do it… 79 Getting ready 86

How it works… 84 How to do it… 88
There’s more… 86 How it works… 92
See also 86
Further reading 93
Connecting OpenMetadata Other tools 94
to our database 86

4
Reading CSV and JSON Files and Solving Problems 95
Technical requirements 95 How it works… 105
Reading a CSV file 96 There’s more… 106
See also 107
Getting ready 96
How to do it… 96 Using PySpark to read CSV files 108
How it works… 98 Getting ready 108
There’s more… 98 How to do it… 108
See also 99 How it works… 109
Reading a JSON file 99 There’s more… 110
See also 114
Getting ready 100
How to do it… 100 Using PySpark to read JSON files 114
How it works… 100 Getting ready 114
There’s more… 101 How to do it… 115
See also 103 How it works… 116
Creating a SparkSession for PySpark 103 There’s more… 117
See also 117
Getting ready 103
How to do it… 104 Further reading 117

5
Ingesting Data from Structured and Unstructured Databases 119
Technical requirements 119 There’s more… 127
Configuring a JDBC connection 120 See also 129

Getting ready 120 Ingesting data from a JDBC database

How to do it… 124 using SQL 129
How it works… 126 Getting ready 129
x Table of Contents

How to do it… 130 Getting ready 146

How it works… 132 How to do it… 146
There’s more… 133 How it works… 148
See also 133 There’s more… 149
See also 150
Connecting to a NoSQL
database (MongoDB) 134 Ingesting data from MongoDB
Getting ready 134 using PySpark 150
How to do it… 136 Getting ready 150
How it works… 142 How to do it… 151
There’s more… 144 How it works… 153
See also 145 There’s more… 155
See also 156
Creating our NoSQL table in
MongoDB145 Further reading 157

6
Using PySpark with Deﬁned and Non-Deﬁned Schemas 159
Technical requirements 159 How to do it… 169
Applying schemas to data ingestion 160 How it works… 170

Getting ready 160 Ingesting unstructured data with a

How to do it… 162 well-deﬁned schema and format 172
How it works… 163 Getting ready 172
There’s more… 165 How to do it… 172
See also 165 How it works… 174
Importing structured data using a There’s more… 176
well-deﬁned schema 165 See also 176

Getting ready 165 Inserting formatted SparkSession

How to do it… 165 logs to facilitate your work 176
How it works… 167 Getting ready 176
There’s more… 168 How to do it… 176
See also 169 How it works… 178
Importing unstructured data without There’s more… 179
a schema 169 See also 179

Getting ready… 169 Further reading 179

Table of Contents xi

7
Ingesting Analytical Data 181
Technical requirements 181 How it works… 197
Ingesting Parquet ﬁles 182 There’s more… 198
See also 200
Getting ready 182
How to do it… 183 Ingesting partitioned data 200
How it works… 184 Getting ready 200
There’s more… 185 How to do it… 201
See also 185 How it works… 201
Ingesting Avro files 185 There’s more… 203
See also 204
Getting ready 186
How to do it… 186 Applying reverse ETL 204
How it works… 188 Getting ready 204
There’s more… 190 How to do it… 205
See also 190 How it works… 206
Applying schemas to analytical data 191 There’s more… 207
See also 207
Getting ready 191
How to do it… 191 Selecting analytical data for reverse
How it works… 194 ETL207
There’s more… 194 Getting ready 207
See also 195 How to do it… 208
Filtering data and handling common How it works… 209
issues195 See also 210

Getting ready 195 Further reading 210

How to do it… 196

Part 2: Structuring the Ingestion Pipeline

8
Designing Monitored Data Workﬂows 213
Technical requirements 213 Getting ready 214
Inserting logs 214 How to do it… 214
How it works… 216
xii Table of Contents

Creating parallel ingest tasks 270 Getting ready 274

Getting ready 270 How to do it… 275
How to do it… 271 How it works… 277
How it works… 272 There's more… 278
There's more… 273 See also 279
See also 274
Further reading 279
Deﬁning ingest-dependent DAGs 274

10
Logging and Monitoring Your Data Ingest in Airflow 281
Technical requirements 281 Designing advanced monitoring 304
Installing and running Airflow 282 Getting ready 304
How to do it… 306
Creating basic logs in Airflow 283
How it works… 308
Getting ready 284
There’s more… 309
How to do it… 284
See also 309
How it works… 287
See also 289 Using notification operators 309
Getting ready 310
Storing log files in a remote location 289
How to do it… 312
Getting ready 289
How it works… 315
How to do it… 290
There’s more… 318
How it works… 298
See also 299 Using SQL operators for data quality 318
Getting ready 318
Configuring logs in airflow.cfg 299
How to do it… 320
Getting ready 299
How it works… 321
How to do it… 299
There’s more… 323
How it works… 301
See also 323
There’s more… 303
See also 304 Further reading 324

11
Automating Your Data Ingestion Pipelines 325
Technical requirements 325 Scheduling daily ingestions 326
Installing and running Airflow 326 Getting ready 327
xiv Table of Contents

How to do it… 327 There's more… 340

How it works… 330
Setting up the schedule_interval
There's more… 330
parameter340
See also 331
Getting ready 340
Scheduling historical data ingestion 331 How to do it… 341
Getting ready 331 How it works… 342
How to do it… 332 See also 342
How it works… 335
Solving scheduling errors 343
There's more… 336
Getting ready 343
Scheduling data replication 337 How to do it… 343
Getting ready 337 How it works… 346
How to do it… 338 There’s more… 347
How it works… 339
Further reading 347

12
Using Data Observability for Debugging, Error Handling,
and Preventing Downtime 349
Technical requirements 349 Getting ready 358
Docker images 350 How to do it… 358
How it works… 361
Setting up StatsD for monitoring 351 There’s more… 363
Getting ready 351
How to do it… 351 Creating an observability dashboard 363
How it works… 353 Getting ready 363
See also 354 How to do it… 363
How it works… 369
Setting up Prometheus for storing There’s more… 370
metrics354
Getting ready 354 Setting custom alerts or notiﬁcations 370
How to do it… 354 Getting ready 371
How it works… 356 How to do it… 371
There’s more… 357 How it works… 377

Setting up Grafana for monitoring 358 Further reading 378

Index379

Other Books You May Enjoy 388

Preface
Welcome to Data Ingestion with Python Cookbook. I hope you are excited as me to enter the world
of data engineering.
Data Ingestion with Python Cookbook is a practical guide that will empower you to design and implement
efficient data ingestion pipelines. With real-world examples and renowned open-source tools, this
book addresses your queries and hurdles head-on.
Beginning with designing pipelines, you’ll explore working with and without data schemas, constructing
monitored workflows using Airflow, and embracing data observability principles while adhering
to best practices. Tackling the challenges of reading diverse data sources and formats, you’ll gain a
comprehensive understanding of all these.
Our journey continues with essential insights into error logging, identification, resolution, data
orchestration, and effective monitoring. You’ll discover optimal approaches for storing logs, ensuring
easy access and references for them in the future.
By the end of this book, you’ll possess a fully automated setup to initiate data ingestion and pipeline
monitoring. This streamlined process will seamlessly integrate into the subsequent stages of the Extract,
Transform, and Load (ETL) process, propelling your data integration capabilities to new heights. Get
ready to embark on an enlightening and transformative data ingestion journey.

Who this book is for

This comprehensive book is specifically designed for Data Engineers, Data Integration Specialists, and
passionate data enthusiasts seeking a deeper understanding of data ingestion processes, data flows,
and the typical challenges encountered along the way. It provides valuable insights, best practices, and
practical knowledge to enhance your skills and proficiency in handling data ingestion tasks effectively.
Whether you are a beginner in the data world or an experienced developer, this book will suit you.
It is recommended to know the Python programming fundamentals and have basic knowledge of
Docker to read and run this book’s code.

What this book covers

Chapter 1, Introduction to Data Ingestion, introduces you to data ingestion best practices and the
challenges of working with diverse data sources. It explains the importance of the tools covered in
the book, presents them, and provides installation instructions.
xvi Preface

Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data
governance, covering workflows and management of familiar sources such as SFTP servers, APIs,
and cloud providers. It also provides examples of creating data access policies in databases, data
warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of
carrying out the data discovery process before data ingestion. It covers manual discovery, documentation,
and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON
files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures
while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts
of relational and non-relational databases, including everyday use cases. You will learn how to read
and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark
use cases, focusing on handling defined and non-defined schemas. It also explores reading and
understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading
and writing. It explores reading partitioned data for improved performance and discusses Reverse
ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workﬂows, covers logging best practices for data ingestion,
facilitating error identification, and debugging. Techniques such as monitoring file size, row count,
and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airﬂow, consolidates the previously presented information
and guides you in building a real-life data ingestion application using Airflow. It covers essential
components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and
monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications,
and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered
to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using
previously learned best practices, enabling reader autonomy. It addresses common challenges with
schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12, Using Data Observability for Debugging, Error Handling, and Preventing Downtime,
explores data observability concepts, popular monitoring tools such as Grafana, and best practices
for log storage and data lineage. It also covers creating visualization graphs to monitor data source
issues using Airflow configuration and data ingestion scripts.
Preface xvii

To get the most out of this book

To execute the code in this book, you must have at least a basic knowledge of Python. We will use
Python as the core language to execute the code. The code examples have been tested using Python
3.8. However, it is expected to still work with future language versions.
Along with Python, this book uses Docker to emulate data systems and applications in our local
machine, such as PostgreSQL, MongoDB, and Airflow. Therefore, a basic knowledge of Docker is
recommended to edit container image files and run and stop containers.
Please, remember that some command-line commands may need adjustments depending on your local
settings or operating system. The commands in the code examples are based on the Linux command-
line syntax and might need some adaptations to run on Windows PowerShell.

Software/Hardware covered in the book OS Requirements

Python 3.8 or higher Windows, Mac OS X, and Linux (any)

Docker Engine 24.0 / Docker Desktop 4.19 Windows, Mac OS X, and Linux (any)

For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it
is not mandatory to install it, this tool can help you to test the code and try new things on the code due
to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code via the GitHub repository (link available in the next section). Doing so will help you
avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook. In case there’s an update
to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://
github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You
can download it here: https://packt.link/xwl0U
xviii Preface

Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then
we proceeded with the with open statement.”
A block of code is set as follows:

def gets_csv_first_line (csv_file):

    logging.info(f"Starting function to read first line")
    try:
        with open(csv_file, 'r') as file:
            logging.info(f"Reading file")

Any command-line input or output is written as follows:

$ python3 –-version
Python 3.8.10

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words
in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected
showString at NativeMethodAccessorImpl.java:0, which redirected us to the
Stages page.”

Tips or important notes

Appear like this.

Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How
it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready
This section tells you what to expect in the recipe and describes how to set up any software or any
preliminary settings required for the recipe.

How to do it…
This section contains the steps required to follow the recipe.
Preface xix

How it works…
This section usually consists of a detailed explanation of what happened in the previous section.

There’s more…
This section consists of additional information about the recipe in order to make you more knowledgeable
about the recipe.

See also
This section provides helpful links to other useful information for the recipe.

Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the
subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata
Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
xx Preface

Share Your Thoughts

Once you’ve read Data Ingestion with Python Cookbook, we’d love to hear your thoughts! Please
click here to go straight to the Amazon review page for this book and share
your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering
excellent quality content.
Preface xxi

Download a free PDF copy of this book

Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical
books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content
in your inbox daily
Follow these simple steps to get the benefits:

1. Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837632602

2. Submit your proof of purchase

3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1:
Fundamentals
of Data Ingestion

In this part, you will be introduced to the fundamentals of data ingestion and data engineering,
passing through the basic definition of an ingestion pipeline, the common types of data sources, and
the technologies involved.
This part has the following chapters:

• Chapter 1, Introduction to Data Ingestion

• Chapter 2, Principals of Data Access – Accessing Your Data
• Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It
• Chapter 4, Reading CSV and JSON Files and Solving Problems
• Chapter 5, Ingesting Data from Structured and Unstructured Databases
• Chapter 6, Using PySpark with Deﬁned and Non-Deﬁned Schemas
• Chapter 7, Ingesting Analytical Data
1
Introduction to Data Ingestion
Welcome to the fantastic world of data! Are you ready to embark on a thrilling journey into data
ingestion? If so, this is the perfect book to start! Ingesting data is the first step into the big data world.
Data ingestion is a process that involves gathering and importing data and also storing it properly
so that the subsequent extract, transform, and load (ETL) pipeline can utilize the data. To make it
happen, we must be cautious about the tools we will use and how to configure them properly.
In our book journey, we will use Python and PySpark to retrieve data from different data sources
and learn how to store them properly. To orchestrate all this, the basic concepts of Airflow will be
implemented, along with efficient monitoring to guarantee that our pipelines are covered.
This chapter will introduce some basic concepts about data ingestion and how to set up your
environment to start the tasks.
In this chapter, you will build and learn the following recipes:

• Setting up Python and the environment

• Installing PySpark
• Configuring Docker for MongoDB
• Configuring Docker for Airflow
• Logging libraries
• Creating schemas
• Applying data governance in ingestion
• Implementing data replication
4 Introduction to Data Ingestion

Technical requirements
The commands inside the recipes of this chapter use Linux syntax. If you don’t use a Linux-based
system, you may need to adapt the commands:

• Docker or Docker Desktop

• The SQL client of your choice (recommended); we recommend DBeaver, since it has a
community-free version

You can find the code from this chapter in this GitHub repository: https://github.com/
PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Note
Windows users might get an error message such as Docker Desktop requires a newer WSL
kernel version. This can be fixed by following the steps here: https://docs.docker.
com/desktop/windows/wsl/.

Setting up Python and its environment

In the data world, languages such as Java, Scala, or Python are commonly used. The first two languages
are used due to their compatibility with the big data tools environment, such as Hadoop and Spark,
the central core of which runs on a Java Virtual Machine (JVM). However, in the past few years, the
use of Python for data engineering and data science has increased significantly due to the language’s
versatility, ease of understanding, and many open source libraries built by the community.

Getting ready
Let’s create a folder for our project:

1. First, open your system command line. Since I use the Windows Subsystem for Linux (WSL),
I will open the WSL application.
2. Go to your home directory and create a folder as follows:
$ mkdir my-project

3. Go inside this folder:

$ cd my-project

4. Check your Python version on your operating system as follows:

$ python -–version
Setting up Python and its environment 5

Depending on your operational system, you might or might not have output here – for example,
WSL 20.04 users might have the following output:
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3

If your Python path is configured to use the python command, you will see output similar
to this:
Python 3.9.0

Sometimes, your Python path might be configured to be invoked using python3. You can
try it using the following command:
$ python3 --version

The output will be similar to the python command, as follows:

Python 3.9.0

5. Now, let’s check our pip version. This check is essential, since some operating systems have
more than one Python version installed:
$ pip --version

You should see similar output:

pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.9)

If your operating system (OS) uses a Python version below 3.8x or doesn’t have the language
installed, proceed to the How to do it steps; otherwise, you are ready to start the following Installing
PySpark recipe.

How to do it…
We are going to use the official installer from Python.org. You can find the link for it here: https://
www.python.org/downloads/:

Note
For Windows users, it is important to check your OS version, since Python 3.10 may not be
yet compatible with Windows 7, or your processor type (32-bits or 64-bits).

1. Download one of the stable versions.

At the time of writing, the stable recommended versions compatible with the tools and resources
presented here are 3.8, 3.9, and 3.10. I will use the 3.9 version and download it using the
following link: https://www.python.org/downloads/release/python-390/.
Scrolling down the page, you will find a list of links to Python installers according to OS, as shown
in the following screenshot.
6 Introduction to Data Ingestion

Figure 1.1 – Python.org download files for version 3.9

2. After downloading the installation file, double-click it and follow the instructions in the wizard
window. To avoid complexity, choose the recommended settings displayed.
The following screenshot shows how it looks on Windows:

Figure 1.2 – The Python Installer for Windows

Random documents with unrelated
content Scribd suggests to you:
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,

the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission

of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,

Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to

the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating

charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where

we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make

any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About

Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

textbookfull.com

Instant Access to Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye ebook Full Chapters
No ratings yet
Instant Access to Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye ebook Full Chapters
41 pages
Proposal Ict200 (M5 Ice)
No ratings yet
Proposal Ict200 (M5 Ice)
14 pages
12th Computer Sc. Solved MCQs
67% (3)
12th Computer Sc. Solved MCQs
15 pages
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz instant download
No ratings yet
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz instant download
52 pages
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
100% (2)
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
80 pages
Ebooks File Data Ingestion With Python Cookbook: A Practical Guide To Ingesting, Monitoring, and Identifying Errors in The Data Ingestion Process 1st Edition Esppenchutz All Chapters
100% (4)
Ebooks File Data Ingestion With Python Cookbook: A Practical Guide To Ingesting, Monitoring, and Identifying Errors in The Data Ingestion Process 1st Edition Esppenchutz All Chapters
57 pages
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz - Instantly access the full ebook content in just a few seconds
100% (1)
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz - Instantly access the full ebook content in just a few seconds
50 pages
Full download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz pdf docx
No ratings yet
Full download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz pdf docx
41 pages
PDF Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
100% (3)
PDF Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz download
41 pages
Download full (Ebook) Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process by Esppenchutz, Glaucia ISBN 9781837632602, 183763260X, 9781837699602, 183993260X ebook all chapters
100% (13)
Download full (Ebook) Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process by Esppenchutz, Glaucia ISBN 9781837632602, 183763260X, 9781837699602, 183993260X ebook all chapters
65 pages
Immediate download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz ebooks 2024
100% (2)
Immediate download Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process 1st Edition Esppenchutz ebooks 2024
66 pages
Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye download
No ratings yet
Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye download
59 pages
PostgreSQL 10 Administration Cookbook Over 165 effective recipes for database management and maintenance in PostgreSQL 10 4th Edition Simon Riggs instant download
100% (2)
PostgreSQL 10 Administration Cookbook Over 165 effective recipes for database management and maintenance in PostgreSQL 10 4th Edition Simon Riggs instant download
61 pages
Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye - Download the ebook and explore the most detailed content
100% (1)
Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye - Download the ebook and explore the most detailed content
58 pages
Get PostgreSQL High Availability Cookbook 2nd edition Edition Thomas PDF ebook with Full Chapters Now
100% (1)
Get PostgreSQL High Availability Cookbook 2nd edition Edition Thomas PDF ebook with Full Chapters Now
67 pages
Instant ebooks textbook PostgreSQL High Availability Cookbook 2nd edition Edition Thomas download all chapters
100% (9)
Instant ebooks textbook PostgreSQL High Availability Cookbook 2nd edition Edition Thomas download all chapters
67 pages
PostgreSQL High Availability Cookbook 2nd edition Edition Thomas instant download
100% (1)
PostgreSQL High Availability Cookbook 2nd edition Edition Thomas instant download
55 pages
Download full (Ebook) Julia Cookbook by Jalem Raj Rohit ISBN 9781785882012, 1785882015 ebook all chapters
100% (2)
Download full (Ebook) Julia Cookbook by Jalem Raj Rohit ISBN 9781785882012, 1785882015 ebook all chapters
76 pages
Azure Data Factory Cookbook - Second Edition Dmitry Foshin download
100% (2)
Azure Data Factory Cookbook - Second Edition Dmitry Foshin download
58 pages
Full download Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye pdf docx
100% (3)
Full download Exploratory Data Analysis with Python Cookbook: Over 50 recipes to analyze, visualize, and extract insights from structured and unstructured data Oluleye pdf docx
41 pages
(Ebook) Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition) by Prateek Gupta ISBN 9789389898064, 9389898064 - The ebook in PDF format is available for download
100% (6)
(Ebook) Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition) by Prateek Gupta ISBN 9789389898064, 9389898064 - The ebook in PDF format is available for download
77 pages
(Ebook) Julia Cookbook by Jalem Raj Rohit ISBN 9781785882012, 1785882015 - The ebook in PDF format with all chapters is ready for download
100% (1)
(Ebook) Julia Cookbook by Jalem Raj Rohit ISBN 9781785882012, 1785882015 - The ebook in PDF format with all chapters is ready for download
47 pages
FastAPI Cookbook: Develop high-performance APIs and web applications with Python
From Everand
FastAPI Cookbook: Develop high-performance APIs and web applications with Python
Giunio De Luca
No ratings yet
Download Full Julia Cookbook 1st Edition Jalem Raj Rohit PDF All Chapters
100% (5)
Download Full Julia Cookbook 1st Edition Jalem Raj Rohit PDF All Chapters
61 pages
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
From Everand
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
Prateek Gupta
No ratings yet
Pentaho Analytics for MongoDB Cookbook: Over 50 recipes to learn how to use Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions
From Everand
Pentaho Analytics for MongoDB Cookbook: Over 50 recipes to learn how to use Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions
Joel Andre Latino
No ratings yet
Go Standard Library Cookbook 1st Edition Radomir Sohlich - The ebook in PDF format is available for download
100% (1)
Go Standard Library Cookbook 1st Edition Radomir Sohlich - The ebook in PDF format is available for download
60 pages
Buy ebook Julia Cookbook 1st Edition Jalem Raj Rohit cheap price
No ratings yet
Buy ebook Julia Cookbook 1st Edition Jalem Raj Rohit cheap price
51 pages
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
From Everand
Data Engineering with Google Cloud Platform: A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud
Adi Wijaya
No ratings yet
[FREE PDF sample] Jira 7 Administration Cookbook 2nd Edition Patrick Li ebooks
100% (2)
[FREE PDF sample] Jira 7 Administration Cookbook 2nd Edition Patrick Li ebooks
81 pages
23680168
No ratings yet
23680168
65 pages
Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- pdf download
100% (2)
Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- pdf download
65 pages
Instant Ebooks Textbook Go Standard Library Cookbook 1st Edition Radomir Sohlich Download All Chapters
100% (7)
Instant Ebooks Textbook Go Standard Library Cookbook 1st Edition Radomir Sohlich Download All Chapters
52 pages
(Ebook) PostgreSQL High Availability Cookbook by Shaun M. Thomas ISBN 9781787125537, 178712553X - Instantly access the full ebook content in just a few seconds
100% (1)
(Ebook) PostgreSQL High Availability Cookbook by Shaun M. Thomas ISBN 9781787125537, 178712553X - Instantly access the full ebook content in just a few seconds
49 pages
(Ebook) Microsoft Power Apps Cookbook: Apply low-code recipes by -- All Chapters Instant Download
100% (9)
(Ebook) Microsoft Power Apps Cookbook: Apply low-code recipes by -- All Chapters Instant Download
58 pages
Spring Boot 3.0 Cookbook: Proven recipes for building modern and robust Java web applications with Spring Boot
From Everand
Spring Boot 3.0 Cookbook: Proven recipes for building modern and robust Java web applications with Spring Boot
Felip Miguel Puig
No ratings yet
Nursery Plant Management System
No ratings yet
Nursery Plant Management System
30 pages
Building a Web Application with PHP and MariaDB: A Reference Guide
From Everand
Building a Web Application with PHP and MariaDB: A Reference Guide
Sai Srinivas Sriparasa
No ratings yet
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
From Everand
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
Dmitry Anoshin
No ratings yet
(PDF) What You Need To Know About R - Free Ebook
100% (1)
(PDF) What You Need To Know About R - Free Ebook
61 pages
Blazor Web Development Cookbook: Tested recipes for advanced single-page application scenarios in .NET 9
From Everand
Blazor Web Development Cookbook: Tested recipes for advanced single-page application scenarios in .NET 9
Pawel Bazyluk
No ratings yet
Complete Download PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas PDF All Chapters
100% (12)
Complete Download PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas PDF All Chapters
50 pages
PostgreSQL Cookbook 1st Edition Chitij Chauhan download
No ratings yet
PostgreSQL Cookbook 1st Edition Chitij Chauhan download
52 pages
Complete Download (Ebook) Mockito for Spring by Acharya, Sujoy ISBN 9781783983780, 1783983787 PDF All Chapters
100% (9)
Complete Download (Ebook) Mockito for Spring by Acharya, Sujoy ISBN 9781783983780, 1783983787 PDF All Chapters
57 pages
Julia Cookbook 1st Edition Jalem Raj Rohit - Get the ebook instantly with just one click
100% (2)
Julia Cookbook 1st Edition Jalem Raj Rohit - Get the ebook instantly with just one click
51 pages
Learning pandas Python Data Discovery and Analysis Made Easy Heydt Michael pdf download
100% (1)
Learning pandas Python Data Discovery and Analysis Made Easy Heydt Michael pdf download
37 pages
Immediate Download PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas Ebooks 2024
100% (8)
Immediate Download PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas Ebooks 2024
70 pages
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
From Everand
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Michael Walker
5/5 (1)
Where can buy (Ebook) Azure Data Factory Cookbook - Second Edition by Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, Xenia Hertzenberg ebook with cheap price
100% (7)
Where can buy (Ebook) Azure Data Factory Cookbook - Second Edition by Dmitry Foshin, Tonya Chernyshova, Dmitry Anoshin, Xenia Hertzenberg ebook with cheap price
55 pages
Practical Data Analysis Using Jupyter Notebook Learn How To Speak The Language of Data by Extracting Useful and Actionable Insights Using Python by Marc Wintjen
No ratings yet
Practical Data Analysis Using Jupyter Notebook Learn How To Speak The Language of Data by Extracting Useful and Actionable Insights Using Python by Marc Wintjen
309 pages
PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas - Instantly access the full ebook content in just a few seconds
100% (1)
PostgreSQL High Availability Cookbook 2nd Edition Shaun M. Thomas - Instantly access the full ebook content in just a few seconds
49 pages
PostgreSQL High Availability Cookbook 2nd edition Edition Thomas - The ebook is ready for download to explore the complete content
100% (1)
PostgreSQL High Availability Cookbook 2nd edition Edition Thomas - The ebook is ready for download to explore the complete content
30 pages
Buy ebook Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- cheap price
100% (2)
Buy ebook Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- cheap price
44 pages
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
From Everand
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Pulkit Chadha
No ratings yet
Spark Cookbook
From Everand
Spark Cookbook
Rishi Yadav
No ratings yet
Julia Cookbook 1st Edition Jalem Raj Rohit Download PDF
100% (13)
Julia Cookbook 1st Edition Jalem Raj Rohit Download PDF
60 pages
Download full Drupal 8 Development Cookbook Harness the power of Drupal 8 with this recipe based practical guide 2nd Edition Matt Glaman ebook all chapters
100% (3)
Download full Drupal 8 Development Cookbook Harness the power of Drupal 8 with this recipe based practical guide 2nd Edition Matt Glaman ebook all chapters
86 pages
HBase High Performance Cookbook
From Everand
HBase High Performance Cookbook
Ruchir Choudhry
No ratings yet
Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- - Download the ebook now to never miss important content
100% (3)
Microsoft Power Apps Cookbook Apply low code recipes to solve everyday business challenges and become a Power Apps pro 2nd Edition -- - Download the ebook now to never miss important content
58 pages
58757
No ratings yet
58757
51 pages
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making 2nd Edition Patanjali Kashyap - Download the entire ebook instantly and explore every detail
100% (1)
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making 2nd Edition Patanjali Kashyap - Download the entire ebook instantly and explore every detail
50 pages
110927352
No ratings yet
110927352
71 pages
Black Participatory Research: Power, Identity, and the Struggle for Justice in Education 1st Edition Elizabeth R. Drame pdf download
100% (2)
Black Participatory Research: Power, Identity, and the Struggle for Justice in Education 1st Edition Elizabeth R. Drame pdf download
62 pages
Computer Science in Sport Modeling Simulation Data Analysis and Visualization of Sports Related Data 2024th Edition Daniel Memmert pdf download
100% (2)
Computer Science in Sport Modeling Simulation Data Analysis and Visualization of Sports Related Data 2024th Edition Daniel Memmert pdf download
65 pages
A New History of Iowa Bremer instant download
100% (2)
A New History of Iowa Bremer instant download
57 pages
Xam Idea Class 12 Chemistry Sample Papers 2023 2024th Edition Xam Idea download
100% (4)
Xam Idea Class 12 Chemistry Sample Papers 2023 2024th Edition Xam Idea download
60 pages
XamIdea Chemistry Class 12 2024 2024th Edition Xam Idea pdf download
100% (4)
XamIdea Chemistry Class 12 2024 2024th Edition Xam Idea pdf download
60 pages
Scarlet Fever and Scarlet Coats A Pride and Prejudice Variation 2nd Edition Anne Morris download
100% (4)
Scarlet Fever and Scarlet Coats A Pride and Prejudice Variation 2nd Edition Anne Morris download
50 pages
XAM IDEA Mathematics Sample Paper Class 10 2023 2024th Edition Xam Idea instant download
100% (4)
XAM IDEA Mathematics Sample Paper Class 10 2023 2024th Edition Xam Idea instant download
58 pages
Birth Mothers and Transnational Adoption Practice in South Korea: Virtual Mothering 1st Edition Hosu Kim (Auth.) instant download
100% (4)
Birth Mothers and Transnational Adoption Practice in South Korea: Virtual Mothering 1st Edition Hosu Kim (Auth.) instant download
56 pages
Urbanization and Social Welfare in China 1 Edition Liu Gordon G instant download
100% (4)
Urbanization and Social Welfare in China 1 Edition Liu Gordon G instant download
59 pages
Xam Idea Class 12 Biology Sample Papers 2023 2023rd Edition Xam Idea instant download
100% (4)
Xam Idea Class 12 Biology Sample Papers 2023 2023rd Edition Xam Idea instant download
63 pages
The Routledge International Handbook of Neuroaesthetics 1st Edition Martin Skov download
100% (4)
The Routledge International Handbook of Neuroaesthetics 1st Edition Martin Skov download
62 pages
Inside Creativity Coaching 40 Inspiring Case Studies from Around the World 1 Edition Eric Maisel pdf download
100% (4)
Inside Creativity Coaching 40 Inspiring Case Studies from Around the World 1 Edition Eric Maisel pdf download
59 pages
Dwbi Notes-4
No ratings yet
Dwbi Notes-4
34 pages
SQL for Data Science
No ratings yet
SQL for Data Science
8 pages
Relational Database
No ratings yet
Relational Database
45 pages
c Dbadm 2404-Demo
No ratings yet
c Dbadm 2404-Demo
5 pages
Maulidya Yunanda Practice Sec 5 6
No ratings yet
Maulidya Yunanda Practice Sec 5 6
19 pages
1z0-071 UPDATE 15-09
No ratings yet
1z0-071 UPDATE 15-09
337 pages
1 Pengenalan Penambangan Data-IMD
No ratings yet
1 Pengenalan Penambangan Data-IMD
34 pages
DA 100 Mod2 ENU PowerPoint
No ratings yet
DA 100 Mod2 ENU PowerPoint
25 pages
Lecture 2-Data Science
No ratings yet
Lecture 2-Data Science
25 pages
Introduction To SQL
No ratings yet
Introduction To SQL
42 pages
3+years Oracle DBA Interview Questions
No ratings yet
3+years Oracle DBA Interview Questions
3 pages
Rockstar Junkies
No ratings yet
Rockstar Junkies
8 pages
PROYECTO WAKA 1 (13)
No ratings yet
PROYECTO WAKA 1 (13)
12 pages
IP Practical File 2024-25
No ratings yet
IP Practical File 2024-25
20 pages
Excel Training Syllabus
0% (1)
Excel Training Syllabus
2 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
13 pages
Six-Step Relational Database Design A Step by Step Approach To Relational Database Design and Development by Captain, Fidel A
No ratings yet
Six-Step Relational Database Design A Step by Step Approach To Relational Database Design and Development by Captain, Fidel A
254 pages
ICDL Professional Modules – Computational – Using Databases
No ratings yet
ICDL Professional Modules – Computational – Using Databases
10 pages
MTA: Database Fundamentals: Lab Exercises & Notes
No ratings yet
MTA: Database Fundamentals: Lab Exercises & Notes
24 pages
Data Warehouse - DWDM
No ratings yet
Data Warehouse - DWDM
54 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
18 pages
AWS Database Migration Service
No ratings yet
AWS Database Migration Service
17 pages
Oracle 19c Database Upgrade From 11.2.0.4 To 19.2.0.0 Using Manual Method Oracledbwr
No ratings yet
Oracle 19c Database Upgrade From 11.2.0.4 To 19.2.0.0 Using Manual Method Oracledbwr
27 pages
James A. Senn's Information Technology, 3 Edition: Enterprise Databases and Data Warehouses
No ratings yet
James A. Senn's Information Technology, 3 Edition: Enterprise Databases and Data Warehouses
38 pages
SQL L
No ratings yet
SQL L
62 pages
It103 DBMS I
No ratings yet
It103 DBMS I
7 pages
Document15 (1)
No ratings yet
Document15 (1)
15 pages
Hospital Management System: Project Title
No ratings yet
Hospital Management System: Project Title
16 pages

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

Uploaded by

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process Esppenchutz download

Uploaded by

Data Ingestion with Python Cookbook: A practical

guide to ingesting, monitoring, and identifying

Download more ebook from https://textbookfull.com

Biota Grow 2C gather 2C cook Loucas

Practical Python Data Visualization: A Fast Track

Hands-On Entity Resolution: A Practical Guide to Data

Practical Data Science Cookbook Data pre processing

Mastering Machine Learning with Python in Six Steps: A

A Python Data Analyst’s Toolkit: Learn Python and

Introduction to Machine Learning with Python A Guide

Azure Data Factory Cookbook: Data engineers guide to

A practical guide to ingesting, monitoring, and identifying

Group Product Manager: Reshma Raman

First published: May 2023

Production reference: 1300523

Published by Packt Publishing Ltd.

About the author

Part 1: Fundamentals of Data Ingestion

Getting ready 8 Applying data governance in ingestion 25

How to do it… 79 Getting ready 86

Getting ready 120 Ingesting data from a JDBC database

How to do it… 130 Getting ready 146

Getting ready 160 Ingesting unstructured data with a

Getting ready 165 Inserting formatted SparkSession

Getting ready… 169 Further reading 179

Getting ready 195 Further reading 210

Part 2: Structuring the Ingestion Pipeline

See also 217 How it works… 229

Creating parallel ingest tasks 270 Getting ready 274

How to do it… 327 There's more… 340

Setting up Grafana for monitoring 358 Further reading 378

Other Books You May Enjoy 388

Who this book is for

What this book covers

To get the most out of this book

Software/Hardware covered in the book OS Requirements

Python 3.8 or higher Windows, Mac OS X, and Linux (any)

Download the example code files

Download the color images

def gets_csv_first_line (csv_file):

Any command-line input or output is written as follows:

Tips or important notes

Share Your Thoughts

Download a free PDF copy of this book

1. Scan the QR code or visit the link below

2. Submit your proof of purchase

• Chapter 1, Introduction to Data Ingestion

• Setting up Python and the environment

• Docker or Docker Desktop

Setting up Python and its environment

3. Go inside this folder:

4. Check your Python version on your operating system as follows:

The output will be similar to the python command, as follows:

You should see similar output:

1. Download one of the stable versions.

Figure 1.1 – Python.org download files for version 3.9

Figure 1.2 – The Python Installer for Windows

1.F.5. Some states do not allow disclaimers of certain implied

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,

Section 2. Information about the Mission

Volunteers and financial support to provide volunteers with the

Section 3. Information about the Project

The Foundation’s business office is located at 809 North 1500 West,

Section 4. Information about Donations to

The Foundation is committed to complying with the laws regulating

While we cannot and do not solicit contributions from states where

International donations are gratefully accepted, but we cannot make

Section 5. General Information About

This website includes information about Project Gutenberg™,

Let us accompany you on the journey of exploring knowledge and

You might also like

Getting ready 8 Applying data governance in ingestion 25

How to do it… 79 Getting ready 86

Getting ready 120 Ingesting data from a JDBC database

How to do it… 130 Getting ready 146

Getting ready 160 Ingesting unstructured data with a

Getting ready 165 Inserting formatted SparkSession

Getting ready… 169 Further reading 179

Getting ready 195 Further reading 210

See also 217 How it works… 229

Creating parallel ingest tasks 270 Getting ready 274

How to do it… 327 There's more… 340

Setting up Grafana for monitoring 358 Further reading 378

Other Books You May Enjoy 388