Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
By Maria Zervou
()
About this ebook
Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.
To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You’ll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You’ll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.
By the end of this book, you’ll be proficient in data cleaning and preparation techniques for both structured and unstructured data.
Related to Python Data Cleaning and Preparation Best Practices
Related ebooks
Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models Rating: 0 out of 5 stars0 ratingsScalable Data Architecture with Java: Build efficient enterprise-grade data architecting solutions using Java Rating: 0 out of 5 stars0 ratingsArchitecting Data-Intensive Applications: Develop scalable, data-intensive, and robust applications the smart way Rating: 0 out of 5 stars0 ratingsData Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights Rating: 0 out of 5 stars0 ratingsMastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions Rating: 0 out of 5 stars0 ratingsAzure Machine Learning Engineering: Deploy, fine-tune, and optimize ML models using Microsoft Azure Rating: 0 out of 5 stars0 ratingsData Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop Rating: 0 out of 5 stars0 ratingsData Lakehouse in Action: Architecting a modern and scalable data analytics platform Rating: 0 out of 5 stars0 ratingsPrinciples of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently Rating: 0 out of 5 stars0 ratingsPower Apps Tips, Tricks, and Best Practices: A step-by-step practical guide to developing robust Power Apps solutions Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsAzure Data Factory Cookbook: A data engineer's guide to building and managing ETL and ELT pipelines with data integration Rating: 0 out of 5 stars0 ratingsGenerative AI with Amazon Bedrock: Build, scale, and secure generative AI applications using Amazon Bedrock Rating: 0 out of 5 stars0 ratingsBig Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data Rating: 0 out of 5 stars0 ratingsData Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx Rating: 0 out of 5 stars0 ratingsAmazon SageMaker Best Practices: Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker Rating: 0 out of 5 stars0 ratingsMicroservices with Clojure: Develop event-driven, scalable, and reactive microservices with real-time monitoring Rating: 0 out of 5 stars0 ratingsPandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python Rating: 0 out of 5 stars0 ratingsDevOps for Databases: A practical guide to applying DevOps best practices to data-persistent technologies Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsMicrosoft Power BI Performance Best Practices: Learn practical techniques for building high-speed Power BI solutions Rating: 0 out of 5 stars0 ratings
Computers For You
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5
Reviews for Python Data Cleaning and Preparation Best Practices
0 ratings0 reviews
Book preview
Python Data Cleaning and Preparation Best Practices - Maria Zervou
Python Data Cleaning and Preparation Best Practices
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Managers: Deepesh Patel and Chayan Majumdar
Book Project Manager: Hemangi Lotlikar
Senior Content Development Editor: Manikandan Kurup
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Manikandan Kurup
Indexer: Hemangini Bari
Production Designer: Joshua Misquitta
Senior DevRel Marketing Executive: Nivedita Singh
First published: September 2024
Production reference: 1190924
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-83763-474-3
www.packtpub.com
I want to extend my deepest thanks to those who have been by my side throughout the journey of writing this book while managing work in parallel. I am immensely grateful to everyone who has cheered me on, offered feedback, and inspired me to keep going. A special thanks to my family, for their unwavering support and for teaching me the power of determination. To my mentors, friends, and partner, who have guided me over the years and helped me see the bigger picture, and from whom I have learned so much! This accomplishment is as much yours as it is mine. Thank you for being part of this journey!
– Maria Zervou
Contributors
About the author
Maria Zervou is a Generative AI and machine learning expert, dedicated to making advanced technologies accessible. With over a decade of experience, she has led impactful AI projects across industries and mentored teams on cutting-edge advancements. As a machine learning specialist at Databricks, Maria drives innovative AI solutions and industry adoption. Beyond her role, she democratizes knowledge through her YouTube channel, featuring experts on AI topics. A recognized thought leader and finalist in the Women in Tech Excellence Awards, Maria advocates for responsible AI use and contributes to open source projects, fostering collaboration and empowering future AI leaders.
About the reviewers
Mohammed Kamil Khan is currently a scientific programmer at UTHealth Houston’s McWilliams School of Biomedical Informatics, wherein he works on data preprocessing, GWAS, and post-GWAS analysis of imaging data. He has a master’s degree from the University of Houston – Downtown (UHD), having majored in data analytics. With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish articles on platforms such as DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, including pandas DataFrames, API data extraction, SQL queries, and much more.
Ashish Shukla is a seasoned professional with 12 years of experience, specializing in Azure technologies, particularly Azure Databricks, for the past 9 years. Formerly associated with Microsoft, Ashish has been instrumental in leading numerous successful projects leveraging Azure Databricks. Currently serving as an associate manager of data operations at PepsiCo India, he brings extensive expertise in cloud-based data solutions, ensuring robust and innovative data operations strategies.
Beyond his professional roles, Ashish is an active contributor to the Azure community through his technical blogs and engagements as a speaker on Azure technologies, where he shares valuable insights and best practices in data management and cloud computing.
Krishnan Raghavan is an IT professional with over 20 years of experience in software development and delivery excellence across multiple domains and technologies, including C++, Java, Python, Angular, Golang, and data warehouses.
When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in Hackathons. Krishnan tries to give back to the community by being part of the GDG – Pune volunteer group.
You can connect with Krishnan at [email protected] or via LinkedIn.
I’d like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.
Table of Contents
Preface
Part 1: Upstream Data Ingestion and Cleaning
1
Data Ingestion Techniques
Technical requirements
Ingesting data in batch mode
Advantages and disadvantages
Common use cases for batch ingestion
Batch ingestion use cases
Batch ingestion with an example
Ingesting data in streaming mode
Advantages and disadvantages
Common use cases for streaming ingestion
Streaming ingestion in an e-commerce platform
Streaming ingestion with an example
Real-time versus semi-real-time ingestion
Common use cases for near-real-time ingestion
Semi-real-time mode with an example
Data source solutions
Event data processing solution
Ingesting event data with Apache Kafka
Ingesting data from databases
Performing data ingestion from cloud-based file systems
APIs
Summary
2
Importance of Data Quality
Technical requirements
Why data quality is important
Dimensions of data quality
Completeness
Accuracy
Timeliness
Consistency
Uniqueness
Duplication
Data usage
Data compliance
Implementing quality controls throughout the data life cycle
Data silos and the impact on data quality
Summary
3
Data Profiling – Understanding Data Structure, Quality, and Distribution
Technical requirements
Understanding data profiling
Identifying goals of data profiling
Exploratory data analysis options – profiler versus manual
Profiling data with pandas’ ydata_profiling
Overview
Interactions
Correlations
Missing values
Duplicate rows
Sample dataset
Profiling high volumes of data with the pandas data profiler
Data validation with the Great Expectations library
Configuring Great Expectations for your project
Create your first Great Expectations data source
Creating your first Great Expectations suite
Great Expectations Suite report
Manually edit Great Expectations
Checkpoints
Using pandas profiler to build your Great Expectations Suite
Comparing Great Expectations and pandas profiler – when to use what
Great Expectations and big data
Summary
4
Cleaning Messy Data and Data Manipulation
Technical requirements
Renaming columns
Renaming a single column
Renaming all columns
Removing irrelevant or redundant columns
Dealing with inconsistent and incorrect data types
Inspecting columns
Columnar type transformations
Converting to numeric types
Converting to string types
Converting to categorical types
Converting to Boolean types
Working with dates and times
Importing and parsing date and time data
Extracting components from dates and times
Calculating time differences and durations
Handling time zones and daylight saving time
Summary
5
Data Transformation – Merging and Concatenating
Technical requirements
Joining datasets
Choosing the correct merge strategy
Handling duplicates when merging datasets
Why handle duplication in rows and columns?
Dropping duplicate rows
Validating data before merging
Aggregation
Concatenation
Handling duplication in columns
Performance tricks for merging
Set indexes
Sorting indexes
Merge versus join
Concatenating DataFrames
Row-wise concatenation
Column-wise concatenation
Summary
References
6
Data Grouping, Aggregation, Filtering, and Applying Functions
Technical requirements
Grouping data using one or multiple keys
Grouping data using one key
Grouping data using multiple keys
Best practices for grouping
Applying aggregate functions on grouped data
Basic aggregate functions
Advanced aggregation with multiple columns
Applying custom aggregate functions
Best practices for aggregate functions
Using the apply function on grouped data
Data filtering
Multiple criteria for filtering
Best practices for filtering
Performance considerations as data grows
Summary
7
Data Sinks
Technical requirements
Choosing the right data sink for your use case
Relational databases
NoSQL databases
Data warehouses
Data lakes
Streaming data sinks
Which sink is the best for my use case?
Decoding file types for optimal usage
Navigating partitioning
Horizontal versus vertical partitioning
Time-based partitioning
Geographic partitioning
Hybrid partitioning
Considerations for choosing partitioning strategies
Designing an online retail data platform
Summary
Part 2: Downstream Data Cleaning – Consuming Structured Data
8
Detecting and Handling Missing Values and Outliers
Technical requirements
Detecting missing data
Handling missing data
Deletion of missing data
Imputation of missing data
Mean imputation
Median imputation
Creating indicator variables
Comparison between imputation methods
Detecting and handling outliers
Impact of outliers
Identifying univariate outliers
Handling univariate outliers
Identifying multivariate outliers
Handling multivariate outliers
Summary
9
Normalization and Standardization
Technical requirements
Scaling features to a range
Min-max scaling
Z-score scaling
When to use Z-score scaling
Robust scaling
Comparison between methods
Summary
10
Handling Categorical Features
Technical requirements
Label encoding
Use case – employee performance analysis
Considerations for label encoding
One-hot encoding
When to use one-hot encoding
Use case – customer churn prediction
Considerations for one-hot encoding
Target encoding (mean encoding)
When to use target encoding
Use case – sales prediction for retail stores
Considerations for target encoding
Frequency encoding
When to use frequency encoding
Use case – customer product preference analysis
Considerations for frequency encoding
Binary encoding
When to use binary encoding
Use case – customer subscription prediction
Considerations for binary encoding
Summary
11
Consuming Time Series Data
Technical requirements
Understanding the components of time series data
Trend
Seasonality
Noise
Types of time series data
Univariate time series data
Multivariate time series data
Identifying missing values in time series data
Checking for NaNs or null values
Visual inspection
Handling missing values in time series data
Removing missing data
Forward and backward fill
Interpolation
Comparing the different methods for missing values
Analyzing time series data
Autocorrelation and partial autocorrelation
ACT and PACF in the stock market use case
Dealing with outliers
Identifying outliers with seasonal decomposition
Handling outliers – model-based approaches – ARIMA
Moving window techniques
Feature engineering for time series data
Lag features and their importance
Differencing time series
Applying time series techniques in different industries
Summary
Part 3: Downstream Data Cleaning – Consuming Unstructured Data
12
Text Preprocessing in the Era of LLMs
Technical requirements
Relearning text preprocessing in the era of LLMs
Text cleaning
Removing HTML tags and special characters
Handling capitalization and letter case
Dealing with numerical values and symbols
Addressing whitespace and formatting issues
Removing personally identifiable information
Handling rare words and spelling variations
Dealing with rare words
Addressing spelling variations and typos
Chunking
Tokenization
Word tokenization
Subword tokenization
Domain-specific data
Turning tokens into embeddings
BERT – Contextualized Embedding Models
BGE
GTE
Selecting the right embedding model
Solving real problems with embeddings
Summary
13
Image and Audio Preprocessing with LLMs
Technical requirements
The current era of image preprocessing
Loading the images
Resizing and cropping
Normalizing and standardizing the dataset
Data augmentation
Noise reduction
Extracting text from images
PaddleOCR
Using LLMs with OCR
Creating image captions
Handling audio data
Using Whisper for audio-to-text conversion
Extracting text from audio
Future research in audio preprocessing
Summary
This concludes the book! You did it!
Index
Other Books You May Enjoy
Preface
In today’s fast-paced data-driven world, it’s easy to be dazzled by the headlines about artificial intelligence (AI) breakthroughs and advanced machine learning (ML) models. But ask any seasoned data scientist or engineer, and they’ll tell you the same thing: the true foundation of any successful data project is not the flashy algorithms or sophisticated models—it’s the data itself, and more importantly, how that data is prepared.
Throughout my career, I have learned that data preprocessing is the unsung hero of data science. It’s the meticulous, often complex process that turns raw data into a reliable asset, ready for analysis, modeling, and ultimately, decision-making. I’ve seen firsthand how the right preprocessing techniques can transform an organization’s approach to data, turning potential challenges into powerful opportunities.
Yet, despite its importance, data preprocessing is often overlooked or undervalued. Many see it as a tedious step, a bottleneck that slows down the exciting work of building models and delivering insights. But I’ve always believed that this phase is where the most critical work happens. After all, even the most sophisticated algorithms can’t make up for poor-quality data. That’s why I’ve dedicated much of my professional journey to mastering this art—exploring the best tools, techniques, and strategies to make preprocessing more efficient, scalable, and aligned with the ever-evolving landscape of AI.
This book aims to demystify the data preprocessing process, offering both a solid grounding in traditional methods and a forward-looking perspective on emerging techniques. We’ll explore how Python can be leveraged to clean, transform, and organize data more effectively. We’ll also look at how the advent of large language models (LLMs) is redefining what’s possible in this space. These models are already proving to be game changers, automating tasks that were once manual and time-consuming, and providing new ways to enhance data quality and usability.
Throughout the pages, I’ll share insights from my experiences, the challenges faced, and the lessons learned along the way. My hope is to provide you with not just a technical roadmap but also a deeper understanding of the strategic importance of data preprocessing in today’s data ecosystem. I strongly believe in the philosophy of learning by doing,
so this book includes a wealth of code examples for you to follow along with. I encourage you to try these examples, experiment with the code, and challenge yourself to apply the techniques to your own datasets.
By the end of this book, you’ll be equipped with the knowledge and skills to approach data preprocessing not just as a necessary step but also as a critical component of your overall data strategy.
So, whether you’re a data scientist, engineer, analyst, or simply someone looking to enhance their understanding of data processes, I invite you to join me on this journey. Together, we will explore how to harness the power of data preprocessing to unlock the full potential of your data.
Who this book is for
This book is for readers with a working knowledge of Python, a good grasp of statistical concepts, and some experience in manipulating data. This book will not start from scratch but will rather build on existing skills, introducing you to sophisticated preprocessing strategies, hands-on code examples, and practical exercises that require a degree of familiarity with the core principles of data science and analytics.
What this book covers
Chapter 1
, Data Ingestion Techniques, provides a comprehensive overview of the data ingestion process, emphasizing its role in collecting and importing data from various sources into storage systems for analysis. You will explore different ingestion methods such as batch and streaming modes, compare real-time and semi-real-time ingestion, and understand the technologies behind data sources. The chapter highlights the advantages, disadvantages, and practical applications of these methods.
Chapter 2
, Importance of Data Quality, emphasizes the critical role data quality plays in business decision-making. It highlights the risks of using inaccurate, inconsistent, or outdated data, which can lead to poor decisions, damaged reputations, and missed opportunities. You will explore why data quality is essential, how to measure it across different dimensions, and the impact of data silos on maintaining data quality.
Chapter 3
, Data Profiling – Understanding Data Structure, Quality, and Distribution, explores data profiling and focuses on scrutinizing and validating datasets to understand their structure, patterns, and quality. You will learn how to perform data profiling using tools such as the pandas Profiler and Great Expectations and understand when to use each tool. Additionally, the chapter covers tactics for handling large data volumes and compares profiling methods to improve data validation.
Chapter 4
, Cleaning Messy Data and Data Manipulation, focuses on the key strategies for cleaning and manipulating data, enabling efficient and accurate analysis. It covers techniques for renaming columns, removing irrelevant or redundant data, fixing inconsistent data types, and handling date and time formats. By mastering these methods, you will learn how to enhance the quality and reliability of your datasets.
Chapter 5
, Data Transformation – Merging and Concatenating, explores techniques for transforming and manipulating data through merging, joining, and concatenating datasets. It covers methods to combine multiple datasets from various sources, handle duplicates effectively, and improve merging performance. The chapter also provides practical tricks to streamline the merging process, ensuring efficient data integration for insightful analysis.
Chapter 6
, Data Grouping, Aggregation, Filtering, and Applying Functions, covers the essential techniques of data grouping and aggregation, which are vital for summarizing large datasets and generating meaningful insights. It discusses methods to handle missing or noisy data by aggregating values, reducing data volume, and enhancing processing efficiency. The chapter also focuses on grouping data by various keys, applying aggregate and custom functions, and filtering data to create valuable features for deeper analysis or ML.
Chapter 7
, Data Sinks, focuses on the critical decisions involved in data processing, particularly the selection of appropriate data sinks for storage and processing needs. It delves into four essential pillars: choosing the right data sink, selecting the correct file type, optimizing partitioning strategies, and understanding how to design a scalable online retail data platform. The chapter equips you with the tools to enhance efficiency, scalability, and performance in data processing pipelines.
Chapter 8
, Detecting and Handling Missing Values and Outliers, delves into techniques for identifying and managing missing values and outliers. It covers a range of methods, from statistical approaches to advanced ML models, to address these issues effectively. The key areas of focus include detecting and handling missing data, identifying univariate and multivariate outliers, and managing outliers in various datasets.
Chapter 9
, Normalization and Standardization, covers essential preprocessing techniques such as feature scaling, normalization, and standardization, which ensure that ML models can effectively learn from data. You will explore different techniques, including scaling features to a range, Z-score scaling, and using a robust scaler, to address various data challenges in ML tasks.
Chapter 10
, Handling Categorical Features, addresses the importance of managing categorical features, which represent non-numerical information in datasets. You will learn various encoding techniques, including label encoding, one-hot encoding, target encoding, frequency encoding, and binary encoding, to transform categorical data for ML models.
Chapter 11
, Consuming Time Series Data, delves into the fundamentals of time series analysis, covering key concepts, methodologies, and applications across various industries. It includes understanding the components and types of time series data, identifying and handling missing values, and techniques for analyzing trends and patterns over time. The chapter also addresses dealing with outliers and feature engineering to enhance predictive modeling with time series data.
Chapter 12
, Text Preprocessing in the Era of LLMs, focuses on mastering text preprocessing techniques that are essential for optimizing the performance of LLMs. It covers methods for cleaning text, handling rare words and spelling variations, chunking, and tokenization strategies. Additionally, it addresses the transformation of tokens into embeddings, highlighting the importance of adapting preprocessing approaches to maximize the potential of LLMs.
Chapter 13
, Image and Audio Preprocessing with LLMs, examines preprocessing techniques for unstructured data, particularly images and audio, to extract meaningful information. It includes methods for image preprocessing, such as optical character recognition (OCR) and image caption generation with the BLIP model. The chapter also explores audio data handling, including converting audio to text using the Whisper model, providing a comprehensive overview of working with multimedia data in the context of LLMs.
To get the most out of this book
To benefit fully from this book, you should have a good knowledge of Python and a grasp of data engineering and data science basics.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
The GitHub repository follows the chapters of the book, and all the scripts are numbered according to the sections within each chapter. Each script is independent of the others, so you can move ahead without having to run all the scripts beforehand. However, it is critically recommended to follow the flow of the book so that you don’t miss any necessary information.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices
. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/
. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The delete_entry() function is used to remove an entry, showing how data can be deleted from the store
A block of code is set as follows:
def process_in_batches(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i + batch_size]
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
user_satisfaction_scores = [
random.randint(1, 5) for _ in range(num_users)]
Any command-line input or output is written as follows:
$ mkdir data
pip install pandas
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: It involves storing data on remote servers accessed from anywhere via the internet, rather than on local devices
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected]
and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata
and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com
.
Share your thoughts
Once you’ve read Python Data Cleaning and Preparation Best Practices, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page
for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781837634743
2. Submit your proof of purchase
3. That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1: Upstream Data Ingestion and Cleaning
This part focuses on the foundational stages of data processing, starting from data ingestion to ensuring its quality and structure for downstream tasks. It guides readers through the essential steps of importing, cleaning, and transforming data, which lay the groundwork for effective data analysis. The chapters explore various methods for ingesting data, maintaining high-quality datasets, profiling data for better insights, and cleaning messy data to make it ready for analysis. Further, it covers advanced techniques like merging, concatenating, grouping, and filtering data, along with choosing appropriate data destinations or sinks to optimize processing pipelines. Each chapter in this part equips readers with the knowledge to handle raw data and turn it into a clean, structured, and usable form.
This part has the following chapters:
Chapter 1
, Data Ingestion Techniques
Chapter 2
, Importance of Data Quality
Chapter 3
, Data Profiling – Understanding Data Structure, Quality, and Distribution
Chapter 4
, Cleaning Messy Data and Data Manipulation
Chapter 5
, Data Transformation – Merging and Concatenating
Chapter 6
, Data Grouping, Aggregation, Filtering, and Applying Functions
Chapter 7
, Data Sinks
1
Data Ingestion Techniques
Data ingestion is a critical component of the data life cycle and sets the foundation for subsequent data transformation and cleaning. It involves the process of collecting and importing data from various sources into a storage system where it can be accessed and analyzed. Effective data ingestion is crucial for ensuring data quality, integrity, and availability, which directly impacts the efficiency and accuracy of data transformation and cleaning processes. In this chapter, we will dive deep into the different types of data sources, explore various data ingestion methods, and discuss their respective advantages, disadvantages, and real-world applications.
In this chapter, we’ll cover the following topics:
Ingesting data in batch mode
Ingesting data in streaming mode
Real-time versus semi-real-time ingestion
Data sources technologies
Technical requirements
You can find all the code for the chapter in the following GitHub repository:
https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/tree/main/chapter01
You can use your favorite IDE (VS Code, PyCharm, Google Colab, etc.) to write and execute your code.
Ingesting data in batch mode
Batch ingestion is a data processing technique whereby large volumes of data are collected, processed, and loaded into a system at scheduled intervals, rather than in real-time. This approach allows organizations to handle substantial amounts of data efficiently by grouping data into batches, which are then processed collectively. For example, a company might collect customer transaction data throughout the day and then process it in a single batch during off-peak hours. This method is particularly useful for organizations that need to process high volumes of data but do not require immediate analysis.
Batch ingestion is beneficial because it optimizes system resources by spreading the processing load across scheduled times, often when the system is underutilized. This reduces the strain on computational resources and can lower costs, especially in cloud-based environments where computing power is metered. Additionally, batch processing simplifies data management, as it allows for the easy application of consistent transformations and validations across large datasets. For organizations with regular, predictable data flows, batch ingestion provides a reliable, scalable, and cost-effective solution for data processing and analytics.
Let’s explore batch ingestion in more detail, starting with its advantages and disadvantages.
Advantages and disadvantages
Batch ingestion offers several notable advantages that make it an attractive choice for many data processing needs:
Efficiency is a key benefit, as batch processing allows for the handling of large volumes of data in a single operation, optimizing resource usage and minimizing overhead
Cost-effectiveness is another benefit, reducing the need for continuous processing resources and lowering operational expenses.
Simplicity makes it easier to manage and implement periodic data processing tasks compared to real-time ingestion, which often requires more complex infrastructure and management
Robustness, as batch processing is well-suited for performing complex data transformations and comprehensive data validation, ensuring high-quality, reliable data
However, batch ingestion also comes with certain drawbacks:
There is an inherent delay between the generation of data and its availability for analysis, which can be a critical issue for applications requiring real-time insights.
Resource spikes can occur during batch processing windows, leading to high resource usage and potential performance bottlenecks
Scalability can also be a concern, as handling very large datasets may require significant infrastructure investment and management
Lastly, maintenance is a crucial aspect of batch ingestion; it demands careful scheduling and ongoing maintenance to ensure the timely and reliable execution of batch jobs
Let’s look at some common use cases for ingesting data in batch mode.
Common use cases for batch ingestion
Any data analytics platform such as data warehouses or data lakes requires regularly updated data for Business Intelligence (BI) and reporting. Batch ingestion is integral as it ensures that data is continually updated with the latest information, enabling businesses to perform comprehensive and up-to-date analyses. By processing data in batches, organizations can efficiently handle vast amounts of transactional and operational data, transforming it into a structured format suitable for querying and reporting. This supports BI initiatives, allowing analysts and decision-makers to generate insightful reports, track Key Performance Indicators (KPIs), and make data-driven decisions.
Extract, Transform, and Load (ETL) processes are a cornerstone of data integration projects, and batch ingestion plays a crucial role in these workflows. In ETL processes, data is extracted from various sources, transformed to fit the operational needs of the target system, and loaded into a database or data warehouse. Batch processing allows for efficient handling of these steps, particularly when dealing with large datasets that require significant transformation and cleansing. This method is ideal for periodic data consolidation, where data from disparate systems is integrated to provide a unified view, supporting activities such as data migration, system integration, and master data management.
Batch ingestion is also widely used for backups and archiving, which are critical processes for data preservation and disaster recovery. Periodic batch processing allows for the scheduled backup of databases, ensuring that all data is captured and securely stored at regular intervals. This approach minimizes the risk of data loss and provides a reliable restore point in case of system failures or data corruption. Additionally, batch processing is used for data archiving, where historical data is periodically moved from active systems to long-term storage solutions. This not only helps in managing storage costs but also ensures that important data is retained and can be retrieved for compliance, auditing, or historical analysis purposes.
Batch ingestion use cases
Batch ingestion is a methodical process involving several key steps: data extraction, data transformation, data loading, scheduling, and automation. To illustrate these steps, let’s explore a use case involving an investment bank that needs to process and analyze trading data for regulatory compliance and performance reporting.
Batch ingestion in an investment bank
An investment bank needs to collect, transform, and load trading data from various financial markets into a central data warehouse. This data will be used for generating daily compliance reports, evaluating trading strategies, and making informed investment decisions.
Data extraction
The first step is identifying the sources from which data will be extracted. For the investment bank, this includes trading systems, market data providers, and internal risk management systems. These sources contain critical data such as trade execution details, market prices, and risk assessments. Once the sources are identified, data is collected using connectors or scripts. This involves setting up data pipelines that extract data from trading systems, import real-time market data feeds, and pull risk metrics from internal systems. The extracted data is then temporarily stored in staging areas before processing.
Data transformation
The extracted data often contains inconsistencies, duplicates, and missing values. Data cleaning is performed to remove duplicates, fill in missing information, and correct errors. For the investment bank, this ensures that trade records are accurate and complete, providing a reliable foundation for compliance reporting and performance analysis. After cleaning, the data undergoes transformations such as aggregations, joins, and calculations. For example, the investment bank might aggregate trade data to calculate daily trading volumes, join trade records with market data to analyze price movements, and calculate key metrics such as Profit and Loss (P&L) and risk exposure. The transformed data must be mapped to the schema of the target system. This involves aligning the data fields with the structure of the data warehouse. For instance, trade data might be mapped to tables representing transactions, market data, and risk metrics, ensuring seamless integration with the existing data model.
Data loading
The transformed data is processed in batches, which allows the investment bank to handle large volumes of data efficiently, performing complex transformations and aggregations in a single run. Once processed, the data is loaded into the target storage system, such as a data warehouse or data lake. For the investment bank, this means loading the cleaned and transformed trading data into their data warehouse, where it can be accessed for compliance reporting and performance analysis.
Scheduling and automation
To ensure that the batch ingestion process runs smoothly and consistently, scheduling tools such as Apache Airflow or Cron jobs are used. These tools automate the data ingestion workflows, scheduling them to run at regular intervals, such as every night or every day. This allows the investment bank to have up-to-date data available for analysis without manual intervention. Implementing monitoring is crucial to track the success and performance of batch jobs. Monitoring tools provide insights into job execution, identifying any failures or performance bottlenecks. For the investment bank, this ensures that any issues in the data ingestion process are promptly detected and resolved, maintaining the integrity and reliability of the data pipeline.
Batch ingestion with an example
Let’s have a look at a simple example of a batch processing ingestion system written in Python. This example will simulate the ETL process. We’ll generate some mock data, process it in batches, and load it into a simulated database.
You can find the code for this part in the GitHub repository at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/1.batch.py
. To run this example, we don’t need any bespoke library installation. We just need to ensure that we are running it in a standard Python environment (Python 3.x):
We create a generate_mock_data function that generates a list of mock data records:
def generate_mock_data(num_records):
data = []
for _ in range(num_records):
record = {
'id': random.randint(1, 1000),
'value': random.random() * 100
}
data.append(record)
return data
Each record is a dictionary with two fields:
id: A random integer between 1 and 1000
value: A random float between 0 and 100
Let’s have a look at what the data looks like:
print(Original data:
, data)
{'id': 449, 'value': 99.79699336555473}
{'id': 991, 'value': 79.65999078145887}
A list of dictionaries is returned, each representing a data record.
Next, we create a batch processing function:
def process_in_batches(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i + batch_size]
This function takes the data, which is a list of data records to process, and batch_size, which represents the number of records per batch, as parameters. The function uses a for loop to iterate over the data in steps of batch_size.