Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
Ebook986 pages6 hours

Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.
To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You’ll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You’ll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.
By the end of this book, you’ll be proficient in data cleaning and preparation techniques for both structured and unstructured data.

LanguageEnglish
Release dateSep 27, 2024
ISBN9781837632909
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

Related to Python Data Cleaning and Preparation Best Practices

Related ebooks

Computers For You

View More

Related articles

Reviews for Python Data Cleaning and Preparation Best Practices

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python Data Cleaning and Preparation Best Practices - Maria Zervou

    Cover.png

    Python Data Cleaning and Preparation Best Practices

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Apeksha Shetty

    Publishing Product Managers: Deepesh Patel and Chayan Majumdar

    Book Project Manager: Hemangi Lotlikar

    Senior Content Development Editor: Manikandan Kurup

    Technical Editor: Kavyashree K S

    Copy Editor: Safis Editing

    Proofreader: Manikandan Kurup

    Indexer: Hemangini Bari

    Production Designer: Joshua Misquitta

    Senior DevRel Marketing Executive: Nivedita Singh

    First published: September 2024

    Production reference: 1190924

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK.

    ISBN 978-1-83763-474-3

    www.packtpub.com

    I want to extend my deepest thanks to those who have been by my side throughout the journey of writing this book while managing work in parallel. I am immensely grateful to everyone who has cheered me on, offered feedback, and inspired me to keep going. A special thanks to my family, for their unwavering support and for teaching me the power of determination. To my mentors, friends, and partner, who have guided me over the years and helped me see the bigger picture, and from whom I have learned so much! This accomplishment is as much yours as it is mine. Thank you for being part of this journey!

    – Maria Zervou

    Contributors

    About the author

    Maria Zervou is a Generative AI and machine learning expert, dedicated to making advanced technologies accessible. With over a decade of experience, she has led impactful AI projects across industries and mentored teams on cutting-edge advancements. As a machine learning specialist at Databricks, Maria drives innovative AI solutions and industry adoption. Beyond her role, she democratizes knowledge through her YouTube channel, featuring experts on AI topics. A recognized thought leader and finalist in the Women in Tech Excellence Awards, Maria advocates for responsible AI use and contributes to open source projects, fostering collaboration and empowering future AI leaders.

    About the reviewers

    Mohammed Kamil Khan is currently a scientific programmer at UTHealth Houston’s McWilliams School of Biomedical Informatics, wherein he works on data preprocessing, GWAS, and post-GWAS analysis of imaging data. He has a master’s degree from the University of Houston – Downtown (UHD), having majored in data analytics. With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish articles on platforms such as DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, including pandas DataFrames, API data extraction, SQL queries, and much more.

    Ashish Shukla is a seasoned professional with 12 years of experience, specializing in Azure technologies, particularly Azure Databricks, for the past 9 years. Formerly associated with Microsoft, Ashish has been instrumental in leading numerous successful projects leveraging Azure Databricks. Currently serving as an associate manager of data operations at PepsiCo India, he brings extensive expertise in cloud-based data solutions, ensuring robust and innovative data operations strategies.

    Beyond his professional roles, Ashish is an active contributor to the Azure community through his technical blogs and engagements as a speaker on Azure technologies, where he shares valuable insights and best practices in data management and cloud computing.

    Krishnan Raghavan is an IT professional with over 20 years of experience in software development and delivery excellence across multiple domains and technologies, including C++, Java, Python, Angular, Golang, and data warehouses.

    When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in Hackathons. Krishnan tries to give back to the community by being part of the GDG – Pune volunteer group.

    You can connect with Krishnan at [email protected] or via LinkedIn.

    I’d like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.

    Table of Contents

    Preface

    Part 1: Upstream Data Ingestion and Cleaning

    1

    Data Ingestion Techniques

    Technical requirements

    Ingesting data in batch mode

    Advantages and disadvantages

    Common use cases for batch ingestion

    Batch ingestion use cases

    Batch ingestion with an example

    Ingesting data in streaming mode

    Advantages and disadvantages

    Common use cases for streaming ingestion

    Streaming ingestion in an e-commerce platform

    Streaming ingestion with an example

    Real-time versus semi-real-time ingestion

    Common use cases for near-real-time ingestion

    Semi-real-time mode with an example

    Data source solutions

    Event data processing solution

    Ingesting event data with Apache Kafka

    Ingesting data from databases

    Performing data ingestion from cloud-based file systems

    APIs

    Summary

    2

    Importance of Data Quality

    Technical requirements

    Why data quality is important

    Dimensions of data quality

    Completeness

    Accuracy

    Timeliness

    Consistency

    Uniqueness

    Duplication

    Data usage

    Data compliance

    Implementing quality controls throughout the data life cycle

    Data silos and the impact on data quality

    Summary

    3

    Data Profiling – Understanding Data Structure, Quality, and Distribution

    Technical requirements

    Understanding data profiling

    Identifying goals of data profiling

    Exploratory data analysis options – profiler versus manual

    Profiling data with pandas’ ydata_profiling

    Overview

    Interactions

    Correlations

    Missing values

    Duplicate rows

    Sample dataset

    Profiling high volumes of data with the pandas data profiler

    Data validation with the Great Expectations library

    Configuring Great Expectations for your project

    Create your first Great Expectations data source

    Creating your first Great Expectations suite

    Great Expectations Suite report

    Manually edit Great Expectations

    Checkpoints

    Using pandas profiler to build your Great Expectations Suite

    Comparing Great Expectations and pandas profiler – when to use what

    Great Expectations and big data

    Summary

    4

    Cleaning Messy Data and Data Manipulation

    Technical requirements

    Renaming columns

    Renaming a single column

    Renaming all columns

    Removing irrelevant or redundant columns

    Dealing with inconsistent and incorrect data types

    Inspecting columns

    Columnar type transformations

    Converting to numeric types

    Converting to string types

    Converting to categorical types

    Converting to Boolean types

    Working with dates and times

    Importing and parsing date and time data

    Extracting components from dates and times

    Calculating time differences and durations

    Handling time zones and daylight saving time

    Summary

    5

    Data Transformation – Merging and Concatenating

    Technical requirements

    Joining datasets

    Choosing the correct merge strategy

    Handling duplicates when merging datasets

    Why handle duplication in rows and columns?

    Dropping duplicate rows

    Validating data before merging

    Aggregation

    Concatenation

    Handling duplication in columns

    Performance tricks for merging

    Set indexes

    Sorting indexes

    Merge versus join

    Concatenating DataFrames

    Row-wise concatenation

    Column-wise concatenation

    Summary

    References

    6

    Data Grouping, Aggregation, Filtering, and Applying Functions

    Technical requirements

    Grouping data using one or multiple keys

    Grouping data using one key

    Grouping data using multiple keys

    Best practices for grouping

    Applying aggregate functions on grouped data

    Basic aggregate functions

    Advanced aggregation with multiple columns

    Applying custom aggregate functions

    Best practices for aggregate functions

    Using the apply function on grouped data

    Data filtering

    Multiple criteria for filtering

    Best practices for filtering

    Performance considerations as data grows

    Summary

    7

    Data Sinks

    Technical requirements

    Choosing the right data sink for your use case

    Relational databases

    NoSQL databases

    Data warehouses

    Data lakes

    Streaming data sinks

    Which sink is the best for my use case?

    Decoding file types for optimal usage

    Navigating partitioning

    Horizontal versus vertical partitioning

    Time-based partitioning

    Geographic partitioning

    Hybrid partitioning

    Considerations for choosing partitioning strategies

    Designing an online retail data platform

    Summary

    Part 2: Downstream Data Cleaning – Consuming Structured Data

    8

    Detecting and Handling Missing Values and Outliers

    Technical requirements

    Detecting missing data

    Handling missing data

    Deletion of missing data

    Imputation of missing data

    Mean imputation

    Median imputation

    Creating indicator variables

    Comparison between imputation methods

    Detecting and handling outliers

    Impact of outliers

    Identifying univariate outliers

    Handling univariate outliers

    Identifying multivariate outliers

    Handling multivariate outliers

    Summary

    9

    Normalization and Standardization

    Technical requirements

    Scaling features to a range

    Min-max scaling

    Z-score scaling

    When to use Z-score scaling

    Robust scaling

    Comparison between methods

    Summary

    10

    Handling Categorical Features

    Technical requirements

    Label encoding

    Use case – employee performance analysis

    Considerations for label encoding

    One-hot encoding

    When to use one-hot encoding

    Use case – customer churn prediction

    Considerations for one-hot encoding

    Target encoding (mean encoding)

    When to use target encoding

    Use case – sales prediction for retail stores

    Considerations for target encoding

    Frequency encoding

    When to use frequency encoding

    Use case – customer product preference analysis

    Considerations for frequency encoding

    Binary encoding

    When to use binary encoding

    Use case – customer subscription prediction

    Considerations for binary encoding

    Summary

    11

    Consuming Time Series Data

    Technical requirements

    Understanding the components of time series data

    Trend

    Seasonality

    Noise

    Types of time series data

    Univariate time series data

    Multivariate time series data

    Identifying missing values in time series data

    Checking for NaNs or null values

    Visual inspection

    Handling missing values in time series data

    Removing missing data

    Forward and backward fill

    Interpolation

    Comparing the different methods for missing values

    Analyzing time series data

    Autocorrelation and partial autocorrelation

    ACT and PACF in the stock market use case

    Dealing with outliers

    Identifying outliers with seasonal decomposition

    Handling outliers – model-based approaches – ARIMA

    Moving window techniques

    Feature engineering for time series data

    Lag features and their importance

    Differencing time series

    Applying time series techniques in different industries

    Summary

    Part 3: Downstream Data Cleaning – Consuming Unstructured Data

    12

    Text Preprocessing in the Era of LLMs

    Technical requirements

    Relearning text preprocessing in the era of LLMs

    Text cleaning

    Removing HTML tags and special characters

    Handling capitalization and letter case

    Dealing with numerical values and symbols

    Addressing whitespace and formatting issues

    Removing personally identifiable information

    Handling rare words and spelling variations

    Dealing with rare words

    Addressing spelling variations and typos

    Chunking

    Tokenization

    Word tokenization

    Subword tokenization

    Domain-specific data

    Turning tokens into embeddings

    BERT – Contextualized Embedding Models

    BGE

    GTE

    Selecting the right embedding model

    Solving real problems with embeddings

    Summary

    13

    Image and Audio Preprocessing with LLMs

    Technical requirements

    The current era of image preprocessing

    Loading the images

    Resizing and cropping

    Normalizing and standardizing the dataset

    Data augmentation

    Noise reduction

    Extracting text from images

    PaddleOCR

    Using LLMs with OCR

    Creating image captions

    Handling audio data

    Using Whisper for audio-to-text conversion

    Extracting text from audio

    Future research in audio preprocessing

    Summary

    This concludes the book! You did it!

    Index

    Other Books You May Enjoy

    Preface

    In today’s fast-paced data-driven world, it’s easy to be dazzled by the headlines about artificial intelligence (AI) breakthroughs and advanced machine learning (ML) models. But ask any seasoned data scientist or engineer, and they’ll tell you the same thing: the true foundation of any successful data project is not the flashy algorithms or sophisticated models—it’s the data itself, and more importantly, how that data is prepared.

    Throughout my career, I have learned that data preprocessing is the unsung hero of data science. It’s the meticulous, often complex process that turns raw data into a reliable asset, ready for analysis, modeling, and ultimately, decision-making. I’ve seen firsthand how the right preprocessing techniques can transform an organization’s approach to data, turning potential challenges into powerful opportunities.

    Yet, despite its importance, data preprocessing is often overlooked or undervalued. Many see it as a tedious step, a bottleneck that slows down the exciting work of building models and delivering insights. But I’ve always believed that this phase is where the most critical work happens. After all, even the most sophisticated algorithms can’t make up for poor-quality data. That’s why I’ve dedicated much of my professional journey to mastering this art—exploring the best tools, techniques, and strategies to make preprocessing more efficient, scalable, and aligned with the ever-evolving landscape of AI.

    This book aims to demystify the data preprocessing process, offering both a solid grounding in traditional methods and a forward-looking perspective on emerging techniques. We’ll explore how Python can be leveraged to clean, transform, and organize data more effectively. We’ll also look at how the advent of large language models (LLMs) is redefining what’s possible in this space. These models are already proving to be game changers, automating tasks that were once manual and time-consuming, and providing new ways to enhance data quality and usability.

    Throughout the pages, I’ll share insights from my experiences, the challenges faced, and the lessons learned along the way. My hope is to provide you with not just a technical roadmap but also a deeper understanding of the strategic importance of data preprocessing in today’s data ecosystem. I strongly believe in the philosophy of learning by doing, so this book includes a wealth of code examples for you to follow along with. I encourage you to try these examples, experiment with the code, and challenge yourself to apply the techniques to your own datasets.

    By the end of this book, you’ll be equipped with the knowledge and skills to approach data preprocessing not just as a necessary step but also as a critical component of your overall data strategy.

    So, whether you’re a data scientist, engineer, analyst, or simply someone looking to enhance their understanding of data processes, I invite you to join me on this journey. Together, we will explore how to harness the power of data preprocessing to unlock the full potential of your data.

    Who this book is for

    This book is for readers with a working knowledge of Python, a good grasp of statistical concepts, and some experience in manipulating data. This book will not start from scratch but will rather build on existing skills, introducing you to sophisticated preprocessing strategies, hands-on code examples, and practical exercises that require a degree of familiarity with the core principles of data science and analytics.

    What this book covers

    Chapter 1

    , Data Ingestion Techniques, provides a comprehensive overview of the data ingestion process, emphasizing its role in collecting and importing data from various sources into storage systems for analysis. You will explore different ingestion methods such as batch and streaming modes, compare real-time and semi-real-time ingestion, and understand the technologies behind data sources. The chapter highlights the advantages, disadvantages, and practical applications of these methods.

    Chapter 2

    , Importance of Data Quality, emphasizes the critical role data quality plays in business decision-making. It highlights the risks of using inaccurate, inconsistent, or outdated data, which can lead to poor decisions, damaged reputations, and missed opportunities. You will explore why data quality is essential, how to measure it across different dimensions, and the impact of data silos on maintaining data quality.

    Chapter 3

    , Data Profiling – Understanding Data Structure, Quality, and Distribution, explores data profiling and focuses on scrutinizing and validating datasets to understand their structure, patterns, and quality. You will learn how to perform data profiling using tools such as the pandas Profiler and Great Expectations and understand when to use each tool. Additionally, the chapter covers tactics for handling large data volumes and compares profiling methods to improve data validation.

    Chapter 4

    , Cleaning Messy Data and Data Manipulation, focuses on the key strategies for cleaning and manipulating data, enabling efficient and accurate analysis. It covers techniques for renaming columns, removing irrelevant or redundant data, fixing inconsistent data types, and handling date and time formats. By mastering these methods, you will learn how to enhance the quality and reliability of your datasets.

    Chapter 5

    , Data Transformation – Merging and Concatenating, explores techniques for transforming and manipulating data through merging, joining, and concatenating datasets. It covers methods to combine multiple datasets from various sources, handle duplicates effectively, and improve merging performance. The chapter also provides practical tricks to streamline the merging process, ensuring efficient data integration for insightful analysis.

    Chapter 6

    , Data Grouping, Aggregation, Filtering, and Applying Functions, covers the essential techniques of data grouping and aggregation, which are vital for summarizing large datasets and generating meaningful insights. It discusses methods to handle missing or noisy data by aggregating values, reducing data volume, and enhancing processing efficiency. The chapter also focuses on grouping data by various keys, applying aggregate and custom functions, and filtering data to create valuable features for deeper analysis or ML.

    Chapter 7

    , Data Sinks, focuses on the critical decisions involved in data processing, particularly the selection of appropriate data sinks for storage and processing needs. It delves into four essential pillars: choosing the right data sink, selecting the correct file type, optimizing partitioning strategies, and understanding how to design a scalable online retail data platform. The chapter equips you with the tools to enhance efficiency, scalability, and performance in data processing pipelines.

    Chapter 8

    , Detecting and Handling Missing Values and Outliers, delves into techniques for identifying and managing missing values and outliers. It covers a range of methods, from statistical approaches to advanced ML models, to address these issues effectively. The key areas of focus include detecting and handling missing data, identifying univariate and multivariate outliers, and managing outliers in various datasets.

    Chapter 9

    , Normalization and Standardization, covers essential preprocessing techniques such as feature scaling, normalization, and standardization, which ensure that ML models can effectively learn from data. You will explore different techniques, including scaling features to a range, Z-score scaling, and using a robust scaler, to address various data challenges in ML tasks.

    Chapter 10

    , Handling Categorical Features, addresses the importance of managing categorical features, which represent non-numerical information in datasets. You will learn various encoding techniques, including label encoding, one-hot encoding, target encoding, frequency encoding, and binary encoding, to transform categorical data for ML models.

    Chapter 11

    , Consuming Time Series Data, delves into the fundamentals of time series analysis, covering key concepts, methodologies, and applications across various industries. It includes understanding the components and types of time series data, identifying and handling missing values, and techniques for analyzing trends and patterns over time. The chapter also addresses dealing with outliers and feature engineering to enhance predictive modeling with time series data.

    Chapter 12

    , Text Preprocessing in the Era of LLMs, focuses on mastering text preprocessing techniques that are essential for optimizing the performance of LLMs. It covers methods for cleaning text, handling rare words and spelling variations, chunking, and tokenization strategies. Additionally, it addresses the transformation of tokens into embeddings, highlighting the importance of adapting preprocessing approaches to maximize the potential of LLMs.

    Chapter 13

    , Image and Audio Preprocessing with LLMs, examines preprocessing techniques for unstructured data, particularly images and audio, to extract meaningful information. It includes methods for image preprocessing, such as optical character recognition (OCR) and image caption generation with the BLIP model. The chapter also explores audio data handling, including converting audio to text using the Whisper model, providing a comprehensive overview of working with multimedia data in the context of LLMs.

    To get the most out of this book

    To benefit fully from this book, you should have a good knowledge of Python and a grasp of data engineering and data science basics.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    The GitHub repository follows the chapters of the book, and all the scripts are numbered according to the sections within each chapter. Each script is independent of the others, so you can move ahead without having to run all the scripts beforehand. However, it is critically recommended to follow the flow of the book so that you don’t miss any necessary information.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices

    . If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The delete_entry() function is used to remove an entry, showing how data can be deleted from the store

    A block of code is set as follows:

    def process_in_batches(data, batch_size):

        for i in range(0, len(data), batch_size):

            yield data[i:i + batch_size]

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    user_satisfaction_scores = [

    random.randint(1, 5) for _ in range(num_users)]

    Any command-line input or output is written as follows:

    $ mkdir data

    pip install pandas

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: It involves storing data on remote servers accessed from anywhere via the internet, rather than on local devices

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    .

    Share your thoughts

    Once you’ve read Python Data Cleaning and Preparation Best Practices, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

    for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781837634743

    2. Submit your proof of purchase

    3. That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1: Upstream Data Ingestion and Cleaning

    This part focuses on the foundational stages of data processing, starting from data ingestion to ensuring its quality and structure for downstream tasks. It guides readers through the essential steps of importing, cleaning, and transforming data, which lay the groundwork for effective data analysis. The chapters explore various methods for ingesting data, maintaining high-quality datasets, profiling data for better insights, and cleaning messy data to make it ready for analysis. Further, it covers advanced techniques like merging, concatenating, grouping, and filtering data, along with choosing appropriate data destinations or sinks to optimize processing pipelines. Each chapter in this part equips readers with the knowledge to handle raw data and turn it into a clean, structured, and usable form.

    This part has the following chapters:

    Chapter 1

    , Data Ingestion Techniques

    Chapter 2

    , Importance of Data Quality

    Chapter 3

    , Data Profiling – Understanding Data Structure, Quality, and Distribution

    Chapter 4

    , Cleaning Messy Data and Data Manipulation

    Chapter 5

    , Data Transformation – Merging and Concatenating

    Chapter 6

    , Data Grouping, Aggregation, Filtering, and Applying Functions

    Chapter 7

    , Data Sinks

    1

    Data Ingestion Techniques

    Data ingestion is a critical component of the data life cycle and sets the foundation for subsequent data transformation and cleaning. It involves the process of collecting and importing data from various sources into a storage system where it can be accessed and analyzed. Effective data ingestion is crucial for ensuring data quality, integrity, and availability, which directly impacts the efficiency and accuracy of data transformation and cleaning processes. In this chapter, we will dive deep into the different types of data sources, explore various data ingestion methods, and discuss their respective advantages, disadvantages, and real-world applications.

    In this chapter, we’ll cover the following topics:

    Ingesting data in batch mode

    Ingesting data in streaming mode

    Real-time versus semi-real-time ingestion

    Data sources technologies

    Technical requirements

    You can find all the code for the chapter in the following GitHub repository:

    https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/tree/main/chapter01

    You can use your favorite IDE (VS Code, PyCharm, Google Colab, etc.) to write and execute your code.

    Ingesting data in batch mode

    Batch ingestion is a data processing technique whereby large volumes of data are collected, processed, and loaded into a system at scheduled intervals, rather than in real-time. This approach allows organizations to handle substantial amounts of data efficiently by grouping data into batches, which are then processed collectively. For example, a company might collect customer transaction data throughout the day and then process it in a single batch during off-peak hours. This method is particularly useful for organizations that need to process high volumes of data but do not require immediate analysis.

    Batch ingestion is beneficial because it optimizes system resources by spreading the processing load across scheduled times, often when the system is underutilized. This reduces the strain on computational resources and can lower costs, especially in cloud-based environments where computing power is metered. Additionally, batch processing simplifies data management, as it allows for the easy application of consistent transformations and validations across large datasets. For organizations with regular, predictable data flows, batch ingestion provides a reliable, scalable, and cost-effective solution for data processing and analytics.

    Let’s explore batch ingestion in more detail, starting with its advantages and disadvantages.

    Advantages and disadvantages

    Batch ingestion offers several notable advantages that make it an attractive choice for many data processing needs:

    Efficiency is a key benefit, as batch processing allows for the handling of large volumes of data in a single operation, optimizing resource usage and minimizing overhead

    Cost-effectiveness is another benefit, reducing the need for continuous processing resources and lowering operational expenses.

    Simplicity makes it easier to manage and implement periodic data processing tasks compared to real-time ingestion, which often requires more complex infrastructure and management

    Robustness, as batch processing is well-suited for performing complex data transformations and comprehensive data validation, ensuring high-quality, reliable data

    However, batch ingestion also comes with certain drawbacks:

    There is an inherent delay between the generation of data and its availability for analysis, which can be a critical issue for applications requiring real-time insights.

    Resource spikes can occur during batch processing windows, leading to high resource usage and potential performance bottlenecks

    Scalability can also be a concern, as handling very large datasets may require significant infrastructure investment and management

    Lastly, maintenance is a crucial aspect of batch ingestion; it demands careful scheduling and ongoing maintenance to ensure the timely and reliable execution of batch jobs

    Let’s look at some common use cases for ingesting data in batch mode.

    Common use cases for batch ingestion

    Any data analytics platform such as data warehouses or data lakes requires regularly updated data for Business Intelligence (BI) and reporting. Batch ingestion is integral as it ensures that data is continually updated with the latest information, enabling businesses to perform comprehensive and up-to-date analyses. By processing data in batches, organizations can efficiently handle vast amounts of transactional and operational data, transforming it into a structured format suitable for querying and reporting. This supports BI initiatives, allowing analysts and decision-makers to generate insightful reports, track Key Performance Indicators (KPIs), and make data-driven decisions.

    Extract, Transform, and Load (ETL) processes are a cornerstone of data integration projects, and batch ingestion plays a crucial role in these workflows. In ETL processes, data is extracted from various sources, transformed to fit the operational needs of the target system, and loaded into a database or data warehouse. Batch processing allows for efficient handling of these steps, particularly when dealing with large datasets that require significant transformation and cleansing. This method is ideal for periodic data consolidation, where data from disparate systems is integrated to provide a unified view, supporting activities such as data migration, system integration, and master data management.

    Batch ingestion is also widely used for backups and archiving, which are critical processes for data preservation and disaster recovery. Periodic batch processing allows for the scheduled backup of databases, ensuring that all data is captured and securely stored at regular intervals. This approach minimizes the risk of data loss and provides a reliable restore point in case of system failures or data corruption. Additionally, batch processing is used for data archiving, where historical data is periodically moved from active systems to long-term storage solutions. This not only helps in managing storage costs but also ensures that important data is retained and can be retrieved for compliance, auditing, or historical analysis purposes.

    Batch ingestion use cases

    Batch ingestion is a methodical process involving several key steps: data extraction, data transformation, data loading, scheduling, and automation. To illustrate these steps, let’s explore a use case involving an investment bank that needs to process and analyze trading data for regulatory compliance and performance reporting.

    Batch ingestion in an investment bank

    An investment bank needs to collect, transform, and load trading data from various financial markets into a central data warehouse. This data will be used for generating daily compliance reports, evaluating trading strategies, and making informed investment decisions.

    Data extraction

    The first step is identifying the sources from which data will be extracted. For the investment bank, this includes trading systems, market data providers, and internal risk management systems. These sources contain critical data such as trade execution details, market prices, and risk assessments. Once the sources are identified, data is collected using connectors or scripts. This involves setting up data pipelines that extract data from trading systems, import real-time market data feeds, and pull risk metrics from internal systems. The extracted data is then temporarily stored in staging areas before processing.

    Data transformation

    The extracted data often contains inconsistencies, duplicates, and missing values. Data cleaning is performed to remove duplicates, fill in missing information, and correct errors. For the investment bank, this ensures that trade records are accurate and complete, providing a reliable foundation for compliance reporting and performance analysis. After cleaning, the data undergoes transformations such as aggregations, joins, and calculations. For example, the investment bank might aggregate trade data to calculate daily trading volumes, join trade records with market data to analyze price movements, and calculate key metrics such as Profit and Loss (P&L) and risk exposure. The transformed data must be mapped to the schema of the target system. This involves aligning the data fields with the structure of the data warehouse. For instance, trade data might be mapped to tables representing transactions, market data, and risk metrics, ensuring seamless integration with the existing data model.

    Data loading

    The transformed data is processed in batches, which allows the investment bank to handle large volumes of data efficiently, performing complex transformations and aggregations in a single run. Once processed, the data is loaded into the target storage system, such as a data warehouse or data lake. For the investment bank, this means loading the cleaned and transformed trading data into their data warehouse, where it can be accessed for compliance reporting and performance analysis.

    Scheduling and automation

    To ensure that the batch ingestion process runs smoothly and consistently, scheduling tools such as Apache Airflow or Cron jobs are used. These tools automate the data ingestion workflows, scheduling them to run at regular intervals, such as every night or every day. This allows the investment bank to have up-to-date data available for analysis without manual intervention. Implementing monitoring is crucial to track the success and performance of batch jobs. Monitoring tools provide insights into job execution, identifying any failures or performance bottlenecks. For the investment bank, this ensures that any issues in the data ingestion process are promptly detected and resolved, maintaining the integrity and reliability of the data pipeline.

    Batch ingestion with an example

    Let’s have a look at a simple example of a batch processing ingestion system written in Python. This example will simulate the ETL process. We’ll generate some mock data, process it in batches, and load it into a simulated database.

    You can find the code for this part in the GitHub repository at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/1.batch.py

    . To run this example, we don’t need any bespoke library installation. We just need to ensure that we are running it in a standard Python environment (Python 3.x):

    We create a generate_mock_data function that generates a list of mock data records:

    def generate_mock_data(num_records):

        data = []

        for _ in range(num_records):

            record = {

                'id': random.randint(1, 1000),

                'value': random.random() * 100

            }

            data.append(record)

    return data

    Each record is a dictionary with two fields:

    id: A random integer between 1 and 1000

    value: A random float between 0 and 100

    Let’s have a look at what the data looks like:

    print(Original data:, data)

    {'id': 449, 'value': 99.79699336555473}

    {'id': 991, 'value': 79.65999078145887}

    A list of dictionaries is returned, each representing a data record.

    Next, we create a batch processing function:

    def process_in_batches(data, batch_size):

        for i in range(0, len(data), batch_size):

            yield data[i:i + batch_size]

    This function takes the data, which is a list of data records to process, and batch_size, which represents the number of records per batch, as parameters. The function uses a for loop to iterate over the data in steps of batch_size.

    Enjoying the preview?
    Page 1 of 1