Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

Ebook986 pages6 hours

Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

Name: Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
Author: Maria Zervou
ISBN: 9781837632909

By Maria Zervou

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Professionals face several challenges in effectively leveraging data in today's data-driven world. One of the main challenges is the low quality of data products, often caused by inaccurate, incomplete, or inconsistent data. Another significant challenge is the lack of skills among data professionals to analyze unstructured data, leading to valuable insights being missed that are difficult or impossible to obtain from structured data alone.
To help you tackle these challenges, this book will take you on a journey through the upstream data pipeline, which includes the ingestion of data from various sources, the validation and profiling of data for high-quality end tables, and writing data to different sinks. You’ll focus on structured data by performing essential tasks, such as cleaning and encoding datasets and handling missing values and outliers, before learning how to manipulate unstructured data with simple techniques. You’ll also be introduced to a variety of natural language processing techniques, from tokenization to vector models, as well as techniques to structure images, videos, and audio.
By the end of this book, you’ll be proficient in data cleaning and preparation techniques for both structured and unstructured data.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateSep 27, 2024

ISBN9781837632909

Author

Maria Zervou

Related authors

Skip carousel

Related to Python Data Cleaning and Preparation Best Practices

Related ebooks

Skip carousel

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
Ebook
Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models
byVijaya Kumar Suda
Rating: 0 out of 5 stars
0 ratings
Data Exploration and Preparation with BigQuery: A practical guide to cleaning, transforming, and analyzing data for business insights
Ebook
Data Exploration and Preparation with BigQuery: A practical guide to cleaning, transforming, and analyzing data for business insights
byMike Kahn
Rating: 0 out of 5 stars
0 ratings
Scalable Data Architecture with Java: Build efficient enterprise-grade data architecting solutions using Java
Ebook
Scalable Data Architecture with Java: Build efficient enterprise-grade data architecting solutions using Java
bySinchan Banerjee
Rating: 0 out of 5 stars
0 ratings
Architecting Data-Intensive Applications: Develop scalable, data-intensive, and robust applications the smart way
Ebook
Architecting Data-Intensive Applications: Develop scalable, data-intensive, and robust applications the smart way
byAnuj Kumar
Rating: 0 out of 5 stars
0 ratings
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
Ebook
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
byGus Frazer
Rating: 0 out of 5 stars
0 ratings
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process
Ebook
Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process
byGláucia Esppenchutz
Rating: 0 out of 5 stars
0 ratings
Mastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions
Ebook
Mastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions
byLillie Beiting
Rating: 0 out of 5 stars
0 ratings
Azure Machine Learning Engineering: Deploy, fine-tune, and optimize ML models using Microsoft Azure
Ebook
Azure Machine Learning Engineering: Deploy, fine-tune, and optimize ML models using Microsoft Azure
bySina Fakhraee
Rating: 0 out of 5 stars
0 ratings
Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop
Ebook
Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop
byKirk Munroe
Rating: 0 out of 5 stars
0 ratings
Data Lakehouse in Action: Architecting a modern and scalable data analytics platform
Ebook
Data Lakehouse in Action: Architecting a modern and scalable data analytics platform
byPradeep Menon
Rating: 0 out of 5 stars
0 ratings
Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently
Ebook
Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently
bySonia Mezzetta
Rating: 0 out of 5 stars
0 ratings
Power Apps Tips, Tricks, and Best Practices: A step-by-step practical guide to developing robust Power Apps solutions
Ebook
Power Apps Tips, Tricks, and Best Practices: A step-by-step practical guide to developing robust Power Apps solutions
byAndrea Pinillos
Rating: 0 out of 5 stars
0 ratings
Debugging Machine Learning Models with Python: Develop high-performance, low-bias, and explainable machine learning and deep learning models
Ebook
Debugging Machine Learning Models with Python: Develop high-performance, low-bias, and explainable machine learning and deep learning models
byAli Madani
Rating: 0 out of 5 stars
0 ratings
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Azure Data Factory Cookbook: A data engineer's guide to building and managing ETL and ELT pipelines with data integration
Ebook
Azure Data Factory Cookbook: A data engineer's guide to building and managing ETL and ELT pipelines with data integration
byDmitry Foshin
Rating: 0 out of 5 stars
0 ratings
Generative AI with Amazon Bedrock: Build, scale, and secure generative AI applications using Amazon Bedrock
Ebook
Generative AI with Amazon Bedrock: Build, scale, and secure generative AI applications using Amazon Bedrock
byShikhar Kwatra
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data
Ebook
Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data
bySridhar Alla
Rating: 0 out of 5 stars
0 ratings
Enterprise-Grade Hybrid and Multi-Cloud Strategies: Proven strategies to digitally transform your business with hybrid and multi-cloud solutions
Ebook
Enterprise-Grade Hybrid and Multi-Cloud Strategies: Proven strategies to digitally transform your business with hybrid and multi-cloud solutions
bySathya AG
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx
Ebook
Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx
byPaul Houghton
Rating: 0 out of 5 stars
0 ratings
Ultimate Snowflake Architecture for Cloud Data Warehousing: Architect, Manage, Secure, and Optimize Your Data Infrastructure Using Snowflake for Actionable Insights and Informed Decisions (English Edition)
Ebook
Ultimate Snowflake Architecture for Cloud Data Warehousing: Architect, Manage, Secure, and Optimize Your Data Infrastructure Using Snowflake for Actionable Insights and Informed Decisions (English Edition)
byGanesh Bharathan
Rating: 2 out of 5 stars
2/5
Amazon SageMaker Best Practices: Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker
Ebook
Amazon SageMaker Best Practices: Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker
bySireesha Muppala
Rating: 0 out of 5 stars
0 ratings
Microservices with Clojure: Develop event-driven, scalable, and reactive microservices with real-time monitoring
Ebook
Microservices with Clojure: Develop event-driven, scalable, and reactive microservices with real-time monitoring
byAnuj Kumar
Rating: 0 out of 5 stars
0 ratings
The Definitive Guide to Google Vertex AI: Accelerate your machine learning journey with Google Cloud Vertex AI and MLOps best practices
Ebook
The Definitive Guide to Google Vertex AI: Accelerate your machine learning journey with Google Cloud Vertex AI and MLOps best practices
byJasmeet Bhatia
Rating: 0 out of 5 stars
0 ratings
Pandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python
Ebook
Pandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python
byTheodore Petrou
Rating: 0 out of 5 stars
0 ratings
Digital Transformation with Dataverse for Teams: Become a citizen developer and lead the digital transformation wave with Microsoft Teams and Power Platform
Ebook
Digital Transformation with Dataverse for Teams: Become a citizen developer and lead the digital transformation wave with Microsoft Teams and Power Platform
bySrikumar Nair
Rating: 0 out of 5 stars
0 ratings
DevOps for Databases: A practical guide to applying DevOps best practices to data-persistent technologies
Ebook
DevOps for Databases: A practical guide to applying DevOps best practices to data-persistent technologies
byDavid Jambor
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Ebook
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks
Ebook
Ultimate Data Engineering with Databricks
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt
Ebook
CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt
byCameron Dodd
Rating: 0 out of 5 stars
0 ratings
Microsoft Power BI Performance Best Practices: Learn practical techniques for building high-speed Power BI solutions
Ebook
Microsoft Power BI Performance Best Practices: Learn practical techniques for building high-speed Power BI solutions
byThomas LeBlanc
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byT.C. Boyle
Rating: 5 out of 5 stars
5/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
The Best Hacking Tricks for Beginners
Ebook
The Best Hacking Tricks for Beginners
byRAJ TYAGI
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence
Ebook
ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence
byJake L Kent
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
UNLIMITED
End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
UNLIMITED
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data, AI, Security, and Privacy
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
UNLIMITED
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
UNLIMITED
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
UNLIMITED
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
MLOps - Design Thinking to Build ML Infra for ML and LLM Use Casess // Amritha Arun Babu & Abhik Choudhury // #221
UNLIMITED
MLOps - Design Thinking to Build ML Infra for ML and LLM Use Casess // Amritha Arun Babu & Abhik Choudhury // #221
byMLOps.community
0 ratings
0% found this document useful
Demystifying Data Modernization Patterns: Demystifying Data Modernization Patterns
UNLIMITED
Demystifying Data Modernization Patterns: Demystifying Data Modernization Patterns
byInsights Tomorrow
0 ratings
0% found this document useful
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
UNLIMITED
87: Michael Katz: The Evolution of packaged CDPs, democratizing ML and the myths of composable and zero data copy
byHumans of Martech
0 ratings
0% found this document useful
The Changing Faces of Data and Analytics
UNLIMITED
The Changing Faces of Data and Analytics
byInsights Tomorrow
0 ratings
0% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
UNLIMITED
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
EP 121: Faster and More Accurate Results From ChatGPT with ScholarAI
UNLIMITED
EP 121: Faster and More Accurate Results From ChatGPT with ScholarAI
byEveryday AI Podcast – An AI and ChatGPT Podcast
0 ratings
0% found this document useful
1425: Exasol - How Data Visualization Is Highlighting Gender Inequality: Exasol, Technology Evangelist, Eva Murray
UNLIMITED
1425: Exasol - How Data Visualization Is Highlighting Gender Inequality: Exasol, Technology Evangelist, Eva Murray
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
AWS and the Journey to Responsible AI with Diya Wynn
UNLIMITED
AWS and the Journey to Responsible AI with Diya Wynn
byScreaming in the Cloud
0 ratings
0% found this document useful
Data-Centric AI - Marysia Winkels
UNLIMITED
Data-Centric AI - Marysia Winkels
byDataTalks.Club
0 ratings
0% found this document useful
Patrick Lewis (Cohere) - Retrieval Augmented Generation
UNLIMITED
Patrick Lewis (Cohere) - Retrieval Augmented Generation
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
UNLIMITED
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
byAI Live & Unbiased
0 ratings
0% found this document useful
Data Protocol’s Privacy Engineering Certificate Course with Jake Ward: Data Protocol is a developer education platform designed specifically to serve the learning styles and needs of engineers. The platform includes a live terminal environment and immersive platform to teach, train, and certify professionals.Companies like ...
UNLIMITED
Data Protocol’s Privacy Engineering Certificate Course with Jake Ward: Data Protocol is a developer education platform designed specifically to serve the learning styles and needs of engineers. The platform includes a live terminal environment and immersive platform to teach, train, and certify professionals.Companies like ...
byPartially Redacted: Data, AI, Security, and Privacy
0 ratings
0% found this document useful
Practical First Steps In Data Governance For Long Term Success: Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.
UNLIMITED
Practical First Steps In Data Governance For Long Term Success: Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.
byData Engineering Podcast
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
UNLIMITED
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
142: Streamline Systems for Nonprofit Success - by Teresa Huff: Join me on the Fast Track to Grant Writer: Streamlining Systems for Nonprofit Success Can you believe we’re already nearing the end of summer? As we charge ahead into the second half of 2023, now is a great time to pause and evaluate whether...
UNLIMITED
142: Streamline Systems for Nonprofit Success - by Teresa Huff: Join me on the Fast Track to Grant Writer: Streamlining Systems for Nonprofit Success Can you believe we’re already nearing the end of summer? As we charge ahead into the second half of 2023, now is a great time to pause and evaluate whether...
byGrant Writing Simplified
0 ratings
0% found this document useful
174: Network Leadership with Jane Wei-Skillern
UNLIMITED
174: Network Leadership with Jane Wei-Skillern
byLeading Learning Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
UNLIMITED
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
UNLIMITED
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
byData Engineering Podcast
0 ratings
0% found this document useful
AI and ML Networking: bridging the gap between performance and economy
UNLIMITED
AI and ML Networking: bridging the gap between performance and economy
byTechnology Now
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
UNLIMITED
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
E66: How to Optimize Game Development with Patric Palm, Founder of Favro: Welcome to another episode of "Building Better Games." In this episode, Ben sits down with Patric Palm, CEO and founder of Favro, to delve into the intricacies of game development and how modern tools and practices can revolutionize the process....
UNLIMITED
E66: How to Optimize Game Development with Patric Palm, Founder of Favro: Welcome to another episode of "Building Better Games." In this episode, Ben sits down with Patric Palm, CEO and founder of Favro, to delve into the intricacies of game development and how modern tools and practices can revolutionize the process....
byBuilding Better Games
0 ratings
0% found this document useful
Creek Drive Capital's Kevin Mak on why $PGY has been largely ignored by the AI hoopla
UNLIMITED
Creek Drive Capital's Kevin Mak on why $PGY has been largely ignored by the AI hoopla
byYet Another Value Podcast
0 ratings
0% found this document useful
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
UNLIMITED
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with Enterprise Software and Digital Transformation
0 ratings
0% found this document useful
Responsible AI with Craig Wiley and Tracy Frey: Stephanie Wong and Priyanka Vergadia host the podcast this week as we talk responsible AI with guests Craig Wiley and Tracy Frey.
UNLIMITED
Responsible AI with Craig Wiley and Tracy Frey: Stephanie Wong and Priyanka Vergadia host the podcast this week as we talk responsible AI with guests Craig Wiley and Tracy Frey.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
#55 Unlocking Customer Success: How BetterUp Uses Data and AI to Transform Workforces: Are we relying too much on technology in customer success? Sarah Parker, the SVP of Customer Success at BetterUp, discusses the balance between automation and human touch in customer relationships. Is it possible to scale customer success without losing the personal connection? Join us as we explore this critical debate in the industry.
UNLIMITED
#55 Unlocking Customer Success: How BetterUp Uses Data and AI to Transform Workforces: Are we relying too much on technology in customer success? Sarah Parker, the SVP of Customer Success at BetterUp, discusses the balance between automation and human touch in customer relationships. Is it possible to scale customer success without losing the personal connection? Join us as we explore this critical debate in the industry.
byExperts of Experience
0 ratings
0% found this document useful

Skip carousel

Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
UNLIMITED
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
The Big Tech Boost
Business Today
UNLIMITED
The Big Tech Boost
Jan 5, 2024
5 min read
Arnab PANDEY
Techfastly
UNLIMITED
Arnab PANDEY
Apr 1, 2021
11 min read
Getting The edge
The European Business Review
UNLIMITED
Getting The edge
Feb 25, 2021
7 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
UNLIMITED
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
UNLIMITED
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
02 Is Trust In AI Holding Up Its Use And Adoption In Singapore?
HWM Singapore
UNLIMITED
02 Is Trust In AI Holding Up Its Use And Adoption In Singapore?
Jun 6, 2024
5 min read
CULTURE SHIFT – An Indispensable Shift To Building An AI-Powered Organisation
Techfastly
UNLIMITED
CULTURE SHIFT – An Indispensable Shift To Building An AI-Powered Organisation
May 3, 2021
5 min read
Signals Of Change: how To Evolve For The New Global Reality
Rotman Management
UNLIMITED
Signals Of Change: how To Evolve For The New Global Reality
May 1, 2022
11 min read
The Art Of AI Maturity: Five Success Factors
Rotman Management
UNLIMITED
The Art Of AI Maturity: Five Success Factors
Jan 1, 2023
TODAY, MUCH OF WHAT WE TAKE FOR GRANTED in our daily lives stems from machine learning. Every time you use a wayfinding app to get from point A to point B, use dictation to convert speech to text, or unlock your phone using face ID, you’re relying on
10 min read
The Path to Future-Ready Operations
Rotman Management
UNLIMITED
The Path to Future-Ready Operations
Sep 1, 2022
As we move forward and the pandemic recedes, leaders must ask a fundamental question: What state are our business operations in? We wanted to better understand the connection between business operations maturity and performance. So in 2020, we survey
3 min read
There’s A New Career In Town
True Love
UNLIMITED
There’s A New Career In Town
Oct 21, 2019
2 min read
The Procurement Call For Agile, What Does It Mean?
The European Business Review
UNLIMITED
The Procurement Call For Agile, What Does It Mean?
Dec 3, 2019
11 min read
Thriving As An Ecosystem Partner
The European Business Review
UNLIMITED
Thriving As An Ecosystem Partner
Sep 30, 2022
Researching ecosystems that span industries from e-commerce and publishing to semiconductors and healthcare over the past decade, we found companies that have been successful for years by contributing to an ecosystem. Sometimes, by contributing as pa
10 min read
“Be Global But Act Local because Each Economy Is Unique”
Business Today
UNLIMITED
“Be Global But Act Local because Each Economy Is Unique”
Dec 8, 2023
6 min read
IT For A New World
Business Today
UNLIMITED
IT For A New World
Jun 10, 2021
6 min read
Empowering Data-Driven Decision-Making: DiLytics Technologies
Business Today
UNLIMITED
Empowering Data-Driven Decision-Making: DiLytics Technologies
Nov 8, 2024
2 min read
Data In A Digital World
NZ Marketing
UNLIMITED
Data In A Digital World
Sep 23, 2019
3 min read
Buying The Tool
Techfastly
UNLIMITED
Buying The Tool
Apr 1, 2021
3 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
UNLIMITED
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Integrated Workplace Management Systems
Facility Management
UNLIMITED
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
Quantum Leap
Marketing
UNLIMITED
Quantum Leap
Jul 11, 2019
6 min read
Facilities Systems
Facility Management
UNLIMITED
Facilities Systems
Oct 21, 2018
5 min read
In Conversation with Surbhi Rathore
Techfastly
UNLIMITED
In Conversation with Surbhi Rathore
Oct 1, 2021
4 min read
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
The European Business Review
UNLIMITED
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
Dec 3, 2019
7 min read
Building Trends, Building Momentum
Facility Management
UNLIMITED
Building Trends, Building Momentum
Oct 14, 2019
3 min read
TimeXtender HELPS EUROPEAN COMPANIES FROM NUMEROUS INDUSTRIES MANAGE THEIR DATA
The European Business Review
UNLIMITED
TimeXtender HELPS EUROPEAN COMPANIES FROM NUMEROUS INDUSTRIES MANAGE THEIR DATA
May 25, 2021
3 min read
What It Takes To Be A Smart Business
Rotman Management
UNLIMITED
What It Takes To Be A Smart Business
Jan 1, 2019
Why is it important for every Western businessperson to be familiar with Alibaba's business model? Alibaba’s business model provides key insights into the future of strategy. The sources of competitive advantage have shifted dramatically, and compani
6 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
UNLIMITED
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
PC Pro Magazine
UNLIMITED
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
Jun 8, 2023
6 min read

Related categories

Skip carousel

Reviews for Python Data Cleaning and Preparation Best Practices

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Python Data Cleaning and Preparation Best Practices - Maria Zervou

Cover.png

Python Data Cleaning and Preparation Best Practices

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Managers: Deepesh Patel and Chayan Majumdar

Book Project Manager: Hemangi Lotlikar

Senior Content Development Editor: Manikandan Kurup

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Manikandan Kurup

Indexer: Hemangini Bari

Production Designer: Joshua Misquitta

Senior DevRel Marketing Executive: Nivedita Singh

First published: September 2024

Production reference: 1190924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-474-3

www.packtpub.com

I want to extend my deepest thanks to those who have been by my side throughout the journey of writing this book while managing work in parallel. I am immensely grateful to everyone who has cheered me on, offered feedback, and inspired me to keep going. A special thanks to my family, for their unwavering support and for teaching me the power of determination. To my mentors, friends, and partner, who have guided me over the years and helped me see the bigger picture, and from whom I have learned so much! This accomplishment is as much yours as it is mine. Thank you for being part of this journey!

– Maria Zervou

Contributors

About the author

Maria Zervou is a Generative AI and machine learning expert, dedicated to making advanced technologies accessible. With over a decade of experience, she has led impactful AI projects across industries and mentored teams on cutting-edge advancements. As a machine learning specialist at Databricks, Maria drives innovative AI solutions and industry adoption. Beyond her role, she democratizes knowledge through her YouTube channel, featuring experts on AI topics. A recognized thought leader and finalist in the Women in Tech Excellence Awards, Maria advocates for responsible AI use and contributes to open source projects, fostering collaboration and empowering future AI leaders.

About the reviewers

Mohammed Kamil Khan is currently a scientific programmer at UTHealth Houston’s McWilliams School of Biomedical Informatics, wherein he works on data preprocessing, GWAS, and post-GWAS analysis of imaging data. He has a master’s degree from the University of Houston – Downtown (UHD), having majored in data analytics. With an unwavering passion for democratizing knowledge, Kamil strives to make complex concepts accessible to all. Moreover, Kamil’s commitment to sharing his expertise led him to publish articles on platforms such as DigitalOcean, Open Source For You magazine, and Red Hat’s opensource.com. These articles explore a diverse range of topics, including pandas DataFrames, API data extraction, SQL queries, and much more.

Ashish Shukla is a seasoned professional with 12 years of experience, specializing in Azure technologies, particularly Azure Databricks, for the past 9 years. Formerly associated with Microsoft, Ashish has been instrumental in leading numerous successful projects leveraging Azure Databricks. Currently serving as an associate manager of data operations at PepsiCo India, he brings extensive expertise in cloud-based data solutions, ensuring robust and innovative data operations strategies.

Beyond his professional roles, Ashish is an active contributor to the Azure community through his technical blogs and engagements as a speaker on Azure technologies, where he shares valuable insights and best practices in data management and cloud computing.

Krishnan Raghavan is an IT professional with over 20 years of experience in software development and delivery excellence across multiple domains and technologies, including C++, Java, Python, Angular, Golang, and data warehouses.

When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, nonfiction, and technical books and participating in Hackathons. Krishnan tries to give back to the community by being part of the GDG – Pune volunteer group.

You can connect with Krishnan at [email protected] or via LinkedIn.

I’d like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.

Table of Contents

Preface

Part 1: Upstream Data Ingestion and Cleaning

Data Ingestion Techniques

Technical requirements

Ingesting data in batch mode

Advantages and disadvantages

Common use cases for batch ingestion

Batch ingestion use cases

Batch ingestion with an example

Ingesting data in streaming mode

Advantages and disadvantages

Common use cases for streaming ingestion

Streaming ingestion in an e-commerce platform

Streaming ingestion with an example

Real-time versus semi-real-time ingestion

Common use cases for near-real-time ingestion

Semi-real-time mode with an example

Data source solutions

Event data processing solution

Ingesting event data with Apache Kafka

Ingesting data from databases

Performing data ingestion from cloud-based file systems

APIs

Summary

Importance of Data Quality

Technical requirements

Why data quality is important

Dimensions of data quality

Completeness

Accuracy

Timeliness

Consistency

Uniqueness

Duplication

Data usage

Data compliance

Implementing quality controls throughout the data life cycle

Data silos and the impact on data quality

Summary

Data Profiling – Understanding Data Structure, Quality, and Distribution

Technical requirements

Understanding data profiling

Identifying goals of data profiling

Exploratory data analysis options – profiler versus manual

Profiling data with pandas’ ydata_profiling

Overview

Interactions

Correlations

Missing values

Duplicate rows

Sample dataset

Profiling high volumes of data with the pandas data profiler

Data validation with the Great Expectations library

Configuring Great Expectations for your project

Create your first Great Expectations data source

Creating your first Great Expectations suite

Great Expectations Suite report

Manually edit Great Expectations

Checkpoints

Using pandas profiler to build your Great Expectations Suite

Comparing Great Expectations and pandas profiler – when to use what

Great Expectations and big data

Summary

Cleaning Messy Data and Data Manipulation

Technical requirements

Renaming columns

Renaming a single column

Renaming all columns

Removing irrelevant or redundant columns

Dealing with inconsistent and incorrect data types

Inspecting columns

Columnar type transformations

Converting to numeric types

Converting to string types

Converting to categorical types

Converting to Boolean types

Working with dates and times

Importing and parsing date and time data

Extracting components from dates and times

Calculating time differences and durations

Handling time zones and daylight saving time

Summary

Data Transformation – Merging and Concatenating

Technical requirements

Joining datasets

Choosing the correct merge strategy

Handling duplicates when merging datasets

Why handle duplication in rows and columns?

Dropping duplicate rows

Validating data before merging

Aggregation

Concatenation

Handling duplication in columns

Performance tricks for merging

Set indexes

Sorting indexes

Merge versus join

Concatenating DataFrames

Row-wise concatenation

Column-wise concatenation

Summary

References

Data Grouping, Aggregation, Filtering, and Applying Functions

Technical requirements

Grouping data using one or multiple keys

Grouping data using one key

Grouping data using multiple keys

Best practices for grouping

Applying aggregate functions on grouped data

Basic aggregate functions

Advanced aggregation with multiple columns

Applying custom aggregate functions

Best practices for aggregate functions

Using the apply function on grouped data

Data filtering

Multiple criteria for filtering

Best practices for filtering

Performance considerations as data grows

Summary

Data Sinks

Technical requirements

Choosing the right data sink for your use case

Relational databases

NoSQL databases

Data warehouses

Data lakes

Streaming data sinks

Which sink is the best for my use case?

Decoding file types for optimal usage

Navigating partitioning

Horizontal versus vertical partitioning

Time-based partitioning

Geographic partitioning

Hybrid partitioning

Considerations for choosing partitioning strategies

Designing an online retail data platform

Summary

Part 2: Downstream Data Cleaning – Consuming Structured Data

Detecting and Handling Missing Values and Outliers

Technical requirements

Detecting missing data

Handling missing data

Deletion of missing data

Imputation of missing data

Mean imputation

Median imputation

Creating indicator variables

Comparison between imputation methods

Detecting and handling outliers

Impact of outliers

Identifying univariate outliers

Handling univariate outliers

Identifying multivariate outliers

Handling multivariate outliers

Summary

Normalization and Standardization

Technical requirements

Scaling features to a range

Min-max scaling

Z-score scaling

When to use Z-score scaling

Robust scaling

Comparison between methods

Summary

Handling Categorical Features

Technical requirements

Label encoding

Use case – employee performance analysis

Considerations for label encoding

One-hot encoding

When to use one-hot encoding

Use case – customer churn prediction

Considerations for one-hot encoding

Target encoding (mean encoding)

When to use target encoding

Use case – sales prediction for retail stores

Considerations for target encoding

Frequency encoding

When to use frequency encoding

Use case – customer product preference analysis

Considerations for frequency encoding

Binary encoding

When to use binary encoding

Use case – customer subscription prediction

Considerations for binary encoding

Summary

Consuming Time Series Data

Technical requirements

Understanding the components of time series data

Trend

Seasonality

Noise

Types of time series data

Univariate time series data

Multivariate time series data

Identifying missing values in time series data

Checking for NaNs or null values

Visual inspection

Handling missing values in time series data

Removing missing data

Forward and backward fill

Interpolation

Comparing the different methods for missing values

Analyzing time series data

Autocorrelation and partial autocorrelation

ACT and PACF in the stock market use case

Dealing with outliers

Identifying outliers with seasonal decomposition

Handling outliers – model-based approaches – ARIMA

Moving window techniques

Feature engineering for time series data

Lag features and their importance

Differencing time series

Applying time series techniques in different industries

Summary

Part 3: Downstream Data Cleaning – Consuming Unstructured Data

Text Preprocessing in the Era of LLMs

Technical requirements

Relearning text preprocessing in the era of LLMs

Text cleaning

Removing HTML tags and special characters

Handling capitalization and letter case

Dealing with numerical values and symbols

Addressing whitespace and formatting issues

Removing personally identifiable information

Handling rare words and spelling variations

Dealing with rare words

Addressing spelling variations and typos

Chunking

Tokenization

Word tokenization

Subword tokenization

Domain-specific data

Turning tokens into embeddings

BERT – Contextualized Embedding Models

BGE

GTE

Selecting the right embedding model

Solving real problems with embeddings

Summary

Image and Audio Preprocessing with LLMs

Technical requirements

The current era of image preprocessing

Loading the images

Resizing and cropping

Normalizing and standardizing the dataset

Data augmentation

Noise reduction

Extracting text from images

PaddleOCR

Using LLMs with OCR

Creating image captions

Handling audio data

Using Whisper for audio-to-text conversion

Extracting text from audio

Future research in audio preprocessing

Summary

This concludes the book! You did it!

Index

Other Books You May Enjoy

Preface

In today’s fast-paced data-driven world, it’s easy to be dazzled by the headlines about artificial intelligence (AI) breakthroughs and advanced machine learning (ML) models. But ask any seasoned data scientist or engineer, and they’ll tell you the same thing: the true foundation of any successful data project is not the flashy algorithms or sophisticated models—it’s the data itself, and more importantly, how that data is prepared.

Throughout my career, I have learned that data preprocessing is the unsung hero of data science. It’s the meticulous, often complex process that turns raw data into a reliable asset, ready for analysis, modeling, and ultimately, decision-making. I’ve seen firsthand how the right preprocessing techniques can transform an organization’s approach to data, turning potential challenges into powerful opportunities.

Yet, despite its importance, data preprocessing is often overlooked or undervalued. Many see it as a tedious step, a bottleneck that slows down the exciting work of building models and delivering insights. But I’ve always believed that this phase is where the most critical work happens. After all, even the most sophisticated algorithms can’t make up for poor-quality data. That’s why I’ve dedicated much of my professional journey to mastering this art—exploring the best tools, techniques, and strategies to make preprocessing more efficient, scalable, and aligned with the ever-evolving landscape of AI.

This book aims to demystify the data preprocessing process, offering both a solid grounding in traditional methods and a forward-looking perspective on emerging techniques. We’ll explore how Python can be leveraged to clean, transform, and organize data more effectively. We’ll also look at how the advent of large language models (LLMs) is redefining what’s possible in this space. These models are already proving to be game changers, automating tasks that were once manual and time-consuming, and providing new ways to enhance data quality and usability.

Throughout the pages, I’ll share insights from my experiences, the challenges faced, and the lessons learned along the way. My hope is to provide you with not just a technical roadmap but also a deeper understanding of the strategic importance of data preprocessing in today’s data ecosystem. I strongly believe in the philosophy of learning by doing, so this book includes a wealth of code examples for you to follow along with. I encourage you to try these examples, experiment with the code, and challenge yourself to apply the techniques to your own datasets.

By the end of this book, you’ll be equipped with the knowledge and skills to approach data preprocessing not just as a necessary step but also as a critical component of your overall data strategy.

So, whether you’re a data scientist, engineer, analyst, or simply someone looking to enhance their understanding of data processes, I invite you to join me on this journey. Together, we will explore how to harness the power of data preprocessing to unlock the full potential of your data.

Who this book is for

This book is for readers with a working knowledge of Python, a good grasp of statistical concepts, and some experience in manipulating data. This book will not start from scratch but will rather build on existing skills, introducing you to sophisticated preprocessing strategies, hands-on code examples, and practical exercises that require a degree of familiarity with the core principles of data science and analytics.

What this book covers

Chapter 1

, Data Ingestion Techniques, provides a comprehensive overview of the data ingestion process, emphasizing its role in collecting and importing data from various sources into storage systems for analysis. You will explore different ingestion methods such as batch and streaming modes, compare real-time and semi-real-time ingestion, and understand the technologies behind data sources. The chapter highlights the advantages, disadvantages, and practical applications of these methods.

Chapter 2

, Importance of Data Quality, emphasizes the critical role data quality plays in business decision-making. It highlights the risks of using inaccurate, inconsistent, or outdated data, which can lead to poor decisions, damaged reputations, and missed opportunities. You will explore why data quality is essential, how to measure it across different dimensions, and the impact of data silos on maintaining data quality.

Chapter 3

, Data Profiling – Understanding Data Structure, Quality, and Distribution, explores data profiling and focuses on scrutinizing and validating datasets to understand their structure, patterns, and quality. You will learn how to perform data profiling using tools such as the pandas Profiler and Great Expectations and understand when to use each tool. Additionally, the chapter covers tactics for handling large data volumes and compares profiling methods to improve data validation.

Chapter 4

, Cleaning Messy Data and Data Manipulation, focuses on the key strategies for cleaning and manipulating data, enabling efficient and accurate analysis. It covers techniques for renaming columns, removing irrelevant or redundant data, fixing inconsistent data types, and handling date and time formats. By mastering these methods, you will learn how to enhance the quality and reliability of your datasets.

Chapter 5

, Data Transformation – Merging and Concatenating, explores techniques for transforming and manipulating data through merging, joining, and concatenating datasets. It covers methods to combine multiple datasets from various sources, handle duplicates effectively, and improve merging performance. The chapter also provides practical tricks to streamline the merging process, ensuring efficient data integration for insightful analysis.

Chapter 6

, Data Grouping, Aggregation, Filtering, and Applying Functions, covers the essential techniques of data grouping and aggregation, which are vital for summarizing large datasets and generating meaningful insights. It discusses methods to handle missing or noisy data by aggregating values, reducing data volume, and enhancing processing efficiency. The chapter also focuses on grouping data by various keys, applying aggregate and custom functions, and filtering data to create valuable features for deeper analysis or ML.

Chapter 7

, Data Sinks, focuses on the critical decisions involved in data processing, particularly the selection of appropriate data sinks for storage and processing needs. It delves into four essential pillars: choosing the right data sink, selecting the correct file type, optimizing partitioning strategies, and understanding how to design a scalable online retail data platform. The chapter equips you with the tools to enhance efficiency, scalability, and performance in data processing pipelines.

Chapter 8

, Detecting and Handling Missing Values and Outliers, delves into techniques for identifying and managing missing values and outliers. It covers a range of methods, from statistical approaches to advanced ML models, to address these issues effectively. The key areas of focus include detecting and handling missing data, identifying univariate and multivariate outliers, and managing outliers in various datasets.

Chapter 9

, Normalization and Standardization, covers essential preprocessing techniques such as feature scaling, normalization, and standardization, which ensure that ML models can effectively learn from data. You will explore different techniques, including scaling features to a range, Z-score scaling, and using a robust scaler, to address various data challenges in ML tasks.

Chapter 10

, Handling Categorical Features, addresses the importance of managing categorical features, which represent non-numerical information in datasets. You will learn various encoding techniques, including label encoding, one-hot encoding, target encoding, frequency encoding, and binary encoding, to transform categorical data for ML models.

Chapter 11

, Consuming Time Series Data, delves into the fundamentals of time series analysis, covering key concepts, methodologies, and applications across various industries. It includes understanding the components and types of time series data, identifying and handling missing values, and techniques for analyzing trends and patterns over time. The chapter also addresses dealing with outliers and feature engineering to enhance predictive modeling with time series data.

Chapter 12

, Text Preprocessing in the Era of LLMs, focuses on mastering text preprocessing techniques that are essential for optimizing the performance of LLMs. It covers methods for cleaning text, handling rare words and spelling variations, chunking, and tokenization strategies. Additionally, it addresses the transformation of tokens into embeddings, highlighting the importance of adapting preprocessing approaches to maximize the potential of LLMs.

Chapter 13

, Image and Audio Preprocessing with LLMs, examines preprocessing techniques for unstructured data, particularly images and audio, to extract meaningful information. It includes methods for image preprocessing, such as optical character recognition (OCR) and image caption generation with the BLIP model. The chapter also explores audio data handling, including converting audio to text using the Whisper model, providing a comprehensive overview of working with multimedia data in the context of LLMs.

To get the most out of this book

To benefit fully from this book, you should have a good knowledge of Python and a grasp of data engineering and data science basics.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

The GitHub repository follows the chapters of the book, and all the scripts are numbered according to the sections within each chapter. Each script is independent of the others, so you can move ahead without having to run all the scripts beforehand. However, it is critically recommended to follow the flow of the book so that you don’t miss any necessary information.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices

. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The delete_entry() function is used to remove an entry, showing how data can be deleted from the store

A block of code is set as follows:

def process_in_batches(data, batch_size):

for i in range(0, len(data), batch_size):

yield data[i:i + batch_size]

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

user_satisfaction_scores = [

random.randint(1, 5) for _ in range(num_users)]

Any command-line input or output is written as follows:

$ mkdir data

pip install pandas

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: It involves storing data on remote servers accessed from anywhere via the internet, rather than on local devices

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected]

and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

Share your thoughts

Once you’ve read Python Data Cleaning and Preparation Best Practices, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837634743

2. Submit your proof of purchase

3. That’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Upstream Data Ingestion and Cleaning

This part focuses on the foundational stages of data processing, starting from data ingestion to ensuring its quality and structure for downstream tasks. It guides readers through the essential steps of importing, cleaning, and transforming data, which lay the groundwork for effective data analysis. The chapters explore various methods for ingesting data, maintaining high-quality datasets, profiling data for better insights, and cleaning messy data to make it ready for analysis. Further, it covers advanced techniques like merging, concatenating, grouping, and filtering data, along with choosing appropriate data destinations or sinks to optimize processing pipelines. Each chapter in this part equips readers with the knowledge to handle raw data and turn it into a clean, structured, and usable form.

This part has the following chapters:

Chapter 1

, Data Ingestion Techniques

Chapter 2

, Importance of Data Quality

Chapter 3

, Data Profiling – Understanding Data Structure, Quality, and Distribution

Chapter 4

, Cleaning Messy Data and Data Manipulation

Chapter 5

, Data Transformation – Merging and Concatenating

Chapter 6

, Data Grouping, Aggregation, Filtering, and Applying Functions

Chapter 7

, Data Sinks

Data Ingestion Techniques

Data ingestion is a critical component of the data life cycle and sets the foundation for subsequent data transformation and cleaning. It involves the process of collecting and importing data from various sources into a storage system where it can be accessed and analyzed. Effective data ingestion is crucial for ensuring data quality, integrity, and availability, which directly impacts the efficiency and accuracy of data transformation and cleaning processes. In this chapter, we will dive deep into the different types of data sources, explore various data ingestion methods, and discuss their respective advantages, disadvantages, and real-world applications.

In this chapter, we’ll cover the following topics:

Ingesting data in batch mode

Ingesting data in streaming mode

Real-time versus semi-real-time ingestion

Data sources technologies

Technical requirements

You can find all the code for the chapter in the following GitHub repository:

https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/tree/main/chapter01

You can use your favorite IDE (VS Code, PyCharm, Google Colab, etc.) to write and execute your code.

Ingesting data in batch mode

Batch ingestion is a data processing technique whereby large volumes of data are collected, processed, and loaded into a system at scheduled intervals, rather than in real-time. This approach allows organizations to handle substantial amounts of data efficiently by grouping data into batches, which are then processed collectively. For example, a company might collect customer transaction data throughout the day and then process it in a single batch during off-peak hours. This method is particularly useful for organizations that need to process high volumes of data but do not require immediate analysis.

Batch ingestion is beneficial because it optimizes system resources by spreading the processing load across scheduled times, often when the system is underutilized. This reduces the strain on computational resources and can lower costs, especially in cloud-based environments where computing power is metered. Additionally, batch processing simplifies data management, as it allows for the easy application of consistent transformations and validations across large datasets. For organizations with regular, predictable data flows, batch ingestion provides a reliable, scalable, and cost-effective solution for data processing and analytics.

Let’s explore batch ingestion in more detail, starting with its advantages and disadvantages.

Advantages and disadvantages

Batch ingestion offers several notable advantages that make it an attractive choice for many data processing needs:

Efficiency is a key benefit, as batch processing allows for the handling of large volumes of data in a single operation, optimizing resource usage and minimizing overhead

Cost-effectiveness is another benefit, reducing the need for continuous processing resources and lowering operational expenses.

Simplicity makes it easier to manage and implement periodic data processing tasks compared to real-time ingestion, which often requires more complex infrastructure and management

Robustness, as batch processing is well-suited for performing complex data transformations and comprehensive data validation, ensuring high-quality, reliable data

However, batch ingestion also comes with certain drawbacks:

There is an inherent delay between the generation of data and its availability for analysis, which can be a critical issue for applications requiring real-time insights.

Resource spikes can occur during batch processing windows, leading to high resource usage and potential performance bottlenecks

Scalability can also be a concern, as handling very large datasets may require significant infrastructure investment and management

Lastly, maintenance is a crucial aspect of batch ingestion; it demands careful scheduling and ongoing maintenance to ensure the timely and reliable execution of batch jobs

Let’s look at some common use cases for ingesting data in batch mode.

Common use cases for batch ingestion

Any data analytics platform such as data warehouses or data lakes requires regularly updated data for Business Intelligence (BI) and reporting. Batch ingestion is integral as it ensures that data is continually updated with the latest information, enabling businesses to perform comprehensive and up-to-date analyses. By processing data in batches, organizations can efficiently handle vast amounts of transactional and operational data, transforming it into a structured format suitable for querying and reporting. This supports BI initiatives, allowing analysts and decision-makers to generate insightful reports, track Key Performance Indicators (KPIs), and make data-driven decisions.

Extract, Transform, and Load (ETL) processes are a cornerstone of data integration projects, and batch ingestion plays a crucial role in these workflows. In ETL processes, data is extracted from various sources, transformed to fit the operational needs of the target system, and loaded into a database or data warehouse. Batch processing allows for efficient handling of these steps, particularly when dealing with large datasets that require significant transformation and cleansing. This method is ideal for periodic data consolidation, where data from disparate systems is integrated to provide a unified view, supporting activities such as data migration, system integration, and master data management.

Batch ingestion is also widely used for backups and archiving, which are critical processes for data preservation and disaster recovery. Periodic batch processing allows for the scheduled backup of databases, ensuring that all data is captured and securely stored at regular intervals. This approach minimizes the risk of data loss and provides a reliable restore point in case of system failures or data corruption. Additionally, batch processing is used for data archiving, where historical data is periodically moved from active systems to long-term storage solutions. This not only helps in managing storage costs but also ensures that important data is retained and can be retrieved for compliance, auditing, or historical analysis purposes.

Batch ingestion use cases

Batch ingestion is a methodical process involving several key steps: data extraction, data transformation, data loading, scheduling, and automation. To illustrate these steps, let’s explore a use case involving an investment bank that needs to process and analyze trading data for regulatory compliance and performance reporting.

Batch ingestion in an investment bank

An investment bank needs to collect, transform, and load trading data from various financial markets into a central data warehouse. This data will be used for generating daily compliance reports, evaluating trading strategies, and making informed investment decisions.

Data extraction

The first step is identifying the sources from which data will be extracted. For the investment bank, this includes trading systems, market data providers, and internal risk management systems. These sources contain critical data such as trade execution details, market prices, and risk assessments. Once the sources are identified, data is collected using connectors or scripts. This involves setting up data pipelines that extract data from trading systems, import real-time market data feeds, and pull risk metrics from internal systems. The extracted data is then temporarily stored in staging areas before processing.

Data transformation

The extracted data often contains inconsistencies, duplicates, and missing values. Data cleaning is performed to remove duplicates, fill in missing information, and correct errors. For the investment bank, this ensures that trade records are accurate and complete, providing a reliable foundation for compliance reporting and performance analysis. After cleaning, the data undergoes transformations such as aggregations, joins, and calculations. For example, the investment bank might aggregate trade data to calculate daily trading volumes, join trade records with market data to analyze price movements, and calculate key metrics such as Profit and Loss (P&L) and risk exposure. The transformed data must be mapped to the schema of the target system. This involves aligning the data fields with the structure of the data warehouse. For instance, trade data might be mapped to tables representing transactions, market data, and risk metrics, ensuring seamless integration with the existing data model.

Data loading

The transformed data is processed in batches, which allows the investment bank to handle large volumes of data efficiently, performing complex transformations and aggregations in a single run. Once processed, the data is loaded into the target storage system, such as a data warehouse or data lake. For the investment bank, this means loading the cleaned and transformed trading data into their data warehouse, where it can be accessed for compliance reporting and performance analysis.

Scheduling and automation

To ensure that the batch ingestion process runs smoothly and consistently, scheduling tools such as Apache Airflow or Cron jobs are used. These tools automate the data ingestion workflows, scheduling them to run at regular intervals, such as every night or every day. This allows the investment bank to have up-to-date data available for analysis without manual intervention. Implementing monitoring is crucial to track the success and performance of batch jobs. Monitoring tools provide insights into job execution, identifying any failures or performance bottlenecks. For the investment bank, this ensures that any issues in the data ingestion process are promptly detected and resolved, maintaining the integrity and reliability of the data pipeline.

Batch ingestion with an example

Let’s have a look at a simple example of a batch processing ingestion system written in Python. This example will simulate the ETL process. We’ll generate some mock data, process it in batches, and load it into a simulated database.

You can find the code for this part in the GitHub repository at https://github.com/PacktPublishing/Python-Data-Cleaning-and-Preparation-Best-Practices/blob/main/chapter01/1.batch.py

. To run this example, we don’t need any bespoke library installation. We just need to ensure that we are running it in a standard Python environment (Python 3.x):

We create a generate_mock_data function that generates a list of mock data records:

def generate_mock_data(num_records):

data = []

for _ in range(num_records):

record = {

'id': random.randint(1, 1000),

'value': random.random() * 100

}

data.append(record)

return data

Each record is a dictionary with two fields:

id: A random integer between 1 and 1000

value: A random float between 0 and 100

Let’s have a look at what the data looks like:

print(Original data:, data)

{'id': 449, 'value': 99.79699336555473}

{'id': 991, 'value': 79.65999078145887}

A list of dictionaries is returned, each representing a data record.

Next, we create a batch processing function:

def process_in_batches(data, batch_size):

for i in range(0, len(data), batch_size):

yield data[i:i + batch_size]

This function takes the data, which is a list of data records to process, and batch_size, which represents the number of records per batch, as parameters. The function uses a for loop to iterate over the data in steps of batch_size.

Enjoying the preview?

Page 1 of 1

Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python

About this ebook

Maria Zervou

Related authors

Related to Python Data Cleaning and Preparation Best Practices

Related ebooks

Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models

Data Exploration and Preparation with BigQuery: A practical guide to cleaning, transforming, and analyzing data for business insights

Scalable Data Architecture with Java: Build efficient enterprise-grade data architecting solutions using Java

Architecting Data-Intensive Applications: Develop scalable, data-intensive, and robust applications the smart way

Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights

Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Mastering Salesforce Experience Cloud: Strategies for creating powerful customer interactions

Azure Machine Learning Engineering: Deploy, fine-tune, and optimize ML models using Microsoft Azure

Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop

Data Lakehouse in Action: Architecting a modern and scalable data analytics platform

Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently

Power Apps Tips, Tricks, and Best Practices: A step-by-step practical guide to developing robust Power Apps solutions

Debugging Machine Learning Models with Python: Develop high-performance, low-bias, and explainable machine learning and deep learning models

Data Lake Development with Big Data

Azure Data Factory Cookbook: A data engineer's guide to building and managing ETL and ELT pipelines with data integration

Generative AI with Amazon Bedrock: Build, scale, and secure generative AI applications using Amazon Bedrock

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Enterprise-Grade Hybrid and Multi-Cloud Strategies: Proven strategies to digitally transform your business with hybrid and multi-cloud solutions

Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx

Ultimate Snowflake Architecture for Cloud Data Warehousing: Architect, Manage, Secure, and Optimize Your Data Infrastructure Using Snowflake for Actionable Insights and Informed Decisions (English Edition)

Amazon SageMaker Best Practices: Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker

Microservices with Clojure: Develop event-driven, scalable, and reactive microservices with real-time monitoring

The Definitive Guide to Google Vertex AI: Accelerate your machine learning journey with Google Cloud Vertex AI and MLOps best practices

Pandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python

Digital Transformation with Dataverse for Teams: Become a citizen developer and lead the digital transformation wave with Microsoft Teams and Power Platform

DevOps for Databases: A practical guide to applying DevOps best practices to data-persistent technologies

Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability

Ultimate Data Engineering with Databricks

CompTIA Data+: DAO-001 Certification Guide: Complete coverage of the new CompTIA Data+ (DAO-001) exam to help you pass on the first attempt

Microsoft Power BI Performance Best Practices: Learn practical techniques for building high-speed Power BI solutions

Computers For You

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

The Invisible Rainbow: A History of Electricity and Life

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

Deep Search: How to Explore the Internet More Effectively

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

Elon Musk

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

Uncanny Valley: A Memoir

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters

The Hacker Crackdown: Law and Disorder on the Electronic Frontier

The Professional Voiceover Handbook: Voiceover training, #1

The Best Hacking Tricks for Beginners

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Python Machine Learning By Example

ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence

Grokking Algorithms: An illustrated guide for programmers and other curious people

Related podcast episodes

Related articles

Related categories

Reviews for Python Data Cleaning and Preparation Best Practices

What did you think?

Book preview

Python Data Cleaning and Preparation Best Practices - Maria Zervou

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters