Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

Ebook1,014 pages7 hours

Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

Name: Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
Author: Roy Jafari
ISBN: 9781801079952

By Roy Jafari

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who’s developed college-level courses on data preprocessing and related subjects.
With this book, you’ll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.
You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.
The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you’ll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.
By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJan 21, 2022

ISBN9781801079952

Author

Roy Jafari

Related authors

Skip carousel

Related to Hands-On Data Preprocessing in Python

Related ebooks

Skip carousel

Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
Ebook
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
byGus Frazer
Rating: 0 out of 5 stars
0 ratings
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
Ebook
Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value
byBipin Chadha
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
byPURNA CHANDER RAO. KATHULA
Rating: 5 out of 5 stars
5/5
Data Analysis with Python
Ebook
Data Analysis with Python
bySam Campbell
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
AI Data Engineering For Beginners
Ebook
AI Data Engineering For Beginners
bySimon Winston
Rating: 0 out of 5 stars
0 ratings
Hands-On AI: Building ML Models with Python
Ebook
Hands-On AI: Building ML Models with Python
byAnand Vemula
Rating: 0 out of 5 stars
0 ratings
Graph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs
Ebook
Graph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs
byGary Hutson
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence with Power BI: Take your data analytics skills to the next level by leveraging the AI capabilities in Power BI
Ebook
Artificial Intelligence with Power BI: Take your data analytics skills to the next level by leveraging the AI capabilities in Power BI
byMary-Jo Diepeveen
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis: Transforming Raw Data into Actionable Intelligence with Python's Data Analysis Capabilities
Ebook
Python Data Analysis: Transforming Raw Data into Actionable Intelligence with Python's Data Analysis Capabilities
byTom Lesley
Rating: 0 out of 5 stars
0 ratings
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
Ebook
A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg
byDr. Ranja Sarkar
Rating: 0 out of 5 stars
0 ratings
Getting Started with Streamlit for Data Science: Create and deploy Streamlit web applications from scratch in Python
Ebook
Getting Started with Streamlit for Data Science: Create and deploy Streamlit web applications from scratch in Python
byTyler Richards
Rating: 0 out of 5 stars
0 ratings
The Scikit-Learn Handbook: A Guide to Machine Learning for Everyone
Ebook
The Scikit-Learn Handbook: A Guide to Machine Learning for Everyone
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning with Python: Real-World Applications
Ebook
Practical Machine Learning with Python: Real-World Applications
byGloria Cheruto
Rating: 0 out of 5 stars
0 ratings
Metaprogramming with Python: A programmer's guide to writing reusable code to build smarter applications
Ebook
Metaprogramming with Python: A programmer's guide to writing reusable code to build smarter applications
bySulekha AloorRavi
Rating: 0 out of 5 stars
0 ratings
Machine Learning Engineering with Python: Manage the production life cycle of machine learning models using MLOps with practical examples
Ebook
Machine Learning Engineering with Python: Manage the production life cycle of machine learning models using MLOps with practical examples
byAndrew P. McMahon
Rating: 0 out of 5 stars
0 ratings
Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go
Ebook
Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go
byXuanyi Chew
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning By Example: The easiest way to get into machine learning
Ebook
Python Machine Learning By Example: The easiest way to get into machine learning
byYuxi (Hayden) Liu
Rating: 5 out of 5 stars
5/5
Hands-On Predictive Analytics with Python: Master the complete predictive analytics process, from problem definition to model deployment
Ebook
Hands-On Predictive Analytics with Python: Master the complete predictive analytics process, from problem definition to model deployment
byAlvaro Fuentes
Rating: 0 out of 5 stars
0 ratings
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
Ebook
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
byAnthony Adams
Rating: 5 out of 5 stars
5/5
Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud
Ebook
Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud
bySaleh Alkhalifa
Rating: 0 out of 5 stars
0 ratings
Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop
Ebook
Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop
byKirk Munroe
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics with Java
Ebook
Big Data Analytics with Java
byRajat Mehta
Rating: 0 out of 5 stars
0 ratings
Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python
Ebook
Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python
byAlan Fontaine
Rating: 0 out of 5 stars
0 ratings
The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies
Ebook
The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies
byBlaine Bateman
Rating: 0 out of 5 stars
0 ratings
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Ebook
Data Science Mastery: From Beginner to Expert in Big Data Analytics
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
Ebook
Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
Data Science with Python: Unlocking the Power of Pandas and Numpy
Ebook
Data Science with Python: Unlocking the Power of Pandas and Numpy
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx
Ebook
Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx
byPaul Houghton
Rating: 0 out of 5 stars
0 ratings

Data Visualization For You

Skip carousel

Data Visualization: A Practical Introduction
Ebook
Data Visualization: A Practical Introduction
byKieran Healy
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
How to Lie with Maps
Ebook
How to Lie with Maps
byMark Monmonier
Rating: 4 out of 5 stars
4/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
The Applied SQL Data Analytics Workshop - Second Edition: Develop your practical skills and prepare to become a professional data analyst, 2nd Edition
Ebook
The Applied SQL Data Analytics Workshop - Second Edition: Develop your practical skills and prepare to become a professional data analyst, 2nd Edition
byMatt Goldwasser
Rating: 0 out of 5 stars
0 ratings
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
Ebook
Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals
byBrent Dykes
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 5 out of 5 stars
5/5
Data Analytics & Visualization All-in-One For Dummies
Ebook
Data Analytics & Visualization All-in-One For Dummies
byJack A. Hyman
Rating: 0 out of 5 stars
0 ratings
LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots
Ebook
LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots
byStefan Kottwitz
Rating: 0 out of 5 stars
0 ratings
Learn Power BI: A comprehensive, step-by-step guide for beginners to learn real-world business intelligence
Ebook
Learn Power BI: A comprehensive, step-by-step guide for beginners to learn real-world business intelligence
byGregory Deckler
Rating: 4 out of 5 stars
4/5
Learning pandas - Second Edition
Ebook
Learning pandas - Second Edition
byMichael Heydt
Rating: 4 out of 5 stars
4/5
Salesforce Reporting and Dashboards
Ebook
Salesforce Reporting and Dashboards
byJohan Yu
Rating: 4 out of 5 stars
4/5
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios
Ebook
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios
bySteve Wexler
Rating: 4 out of 5 stars
4/5
Visualizing Graph Data
Ebook
Visualizing Graph Data
byCorey Lanum
Rating: 0 out of 5 stars
0 ratings
Jupyter Cookbook: Over 75 recipes to perform interactive computing across Python, R, Scala, Spark, JavaScript, and more
Ebook
Jupyter Cookbook: Over 75 recipes to perform interactive computing across Python, R, Scala, Spark, JavaScript, and more
byDan Toomey
Rating: 0 out of 5 stars
0 ratings
Visual Analytics with Tableau
Ebook
Visual Analytics with Tableau
byAlexander Loth
Rating: 0 out of 5 stars
0 ratings
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
Ebook
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
byJohn J Burrett
Rating: 0 out of 5 stars
0 ratings
Teach Yourself VISUALLY Power BI
Ebook
Teach Yourself VISUALLY Power BI
byAlexander Loth
Rating: 0 out of 5 stars
0 ratings
Exploratory Data Analysis: Uncovering Insights from Your Data
Ebook
Exploratory Data Analysis: Uncovering Insights from Your Data
byDaniel Garfield
Rating: 0 out of 5 stars
0 ratings
Data Visualization For Dummies
Ebook
Data Visualization For Dummies
byMico Yuk
Rating: 2 out of 5 stars
2/5
How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech
Ebook
How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech
byAnnie Nelson
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
Ebook
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
byDumky De Wilde
Rating: 0 out of 5 stars
0 ratings
Data Visualization with Excel Dashboards and Reports
Ebook
Data Visualization with Excel Dashboards and Reports
byDick Kusleika
Rating: 4 out of 5 stars
4/5
AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam
Ebook
AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam
byAsif Abbasi
Rating: 0 out of 5 stars
0 ratings
Cool Infographics: Effective Communication with Data Visualization and Design
Ebook
Cool Infographics: Effective Communication with Data Visualization and Design
byRandy Krum
Rating: 4 out of 5 stars
4/5
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Ebook
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
byVibrant Publishers
Rating: 1 out of 5 stars
1/5
Excel 2024: Mastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step GuideMastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step Guide
Ebook
Excel 2024: Mastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step GuideMastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step Guide
byThomas Reynolds
Rating: 0 out of 5 stars
0 ratings
Ultimate Azure Data Engineering
Ebook
Ultimate Azure Data Engineering
byAshish Agarwal
Rating: 0 out of 5 stars
0 ratings
Functional Aesthetics for Data Visualization
Ebook
Functional Aesthetics for Data Visualization
byVidya Setlur
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
UNLIMITED
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
554. Barry Saunders: AI Project Case Study: Show Notes: Barry Saunders, a digital expert at McKinsey, discusses his background in the firm and his experience in AI-related projects. He worked in the LEAP practice, which built platforms for video streaming, preventative maintenance, and...
UNLIMITED
554. Barry Saunders: AI Project Case Study: Show Notes: Barry Saunders, a digital expert at McKinsey, discusses his background in the firm and his experience in AI-related projects. He worked in the LEAP practice, which built platforms for video streaming, preventative maintenance, and...
byUnleashed - How to Thrive as an Independent Professional
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
UNLIMITED
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
UNLIMITED
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
byData Engineering Podcast
0 ratings
0% found this document useful
Product Owners in Data Science - Anna Hannemann
UNLIMITED
Product Owners in Data Science - Anna Hannemann
byDataTalks.Club
0 ratings
0% found this document useful
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
UNLIMITED
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
byData Engineering Podcast
0 ratings
0% found this document useful
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
UNLIMITED
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
byLast Week in AI
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
UNLIMITED
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data, AI, Security, and Privacy
0 ratings
0% found this document useful
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
UNLIMITED
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
byAI Live & Unbiased
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
UNLIMITED
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Automating Analytics Teams
UNLIMITED
Automating Analytics Teams
byThe Cloudcast
0 ratings
0% found this document useful
Continuous Application Profiling
UNLIMITED
Continuous Application Profiling
byThe Cloudcast
0 ratings
0% found this document useful
The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
UNLIMITED
The Three Roles of the Chief Data Officer: ADP’s Jack Berkowitz
byMe, Myself, and AI
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
UNLIMITED
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
Understanding Machine Learning Features and Platforms
UNLIMITED
Understanding Machine Learning Features and Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
UNLIMITED
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Similarities and Differences between ML and Analytics - Rishabh Bhargava
UNLIMITED
Similarities and Differences between ML and Analytics - Rishabh Bhargava
byDataTalks.Club
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
UNLIMITED
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
UNLIMITED
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
UNLIMITED
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
34. Data-Driven Decision-Making and Intranet Design (feat. Christian Knoebel and Charlie Kreitzberg, Princeton University)
UNLIMITED
34. Data-Driven Decision-Making and Intranet Design (feat. Christian Knoebel and Charlie Kreitzberg, Princeton University)
byNN/g UX Podcast
0 ratings
0% found this document useful
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
UNLIMITED
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
byData Engineering Podcast
0 ratings
0% found this document useful
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
UNLIMITED
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
byData Engineering Podcast
0 ratings
0% found this document useful
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
UNLIMITED
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
byAI Live & Unbiased
0 ratings
0% found this document useful
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
UNLIMITED
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
byThe Digital Adoption Show | Upskilling the Future Digital Workforce
0 ratings
0% found this document useful
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
UNLIMITED
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
byMLOps.community
0 ratings
0% found this document useful
Cloud BI for Everyone
UNLIMITED
Cloud BI for Everyone
byThe Cloudcast
100%
100% found this document useful
The Changing Faces of Data and Analytics
UNLIMITED
The Changing Faces of Data and Analytics
byInsights Tomorrow
0 ratings
0% found this document useful
Data Observability
UNLIMITED
Data Observability
byThe Cloudcast
0 ratings
0% found this document useful
AI and ML Networking: bridging the gap between performance and economy
UNLIMITED
AI and ML Networking: bridging the gap between performance and economy
byTechnology Now
0 ratings
0% found this document useful

Related categories

Skip carousel

Reviews for Hands-On Data Preprocessing in Python

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Hands-On Data Preprocessing in Python - Roy Jafari

Cover.png

BIRMINGHAM—MUMBAI

Hands-On Data Preprocessing in Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Gebin George

Publishing Product Manager: Ali Abidi

Senior Editor: Roshan Kumar

Content Development Editor: Priyanka Soam

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Nilesh Mohite

Marketing Coordinator: Shifa Ansari

First published: January 2022

Production reference: 1161221

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

978-1-80107-213-7

www.packt.com

To my parents,

Soqra Bayati

and

Jahanfar Jafari.

Contributors

About the author

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands.

Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization.

Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book.

Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.

About the reviewers

Arsia Takeh is a director of data science at a healthcare company and is responsible for designing algorithms for cutting-edge applications in healthcare. He has over a decade of experience in academia and industry delivering data-driven products. His work involves the research and development of large-scale solutions based on machine learning, deep learning, and generative models for healthcare-related use cases. In his previous role as a co-founder of a digital health start-up, he was responsible for building the first integrated -omics platform that provided a 360 view of the user as well as personalized recommendations to improve chronic diseases.

Sreeraj Chundayil is a software developer with more than 10 years of experience. He is an expert in C, C++, Python, and Bash. He has a B.Tech from the prestigious National Institute of Technology Durgapur in electronics and communication engineering. He likes reading technical books, watching technical videos, and contributing to open source projects. Previously, he was involved in the development of NX, 3D modeling software, at Siemens PLM. He is currently working at Siemens EDA (Mentor Graphics) and is involved in the development of integrated chip verification software.

I would like to thank the C++ and Python communities who have made an immense contribution to molding me into the tech lover I am today.

Table of Contents

Preface

Part 1:Technical Needs

Chapter 1: Review of the Core Modules of NumPy and Pandas

Technical requirements

Overview of the Jupyter Notebook

Are we analyzing data via computer programming?

Overview of the basic functions of NumPy

The np.arange() function

The np.zeros() and np.ones() functions

The np.linspace() function

Overview of Pandas

Pandas data access

Boolean masking for filtering a DataFrame

Pandas functions for exploring a DataFrame

Pandas applying a function

The Pandas groupby function

Pandas multi-level indexing

Pandas pivot and melt functions

Summary

Exercises

Chapter 2: Review of Another Core Module – Matplotlib

Technical requirements

Drawing the main plots in Matplotlib

Summarizing numerical attributes using histograms or boxplots

Observing trends in the data using a line plot

Relating two numerical attributes using a scatterplot

Modifying the visuals

Adding a title to visuals and labels to the axis

Adding legends

Modifying ticks

Modifying markers

Subplots

Resizing visuals and saving them

Resizing

Saving

Example of Matplotilb assisting data preprocessing

Summary

Exercises

Chapter 3: Data – What Is It Really?

Technical requirements

What is data?

Why this definition?

DIKW pyramid

Data preprocessing for data analytics versus data preprocessing for machine learning

The most universal data structure – a table

Data objects

Data attributes

Types of data values

Analytics standpoint

Programming standpoint

Information versus pattern

Understanding everyday use of the word information

Statistical use of the word information

Statistical meaning of the word pattern

Summary

Exercises

References

Chapter 4: Databases

Technical requirements

What is a database?

Understanding the difference between a database and a dataset

Types of databases

The differentiating elements of databases

Relational databases (SQL databases)

Unstructured databases (NoSQL databases)

A practical example that requires a combination of both structured and unstructured databases

Distributed databases

Blockchain

Connecting to, and pulling data from, databases

Direct connection

Web page connection

API connection

Request connection

Publicly shared

Summary

Exercises

Part 2: Analytic Goals

Chapter 5: Data Visualization

Technical requirements

Summarizing a population

Example of summarizing numerical attributes

Example of summarizing categorical attributes

Comparing populations

Example of comparing populations using boxplots

Example of comparing populations using histograms

Example of comparing populations using bar charts

Investigating the relationship between two attributes

Visualizing the relationship between two numerical attributes

Visualizing the relationship between two categorical attributes

Visualizing the relationship between a numerical attribute and a categorical attribute

Adding visual dimensions

Example of a five-dimensional scatter plot

Showing and comparing trends

Example of visualizing and comparing trends

Summary

Exercise

Chapter 6: Prediction

Technical requirements

Predictive models

Forecasting

Regression analysis

Linear regression

Example of applying linear regression to perform regression analysis

MLP

How does MLP work?

Example of applying MLP to perform regression analysis

Summary

Exercises

Chapter 7: Classification

Technical requirements

Classification models

Example of designing a classification model

Classification algorithms

KNN

Example of using KNN for classification

Decision Trees

Example of using Decision Trees for classification

Summary

Exercises

Chapter 8: Clustering Analysis

Technical requirements

Clustering model

Clustering example using a two-dimensional dataset

Clustering example using a three-dimensional dataset

K-Means algorithm

Using K-Means to cluster a two-dimensional dataset

Using K-Means to cluster a dataset with more than two dimensions

Centroid analysis

Summary

Exercises

Part 3: The Preprocessing

Chapter 9: Data Cleaning Level I – Cleaning Up the Table

Technical requirements

The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11

Purpose of data analytics

Tools for data analytics

Levels of data cleaning

Mapping the purposes and tools of analytics to the levels of data cleaning

Data cleaning level I – cleaning up the table

Example 1 – unwise data collection

Example 2 – reindexing (multi-level indexing)

Example 3 – intuitive but long column titles

Summary

Exercises

Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table

Technical requirements

Example 1 – unpacking columns and reformulating the table

Unpacking FileName

Unpacking Content

Reformulating a new table for visualization

The last step – drawing the visualization

Example 2 – restructuring the table

Example 3 – level I and II data cleaning

Level I cleaning

Level II cleaning

Doing the analytics – using linear regression to create a predictive model

Summary

Exercises

Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors

Technical requirements

Missing values

Detecting missing values

Example of detecting missing values

Causes of missing values

Types of missing values

Diagnosis of missing values

Dealing with missing values

Outliers

Detecting outliers

Dealing with outliers

Errors

Types of errors

Dealing with errors

Detecting systematic errors

Summary

Exercises

Chapter 12: Data Fusion and Data Integration

Technical requirements

What are data fusion and data integration?

Data fusion versus data integration

Directions of data integration

Frequent challenges regarding data fusion and integration

Challenge 1 – entity identification

Challenge 2 – unwise data collection

Challenge 3 – index mismatched formatting

Challenge 4 – aggregation mismatch

Challenge 5 – duplicate data objects

Challenge 6 – data redundancy

Example 1 (challenges 3 and 4)

Example 2 (challenges 2 and 3)

Example 3 (challenges 1, 3, 5, and 6)

Checking for duplicate data objects

Designing the structure for the result of data integration

Filling songIntegrate_df from billboard_df

Filling songIntegrate_df from songAttribute_df

Filling songIntegrate_df from artist_df

Checking for data redundancy

The analysis

Example summary

Summary

Exercise

Chapter 13: Data Reduction

Technical requirements

The distinction between data reduction and data redundancy

The objectives of data reduction

Types of data reduction

Performing numerosity data reduction

Random sampling

Stratified sampling

Random over/undersampling

Performing dimensionality data reduction

Linear regression as a dimension reduction method

Using a decision tree as a dimension reduction method

Using random forest as a dimension reduction method

Brute-force computational dimension reduction

PCA

Functional data analysis

Summary

Exercises

Chapter 14: Data Transformation and Massaging

Technical requirements

The whys of data transformation and massaging

Data transformation versus data massaging

Normalization and standardization

Binary coding, ranking transformation, and discretization

Example one – binary coding of nominal attribute

Example two – binary coding or ranking transformation of ordinal attributes

Example three – discretization of numerical attributes

Understanding the types of discretization

Discretization – the number of cut-off points

A summary – from numbers to categories and back

Attribute construction

Example – construct one transformed attribute from two attributes

Feature extraction

Example – extract three attributes from one attribute

Example – Morphological feature extraction

Feature extraction examples from the previous chapters

Log transformation

Implementation – doing it yourself

Implementation – the working module doing it for you

Smoothing, aggregation, and binning

Smoothing

Aggregation

Binning

Summary

Exercise

Part 4: Case Studies

Chapter 15: Case Study 1 – Mental Health in Tech

Technical requirements

Introducing the case study

The audience of the results of analytics

Introduction to the source of the data

Integrating the data sources

Cleaning the data

Detecting and dealing with outliers and errors

Detecting and dealing with missing values

Analyzing the data

Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?

Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?

Analysis question three – do more supportive companies have mentally healthier employees?

Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?

Summary

Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations

Technical requirements

Introducing the case study

Introducing the source of the data

Preprocessing the data

Designing the dataset to support the prediction

Filling up the placeholder dataset

Supervised dimension reduction

Analyzing the data

Summary

Chapter 17: Case Study 3: United States Counties Clustering Analysis

Technical requirements

Introducing the case study

Introduction to the source of the data

Preprocessing the data

Transforming election_df to partisan_df

Cleaning edu_df, employ_df, pop_df, and pov_df

Data integration

Data cleaning level III – missing values, errors, and outliers

Checking for data redundancy

Analyzing the data

Using PCA to visualize the dataset

K-Means clustering analysis

Summary

Chapter 18: Summary, Practice Case Studies, and Conclusions

A summary of the book

Part 1 – Technical requirements

Part 2 – Analytics goals

Part 3 – The preprocessing

Part 4 – Case studies

Practice case studies

Google Covid-19 mobility dataset

Police killings in the US

US accidents

San Francisco crime

Data analytics job market

FIFA 2018 player of the match

Hot hands in basketball

Wildfires in California

Silicon Valley diversity profile

Recognizing fake job posting

Hunting more practice case studies

Conclusions

Other Books You May Enjoy

Preface

Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.

Who this book is for

Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.

What this book covers

Chapter 1, Review of the Core Modules of NumPy and Pandas, introduces two of three main modules used for data manipulation, using real dataset examples to show their relevant capabilities.

Chapter 2, Review of Another Core Module – Matplotlib, introduces the last of the three modules used for data manipulation, using real dataset examples to show its relevant capabilities.

Chapter 3, Data – What Is It Really?, puts forth a technical definition of data and introduces data concepts and languages that are necessary for data preprocessing.

Chapter 4, Databases, explains the role of databases, the different kinds, and teaches you how to connect and pull data from relational databases. It also teaches you how to pull data from databases using APIs.

Chapter 5, Data Visualization, showcases some analytics examples using data visualizations to inform you of the potential of data visualization.

Chapter 6, Prediction, introduces predictive models and explains how to use Multivariate Regression and a Multi-Layered Perceptron (MLP).

Chapter 7, Classification, introduces classification models and explains how to use Decision Trees and K-Nearest Neighbors (KNN).

Chapter 8, Clustering Analysis, introduces clustering models and explains how to use K-means.

Chapter 9, Data Cleaning Level I – Cleaning Up the Table, introduces three different levels of data cleaning and covers the first level through examples.

Chapter 10, Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table, covers the second level of data cleaning through examples.

Chapter 11, Data Cleaning Level III – Missing Values, Outliers, and Errors, covers the third level of data cleaning through examples.

Chapter 12, Data Fusion and Data Integration, covers the technique for mixing different data sources.

Chapter 13, Data Reduction, introduces data reduction and, with the help of examples, shows how its different cases and versions can be done via Python.

Chapter 14, Data Transformation and Massaging, introduces data transformation and massaging and, through many examples, shows their requirements and capabilities for analysis.

Chapter 15, Case Study 1 – Mental Health in Tech, introduces an analytic problem and preprocesses the data to solve it.

Chapter 16, Case Study 2 – Predicting COVID-19 Hospitalizations, introduces an analytic problem and preprocesses the data to solve it.

Chapter 17, Case Study 3 – United States Counties Clustering Analysis, introduces an analytic problem and preprocesses the data to solve it.

Chapter 18, Summary, Practice Case Studies, and Conclusions, introduces some possible practice cases that users can use to learn in more depth and start creating their analytics portfolios.

To get the most out of this book

The book assumes basic programming skills such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python. Other than that, you can start your journey from the beginning of the book and start learning.

The Jupyter Notebook is an excellent UI for learning and practicing programming and data analytics. It can be downloaded and installed easily using Anaconda Navigator. Visit this page for installation: https://docs.anaconda.com/anaconda/navigator/install/.

While Anaconda has most of the modules that the book uses already installed, you will need to install a few other modules such as Seaborn and Graphviz. Don't worry; when the time comes, the book will instruct you on how to go about these installations.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

While learning, keep a file of your own code from each chapter. This learning repository can be used in the future for deeper learning and real projects. The Jupyter Notebook is especially great for this purpose as it allows you to take notes along with the code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801072137_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: To create this interactive visual, we have used the interact and widgets programming objects from the ipywidgets module.

A block of code is set as follows:

from ipywidgets import interact, widgets

interact(plotyear,year=widgets.IntSlider(min=2010,max=2019,step=1,value=2010))

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

Xs_t.plot.scatter(x='PC1',y='PC2',c='PC3',sharex=False,

vmin=-1/0.101, vmax=1/0.101,

figsize=(12,9))

x_ticks_vs = [-2.9*4 + 2.9*i for i in range(9)]

Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: The missing values for the attributes from SupportQ1 to AttitudeQ3 are from the same data objects.

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Hands-On Data Preprocessing in Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Part 1:Technical Needs

After reading this part of the book, you will be able to use Python to effectively manipulate data.

This part comprises the following chapters:

Chapter 1, Review of the Core Modules of NumPy and Pandas

Chapter 2, Review of Another Core Module – Matplotlib

Chapter 3, Data – What Is It Really?

Chapter 4, Databases

Chapter 1: Review of the Core Modules of NumPy and Pandas

NumPy and Pandas modules are capable of meeting your needs for the majority of data analytics and data preprocessing tasks. Before we start reviewing these two valuable modules, I would like to let you know that this chapter is not meant to be a comprehensive teaching guide to these modules, but rather a collection of concepts, functions, and examples that will be invaluable, as we will cover data analytics and data preprocessing in proceeding chapters.

In this chapter, we will first review the Jupyter Notebooks and their capability as an excellent coding User Interface (UI). Next, we will review the most relevant data analytic resources of the NumPy and Pandas Python modules.

The following topics will be covered in this chapter:

Overview of the Jupyter Notebook

Are we analyzing data via computer programming?

Overview of the basic functions of NumPy

Overview of Pandas

Technical requirements

The easiest way to get started with Python programming is by installing Anaconda Navigator. It is open source software that brings together many useful open source tools for developers. You can download Anaconda Navigator by following this link: https://www.anaconda.com/products/individual.

We will be using Jupyter Notebook throughout this book. Jupyter Notebook is one of the open source tools that Anaconda Navigator provides. Anaconda Navigator also installs a Python version on your computer. So, following Anaconda Navigator's easy installation, all you need to do is open Anaconda Navigator and then select Jupyter Notebook.

You will be able to find all of the code and the dataset that is used in this book in a GitHub repository exclusively created for this book. To find the repository, click on the following link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. Each chapter in this book will have a folder that contains all of the code and datasets that were used in the chapter.

Overview of the Jupyter Notebook

The Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing. As a UI, the Jupyter Notebook provides an interactive environment where you can run your Python code, see immediate outputs, and take notes.

Fernando Pérezthe and Brian Granger, the architects of the Jupyter Notebook, outlines the following reasons in terms of what they were looking for in an innovative programming UI:

Space for individual exploratory work

Space for collaboration

Space for learning and education

If you have used the Jupyter Notebook already, you can attest that it delivers all these promises, and if you have not yet used it, I have good news for you: we will be using Jupyter Notebook for the entirety of this book. Some of the code that I will be sharing will be in the form of screenshots from the Jupyter Notebook UI.

The UI design of the Jupyter Notebook is very simple. You can think of it as one column of material. These materials could be under code chunks or Markdown chunks. The solution development and the actual coding happens under the code chunks, whereas notes for yourself or other developers are presented under Markdown chunks. The following screenshot shows both an example of a Markdown chunk and a code chunk. You can see that the code chunk has been executed and the requested print has taken place and the output is shown immediately after the code chunk:

Figure 1.1 – Code for printing Hello World in a Jupyter notebook

Figure 1.1 – Code for printing Hello World in a Jupyter notebook

To create a new chunk, you can click on the + sign on the top ribbon of the UI. The newly added chunk will be a code chunk by default. You can switch the code chunk to a Markdown chunk by using the drop-down list on the top ribbon. Moreover, you can move the chunks up or down by using the correct arrows on the ribbon. You can see these three buttons in the following screenshot:

Figure 1.2 – Jupyter Notebook control ribbon

Figure 1.2 – Jupyter Notebook control ribbon

You can see the following in the preceding screenshot:

The ribbon shown in the screenshot also allows you to Cut, Copy, and Paste the chunks.

The Run button on the ribbon is to execute the code of a chunk.

The Stop button is to stop running code. You normally use this button if your code has been running for a while with no output.

The Restart button wipes the slate clean; it removes all of the variables you have defined so you can start over.

Finally, the Restart & Run button restarts the kernel and runs all of the chunks of code in the Jupyter Notebook files.

There is more to the Jupyter Notebook, such as useful short keys to speed up development and specific Markdown syntax to format the text under Markdown chunks. However, the introduction here is just enough for you to start meaningfully analyzing data using Python through the Jupyter Notebook UI.

Are we analyzing data via computer programming?

To benefit most from the two modules that we will cover in this chapter, we need to understand what they really are and what we are really doing when we use them. I am sure whoever is in the business of content development for data analytics using Python, including me (guilty as charged), would tell you that when you use these modules to manipulate your data, you are analyzing your data using computer programming. However, what you are actually doing is not computer programming. The computer programming part has already been done for the most part. In fact, this has been done by the top-notch programmers who put together these invaluable packages. What you do is use their code made available to you as programming objects and functions under these modules. Well, if I am being completely honest, you are doing a tad bit of computer programming, but just enough to access the good stuff (these modules). Thanks to these modules, you will not experience any difficulty in analyzing data using computer programming.

So, before embarking on your journey in this chapter and this book, remember this: for the most part, our job as data analysts is to connect three things – our business problem, our data, and technology. The technology could be commercial software such as Excel or Tableau, or, in the case of this book, these modules.

Overview of the basic functions of NumPy

In short, as the name suggests, NumPy is a Python module brimming with useful functions for dealing with numbers. The Num in the first part of the name NumPy stands for numbers, and Py stands for Python. There you have it. If you have numbers and you are in Python, you know what you need to import. That is correct; you need to import NumPy, simple as that. See the following screenshot:

Figure 1.3 – Code for importing the NumPy module

Figure 1.3 – Code for importing the NumPy module

As you can see, we have given the alias np to the module after importing it. You can actually assign any alias that you wish and your code would function; however, I suggest sticking with np. I have two compelling reasons for doing so:

First, everyone else uses this alias, so if you share your code with others, they know what you are doing throughout your project.

Second, a lot of the time, you end up using code written by others in your projects, so consistency will make your job easier. You will see that most of the famous modules also have a famous alias, for example, pd for Pandas, and plt for matplotlib.pyplot.

Good practice advice

NumPy can handle all types of mathematical and statistical calculations for a collection of numbers, such as mean, median, standard deviation (std), and variance (var). If you have something else in mind and are not sure whether NumPy has it, I suggest googling it before trying to write your own. If it involves numbers, chances are NumPy has it.

The following screenshot shows the mean, for example, applied to a collection of numbers:

Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

As shown in Figure 1.4, there are two ways to do this. The first one, portrayed in the top chunk, uses np.mean(). This function is one of the properties of the NumPy module and can be accessed directly. The great aspect of using this approach is that you do not need to change your data type most of the time before NumPy honors your request. You can input lists, Pandas series, or DataFrames. You can see on the top chunk that np.mean() easily calculated the mean of lst_nums, which is of the list type. The second way, as shown in the bottom chunk, is to first use np.array() to transform the list into a NumPy array and then use the .mean() function, which is a property of any NumPy array. Before continuing to progress with this chapter, take a moment and use the Python type() function to see the different types of lst_numbs and ary_nums, as shown in the following screenshot:

Figure 1.5 – The application of the type() function

Figure 1.5 – The application of the type() function

Next we will learn about four NumPy functions: np.arange(), np.zeros(), np.ones(), and np.linspace().

The np.arange() function

This function, as shown in the following screenshot, produces a sequence of numbers with equal increments. You can see in the figure that by changing the two inputs, you can get the function to output many different sequences of numbers that are required for your analytic purposes:

Figure 1.6 – Examples of using the np.arange() function

Figure 1.6 – Examples of using the np.arange() function

Pay attention to the three chunks of code in the preceding figure to see the default behavior of np.arange() when only one or two inputs are passed.

When only one input is passed, as in the first chunk of code, the default of np.arange() is that you want a sequence of numbers from zero to the input number with increments of one.

When two inputs are passed, as in the second chunk of code, the default of the function is that you want a sequence of numbers from the first input to the second input with increments of one.

The np.zeros() and np.ones() functions

np.ones() creates a NumPy array filled with ones, and np.zeros() does the same thing with zeros. Unlike np.arange(), which takes the input to calculate what needs to be included in the output array, np.zeros() and np.ones() take the input to structure the output array. For instance, the top chunk of the following screenshot specifies the request for an array with four rows and five columns filled with zeros. As you can see in the bottom chunk, if you only pass in one number, the output array will have only one dimension:

Figure 1.7 – Examples of np.zeros() and np.ones()

Figure 1.7 – Examples of np.zeros() and np.ones()

These two functions are excellent resources for creating a placeholder to keep the results of calculations in a loop. For instance, review the following example and observe how this function facilitated the coding.

Example – Using a placeholder to accommodate analytics

Given the grade data of 10 students, create a code using NumPy that calculates and reports their grade average.

The data of the 10 students and the solution to this example are provided in the following screenshots. Please review and try this code before progressing:

Figure 1.8 – Grade data for the example

Figure 1.8 – Grade data for the example

Now that you've had a chance to engage with this example, allow me to highlight a few matters about the provided solution presented in Figure 1.9:

Notice how np.zeros() facilitated the solution by streamlining it significantly. After the code is done, all of the average grades are calculated and saved already. Compare the printed values before and after the for loop.

The enumerate() function in the for loop might sound strange to you. What that does is help the code to have both an index (i) and the item (name) from the collection (Names).

The .format() function is an invaluable property of any string variable. If there are any symbols such as {} in the string, this function will replace them with what has been input sequentially.

# better-looking report is a comment in the second chunk of the code. Comments are not compiled and their only purpose is to communicate something with whoever reads the source code.

Figure 1.9 – Solution to the preceding example

Figure 1.9 – Solution to the preceding example

The np.linspace() function

This function returns evenly spaced numbers over a specified interval. The function takes three inputs. The first two inputs specify the interval, and the third shows the number of elements that the output will have. For example, refer to the following screenshot:

Figure 1.10 – Solution to the preceding example

Figure 1.10 – Solution to the preceding example

In the first code block, 19 numbers are evenly spaced between 0 and 1, altogether creating an array with 21 numbers. The second gives another example. After trying out the two examples in the screenshot, try np.linspace(0,1,20) and after investigating the results, think about why I chose 21 over 20 in my example.

np.linspace() is a very handy function for situations where you need to try out different values to find the one that best fits your needs. The following example showcases a simple situation like that.

Example – np.linspace() to create solution candidates

We are interested in finding the value(s) that holds the following mathematical statement: .

Imagine that we don't know that the statement can be simplified easily to ascertain that either 2 or 3 will hold the statement:

So we would like to use NumPy to try out any whole numbers between -1000 and 1000 and find the answer.

The following screenshot shows Python code that provides a solution to this problem:

Figure 1.11 – Solution to the preceding example

Figure 1.11 – Solution to the preceding example

Please review and try this code before moving on.

Now that you've had a chance to engage with this example, allow me to highlight a couple of things:

Notice how smart use of np.linspace() leads to an array with all of the numbers that we were interested in trying out.

Uncomment #print(Candidates) and review all of the numbers that were tried out to establish the desired answers.

This concludes our review of the NumPy module. Next, we will review another very useful Python module, Pandas.

Overview of Pandas

In short, Pandas is our main module for working with data. The module is brimming with useful functions and tools, but let's get down to the basics first. The greatest tool of Pandas is its data structure, which is known as a DataFrame. In short, a DataFrame is a two-dimensional data structure with a good interface and great codability.

The DataFrame makes itself useful to you right off the bat. The moment you read a data source using Pandas, the data is restructured and shown to you as a DataFrame. Let's give it a try.

We will use the famous adult dataset (adult.csv) to practice and learn the different functionalities of Pandas. Refer to the following screenshot, which shows the importing of Pandas and then reading and showing the dataset. In this code, .head() requests that only the top five rows of data are output. The .tail() code could do the same for the bottom five rows of the data.

Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rows

Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rows

The adult dataset has six continuous and eight categorical attributes. Due to print limitations, I have only been able to include some parts of the data; however, if you pay attention to Figure 1.12, the output comes with a scroll bar at the bottom that you can scroll to see the rest of the attributes. Give this code a try and study its attributes. As you will see, all of the attributes in this dataset are self-explanatory, apart from fnlwgt. The title is short for final weight and it is calculated by the Census Bureau to represent the ratio of the population that each row represents.

Good practice advice

It is good practice to always get to know the dataset you are about to work on. This process always starts with making sure you understand each attribute, the way I just did now. If you have just received a dataset and you don't know what each attribute is, ask. Trust me, you will look more like a pro than not.

There are other steps to get to know a dataset. I will mention them all here and you will learn how to do them by the end of this chapter.

Step one: Understand each attribute as I just explained.

Step two: Check the shape of the dataset. How many rows and columns does the dataset have? This one is easy. For instance, just try adult_df.shape and review the result.

Step three: Check whether the data has any missing values.

Step four: Calculate summarizing values for numerical attributes such as mean, median, and standard deviation, and compute all the possible values for categorical attributes.

Step five: Visualize the attributes. For numerical attributes, use a histogram or a boxplot, and for categorical ones, use a bar chart.

As you just saw, before you know it, you are enjoying the benefits of a Pandas DataFrame. So it is important to better understand the structure of a DataFrame. Simply put, a DataFrame is a collection of series. A series is another Pandas data structure that does not get as much credit, but is useful all the same, if not more so.

To understand this better, try to call some of the columns of the adult dataset. Each column is a property of a DataFrame, so to access it, all you need to do is to use .ColumnName after the DataFrame. For instance, try running adult_df.age to see the column age. Try running all of the columns and study them, and if you come across errors for some of them, do not worry about it; we will address them soon if you continue reading. The following screenshot shows how you can confirm what was just described for the adult dataset:

Figure 1.13 – Checking the type of adult_df and adult_df.age

Figure 1.13 – Checking the type of adult_df and adult_df.age

It gets more exciting. Not only is each attribute a series, but each row is also a series. To access each row of a DataFrame, you need to use .loc[] after the DataFrame. What comes between the brackets is the index of each row. Go back and study the output of df_adult.head() in Figure 1.12 and you will see that each row is represented by an index. The indices do not have to be numerical and we will see how indices of a Pandas DataFrame can be adjusted, but when reading data using pd.read_csv() with default properties, numerical indices will be assigned. So give it a try and access some of the rows and study them. For instance, you can access the second row by running adult_df.loc[1]. After running a few of them, run type(adult_df.loc[1]) to confirm that each row is a series.

When accessed separately, each column or row of a DataFrame is a series. The only difference between a column series and a row series is that the index of a column series is the index of the DataFrame, and the index of a row series is the column names. Study the following screenshot, which compares the index of the first row of adult_df and the index of the first column of adult_df:

Figure 1.14 – Investigating the index for a column series and a row series

Figure 1.14 – Investigating the index for a column series and a row series

Now that we have been introduced to Pandas data structures, next we will cover how we can access the values that are presented in them.

Pandas data access

One of the greatest advantages of both Pandas series and DataFrames is the excellent access they afford us. Let's start with DataFrames, and then we will move on to series as there are lots of commonalities between the two.

Pandas DataFrame access

As DataFrames are two-dimensional, this section first addresses how to access rows, and then columns. The end part of the section will address how to access each value.

DataFrame access rows

Enjoying the preview?

Page 1 of 1

Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

About this ebook

Roy Jafari

Related authors

Related to Hands-On Data Preprocessing in Python

Related ebooks

Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights

Agile Machine Learning with DataRobot: Automate each step of the machine learning life cycle, from understanding problems to delivering value

Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries

Data Analysis with Python

Practical Data Analysis - Second Edition

AI Data Engineering For Beginners

Hands-On AI: Building ML Models with Python

Graph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs

Artificial Intelligence with Power BI: Take your data analytics skills to the next level by leveraging the AI capabilities in Power BI

Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects

Python Data Analysis: Transforming Raw Data into Actionable Intelligence with Python's Data Analysis Capabilities

A Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg

Getting Started with Streamlit for Data Science: Create and deploy Streamlit web applications from scratch in Python

The Scikit-Learn Handbook: A Guide to Machine Learning for Everyone

Practical Machine Learning with Python: Real-World Applications

Metaprogramming with Python: A programmer's guide to writing reusable code to build smarter applications

Machine Learning Engineering with Python: Manage the production life cycle of machine learning models using MLOps with practical examples

Go Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go

Python Machine Learning By Example: The easiest way to get into machine learning

Hands-On Predictive Analytics with Python: Master the complete predictive analytics process, from problem definition to model deployment

Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language

Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud

Data Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop

Big Data Analytics with Java

Mastering Predictive Analytics with scikit-learn and TensorFlow: Implement machine learning techniques to build advanced predictive models using Python

The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies

Data Science Mastery: From Beginner to Expert in Big Data Analytics

Learning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk

Data Science with Python: Unlocking the Power of Pandas and Numpy

Data Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx

Data Visualization For You

Data Visualization: A Practical Introduction

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

How to Lie with Maps

Data Analytics for Beginners: Introduction to Data Analytics

The Applied SQL Data Analytics Workshop - Second Edition: Develop your practical skills and prepare to become a professional data analyst, 2nd Edition

Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Data Analytics & Visualization All-in-One For Dummies

LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots

Learn Power BI: A comprehensive, step-by-step guide for beginners to learn real-world business intelligence

Learning pandas - Second Edition

Salesforce Reporting and Dashboards

The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios

Visualizing Graph Data

Jupyter Cookbook: Over 75 recipes to perform interactive computing across Python, R, Scala, Spark, JavaScript, and more

Visual Analytics with Tableau

How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics

Teach Yourself VISUALLY Power BI

Exploratory Data Analysis: Uncovering Insights from Your Data

Data Visualization For Dummies

How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech

Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions

Data Visualization with Excel Dashboards and Reports

AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam

Cool Infographics: Effective Communication with Data Visualization and Design

Data Pipelines with Apache Airflow

Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked

Excel 2024: Mastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step GuideMastering Charts, Functions, Formula and Pivot Table in Excel 2024 as a Beginner with Step by Step Guide

Ultimate Azure Data Engineering

Functional Aesthetics for Data Visualization

Related podcast episodes

Related categories

Reviews for Hands-On Data Preprocessing in Python

What did you think?

Book preview

Hands-On Data Preprocessing in Python - Roy Jafari