Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
By Roy Jafari
()
About this ebook
Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who’s developed college-level courses on data preprocessing and related subjects.
With this book, you’ll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.
You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.
The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you’ll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.
By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.
Related to Hands-On Data Preprocessing in Python
Related ebooks
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights Rating: 0 out of 5 stars0 ratingsData Analysis with Python Rating: 0 out of 5 stars0 ratingsPractical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsAI Data Engineering For Beginners Rating: 0 out of 5 stars0 ratingsHands-On AI: Building ML Models with Python Rating: 0 out of 5 stars0 ratingsGraph Data Modeling in Python: A practical guide to curating, analyzing, and modeling data with graphs Rating: 0 out of 5 stars0 ratingsSmarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects Rating: 0 out of 5 stars0 ratingsPython Data Analysis: Transforming Raw Data into Actionable Intelligence with Python's Data Analysis Capabilities Rating: 0 out of 5 stars0 ratingsA Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg Rating: 0 out of 5 stars0 ratingsGetting Started with Streamlit for Data Science: Create and deploy Streamlit web applications from scratch in Python Rating: 0 out of 5 stars0 ratingsThe Scikit-Learn Handbook: A Guide to Machine Learning for Everyone Rating: 0 out of 5 stars0 ratingsPractical Machine Learning with Python: Real-World Applications Rating: 0 out of 5 stars0 ratingsMetaprogramming with Python: A programmer's guide to writing reusable code to build smarter applications Rating: 0 out of 5 stars0 ratingsGo Machine Learning Projects: Eight projects demonstrating end-to-end machine learning and predictive analytics applications in Go Rating: 0 out of 5 stars0 ratingsPython Machine Learning By Example: The easiest way to get into machine learning Rating: 5 out of 5 stars5/5Machine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud Rating: 0 out of 5 stars0 ratingsData Modeling with Tableau: A practical guide to building data models using Tableau Prep and Tableau Desktop Rating: 0 out of 5 stars0 ratingsBig Data Analytics with Java Rating: 0 out of 5 stars0 ratingsThe Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies Rating: 0 out of 5 stars0 ratingsData Science Mastery: From Beginner to Expert in Big Data Analytics Rating: 0 out of 5 stars0 ratingsLearning Hunk: A quick, practical guide to rapidly visualizing and analyzing your Hadoop data using Hunk Rating: 0 out of 5 stars0 ratingsData Science with Python: Unlocking the Power of Pandas and Numpy Rating: 0 out of 5 stars0 ratingsData Engineering with Alteryx: Helping data engineers apply DataOps practices with Alteryx Rating: 0 out of 5 stars0 ratings
Data Visualization For You
Data Visualization: A Practical Introduction Rating: 5 out of 5 stars5/5How to Lie with Maps Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals Rating: 4 out of 5 stars4/5Data Analytics & Visualization All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsLaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots Rating: 0 out of 5 stars0 ratingsLearn Power BI: A comprehensive, step-by-step guide for beginners to learn real-world business intelligence Rating: 4 out of 5 stars4/5Learning pandas - Second Edition Rating: 4 out of 5 stars4/5Salesforce Reporting and Dashboards Rating: 4 out of 5 stars4/5The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios Rating: 4 out of 5 stars4/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsJupyter Cookbook: Over 75 recipes to perform interactive computing across Python, R, Scala, Spark, JavaScript, and more Rating: 0 out of 5 stars0 ratingsVisual Analytics with Tableau Rating: 0 out of 5 stars0 ratingsHow to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics Rating: 0 out of 5 stars0 ratingsTeach Yourself VISUALLY Power BI Rating: 0 out of 5 stars0 ratingsExploratory Data Analysis: Uncovering Insights from Your Data Rating: 0 out of 5 stars0 ratingsData Visualization For Dummies Rating: 2 out of 5 stars2/5How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech Rating: 0 out of 5 stars0 ratingsFundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions Rating: 0 out of 5 stars0 ratingsData Visualization with Excel Dashboards and Reports Rating: 4 out of 5 stars4/5AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam Rating: 0 out of 5 stars0 ratingsCool Infographics: Effective Communication with Data Visualization and Design Rating: 4 out of 5 stars4/5Data Pipelines with Apache Airflow Rating: 0 out of 5 stars0 ratingsData Structures & Algorithms Interview Questions You'll Most Likely Be Asked Rating: 1 out of 5 stars1/5Ultimate Azure Data Engineering Rating: 0 out of 5 stars0 ratingsFunctional Aesthetics for Data Visualization Rating: 0 out of 5 stars0 ratings
Reviews for Hands-On Data Preprocessing in Python
0 ratings0 reviews
Book preview
Hands-On Data Preprocessing in Python - Roy Jafari
BIRMINGHAM—MUMBAI
Hands-On Data Preprocessing in Python
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Gebin George
Publishing Product Manager: Ali Abidi
Senior Editor: Roshan Kumar
Content Development Editor: Priyanka Soam
Technical Editor: Sonam Pandey
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Nilesh Mohite
Marketing Coordinator: Shifa Ansari
First published: January 2022
Production reference: 1161221
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
978-1-80107-213-7
www.packt.com
To my parents,
Soqra Bayati
and
Jahanfar Jafari.
Contributors
About the author
Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands.
Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization.
Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book.
Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
About the reviewers
Arsia Takeh is a director of data science at a healthcare company and is responsible for designing algorithms for cutting-edge applications in healthcare. He has over a decade of experience in academia and industry delivering data-driven products. His work involves the research and development of large-scale solutions based on machine learning, deep learning, and generative models for healthcare-related use cases. In his previous role as a co-founder of a digital health start-up, he was responsible for building the first integrated -omics platform that provided a 360 view of the user as well as personalized recommendations to improve chronic diseases.
Sreeraj Chundayil is a software developer with more than 10 years of experience. He is an expert in C, C++, Python, and Bash. He has a B.Tech from the prestigious National Institute of Technology Durgapur in electronics and communication engineering. He likes reading technical books, watching technical videos, and contributing to open source projects. Previously, he was involved in the development of NX, 3D modeling software, at Siemens PLM. He is currently working at Siemens EDA (Mentor Graphics) and is involved in the development of integrated chip verification software.
I would like to thank the C++ and Python communities who have made an immense contribution to molding me into the tech lover I am today.
Table of Contents
Preface
Part 1:Technical Needs
Chapter 1: Review of the Core Modules of NumPy and Pandas
Technical requirements
Overview of the Jupyter Notebook
Are we analyzing data via computer programming?
Overview of the basic functions of NumPy
The np.arange() function
The np.zeros() and np.ones() functions
The np.linspace() function
Overview of Pandas
Pandas data access
Boolean masking for filtering a DataFrame
Pandas functions for exploring a DataFrame
Pandas applying a function
The Pandas groupby function
Pandas multi-level indexing
Pandas pivot and melt functions
Summary
Exercises
Chapter 2: Review of Another Core Module – Matplotlib
Technical requirements
Drawing the main plots in Matplotlib
Summarizing numerical attributes using histograms or boxplots
Observing trends in the data using a line plot
Relating two numerical attributes using a scatterplot
Modifying the visuals
Adding a title to visuals and labels to the axis
Adding legends
Modifying ticks
Modifying markers
Subplots
Resizing visuals and saving them
Resizing
Saving
Example of Matplotilb assisting data preprocessing
Summary
Exercises
Chapter 3: Data – What Is It Really?
Technical requirements
What is data?
Why this definition?
DIKW pyramid
Data preprocessing for data analytics versus data preprocessing for machine learning
The most universal data structure – a table
Data objects
Data attributes
Types of data values
Analytics standpoint
Programming standpoint
Information versus pattern
Understanding everyday use of the word information
Statistical use of the word information
Statistical meaning of the word pattern
Summary
Exercises
References
Chapter 4: Databases
Technical requirements
What is a database?
Understanding the difference between a database and a dataset
Types of databases
The differentiating elements of databases
Relational databases (SQL databases)
Unstructured databases (NoSQL databases)
A practical example that requires a combination of both structured and unstructured databases
Distributed databases
Blockchain
Connecting to, and pulling data from, databases
Direct connection
Web page connection
API connection
Request connection
Publicly shared
Summary
Exercises
Part 2: Analytic Goals
Chapter 5: Data Visualization
Technical requirements
Summarizing a population
Example of summarizing numerical attributes
Example of summarizing categorical attributes
Comparing populations
Example of comparing populations using boxplots
Example of comparing populations using histograms
Example of comparing populations using bar charts
Investigating the relationship between two attributes
Visualizing the relationship between two numerical attributes
Visualizing the relationship between two categorical attributes
Visualizing the relationship between a numerical attribute and a categorical attribute
Adding visual dimensions
Example of a five-dimensional scatter plot
Showing and comparing trends
Example of visualizing and comparing trends
Summary
Exercise
Chapter 6: Prediction
Technical requirements
Predictive models
Forecasting
Regression analysis
Linear regression
Example of applying linear regression to perform regression analysis
MLP
How does MLP work?
Example of applying MLP to perform regression analysis
Summary
Exercises
Chapter 7: Classification
Technical requirements
Classification models
Example of designing a classification model
Classification algorithms
KNN
Example of using KNN for classification
Decision Trees
Example of using Decision Trees for classification
Summary
Exercises
Chapter 8: Clustering Analysis
Technical requirements
Clustering model
Clustering example using a two-dimensional dataset
Clustering example using a three-dimensional dataset
K-Means algorithm
Using K-Means to cluster a two-dimensional dataset
Using K-Means to cluster a dataset with more than two dimensions
Centroid analysis
Summary
Exercises
Part 3: The Preprocessing
Chapter 9: Data Cleaning Level I – Cleaning Up the Table
Technical requirements
The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11
Purpose of data analytics
Tools for data analytics
Levels of data cleaning
Mapping the purposes and tools of analytics to the levels of data cleaning
Data cleaning level I – cleaning up the table
Example 1 – unwise data collection
Example 2 – reindexing (multi-level indexing)
Example 3 – intuitive but long column titles
Summary
Exercises
Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table
Technical requirements
Example 1 – unpacking columns and reformulating the table
Unpacking FileName
Unpacking Content
Reformulating a new table for visualization
The last step – drawing the visualization
Example 2 – restructuring the table
Example 3 – level I and II data cleaning
Level I cleaning
Level II cleaning
Doing the analytics – using linear regression to create a predictive model
Summary
Exercises
Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors
Technical requirements
Missing values
Detecting missing values
Example of detecting missing values
Causes of missing values
Types of missing values
Diagnosis of missing values
Dealing with missing values
Outliers
Detecting outliers
Dealing with outliers
Errors
Types of errors
Dealing with errors
Detecting systematic errors
Summary
Exercises
Chapter 12: Data Fusion and Data Integration
Technical requirements
What are data fusion and data integration?
Data fusion versus data integration
Directions of data integration
Frequent challenges regarding data fusion and integration
Challenge 1 – entity identification
Challenge 2 – unwise data collection
Challenge 3 – index mismatched formatting
Challenge 4 – aggregation mismatch
Challenge 5 – duplicate data objects
Challenge 6 – data redundancy
Example 1 (challenges 3 and 4)
Example 2 (challenges 2 and 3)
Example 3 (challenges 1, 3, 5, and 6)
Checking for duplicate data objects
Designing the structure for the result of data integration
Filling songIntegrate_df from billboard_df
Filling songIntegrate_df from songAttribute_df
Filling songIntegrate_df from artist_df
Checking for data redundancy
The analysis
Example summary
Summary
Exercise
Chapter 13: Data Reduction
Technical requirements
The distinction between data reduction and data redundancy
The objectives of data reduction
Types of data reduction
Performing numerosity data reduction
Random sampling
Stratified sampling
Random over/undersampling
Performing dimensionality data reduction
Linear regression as a dimension reduction method
Using a decision tree as a dimension reduction method
Using random forest as a dimension reduction method
Brute-force computational dimension reduction
PCA
Functional data analysis
Summary
Exercises
Chapter 14: Data Transformation and Massaging
Technical requirements
The whys of data transformation and massaging
Data transformation versus data massaging
Normalization and standardization
Binary coding, ranking transformation, and discretization
Example one – binary coding of nominal attribute
Example two – binary coding or ranking transformation of ordinal attributes
Example three – discretization of numerical attributes
Understanding the types of discretization
Discretization – the number of cut-off points
A summary – from numbers to categories and back
Attribute construction
Example – construct one transformed attribute from two attributes
Feature extraction
Example – extract three attributes from one attribute
Example – Morphological feature extraction
Feature extraction examples from the previous chapters
Log transformation
Implementation – doing it yourself
Implementation – the working module doing it for you
Smoothing, aggregation, and binning
Smoothing
Aggregation
Binning
Summary
Exercise
Part 4: Case Studies
Chapter 15: Case Study 1 – Mental Health in Tech
Technical requirements
Introducing the case study
The audience of the results of analytics
Introduction to the source of the data
Integrating the data sources
Cleaning the data
Detecting and dealing with outliers and errors
Detecting and dealing with missing values
Analyzing the data
Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?
Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?
Analysis question three – do more supportive companies have mentally healthier employees?
Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?
Summary
Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations
Technical requirements
Introducing the case study
Introducing the source of the data
Preprocessing the data
Designing the dataset to support the prediction
Filling up the placeholder dataset
Supervised dimension reduction
Analyzing the data
Summary
Chapter 17: Case Study 3: United States Counties Clustering Analysis
Technical requirements
Introducing the case study
Introduction to the source of the data
Preprocessing the data
Transforming election_df to partisan_df
Cleaning edu_df, employ_df, pop_df, and pov_df
Data integration
Data cleaning level III – missing values, errors, and outliers
Checking for data redundancy
Analyzing the data
Using PCA to visualize the dataset
K-Means clustering analysis
Summary
Chapter 18: Summary, Practice Case Studies, and Conclusions
A summary of the book
Part 1 – Technical requirements
Part 2 – Analytics goals
Part 3 – The preprocessing
Part 4 – Case studies
Practice case studies
Google Covid-19 mobility dataset
Police killings in the US
US accidents
San Francisco crime
Data analytics job market
FIFA 2018 player of the match
Hot hands in basketball
Wildfires in California
Silicon Valley diversity profile
Recognizing fake job posting
Hunting more practice case studies
Conclusions
Other Books You May Enjoy
Preface
Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.
This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.
By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.
Who this book is for
Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.
What this book covers
Chapter 1, Review of the Core Modules of NumPy and Pandas, introduces two of three main modules used for data manipulation, using real dataset examples to show their relevant capabilities.
Chapter 2, Review of Another Core Module – Matplotlib, introduces the last of the three modules used for data manipulation, using real dataset examples to show its relevant capabilities.
Chapter 3, Data – What Is It Really?, puts forth a technical definition of data and introduces data concepts and languages that are necessary for data preprocessing.
Chapter 4, Databases, explains the role of databases, the different kinds, and teaches you how to connect and pull data from relational databases. It also teaches you how to pull data from databases using APIs.
Chapter 5, Data Visualization, showcases some analytics examples using data visualizations to inform you of the potential of data visualization.
Chapter 6, Prediction, introduces predictive models and explains how to use Multivariate Regression and a Multi-Layered Perceptron (MLP).
Chapter 7, Classification, introduces classification models and explains how to use Decision Trees and K-Nearest Neighbors (KNN).
Chapter 8, Clustering Analysis, introduces clustering models and explains how to use K-means.
Chapter 9, Data Cleaning Level I – Cleaning Up the Table, introduces three different levels of data cleaning and covers the first level through examples.
Chapter 10, Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table, covers the second level of data cleaning through examples.
Chapter 11, Data Cleaning Level III – Missing Values, Outliers, and Errors, covers the third level of data cleaning through examples.
Chapter 12, Data Fusion and Data Integration, covers the technique for mixing different data sources.
Chapter 13, Data Reduction, introduces data reduction and, with the help of examples, shows how its different cases and versions can be done via Python.
Chapter 14, Data Transformation and Massaging, introduces data transformation and massaging and, through many examples, shows their requirements and capabilities for analysis.
Chapter 15, Case Study 1 – Mental Health in Tech, introduces an analytic problem and preprocesses the data to solve it.
Chapter 16, Case Study 2 – Predicting COVID-19 Hospitalizations, introduces an analytic problem and preprocesses the data to solve it.
Chapter 17, Case Study 3 – United States Counties Clustering Analysis, introduces an analytic problem and preprocesses the data to solve it.
Chapter 18, Summary, Practice Case Studies, and Conclusions, introduces some possible practice cases that users can use to learn in more depth and start creating their analytics portfolios.
To get the most out of this book
The book assumes basic programming skills such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python. Other than that, you can start your journey from the beginning of the book and start learning.
The Jupyter Notebook is an excellent UI for learning and practicing programming and data analytics. It can be downloaded and installed easily using Anaconda Navigator. Visit this page for installation: https://docs.anaconda.com/anaconda/navigator/install/.
While Anaconda has most of the modules that the book uses already installed, you will need to install a few other modules such as Seaborn and Graphviz. Don't worry; when the time comes, the book will instruct you on how to go about these installations.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
While learning, keep a file of your own code from each chapter. This learning repository can be used in the future for deeper learning and real projects. The Jupyter Notebook is especially great for this purpose as it allows you to take notes along with the code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. If there's an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801072137_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: To create this interactive visual, we have used the interact and widgets programming objects from the ipywidgets module.
A block of code is set as follows:
from ipywidgets import interact, widgets
interact(plotyear,year=widgets.IntSlider(min=2010,max=2019,step=1,value=2010))
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Xs_t.plot.scatter(x='PC1',y='PC2',c='PC3',sharex=False,
vmin=-1/0.101, vmax=1/0.101,
figsize=(12,9))
x_ticks_vs = [-2.9*4 + 2.9*i for i in range(9)]
Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: The missing values for the attributes from SupportQ1 to AttitudeQ3 are from the same data objects.
Tips or Important Notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Share Your Thoughts
Once you've read Hands-On Data Preprocessing in Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.
Part 1:Technical Needs
After reading this part of the book, you will be able to use Python to effectively manipulate data.
This part comprises the following chapters:
Chapter 1, Review of the Core Modules of NumPy and Pandas
Chapter 2, Review of Another Core Module – Matplotlib
Chapter 3, Data – What Is It Really?
Chapter 4, Databases
Chapter 1: Review of the Core Modules of NumPy and Pandas
NumPy and Pandas modules are capable of meeting your needs for the majority of data analytics and data preprocessing tasks. Before we start reviewing these two valuable modules, I would like to let you know that this chapter is not meant to be a comprehensive teaching guide to these modules, but rather a collection of concepts, functions, and examples that will be invaluable, as we will cover data analytics and data preprocessing in proceeding chapters.
In this chapter, we will first review the Jupyter Notebooks and their capability as an excellent coding User Interface (UI). Next, we will review the most relevant data analytic resources of the NumPy and Pandas Python modules.
The following topics will be covered in this chapter:
Overview of the Jupyter Notebook
Are we analyzing data via computer programming?
Overview of the basic functions of NumPy
Overview of Pandas
Technical requirements
The easiest way to get started with Python programming is by installing Anaconda Navigator. It is open source software that brings together many useful open source tools for developers. You can download Anaconda Navigator by following this link: https://www.anaconda.com/products/individual.
We will be using Jupyter Notebook throughout this book. Jupyter Notebook is one of the open source tools that Anaconda Navigator provides. Anaconda Navigator also installs a Python version on your computer. So, following Anaconda Navigator's easy installation, all you need to do is open Anaconda Navigator and then select Jupyter Notebook.
You will be able to find all of the code and the dataset that is used in this book in a GitHub repository exclusively created for this book. To find the repository, click on the following link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. Each chapter in this book will have a folder that contains all of the code and datasets that were used in the chapter.
Overview of the Jupyter Notebook
The Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing. As a UI, the Jupyter Notebook provides an interactive environment where you can run your Python code, see immediate outputs, and take notes.
Fernando Pérezthe and Brian Granger, the architects of the Jupyter Notebook, outlines the following reasons in terms of what they were looking for in an innovative programming UI:
Space for individual exploratory work
Space for collaboration
Space for learning and education
If you have used the Jupyter Notebook already, you can attest that it delivers all these promises, and if you have not yet used it, I have good news for you: we will be using Jupyter Notebook for the entirety of this book. Some of the code that I will be sharing will be in the form of screenshots from the Jupyter Notebook UI.
The UI design of the Jupyter Notebook is very simple. You can think of it as one column of material. These materials could be under code chunks or Markdown chunks. The solution development and the actual coding happens under the code chunks, whereas notes for yourself or other developers are presented under Markdown chunks. The following screenshot shows both an example of a Markdown chunk and a code chunk. You can see that the code chunk has been executed and the requested print has taken place and the output is shown immediately after the code chunk:
Figure 1.1 – Code for printing Hello World in a Jupyter notebookFigure 1.1 – Code for printing Hello World in a Jupyter notebook
To create a new chunk, you can click on the + sign on the top ribbon of the UI. The newly added chunk will be a code chunk by default. You can switch the code chunk to a Markdown chunk by using the drop-down list on the top ribbon. Moreover, you can move the chunks up or down by using the correct arrows on the ribbon. You can see these three buttons in the following screenshot:
Figure 1.2 – Jupyter Notebook control ribbonFigure 1.2 – Jupyter Notebook control ribbon
You can see the following in the preceding screenshot:
The ribbon shown in the screenshot also allows you to Cut, Copy, and Paste the chunks.
The Run button on the ribbon is to execute the code of a chunk.
The Stop button is to stop running code. You normally use this button if your code has been running for a while with no output.
The Restart button wipes the slate clean; it removes all of the variables you have defined so you can start over.
Finally, the Restart & Run button restarts the kernel and runs all of the chunks of code in the Jupyter Notebook files.
There is more to the Jupyter Notebook, such as useful short keys to speed up development and specific Markdown syntax to format the text under Markdown chunks. However, the introduction here is just enough for you to start meaningfully analyzing data using Python through the Jupyter Notebook UI.
Are we analyzing data via computer programming?
To benefit most from the two modules that we will cover in this chapter, we need to understand what they really are and what we are really doing when we use them. I am sure whoever is in the business of content development for data analytics using Python, including me (guilty as charged), would tell you that when you use these modules to manipulate your data, you are analyzing your data using computer programming. However, what you are actually doing is not computer programming. The computer programming part has already been done for the most part. In fact, this has been done by the top-notch programmers who put together these invaluable packages. What you do is use their code made available to you as programming objects and functions under these modules. Well, if I am being completely honest, you are doing a tad bit of computer programming, but just enough to access the good stuff (these modules). Thanks to these modules, you will not experience any difficulty in analyzing data using computer programming.
So, before embarking on your journey in this chapter and this book, remember this: for the most part, our job as data analysts is to connect three things – our business problem, our data, and technology. The technology could be commercial software such as Excel or Tableau, or, in the case of this book, these modules.
Overview of the basic functions of NumPy
In short, as the name suggests, NumPy is a Python module brimming with useful functions for dealing with numbers. The Num in the first part of the name NumPy stands for numbers, and Py stands for Python. There you have it. If you have numbers and you are in Python, you know what you need to import. That is correct; you need to import NumPy, simple as that. See the following screenshot:
Figure 1.3 – Code for importing the NumPy moduleFigure 1.3 – Code for importing the NumPy module
As you can see, we have given the alias np to the module after importing it. You can actually assign any alias that you wish and your code would function; however, I suggest sticking with np. I have two compelling reasons for doing so:
First, everyone else uses this alias, so if you share your code with others, they know what you are doing throughout your project.
Second, a lot of the time, you end up using code written by others in your projects, so consistency will make your job easier. You will see that most of the famous modules also have a famous alias, for example, pd for Pandas, and plt for matplotlib.pyplot.
Good practice advice
NumPy can handle all types of mathematical and statistical calculations for a collection of numbers, such as mean, median, standard deviation (std), and variance (var). If you have something else in mind and are not sure whether NumPy has it, I suggest googling it before trying to write your own. If it involves numbers, chances are NumPy has it.
The following screenshot shows the mean, for example, applied to a collection of numbers:
Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array functionFigure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function
As shown in Figure 1.4, there are two ways to do this. The first one, portrayed in the top chunk, uses np.mean(). This function is one of the properties of the NumPy module and can be accessed directly. The great aspect of using this approach is that you do not need to change your data type most of the time before NumPy honors your request. You can input lists, Pandas series, or DataFrames. You can see on the top chunk that np.mean() easily calculated the mean of lst_nums, which is of the list type. The second way, as shown in the bottom chunk, is to first use np.array() to transform the list into a NumPy array and then use the .mean() function, which is a property of any NumPy array. Before continuing to progress with this chapter, take a moment and use the Python type() function to see the different types of lst_numbs and ary_nums, as shown in the following screenshot:
Figure 1.5 – The application of the type() functionFigure 1.5 – The application of the type() function
Next we will learn about four NumPy functions: np.arange(), np.zeros(), np.ones(), and np.linspace().
The np.arange() function
This function, as shown in the following screenshot, produces a sequence of numbers with equal increments. You can see in the figure that by changing the two inputs, you can get the function to output many different sequences of numbers that are required for your analytic purposes:
Figure 1.6 – Examples of using the np.arange() functionFigure 1.6 – Examples of using the np.arange() function
Pay attention to the three chunks of code in the preceding figure to see the default behavior of np.arange() when only one or two inputs are passed.
When only one input is passed, as in the first chunk of code, the default of np.arange() is that you want a sequence of numbers from zero to the input number with increments of one.
When two inputs are passed, as in the second chunk of code, the default of the function is that you want a sequence of numbers from the first input to the second input with increments of one.
The np.zeros() and np.ones() functions
np.ones() creates a NumPy array filled with ones, and np.zeros() does the same thing with zeros. Unlike np.arange(), which takes the input to calculate what needs to be included in the output array, np.zeros() and np.ones() take the input to structure the output array. For instance, the top chunk of the following screenshot specifies the request for an array with four rows and five columns filled with zeros. As you can see in the bottom chunk, if you only pass in one number, the output array will have only one dimension:
Figure 1.7 – Examples of np.zeros() and np.ones()Figure 1.7 – Examples of np.zeros() and np.ones()
These two functions are excellent resources for creating a placeholder to keep the results of calculations in a loop. For instance, review the following example and observe how this function facilitated the coding.
Example – Using a placeholder to accommodate analytics
Given the grade data of 10 students, create a code using NumPy that calculates and reports their grade average.
The data of the 10 students and the solution to this example are provided in the following screenshots. Please review and try this code before progressing:
Figure 1.8 – Grade data for the exampleFigure 1.8 – Grade data for the example
Now that you've had a chance to engage with this example, allow me to highlight a few matters about the provided solution presented in Figure 1.9:
Notice how np.zeros() facilitated the solution by streamlining it significantly. After the code is done, all of the average grades are calculated and saved already. Compare the printed values before and after the for loop.
The enumerate() function in the for loop might sound strange to you. What that does is help the code to have both an index (i) and the item (name) from the collection (Names).
The .format() function is an invaluable property of any string variable. If there are any symbols such as {} in the string, this function will replace them with what has been input sequentially.
# better-looking report is a comment in the second chunk of the code. Comments are not compiled and their only purpose is to communicate something with whoever reads the source code.
Figure 1.9 – Solution to the preceding exampleFigure 1.9 – Solution to the preceding example
The np.linspace() function
This function returns evenly spaced numbers over a specified interval. The function takes three inputs. The first two inputs specify the interval, and the third shows the number of elements that the output will have. For example, refer to the following screenshot:
Figure 1.10 – Solution to the preceding exampleFigure 1.10 – Solution to the preceding example
In the first code block, 19 numbers are evenly spaced between 0 and 1, altogether creating an array with 21 numbers. The second gives another example. After trying out the two examples in the screenshot, try np.linspace(0,1,20) and after investigating the results, think about why I chose 21 over 20 in my example.
np.linspace() is a very handy function for situations where you need to try out different values to find the one that best fits your needs. The following example showcases a simple situation like that.
Example – np.linspace() to create solution candidates
We are interested in finding the value(s) that holds the following mathematical statement: .
Imagine that we don't know that the statement can be simplified easily to ascertain that either 2 or 3 will hold the statement:
So we would like to use NumPy to try out any whole numbers between -1000 and 1000 and find the answer.
The following screenshot shows Python code that provides a solution to this problem:
Figure 1.11 – Solution to the preceding exampleFigure 1.11 – Solution to the preceding example
Please review and try this code before moving on.
Now that you've had a chance to engage with this example, allow me to highlight a couple of things:
Notice how smart use of np.linspace() leads to an array with all of the numbers that we were interested in trying out.
Uncomment #print(Candidates) and review all of the numbers that were tried out to establish the desired answers.
This concludes our review of the NumPy module. Next, we will review another very useful Python module, Pandas.
Overview of Pandas
In short, Pandas is our main module for working with data. The module is brimming with useful functions and tools, but let's get down to the basics first. The greatest tool of Pandas is its data structure, which is known as a DataFrame. In short, a DataFrame is a two-dimensional data structure with a good interface and great codability.
The DataFrame makes itself useful to you right off the bat. The moment you read a data source using Pandas, the data is restructured and shown to you as a DataFrame. Let's give it a try.
We will use the famous adult dataset (adult.csv) to practice and learn the different functionalities of Pandas. Refer to the following screenshot, which shows the importing of Pandas and then reading and showing the dataset. In this code, .head() requests that only the top five rows of data are output. The .tail() code could do the same for the bottom five rows of the data.
Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rowsFigure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rows
The adult dataset has six continuous and eight categorical attributes. Due to print limitations, I have only been able to include some parts of the data; however, if you pay attention to Figure 1.12, the output comes with a scroll bar at the bottom that you can scroll to see the rest of the attributes. Give this code a try and study its attributes. As you will see, all of the attributes in this dataset are self-explanatory, apart from fnlwgt. The title is short for final weight and it is calculated by the Census Bureau to represent the ratio of the population that each row represents.
Good practice advice
It is good practice to always get to know the dataset you are about to work on. This process always starts with making sure you understand each attribute, the way I just did now. If you have just received a dataset and you don't know what each attribute is, ask. Trust me, you will look more like a pro than not.
There are other steps to get to know a dataset. I will mention them all here and you will learn how to do them by the end of this chapter.
Step one: Understand each attribute as I just explained.
Step two: Check the shape of the dataset. How many rows and columns does the dataset have? This one is easy. For instance, just try adult_df.shape and review the result.
Step three: Check whether the data has any missing values.
Step four: Calculate summarizing values for numerical attributes such as mean, median, and standard deviation, and compute all the possible values for categorical attributes.
Step five: Visualize the attributes. For numerical attributes, use a histogram or a boxplot, and for categorical ones, use a bar chart.
As you just saw, before you know it, you are enjoying the benefits of a Pandas DataFrame. So it is important to better understand the structure of a DataFrame. Simply put, a DataFrame is a collection of series. A series is another Pandas data structure that does not get as much credit, but is useful all the same, if not more so.
To understand this better, try to call some of the columns of the adult dataset. Each column is a property of a DataFrame, so to access it, all you need to do is to use .ColumnName after the DataFrame. For instance, try running adult_df.age to see the column age. Try running all of the columns and study them, and if you come across errors for some of them, do not worry about it; we will address them soon if you continue reading. The following screenshot shows how you can confirm what was just described for the adult dataset:
Figure 1.13 – Checking the type of adult_df and adult_df.ageFigure 1.13 – Checking the type of adult_df and adult_df.age
It gets more exciting. Not only is each attribute a series, but each row is also a series. To access each row of a DataFrame, you need to use .loc[] after the DataFrame. What comes between the brackets is the index of each row. Go back and study the output of df_adult.head() in Figure 1.12 and you will see that each row is represented by an index. The indices do not have to be numerical and we will see how indices of a Pandas DataFrame can be adjusted, but when reading data using pd.read_csv() with default properties, numerical indices will be assigned. So give it a try and access some of the rows and study them. For instance, you can access the second row by running adult_df.loc[1]. After running a few of them, run type(adult_df.loc[1]) to confirm that each row is a series.
When accessed separately, each column or row of a DataFrame is a series. The only difference between a column series and a row series is that the index of a column series is the index of the DataFrame, and the index of a row series is the column names. Study the following screenshot, which compares the index of the first row of adult_df and the index of the first column of adult_df:
Figure 1.14 – Investigating the index for a column series and a row seriesFigure 1.14 – Investigating the index for a column series and a row series
Now that we have been introduced to Pandas data structures, next we will cover how we can access the values that are presented in them.
Pandas data access
One of the greatest advantages of both Pandas series and DataFrames is the excellent access they afford us. Let's start with DataFrames, and then we will move on to series as there are lots of commonalities between the two.
Pandas DataFrame access
As DataFrames are two-dimensional, this section first addresses how to access rows, and then columns. The end part of the section will address how to access each value.
DataFrame access rows