Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics
Ebook1,014 pages7 hours

Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Hands-On Data Preprocessing is a primer on the best data cleaning and preprocessing techniques, written by an expert who’s developed college-level courses on data preprocessing and related subjects.
With this book, you’ll be equipped with the optimum data preprocessing techniques from multiple perspectives, ensuring that you get the best possible insights from your data.
You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment.
The hands-on examples and easy-to-follow chapters will help you gain a comprehensive articulation of data preprocessing, its whys and hows, and identify opportunities where data analytics could lead to more effective decision making. As you progress through the chapters, you’ll also understand the role of data management systems and technologies for effective analytics and how to use APIs to pull data.
By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques, and handle outliers or missing values to effectively prepare data for analytic tools.

LanguageEnglish
Release dateJan 21, 2022
ISBN9781801079952
Hands-On Data Preprocessing in Python: Learn how to effectively prepare data for successful data analytics

Related to Hands-On Data Preprocessing in Python

Related ebooks

Data Visualization For You

View More

Reviews for Hands-On Data Preprocessing in Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hands-On Data Preprocessing in Python - Roy Jafari

    Cover.png

    BIRMINGHAM—MUMBAI

    Hands-On Data Preprocessing in Python

    Copyright © 2022 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Gebin George

    Publishing Product Manager: Ali Abidi

    Senior Editor: Roshan Kumar

    Content Development Editor: Priyanka Soam

    Technical Editor: Sonam Pandey

    Copy Editor: Safis Editing

    Project Coordinator: Aparna Ravikumar Nair

    Proofreader: Safis Editing

    Indexer: Pratik Shirodkar

    Production Designer: Nilesh Mohite

    Marketing Coordinator: Shifa Ansari

    First published: January 2022

    Production reference: 1161221

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    978-1-80107-213-7

    www.packt.com

    To my parents,

    Soqra Bayati

    and

    Jahanfar Jafari.

    Contributors

    About the author

    Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands.

    Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization.

    Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book.

    Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.

    About the reviewers

    Arsia Takeh is a director of data science at a healthcare company and is responsible for designing algorithms for cutting-edge applications in healthcare. He has over a decade of experience in academia and industry delivering data-driven products. His work involves the research and development of large-scale solutions based on machine learning, deep learning, and generative models for healthcare-related use cases. In his previous role as a co-founder of a digital health start-up, he was responsible for building the first integrated -omics platform that provided a 360 view of the user as well as personalized recommendations to improve chronic diseases.

    Sreeraj Chundayil is a software developer with more than 10 years of experience. He is an expert in C, C++, Python, and Bash. He has a B.Tech from the prestigious National Institute of Technology Durgapur in electronics and communication engineering. He likes reading technical books, watching technical videos, and contributing to open source projects. Previously, he was involved in the development of NX, 3D modeling software, at Siemens PLM. He is currently working at Siemens EDA (Mentor Graphics) and is involved in the development of integrated chip verification software.

    I would like to thank the C++ and Python communities who have made an immense contribution to molding me into the tech lover I am today.

    Table of Contents

    Preface

    Part 1:Technical Needs

    Chapter 1: Review of the Core Modules of NumPy and Pandas

    Technical requirements

    Overview of the Jupyter Notebook

    Are we analyzing data via computer programming?

    Overview of the basic functions of NumPy

    The np.arange() function

    The np.zeros() and np.ones() functions

    The np.linspace() function

    Overview of Pandas

    Pandas data access

    Boolean masking for filtering a DataFrame

    Pandas functions for exploring a DataFrame

    Pandas applying a function

    The Pandas groupby function

    Pandas multi-level indexing

    Pandas pivot and melt functions

    Summary

    Exercises

    Chapter 2: Review of Another Core Module – Matplotlib

    Technical requirements

    Drawing the main plots in Matplotlib

    Summarizing numerical attributes using histograms or boxplots

    Observing trends in the data using a line plot

    Relating two numerical attributes using a scatterplot

    Modifying the visuals

    Adding a title to visuals and labels to the axis

    Adding legends

    Modifying ticks

    Modifying markers

    Subplots

    Resizing visuals and saving them

    Resizing

    Saving

    Example of Matplotilb assisting data preprocessing

    Summary

    Exercises

    Chapter 3: Data – What Is It Really?

    Technical requirements

    What is data?

    Why this definition?

    DIKW pyramid

    Data preprocessing for data analytics versus data preprocessing for machine learning

    The most universal data structure – a table

    Data objects

    Data attributes

    Types of data values

    Analytics standpoint

    Programming standpoint

    Information versus pattern

    Understanding everyday use of the word information

    Statistical use of the word information

    Statistical meaning of the word pattern

    Summary

    Exercises

    References

    Chapter 4: Databases

    Technical requirements

    What is a database?

    Understanding the difference between a database and a dataset

    Types of databases

    The differentiating elements of databases

    Relational databases (SQL databases)

    Unstructured databases (NoSQL databases)

    A practical example that requires a combination of both structured and unstructured databases

    Distributed databases

    Blockchain

    Connecting to, and pulling data from, databases

    Direct connection

    Web page connection

    API connection

    Request connection

    Publicly shared

    Summary

    Exercises

    Part 2: Analytic Goals

    Chapter 5: Data Visualization

    Technical requirements

    Summarizing a population

    Example of summarizing numerical attributes

    Example of summarizing categorical attributes

    Comparing populations

    Example of comparing populations using boxplots

    Example of comparing populations using histograms

    Example of comparing populations using bar charts

    Investigating the relationship between two attributes

    Visualizing the relationship between two numerical attributes

    Visualizing the relationship between two categorical attributes

    Visualizing the relationship between a numerical attribute and a categorical attribute

    Adding visual dimensions

    Example of a five-dimensional scatter plot

    Showing and comparing trends

    Example of visualizing and comparing trends

    Summary

    Exercise

    Chapter 6: Prediction

    Technical requirements

    Predictive models

    Forecasting

    Regression analysis

    Linear regression

    Example of applying linear regression to perform regression analysis

    MLP

    How does MLP work?

    Example of applying MLP to perform regression analysis

    Summary

    Exercises

    Chapter 7: Classification

    Technical requirements

    Classification models

    Example of designing a classification model

    Classification algorithms

    KNN

    Example of using KNN for classification

    Decision Trees

    Example of using Decision Trees for classification

    Summary

    Exercises

    Chapter 8: Clustering Analysis

    Technical requirements

    Clustering model

    Clustering example using a two-dimensional dataset

    Clustering example using a three-dimensional dataset

    K-Means algorithm

    Using K-Means to cluster a two-dimensional dataset

    Using K-Means to cluster a dataset with more than two dimensions

    Centroid analysis

    Summary

    Exercises

    Part 3: The Preprocessing

    Chapter 9: Data Cleaning Level I – Cleaning Up the Table

    Technical requirements

    The levels, tools, and purposes of data cleaning – a roadmap to chapters 9, 10, and 11

    Purpose of data analytics

    Tools for data analytics

    Levels of data cleaning

    Mapping the purposes and tools of analytics to the levels of data cleaning

    Data cleaning level I – cleaning up the table

    Example 1 – unwise data collection

    Example 2 – reindexing (multi-level indexing)

    Example 3 – intuitive but long column titles

    Summary

    Exercises

    Chapter 10: Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table

    Technical requirements

    Example 1 – unpacking columns and reformulating the table

    Unpacking FileName

    Unpacking Content

    Reformulating a new table for visualization

    The last step – drawing the visualization

    Example 2 – restructuring the table

    Example 3 – level I and II data cleaning

    Level I cleaning

    Level II cleaning

    Doing the analytics – using linear regression to create a predictive model

    Summary

    Exercises

    Chapter 11: Data Cleaning Level III – Missing Values, Outliers, and Errors

    Technical requirements

    Missing values

    Detecting missing values

    Example of detecting missing values

    Causes of missing values

    Types of missing values

    Diagnosis of missing values

    Dealing with missing values

    Outliers

    Detecting outliers

    Dealing with outliers

    Errors

    Types of errors

    Dealing with errors

    Detecting systematic errors

    Summary

    Exercises

    Chapter 12: Data Fusion and Data Integration

    Technical requirements

    What are data fusion and data integration?

    Data fusion versus data integration

    Directions of data integration

    Frequent challenges regarding data fusion and integration

    Challenge 1 – entity identification

    Challenge 2 – unwise data collection

    Challenge 3 – index mismatched formatting

    Challenge 4 – aggregation mismatch

    Challenge 5 – duplicate data objects

    Challenge 6 – data redundancy

    Example 1 (challenges 3 and 4)

    Example 2 (challenges 2 and 3)

    Example 3 (challenges 1, 3, 5, and 6)

    Checking for duplicate data objects

    Designing the structure for the result of data integration

    Filling songIntegrate_df from billboard_df

    Filling songIntegrate_df from songAttribute_df

    Filling songIntegrate_df from artist_df

    Checking for data redundancy

    The analysis

    Example summary

    Summary

    Exercise

    Chapter 13: Data Reduction

    Technical requirements

    The distinction between data reduction and data redundancy

    The objectives of data reduction

    Types of data reduction

    Performing numerosity data reduction

    Random sampling

    Stratified sampling

    Random over/undersampling

    Performing dimensionality data reduction

    Linear regression as a dimension reduction method

    Using a decision tree as a dimension reduction method

    Using random forest as a dimension reduction method

    Brute-force computational dimension reduction

    PCA

    Functional data analysis

    Summary

    Exercises

    Chapter 14: Data Transformation and Massaging

    Technical requirements

    The whys of data transformation and massaging

    Data transformation versus data massaging

    Normalization and standardization

    Binary coding, ranking transformation, and discretization

    Example one – binary coding of nominal attribute

    Example two – binary coding or ranking transformation of ordinal attributes

    Example three – discretization of numerical attributes

    Understanding the types of discretization

    Discretization – the number of cut-off points

    A summary – from numbers to categories and back

    Attribute construction

    Example – construct one transformed attribute from two attributes

    Feature extraction

    Example – extract three attributes from one attribute

    Example – Morphological feature extraction

    Feature extraction examples from the previous chapters

    Log transformation

    Implementation – doing it yourself

    Implementation – the working module doing it for you

    Smoothing, aggregation, and binning

    Smoothing

    Aggregation

    Binning

    Summary

    Exercise

    Part 4: Case Studies

    Chapter 15: Case Study 1 – Mental Health in Tech

    Technical requirements

    Introducing the case study

    The audience of the results of analytics

    Introduction to the source of the data

    Integrating the data sources

    Cleaning the data

    Detecting and dealing with outliers and errors

    Detecting and dealing with missing values

    Analyzing the data

    Analysis question one – is there a significant difference between the mental health of employees across the attribute of gender?

    Analysis question two – is there a significant difference between the mental health of employees across the Age attribute?

    Analysis question three – do more supportive companies have mentally healthier employees?

    Analysis question four – does the attitude of individuals toward mental health influence their mental health and their seeking of treatments?

    Summary

    Chapter 16: Case Study 2 – Predicting COVID-19 Hospitalizations

    Technical requirements

    Introducing the case study

    Introducing the source of the data

    Preprocessing the data

    Designing the dataset to support the prediction

    Filling up the placeholder dataset

    Supervised dimension reduction

    Analyzing the data

    Summary

    Chapter 17: Case Study 3: United States Counties Clustering Analysis

    Technical requirements

    Introducing the case study

    Introduction to the source of the data

    Preprocessing the data

    Transforming election_df to partisan_df

    Cleaning edu_df, employ_df, pop_df, and pov_df

    Data integration

    Data cleaning level III – missing values, errors, and outliers

    Checking for data redundancy

    Analyzing the data

    Using PCA to visualize the dataset

    K-Means clustering analysis

    Summary

    Chapter 18: Summary, Practice Case Studies, and Conclusions

    A summary of the book

    Part 1 – Technical requirements

    Part 2 – Analytics goals

    Part 3 – The preprocessing

    Part 4 – Case studies

    Practice case studies

    Google Covid-19 mobility dataset

    Police killings in the US

    US accidents

    San Francisco crime

    Data analytics job market

    FIFA 2018 player of the match

    Hot hands in basketball

    Wildfires in California

    Silicon Valley diversity profile

    Recognizing fake job posting

    Hunting more practice case studies

    Conclusions

    Other Books You May Enjoy

    Preface

    Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

    This book will equip you with the optimum data preprocessing techniques from multiple perspectives. You'll learn about different technical and analytical aspects of data preprocessing – data collection, data cleaning, data integration, data reduction, and data transformation – and get to grips with implementing them using the open source Python programming environment. This book will provide a comprehensive articulation of data preprocessing, its whys and hows, and help you identify opportunities where data analytics could lead to more effective decision making. It also demonstrates the role of data management systems and technologies for effective analytics and how to use APIs to pull data.

    By the end of this Python data preprocessing book, you'll be able to use Python to read, manipulate, and analyze data; perform data cleaning, integration, reduction, and transformation techniques; and handle outliers or missing values to effectively prepare data for analytic tools.

    Who this book is for

    Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.

    What this book covers

    Chapter 1, Review of the Core Modules of NumPy and Pandas, introduces two of three main modules used for data manipulation, using real dataset examples to show their relevant capabilities.

    Chapter 2, Review of Another Core Module – Matplotlib, introduces the last of the three modules used for data manipulation, using real dataset examples to show its relevant capabilities.

    Chapter 3, Data – What Is It Really?, puts forth a technical definition of data and introduces data concepts and languages that are necessary for data preprocessing.

    Chapter 4, Databases, explains the role of databases, the different kinds, and teaches you how to connect and pull data from relational databases. It also teaches you how to pull data from databases using APIs.

    Chapter 5, Data Visualization, showcases some analytics examples using data visualizations to inform you of the potential of data visualization.

    Chapter 6, Prediction, introduces predictive models and explains how to use Multivariate Regression and a Multi-Layered Perceptron (MLP).

    Chapter 7, Classification, introduces classification models and explains how to use Decision Trees and K-Nearest Neighbors (KNN).

    Chapter 8, Clustering Analysis, introduces clustering models and explains how to use K-means.

    Chapter 9, Data Cleaning Level I – Cleaning Up the Table, introduces three different levels of data cleaning and covers the first level through examples.

    Chapter 10, Data Cleaning Level II – Unpacking, Restructuring, and Reformulating the Table, covers the second level of data cleaning through examples.

    Chapter 11, Data Cleaning Level III – Missing Values, Outliers, and Errors, covers the third level of data cleaning through examples.

    Chapter 12, Data Fusion and Data Integration, covers the technique for mixing different data sources.

    Chapter 13, Data Reduction, introduces data reduction and, with the help of examples, shows how its different cases and versions can be done via Python.

    Chapter 14, Data Transformation and Massaging, introduces data transformation and massaging and, through many examples, shows their requirements and capabilities for analysis.

    Chapter 15, Case Study 1 – Mental Health in Tech, introduces an analytic problem and preprocesses the data to solve it.

    Chapter 16, Case Study 2 – Predicting COVID-19 Hospitalizations, introduces an analytic problem and preprocesses the data to solve it.

    Chapter 17, Case Study 3 – United States Counties Clustering Analysis, introduces an analytic problem and preprocesses the data to solve it.

    Chapter 18, Summary, Practice Case Studies, and Conclusions, introduces some possible practice cases that users can use to learn in more depth and start creating their analytics portfolios.

    To get the most out of this book

    The book assumes basic programming skills such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python. Other than that, you can start your journey from the beginning of the book and start learning.

    The Jupyter Notebook is an excellent UI for learning and practicing programming and data analytics. It can be downloaded and installed easily using Anaconda Navigator. Visit this page for installation: https://docs.anaconda.com/anaconda/navigator/install/.

    While Anaconda has most of the modules that the book uses already installed, you will need to install a few other modules such as Seaborn and Graphviz. Don't worry; when the time comes, the book will instruct you on how to go about these installations.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    While learning, keep a file of your own code from each chapter. This learning repository can be used in the future for deeper learning and real projects. The Jupyter Notebook is especially great for this purpose as it allows you to take notes along with the code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. If there's an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801072137_ColorImages.pdf.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: To create this interactive visual, we have used the interact and widgets programming objects from the ipywidgets module.

    A block of code is set as follows:

    from ipywidgets import interact, widgets

    interact(plotyear,year=widgets.IntSlider(min=2010,max=2019,step=1,value=2010))

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    Xs_t.plot.scatter(x='PC1',y='PC2',c='PC3',sharex=False,

                      vmin=-1/0.101, vmax=1/0.101,

    figsize=(12,9))

    x_ticks_vs = [-2.9*4 + 2.9*i for i in range(9)]

    Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: The missing values for the attributes from SupportQ1 to AttitudeQ3 are from the same data objects.

    Tips or Important Notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Share Your Thoughts

    Once you've read Hands-On Data Preprocessing in Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

    Part 1:Technical Needs

    After reading this part of the book, you will be able to use Python to effectively manipulate data.

    This part comprises the following chapters:

    Chapter 1, Review of the Core Modules of NumPy and Pandas

    Chapter 2, Review of Another Core Module – Matplotlib

    Chapter 3, Data – What Is It Really?

    Chapter 4, Databases

    Chapter 1: Review of the Core Modules of NumPy and Pandas

    NumPy and Pandas modules are capable of meeting your needs for the majority of data analytics and data preprocessing tasks. Before we start reviewing these two valuable modules, I would like to let you know that this chapter is not meant to be a comprehensive teaching guide to these modules, but rather a collection of concepts, functions, and examples that will be invaluable, as we will cover data analytics and data preprocessing in proceeding chapters.

    In this chapter, we will first review the Jupyter Notebooks and their capability as an excellent coding User Interface (UI). Next, we will review the most relevant data analytic resources of the NumPy and Pandas Python modules.

    The following topics will be covered in this chapter:

    Overview of the Jupyter Notebook

    Are we analyzing data via computer programming?

    Overview of the basic functions of NumPy

    Overview of Pandas

    Technical requirements

    The easiest way to get started with Python programming is by installing Anaconda Navigator. It is open source software that brings together many useful open source tools for developers. You can download Anaconda Navigator by following this link: https://www.anaconda.com/products/individual.

    We will be using Jupyter Notebook throughout this book. Jupyter Notebook is one of the open source tools that Anaconda Navigator provides. Anaconda Navigator also installs a Python version on your computer. So, following Anaconda Navigator's easy installation, all you need to do is open Anaconda Navigator and then select Jupyter Notebook.

    You will be able to find all of the code and the dataset that is used in this book in a GitHub repository exclusively created for this book. To find the repository, click on the following link: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. Each chapter in this book will have a folder that contains all of the code and datasets that were used in the chapter.

    Overview of the Jupyter Notebook

    The Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing. As a UI, the Jupyter Notebook provides an interactive environment where you can run your Python code, see immediate outputs, and take notes.

    Fernando Pérezthe and Brian Granger, the architects of the Jupyter Notebook, outlines the following reasons in terms of what they were looking for in an innovative programming UI:

    Space for individual exploratory work

    Space for collaboration

    Space for learning and education

    If you have used the Jupyter Notebook already, you can attest that it delivers all these promises, and if you have not yet used it, I have good news for you: we will be using Jupyter Notebook for the entirety of this book. Some of the code that I will be sharing will be in the form of screenshots from the Jupyter Notebook UI.

    The UI design of the Jupyter Notebook is very simple. You can think of it as one column of material. These materials could be under code chunks or Markdown chunks. The solution development and the actual coding happens under the code chunks, whereas notes for yourself or other developers are presented under Markdown chunks. The following screenshot shows both an example of a Markdown chunk and a code chunk. You can see that the code chunk has been executed and the requested print has taken place and the output is shown immediately after the code chunk:

    Figure 1.1 – Code for printing Hello World in a Jupyter notebook

    Figure 1.1 – Code for printing Hello World in a Jupyter notebook

    To create a new chunk, you can click on the + sign on the top ribbon of the UI. The newly added chunk will be a code chunk by default. You can switch the code chunk to a Markdown chunk by using the drop-down list on the top ribbon. Moreover, you can move the chunks up or down by using the correct arrows on the ribbon. You can see these three buttons in the following screenshot:

    Figure 1.2 – Jupyter Notebook control ribbon

    Figure 1.2 – Jupyter Notebook control ribbon

    You can see the following in the preceding screenshot:

    The ribbon shown in the screenshot also allows you to Cut, Copy, and Paste the chunks.

    The Run button on the ribbon is to execute the code of a chunk.

    The Stop button is to stop running code. You normally use this button if your code has been running for a while with no output.

    The Restart button wipes the slate clean; it removes all of the variables you have defined so you can start over.

    Finally, the Restart & Run button restarts the kernel and runs all of the chunks of code in the Jupyter Notebook files.

    There is more to the Jupyter Notebook, such as useful short keys to speed up development and specific Markdown syntax to format the text under Markdown chunks. However, the introduction here is just enough for you to start meaningfully analyzing data using Python through the Jupyter Notebook UI.

    Are we analyzing data via computer programming?

    To benefit most from the two modules that we will cover in this chapter, we need to understand what they really are and what we are really doing when we use them. I am sure whoever is in the business of content development for data analytics using Python, including me (guilty as charged), would tell you that when you use these modules to manipulate your data, you are analyzing your data using computer programming. However, what you are actually doing is not computer programming. The computer programming part has already been done for the most part. In fact, this has been done by the top-notch programmers who put together these invaluable packages. What you do is use their code made available to you as programming objects and functions under these modules. Well, if I am being completely honest, you are doing a tad bit of computer programming, but just enough to access the good stuff (these modules). Thanks to these modules, you will not experience any difficulty in analyzing data using computer programming.

    So, before embarking on your journey in this chapter and this book, remember this: for the most part, our job as data analysts is to connect three things – our business problem, our data, and technology. The technology could be commercial software such as Excel or Tableau, or, in the case of this book, these modules.

    Overview of the basic functions of NumPy

    In short, as the name suggests, NumPy is a Python module brimming with useful functions for dealing with numbers. The Num in the first part of the name NumPy stands for numbers, and Py stands for Python. There you have it. If you have numbers and you are in Python, you know what you need to import. That is correct; you need to import NumPy, simple as that. See the following screenshot:

    Figure 1.3 – Code for importing the NumPy module

    Figure 1.3 – Code for importing the NumPy module

    As you can see, we have given the alias np to the module after importing it. You can actually assign any alias that you wish and your code would function; however, I suggest sticking with np. I have two compelling reasons for doing so:

    First, everyone else uses this alias, so if you share your code with others, they know what you are doing throughout your project.

    Second, a lot of the time, you end up using code written by others in your projects, so consistency will make your job easier. You will see that most of the famous modules also have a famous alias, for example, pd for Pandas, and plt for matplotlib.pyplot.

    Good practice advice

    NumPy can handle all types of mathematical and statistical calculations for a collection of numbers, such as mean, median, standard deviation (std), and variance (var). If you have something else in mind and are not sure whether NumPy has it, I suggest googling it before trying to write your own. If it involves numbers, chances are NumPy has it.

    The following screenshot shows the mean, for example, applied to a collection of numbers:

    Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

    Figure 1.4 – Example of using the np.mean() NumPy function and the .mean() NumPy array function

    As shown in Figure 1.4, there are two ways to do this. The first one, portrayed in the top chunk, uses np.mean(). This function is one of the properties of the NumPy module and can be accessed directly. The great aspect of using this approach is that you do not need to change your data type most of the time before NumPy honors your request. You can input lists, Pandas series, or DataFrames. You can see on the top chunk that np.mean() easily calculated the mean of lst_nums, which is of the list type. The second way, as shown in the bottom chunk, is to first use np.array() to transform the list into a NumPy array and then use the .mean() function, which is a property of any NumPy array. Before continuing to progress with this chapter, take a moment and use the Python type() function to see the different types of lst_numbs and ary_nums, as shown in the following screenshot:

    Figure 1.5 – The application of the type() function

    Figure 1.5 – The application of the type() function

    Next we will learn about four NumPy functions: np.arange(), np.zeros(), np.ones(), and np.linspace().

    The np.arange() function

    This function, as shown in the following screenshot, produces a sequence of numbers with equal increments. You can see in the figure that by changing the two inputs, you can get the function to output many different sequences of numbers that are required for your analytic purposes:

    Figure 1.6 – Examples of using the np.arange() function

    Figure 1.6 – Examples of using the np.arange() function

    Pay attention to the three chunks of code in the preceding figure to see the default behavior of np.arange() when only one or two inputs are passed.

    When only one input is passed, as in the first chunk of code, the default of np.arange() is that you want a sequence of numbers from zero to the input number with increments of one.

    When two inputs are passed, as in the second chunk of code, the default of the function is that you want a sequence of numbers from the first input to the second input with increments of one.

    The np.zeros() and np.ones() functions

    np.ones() creates a NumPy array filled with ones, and np.zeros() does the same thing with zeros. Unlike np.arange(), which takes the input to calculate what needs to be included in the output array, np.zeros() and np.ones() take the input to structure the output array. For instance, the top chunk of the following screenshot specifies the request for an array with four rows and five columns filled with zeros. As you can see in the bottom chunk, if you only pass in one number, the output array will have only one dimension:

    Figure 1.7 – Examples of np.zeros() and np.ones()

    Figure 1.7 – Examples of np.zeros() and np.ones()

    These two functions are excellent resources for creating a placeholder to keep the results of calculations in a loop. For instance, review the following example and observe how this function facilitated the coding.

    Example – Using a placeholder to accommodate analytics

    Given the grade data of 10 students, create a code using NumPy that calculates and reports their grade average.

    The data of the 10 students and the solution to this example are provided in the following screenshots. Please review and try this code before progressing:

    Figure 1.8 – Grade data for the example

    Figure 1.8 – Grade data for the example

    Now that you've had a chance to engage with this example, allow me to highlight a few matters about the provided solution presented in Figure 1.9:

    Notice how np.zeros() facilitated the solution by streamlining it significantly. After the code is done, all of the average grades are calculated and saved already. Compare the printed values before and after the for loop.

    The enumerate() function in the for loop might sound strange to you. What that does is help the code to have both an index (i) and the item (name) from the collection (Names).

    The .format() function is an invaluable property of any string variable. If there are any symbols such as {} in the string, this function will replace them with what has been input sequentially.

    # better-looking report is a comment in the second chunk of the code. Comments are not compiled and their only purpose is to communicate something with whoever reads the source code.

    Figure 1.9 – Solution to the preceding example

    Figure 1.9 – Solution to the preceding example

    The np.linspace() function

    This function returns evenly spaced numbers over a specified interval. The function takes three inputs. The first two inputs specify the interval, and the third shows the number of elements that the output will have. For example, refer to the following screenshot:

    Figure 1.10 – Solution to the preceding example

    Figure 1.10 – Solution to the preceding example

    In the first code block, 19 numbers are evenly spaced between 0 and 1, altogether creating an array with 21 numbers. The second gives another example. After trying out the two examples in the screenshot, try np.linspace(0,1,20) and after investigating the results, think about why I chose 21 over 20 in my example.

    np.linspace() is a very handy function for situations where you need to try out different values to find the one that best fits your needs. The following example showcases a simple situation like that.

    Example – np.linspace() to create solution candidates

    We are interested in finding the value(s) that holds the following mathematical statement: .

    Imagine that we don't know that the statement can be simplified easily to ascertain that either 2 or 3 will hold the statement:

    So we would like to use NumPy to try out any whole numbers between -1000 and 1000 and find the answer.

    The following screenshot shows Python code that provides a solution to this problem:

    Figure 1.11 – Solution to the preceding example

    Figure 1.11 – Solution to the preceding example

    Please review and try this code before moving on.

    Now that you've had a chance to engage with this example, allow me to highlight a couple of things:

    Notice how smart use of np.linspace() leads to an array with all of the numbers that we were interested in trying out.

    Uncomment #print(Candidates) and review all of the numbers that were tried out to establish the desired answers.

    This concludes our review of the NumPy module. Next, we will review another very useful Python module, Pandas.

    Overview of Pandas

    In short, Pandas is our main module for working with data. The module is brimming with useful functions and tools, but let's get down to the basics first. The greatest tool of Pandas is its data structure, which is known as a DataFrame. In short, a DataFrame is a two-dimensional data structure with a good interface and great codability.

    The DataFrame makes itself useful to you right off the bat. The moment you read a data source using Pandas, the data is restructured and shown to you as a DataFrame. Let's give it a try.

    We will use the famous adult dataset (adult.csv) to practice and learn the different functionalities of Pandas. Refer to the following screenshot, which shows the importing of Pandas and then reading and showing the dataset. In this code, .head() requests that only the top five rows of data are output. The .tail() code could do the same for the bottom five rows of the data.

    Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rows

    Figure 1.12 – Reading the adult.csv file using pd.read_csv() and showing its first five rows

    The adult dataset has six continuous and eight categorical attributes. Due to print limitations, I have only been able to include some parts of the data; however, if you pay attention to Figure 1.12, the output comes with a scroll bar at the bottom that you can scroll to see the rest of the attributes. Give this code a try and study its attributes. As you will see, all of the attributes in this dataset are self-explanatory, apart from fnlwgt. The title is short for final weight and it is calculated by the Census Bureau to represent the ratio of the population that each row represents.

    Good practice advice

    It is good practice to always get to know the dataset you are about to work on. This process always starts with making sure you understand each attribute, the way I just did now. If you have just received a dataset and you don't know what each attribute is, ask. Trust me, you will look more like a pro than not.

    There are other steps to get to know a dataset. I will mention them all here and you will learn how to do them by the end of this chapter.

    Step one: Understand each attribute as I just explained.

    Step two: Check the shape of the dataset. How many rows and columns does the dataset have? This one is easy. For instance, just try adult_df.shape and review the result.

    Step three: Check whether the data has any missing values.

    Step four: Calculate summarizing values for numerical attributes such as mean, median, and standard deviation, and compute all the possible values for categorical attributes.

    Step five: Visualize the attributes. For numerical attributes, use a histogram or a boxplot, and for categorical ones, use a bar chart.

    As you just saw, before you know it, you are enjoying the benefits of a Pandas DataFrame. So it is important to better understand the structure of a DataFrame. Simply put, a DataFrame is a collection of series. A series is another Pandas data structure that does not get as much credit, but is useful all the same, if not more so.

    To understand this better, try to call some of the columns of the adult dataset. Each column is a property of a DataFrame, so to access it, all you need to do is to use .ColumnName after the DataFrame. For instance, try running adult_df.age to see the column age. Try running all of the columns and study them, and if you come across errors for some of them, do not worry about it; we will address them soon if you continue reading. The following screenshot shows how you can confirm what was just described for the adult dataset:

    Figure 1.13 – Checking the type of adult_df and adult_df.age

    Figure 1.13 – Checking the type of adult_df and adult_df.age

    It gets more exciting. Not only is each attribute a series, but each row is also a series. To access each row of a DataFrame, you need to use .loc[] after the DataFrame. What comes between the brackets is the index of each row. Go back and study the output of df_adult.head() in Figure 1.12 and you will see that each row is represented by an index. The indices do not have to be numerical and we will see how indices of a Pandas DataFrame can be adjusted, but when reading data using pd.read_csv() with default properties, numerical indices will be assigned. So give it a try and access some of the rows and study them. For instance, you can access the second row by running adult_df.loc[1]. After running a few of them, run type(adult_df.loc[1]) to confirm that each row is a series.

    When accessed separately, each column or row of a DataFrame is a series. The only difference between a column series and a row series is that the index of a column series is the index of the DataFrame, and the index of a row series is the column names. Study the following screenshot, which compares the index of the first row of adult_df and the index of the first column of adult_df:

    Figure 1.14 – Investigating the index for a column series and a row series

    Figure 1.14 – Investigating the index for a column series and a row series

    Now that we have been introduced to Pandas data structures, next we will cover how we can access the values that are presented in them.

    Pandas data access

    One of the greatest advantages of both Pandas series and DataFrames is the excellent access they afford us. Let's start with DataFrames, and then we will move on to series as there are lots of commonalities between the two.

    Pandas DataFrame access

    As DataFrames are two-dimensional, this section first addresses how to access rows, and then columns. The end part of the section will address how to access each value.

    DataFrame access rows

    Enjoying the preview?
    Page 1 of 1