Exploratory Data Analysis With Python Cookbook: Over 50 Recipes To Analyze, Visualize, and Extract Insights From Structured and Unstructured Data Oluleye Full Chapter Instant Download
Exploratory Data Analysis With Python Cookbook: Over 50 Recipes To Analyze, Visualize, and Extract Insights From Structured and Unstructured Data Oluleye Full Chapter Instant Download
https://ebookmass.com/product/statistics-for-biomedical-
engineers-and-scientists-how-to-visualize-and-analyze-data-
eckersley/
https://ebookmass.com/product/data-universe-organizational-
insights-with-python-embracing-data-driven-decision-making-van-
der-post/
https://ebookmass.com/product/introduction-to-python-for-
econometrics-statistics-and-data-analysis-kevin-sheppard/
https://ebookmass.com/product/python-data-cleaning-cookbook-
second-edition-michael-walker/
Intelligent Data Analysis: From Data Gathering to Data
Comprehension Deepak Gupta
https://ebookmass.com/product/intelligent-data-analysis-from-
data-gathering-to-data-comprehension-deepak-gupta/
https://ebookmass.com/product/data-ingestion-with-python-
cookbook-a-practical-guide-to-ingesting-monitoring-and-
identifying-errors-in-the-data-ingestion-process-1st-edition-
esppenchutz/
https://ebookmass.com/product/exploratory-data-analysis-
using-r-1st-edition-ronald-k-pearson/
https://ebookmass.com/product/data-science-from-scratch-first-
principles-with-python-2nd-edition/
https://ebookmass.com/product/introduction-to-python-for-
econometrics-statistics-and-data-analysis-5th-edition-kevin-
sheppard/
Exploratory Data Analysis
with Python Cookbook
Ayodele Oluleye
BIRMINGHAM—MUMBAI
Exploratory Data Analysis with Python Cookbook
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
ISBN 978-1-80323-110-5
www.packtpub.com
To my wife and daughter, I am deeply grateful for your unwavering support throughout this journey.
Your love and encouragement were pillars of strength that constantly propelled me forward. Your
sacrifices and belief in me have been a constant source of inspiration, and I am truly blessed to have
you both by my side.
To my dad, thank you for instilling in me a solid foundation in technology right from my formative
years. You exposed me to the world of technology in my early teenage years. This has been very
instrumental in shaping my career in tech. To my mum (of blessed memory), thank you for your
unwavering belief in my abilities and constantly nudging me to be my best self.
To PwC Nigeria, Data Scientists Network (DSN) and the Young Data Professionals group (YDP),
thank you for the invaluable role you played in my growth and development in the field of data
science. Your unwavering support, resources, and opportunities have significantly contributed to my
professional growth.
Ayodele Oluleye
Contributors
1
Generating Summary Statistics 1
Technical requirements 1 Identifying the standard deviation of
Analyzing the mean of a dataset 2 a dataset 8
Getting ready 2 Getting ready 9
How to do it… 2 How to do it… 9
How it works... 3 How it works... 9
There’s more... 4 There’s more... 10
2
Preparing Data for EDA 17
Technical requirements 17 Categorizing data 33
Grouping data 18 Getting ready 33
Getting ready 18 How to do it… 33
How to do it… 18 How it works... 35
How it works... 20 There’s more... 35
There’s more... 20 Removing duplicate data 36
See also 20
Getting ready 36
Appending data 20 How to do it… 36
Getting ready 21 How it works... 37
How to do it… 21 There’s more... 38
How it works... 23 Dropping data rows and columns 38
There’s more... 23
Getting ready 38
Concatenating data 24 How to do it… 38
Getting ready 24 How it works... 39
How to do it… 24 There’s more... 40
How it works... 26 Replacing data 40
There’s more... 27
Getting ready 40
See also 27
How to do it… 40
Merging data 27 How it works... 41
Getting ready 28 There’s more... 42
How to do it… 28 See also 42
How it works... 30 Changing a data format 42
There’s more... 30
Getting ready 42
See also 30
How to do it… 42
Sorting data 30 How it works... 44
Getting ready 31 There’s more... 44
How to do it… 31 See also 44
How it works... 32
There’s more... 33
Table of Contents ix
3
Visualizing Data in Python 47
Technical requirements 47 How it works... 60
Preparing for visualization 47 There’s more... 61
See also 61
Getting ready 48
How to do it… 48 Visualizing data in GGPLOT 61
How it works... 49 Getting ready 62
There’s more... 49 How to do it… 62
Visualizing data in Matplotlib 50 How it works... 65
There’s more... 66
Getting ready 50
See also 66
How to do it… 50
How it works... 54 Visualizing data in Bokeh 66
There’s more... 55 Getting ready 66
See also 55 How to do it… 67
Visualizing data in Seaborn 55 How it works... 72
There's more... 73
Getting ready 56
See also 73
How to do it… 56
4
Performing Univariate Analysis in Python 75
Technical requirements 75 How to do it… 80
Performing univariate analysis using How it works... 83
a histogram 76 There’s more... 84
Getting ready 76 Performing univariate analysis using
How to do it… 76 a violin plot 84
How it works... 79 Getting ready 85
Performing univariate analysis using How to do it… 85
a boxplot 79 How it works... 88
Getting ready 80
x Table of Contents
5
Performing Bivariate Analysis in Python 99
Technical requirements 100 How to do it… 108
Analyzing two variables using a How it works... 110
scatter plot 100 Analyzing two variables using
Getting ready 101 a bar chart 110
How to do it… 101 Getting ready 111
How it works... 103 How to do it… 111
There’s more... 103 How it works... 113
See also... 104 There is more... 114
Creating a crosstab/two-way table on Generating box plots for two
bivariate data 104 variables114
Getting ready 104 Getting ready 114
How to do it… 104 How to do it… 114
How it works... 105 How it works... 116
Analyzing two variables using a pivot Creating histograms on two variables 116
table106 Getting ready 117
Getting ready 106 How to do it… 117
How to do it… 106 How it works... 119
How it works... 107
There is more... 107 Analyzing two variables using a
correlation analysis 120
Generating pairplots on two variables108 Getting ready 120
Getting ready 108 How to do it… 120
How it works... 122
Table of Contents xi
6
Performing Multivariate Analysis in Python 123
Technical requirements 124 Choosing the number of principal
Implementing Cluster Analysis on components142
multiple variables using Kmeans 124 Getting ready 142
Getting ready 124 How to do it… 142
How to do it… 125 How it works... 145
How it works... 127 Analyzing principal components 146
There is more... 128
Getting ready 146
See also... 128
How to do it… 146
Choosing the optimal number of How it works... 149
clusters in Kmeans 129 There’s more... 150
Getting ready 129 See also... 150
How to do it… 129 Implementing factor analysis on
How it works... 132 multiple variables 150
There is more... 133
Getting ready 150
See also... 133
How to do it… 151
Profiling Kmeans clusters 133 How it works... 154
Getting ready 134 There is more... 154
How to do it… 134 Determining the number of factors 154
How it works... 137
Getting ready 155
There’s more... 138
How to do it… 155
Implementing principal component How it works... 158
analysis on multiple variables 138 Analyzing the factors 159
Getting ready 139
Getting ready 159
How to do it… 139
How to do it… 159
How it works... 141
How it works... 165
There is more... 142
See also... 142
7
Analyzing Time Series Data in Python 167
Technical requirements 168 Using line and boxplots to visualize
time series data 169
xii Table of Contents
8
Analysing Text Data in Python 211
Technical requirements 212 Analyzing part of speech 224
Preparing text data 212 Getting ready 225
Getting ready 213 How to do it… 225
How to do it… 214 How it works... 229
How it works... 217 Performing stemming and
There’s more… 218 lemmatization230
See also… 218
Getting ready 230
Dealing with stop words 218 How to do it… 231
Getting ready 219 How it works... 237
How to do it… 219 Analyzing ngrams 237
How it works... 224
Getting ready 238
There’s more… 224
How to do it… 238
Table of Contents xiii
9
Dealing with Outliers and Missing Values 269
Technical requirements 270 Flooring and capping outliers 290
Identifying outliers 270 Getting ready 290
Getting ready 271 How to do it… 290
How to do it… 271 How it works... 293
How it works... 273 Removing outliers 294
Spotting univariate outliers 274 Getting ready 294
Getting ready 274 How to do it… 294
How to do it… 274 How it works... 296
How it works... 277 Replacing outliers 297
Finding bivariate outliers 278 Getting ready 297
Getting ready 278 How to do it… 297
How to do it… 279 How it works... 300
How it works... 281 Identifying missing values 301
Identifying multivariate outliers 282 Getting ready 302
Getting ready 282 How to do it… 302
How to do it… 282 How it works... 305
How it works... 288
See also 289
xiv Table of Contents
10
Performing Automated Exploratory Data Analysis in Python 315
Technical requirements 316 Getting ready 331
Doing Automated EDA using pandas How to do it… 331
profiling316 How it works... 335
Getting ready 317 See also 336
How to do it… 318 Performing Automated EDA using
How it works... 324 Sweetviz336
See also… 324 Getting ready 336
Performing Automated EDA using How to do it… 336
dtale325 How it works... 339
Getting ready 325 See also 340
How to do it… 325 Implementing Automated EDA
How it works... 330 using custom functions 340
See also 330 Getting ready 340
Doing Automated EDA using How to do it… 340
AutoViz330 How it works... 347
There’s more… 348
Index349
SciPy to compute measures (like the mean, median, mode, standard deviation, percentiles, and other
critical summary statistics). By the end of the chapter, you will have gained the required knowledge
for generating summary statistics in Python. You will also have gained the foundational knowledge
required for understanding some of the more complex EDA techniques covered in other chapters.
Chapter 2, Preparing Data for EDA, focuses on the critical steps required to prepare data for analysis.
Real-world data rarely come in a ready-made format, hence the reason for this very crucial step in EDA.
Through practical examples, you will learn aggregation techniques such as grouping, concatenating,
appending, and merging. You will also learn data-cleaning techniques, such as handling missing
values, changing data formats, removing records, and replacing records. Lastly, you will learn how to
transform data by sorting and categorizing it.
By the end of this chapter, you will have mastered the techniques in Python required for preparing
data for EDA.
Chapter 3, Visualizing Data in Python, covers data visualization tools critical for uncovering hidden
trends and patterns in data. It focuses on popular visualization libraries in Python, such as Matplotlib,
Seaborn, GGPLOT and Bokeh, which are used to create compelling representations of data. It also
provides the required foundation for subsequent chapters in which some of the libraries will be used.
With practical examples and a step-by-step guide, you will learn how to plot charts and customize
them to present data effectively. By the end of this chapter, you will be equipped with the knowledge
and hands-on experience of Python’s visualization capabilities to uncover valuable insights.
Chapter 4, Performing Univariate Analysis in Python, focuses on essential techniques for analyzing
and visualizing a single variable of interest to gain insights into its distribution and characteristics.
Through practical examples, it delves into a wide range of visualizations such as histograms, boxplots,
bar plots, summary tables, and pie charts required to understand the underlying distribution of a
single variable and uncover hidden patterns in the variable. It also covers univariate analysis for both
categorical and numerical variables.
By the end of this chapter, you will be equipped with the knowledge and skills required to perform
comprehensive univariate analysis in Python to uncover insights.
Chapter 5, Performing Bivariate Analysis in Python, explores techniques for analyzing the relationships
between two variables of interest and uncovering meaningful insights embedded in them. It delves
into various techniques, such as correlation analysis, scatter plots, and box plots required to effectively
understand relationships, trends, and patterns that exist between two variables. It also explores the
various bivariate analysis options for different variable combinations, such as numerical-numerical,
numerical-categorical, and categorical-categorical. By the end of this chapter, you will have gained
the knowledge and hands-on experience required to perform in-depth bivariate analysis in Python
to uncover meaningful insights.
Chapter 6, Performing Multivariate Analysis in Python, builds on previous chapters and delves into some
more advanced techniques required to gain insights and identify complex patterns within multiple
variables of interest. Through practical examples, it delves into concepts, such as clustering analysis,
Preface xvii
principal component analysis and factor analysis, which enable the understanding of interactions
among multiple variables of interest. By the end of this chapter, you will have the skills required to
apply advanced analysis techniques to uncover hidden patterns in multiple variables.
Chapter 7, Analyzing Time Series Data, offers a practical guide to analyze and visualize time series
data. It introduces time series terminologies and techniques (such as trend analysis, decomposition,
seasonality detection, differencing, and smoothing) and provides practical examples and code on
how to implement them using various libraries in Python. It also covers how to spot patterns within
time series data to uncover valuable insights. By the end of the chapter, you will be equipped with the
relevant skills required to explore, analyze, and derive insights from time series data.
Chapter 8, Analyzing Text Data, covers techniques for analyzing text data, a form of unstructured
data. It provides a comprehensive guide on how to effectively analyze and extract insights from text
data. Through practical steps, it covers key concepts and techniques for data preprocessing such as
stop-word removal, tokenization, stemming, and lemmatization. It also covers essential techniques
for text analysis such as sentiment analysis, n-gram analysis, topic modelling, and part-of-speech
tagging. At the end of this chapter, you will have the necessary skills required to process and analyze
various forms of text data to unpack valuable insights.
Chapter 9, Dealing with Outliers and Missing Values, explores the process of effectively handling outliers
and missing values within data. It highlights the importance of dealing with missing values and outliers
and provides step-by-step instructions on how to handle them using visualization techniques and
statistical methods in Python. It also delves into various strategies for handling missing values and
outliers within different scenarios. At the end of the chapter, you will have the essential knowledge of
the tools and techniques required to handle missing values and outliers in various scenarios.
Chapter 10, Performing Automated EDA, focuses on speeding up the EDA process through automation.
It explores the popular automated EDA libraries in Python, such as Pandas Profiling, Dtale, SweetViz,
and AutoViz. It also provides hands-on guidance on how to build custom functions to automate the
EDA process yourself. With step-by-step instructions and practical examples, it will empower you to
gain deep insights quickly from data and save time during the EDA process.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code from the book’s GitHub repository (a link is available in the next section). Doing so will
help you avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Create a
histogram using the histplot method in seaborn and specify the data using the data parameter
of the method.”
A block of code is set as follows:
import numpy as np
import pandas as pd
import seaborn as sns
When we wish to draw your attention to a particular part of a code block, the relevant lines or items
are set in bold:
data.shape
(30,2)
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance,
words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the
Administration panel.”
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@
packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
THE
HAPPY AVERAGE
By BRAND WHITLOCK
Author of The 13th District and Her Infinite Variety
Mr. Whitlock has done more than simply repeat his earlier
success. He has achieved a new one. In The Happy
Average he has voiced a deep-seated human sympathy
for the unheroic.
Life
A most delightful romance that is as fresh as the flowers of
May.
Pittsburg Leader
As an example of a good, healthy, entertaining and human
story, The Happy Average must be given a place in the
front rank.
Nashville American
Not only the best book that has come from Mr. Whitlock’s
pen, but a really noteworthy achievement in fiction.
Chicago Tribune
12mo, cloth, price, $1.50
The Bobbs-Merrill Company, Indianapolis
THE LIFE AND LOVES OF LORD BYRON
THE
CASTAWAY
“Three great men ruined in one year—a king, a cad and a
castaway.”—Byron.
By HALLIE ERMINIE RIVES
Author of Hearts Courageous
THE
PLUM TREE
A New Novel
By DAVID GRAHAM PHILLIPS
Author of “The Cost,” “Golden Fleece,” Etc.
THE
MILLIONAIRE
BABY
By ANNA KATHARINE GREEN
Author of “The Filigree Ball”
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.