Learn Data Science Using Python (2024)
Learn Data Science Using Python (2024)
Using Python
A Quick-Start Guide
Engy Fouda
Learn Data Science Using Python: A Quick-Start Guide
Engy Fouda
Hopewell Junction, NY, USA
Introduction���������������������������������������������������������������������������������������xiii
v
Table of Contents
Modeling�������������������������������������������������������������������������������������������������������������19
My 2016 Predictions��������������������������������������������������������������������������������������20
My 2020 Predictions��������������������������������������������������������������������������������������21
Summary������������������������������������������������������������������������������������������������������������22
vi
Table of Contents
Histogram�����������������������������������������������������������������������������������������������������������57
Series Plot�����������������������������������������������������������������������������������������������������������62
Bar Chart�������������������������������������������������������������������������������������������������������������66
How to Sort a Bar Chart?�������������������������������������������������������������������������������69
Create a Histogram Using a Bar Chart�����������������������������������������������������������75
Bubble Chart�������������������������������������������������������������������������������������������������������77
Summary������������������������������������������������������������������������������������������������������������83
vii
Table of Contents
While Loop���������������������������������������������������������������������������������������������������113
For Loop�������������������������������������������������������������������������������������������������������115
Summary����������������������������������������������������������������������������������������������������������121
Chapter 7: Regression����������������������������������������������������������������������155
Simple Linear Regression���������������������������������������������������������������������������������155
Multiple Linear Regression�������������������������������������������������������������������������������167
Logistic Regression�������������������������������������������������������������������������������������������172
Summary����������������������������������������������������������������������������������������������������������176
Index�������������������������������������������������������������������������������������������������177
viii
About the Author
Engy Fouda is an adjunct lecturer at
SUNY New Paltz University, teaching
Introduction to Data Science using SAS
Studio and Introduction to Machine Learning
using Python. She is an Apress and Packt
Publishing author. Currently, she teaches SAS
fundamentals, intermediate SAS, advanced
SAS, SAS SQL, introduction to Python, Python
for data science, Docker fundamentals, Docker
enterprise for developers, Docker enterprise
for operations, Kubernetes, and DCA and SAS
exams test-prep course tracks at several venues
as a freelance instructor.
She also works as a freelance writer for Geek Culture, Towards Data
Science, and Medium Partner Program. She holds two master’s degrees:
one in journalism from Harvard University, the Extension School, and
the other in computer engineering from Cairo University. Moreover, she
earned a Data Science Graduate Professional Certificate from Harvard
University, the Extension School. She volunteers as the chair of Egypt
Scholars board and is former executive manager and former Momken
team leader (Engineering for the Blind). She is the author of the books
Learn Data Science Using SAS Studio and A Complete Guide to Docker for
Operations and Development published by Apress and a co-author of The
Docker Workshop published by Packt.
ix
About the Technical Reviewer
Suvoraj Biswas is a subject matter expert
and thought leader in enterprise generative
AI, specializing in its architecture and
governance. He authored the award-winning
book Enterprise Generative AI Well-Architected
Framework & Patterns, which has received
global acclaim from industry veterans for
its original contribution in defining a well-
architected framework for designing and
building secure, scalable, and trustworthy enterprise generative AI across
industries.
Professionally, Suvoraj is an architect at Ameriprise Financial, a
130-year-old Fortune 500 organization. With over 19+ years of experience
in enterprise IT across India, the United States, and Canada, he has held
architectural and consulting roles at companies like Thomson Reuters
and IBM.
His expertise spans secure and scalable enterprise system design,
multicloud adoption, digital transformation, and advanced technology
implementation, including SaaS platform engineering, AWS cloud
migration, generative AI adoption, and smart IoT platform development.
His LinkedIn profile is https://www.linkedin.com/in/suvoraj/.
xi
Introduction
This book shows you in step-by-step sequences how to use Python to
accomplish data cleaning, statistics, and visualization tasks even if you do
not have any programming or statistics background.
The book’s case study predicts the presidential elections in the state of
Maine, which is a project I did at Harvard University. Chapter 1 explains
the case study in more detail. In addition to the presidential elections,
the book provides real-life examples, including analyzing stocks, oil and
gold prices, crime, marketing, and healthcare. You will see data science
in action and how to accomplish complicated tasks and visualizations
in Python.
You will learn, step-by-step, how to do visualizations. The book
includes explanations of the code with screenshots of every step. The
whole book follows the paradigm of data science in action to demonstrate
how to perform complicated tasks and visualizations in Python in an easy
to follow way and with real big data for interesting hands-on-labs.
It will provide readers the required expertise in data science and
analytics using Python:
• Combine datasets
xiii
Introduction
• T-test
• Linear regression
• Visualizations
xiv
CHAPTER 1
2
Chapter 1 Data Science in Action
Subsequently, the third stage entails cleaning the raw data; addressing
issues like managing missing values, outliers, and repeated rows; and
correcting misspellings while standardizing data types and formats across
columns.
The fourth stage involves experimenting with multiple models and
comparing their tailored performance to the given problem. For instance,
I utilized Monte Carlo and Bayes algorithms in the presidential election
scenario.
The final stage involves visualizing and articulating the results in
accessible language within reports. This step is paramount as it culminates
in addressing the initial question that instigated the entire process, thereby
fulfilling the primary objective of the endeavor.
Population
The next step involves gathering relevant data extensively. First, I focused
on understanding the population distribution across Maine’s counties,
utilizing information from the US Census Bureau. This exploration
revealed that population distribution is unique, with vast areas either
sparsely populated or inhabited by solitary individuals. While the tiny dots
in the southern region may seem insignificant, each has over 5,000 people.
Thus, be cautious against misinterpretations from maps disseminated by
presidential campaigns or mainstream media outlets.
3
Chapter 1 Data Science in Action
4
Chapter 1 Data Science in Action
The initial dataset was full of erroneous values and outliers. For
instance, one voter’s age was listed as 220 years despite their birthdate
indicating an age of approximately 67 years—additionally, some voter
information was entirely blank. As emphasized earlier, it is imperative to
meticulously clean your data by addressing outliers and missing values,
formatting inconsistencies, and conducting thorough data exploration. In
Chapter 6, we will perform the data cleaning together in step-by-step labs.
5
Chapter 1 Data Science in Action
https://www.census.gov/data/tables/time-series/demo/voting-and-
registration/p20-585.html.
Gender
As depicted in Figure 1-4, my analysis revealed that the number of
registered female voters surpasses that of registered male voters in
Maine. However, the percentage disparity between the two genders is
between ±2%. Regarding the strategic implications for political campaigns,
representatives could consider leveraging this information by, for example,
wearing the cancer awareness pink ribbon to resonate with female voters,
given their historically higher turnout than men. This recommendation
highlights the significant influence that data and statistics wield, shaping
speech topics and influencing political representatives’ attire.
6
Chapter 1 Data Science in Action
7
Chapter 1 Data Science in Action
Race
Concerning racial demographics in Maine, over 96% of the population
identifies as white, as indicated in Figure 1-5. Consequently, I categorized
Black, Asian, and Hispanic voters as non-white registered voters.
8
Chapter 1 Data Science in Action
Age
In 2016, there was a notable decline in registered voters over 65, as
depicted in Figure 1-6 (note the varying scales). Conversely, there was an
uptick in the 18–24 and 25–44 age brackets. This observation suggests a
need for adjustments in speech topics to resonate with a broader voter
base. For instance, campaign representatives should prioritize issues
like student loans and home mortgages instead of prioritizing medical
insurance and retirement funding discussions.
9
Chapter 1 Data Science in Action
Voter Turnout
I examined historical voter turnout to gauge the number of individuals
who braved Maine’s snowy streets to cast their votes. Figure 1-7 illustrates
voter turnout percentages from 2000 to 2016, with the Democratic Party
emerging as the victor.
If you try this exercise with a swing party, the columns’ colors will
change to reflect the winning party.
10
Chapter 1 Data Science in Action
Figure 1-8. Maine voter turnout according to party, age, race, and
gender groupings
11
Chapter 1 Data Science in Action
12
Chapter 1 Data Science in Action
Categories/Issues
Each presidential election debate addresses pertinent issues and
elucidates each candidate’s proposed approach. These strategies can
sway voter opinions and influence their decision-making based on topics.
Consequently, it is imperative to examine these issues thoroughly.
13
Chapter 1 Data Science in Action
1
Data source: http://www.cnn.com/election/results/exit-polls/maine/
president
2
Data sources: https://ballotpedia.org/Maine_2000_ballot_measures
https://ballotpedia.org/Maine_2004_ballot_measures
https://ballotpedia.org/Maine_2008_ballot_measures
https://ballotpedia.org/Maine_2012_ballot_measures
https://ballotpedia.org/Maine_2016_ballot_measures
14
Chapter 1 Data Science in Action
2. Forestry
3. Fishing
4. Hunting
5. Taxes
3
Data source: https://www.pressherald.com/2019/03/27/maine-incomes-
up-4-percent-in-2018/
15
Chapter 1 Data Science in Action
16
Chapter 1 Data Science in Action
4
Data source: http://www.deptofnumbers.com/income/maine/
17
Chapter 1 Data Science in Action
5
Data source: http://www.cnn.com/election/results/exit-polls/maine/
president
6
Data source: https://www.bangordailynews.com/2016/11/09/politics/
elections/clinton-leads-maine-but-trump-poised-to-take-one-
electoral-vote/
18
Chapter 1 Data Science in Action
Year D D% R R%
Upon reviewing the data presented in Table 1-3, I was taken aback
to discover Maine’s historical political trajectory: transitioning from
a predominantly Republican state to a staunchly Democratic one.
Consequently, I opted to characterize Maine as a “Lean Democrat” state,
given that Democrats emerged victorious in 7 out of 15 elections from 1960
to 2016. My prediction for 2020 leans toward a predominantly Democratic
outcome, albeit by a narrow margin. Notably, since 1992, Maine has
consistently favored the Democratic Party. However, neglecting Maine
could potentially lead to a shift toward Republican allegiance. According to
the Bangor Daily News, Clinton did not visit Maine after September 2015,
opting instead to delegate surrogates. Consequently, for the first time in
years, the Democratic Party lost one electoral vote to the Republican Party.
Modeling
In modeling, it is advisable to experiment with various algorithms for
prediction and assess their outcomes rather than relying solely on one
model. In this project, I used Monte Carlo and Bayes algorithms. Regarding
statistical tests, the methods utilized included
• Histograms
• Box plots
• Proportion test
19
Chapter 1 Data Science in Action
• T-test
• Decision tree
• Chi-square test
20
Chapter 1 Data Science in Action
7
Data source: https://ballotpedia.org/Presidential_election_in_Maine,_2016
21
Chapter 1 Data Science in Action
Summary
In this chapter, our attention is directed toward extracting the key concepts
from the case study of the presidential elections project in Maine, along
with its accompanying charts. In the subsequent chapter, we will delve into
the process of writing our initial Python program and data types.
8
Data source: https://ballotpedia.org/Presidential_election_in_
Maine,_2020
22
CHAPTER 2
Getting Started
We will start this chapter by installing and taking a tour in the Jupyter
Notebook interface to get more familiar with it. When studying any
programming language, we always cover the following topics: the datatypes,
which we will discuss in this chapter, comparison and logical operators in
Chapter 5, conditional statements in Chapter 6, and loops in Chapter 5.
Installation
You can install Anaconda for free for personal and educational purposes.
Anaconda is an easy-to-install manager for environment, Python
distribution, and a collection of over 720 open source packages. Its main
advantage is that it comes bundled with Python, Jupyter Notebook, Spyder
IDE, and data analysis and machine learning libraries and packages, as
shown in Figure 2-1. You can install it for free from this site:
https://www.anaconda.com/download/
https://jupyter.org/install
• Kaggle: https://www.kaggle.com/code
• Sololearn: https://code.sololearn.com/python
• Replit: https://replit.com/languages/python3
• Trinket: https://trinket.io/library/trinkets/
create?lang=python3
24
Chapter 2 Getting Started
What Is Python?
Python is a programming and scripting language. It is popular, fast, and
more flexible than other programming languages. The main advantage is
that an enormous community contributes to it. Therefore, you will easily
find batteries/packages and code samples for whatever you need for your
projects.
Tour
Each place to enter text is called a cell, as shown in Figure 2-2. You can
enter Python code or any text. If you want to insert any plain English
text, select the markup language from the drop-down menu and choose
Markdown.
To make titles in the markup cell, you type # and space before your
title. When you run this cell, the Python interpreter will make it bolded and
of large font, as shown in Figure 2-3.
25
Chapter 2 Getting Started
To run a cell, click the Run icon, as shown in Figure 2-3, or click Shift-
Return by holding the Shift key; then, click the Return key. If you want
to run all Jupyter Notebook cells, you click on the double arrows icon, as
highlighted in Figure 2-3.
To delete any cell, you can click ESC, then double-click on the D key on
the keyboard, or select it from the Edit menu by clicking on Delete Cell, as
in Figure 2-4.
26
Chapter 2 Getting Started
To add any cell, click on the plus. As shown in Figure 2-5, you can
move it up and down by clicking on the icons in the toolbar. You can also
move it from the Insert menu as well.
27
Chapter 2 Getting Started
Datatypes
There are different ways to classify the datatypes in Python. Usually, it will
be in the following six categories:
1. Numeric Types
2. Sequence Types
3. Mapping Types
4. Set Types
5. Boolean Types
28
Chapter 2 Getting Started
6. Binary Types
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)
The output is as in Figure 2-6. The first column on the left is not a
component of the DataFrame; it is an index that serves as an identifier for
the rows.
29
Chapter 2 Getting Started
import numpy as np
arr = np.array([1, 2, 3])
print(arr)
import pandas as pd
s = pd.Series([1, 2, 3])
print(s)
30
Chapter 2 Getting Started
The output is in Figure 2-8. We use the Pandas Series to store one
column from a Pandas DataFrame, as we are going to see in many of the
book’s examples.
Typecasting
Typecasting is converting from one datatype to another. Here are examples
within these categories:
x = int(3.14) # x is now 3
print(x)
31
Chapter 2 Getting Started
32
Chapter 2 Getting Started
import pandas as pd
import numpy as np
33
Chapter 2 Getting Started
import numpy as np
import pandas as pd
34
Chapter 2 Getting Started
Pandas
Throughout this book, we will use Pandas and Numpy extensively. Pandas
is a powerful and popular Python library for data manipulation and
analysis. It provides easy-to-use data structures and functions designed
to work with structured data seamlessly. One of its notable features is
the ability to present data in a human-readable table format while also
allowing numerical interpretation for extensive computations. Think of
Pandas as a supercharged version of Excel within Python, allowing you to
handle and analyze data more efficiently and programmatically.
35
Chapter 2 Getting Started
Key Concepts
• Series: A Series is a one-dimensional array-like object
that can hold various datatypes, including numbers,
strings, and dates. Think of it as a single column in a
spreadsheet, as shown in Listing 2-13.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
series=df['A']
print(series)
36
Chapter 2 Getting Started
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
37
Chapter 2 Getting Started
Numpy
Numpy, short for “numerical Python,” is a foundational Python library
for numerical computing. It supports arrays, matrices, and high-level
mathematical functions to operate on these data structures. Numpy is the
backbone of many other data science libraries, including Pandas.
We frequently transfer data from our Pandas DataFrame to Numpy
arrays. Pandas DataFrames are beneficial because they include column
38
Chapter 2 Getting Started
names and other text data, making them easy for humans to read.
However, Numpy arrays are the most efficient for computers to perform
calculations.
Key Concepts
• Array: A Numpy array is a grid of values, all the same
type, as in Listing 2-15. A tuple of nonnegative integers
indexes arrays.
import numpy as np
array = np.array([1, 2, 3, 4, 5])
print(array)
39
Chapter 2 Getting Started
40
Chapter 2 Getting Started
Figure 2-15. Creating two Numpy matrices, one is 3×3 filled with
zeros and a smaller one of 2×2 matrix filled with ones
41
Chapter 2 Getting Started
42
Chapter 2 Getting Started
Figure 2-19. Reshaping the matrix from 3×3 to 3×2 without losing
and changing data
Summary
This chapter represents the foundation of our journey in learning Python.
It starts with the installation and Python Cloud Playgrounds and then
explains the interface and the essential components of a Jupyter Notebook.
It also describes the datatypes and typecasting.
43
Chapter 2 Getting Started
44
CHAPTER 3
Data Visualizations
There is an adage that says, “A picture is worth a thousand words.”
Visualization is a tedious and harsh task. Luckily, Python has plenty of
batteries that contain functions with options to make expressive graphs.
This chapter uses big datasets that are available in the public domain. We
shall follow the data science process mentioned in Chapter 1, where we
start every section with a question and seek its answer through a graph.
This chapter covers various essential charts, such as scatterplots,
histograms, series plots, bar charts and sorted bar charts, and bubble plots.
Scatterplot
Scatterplots are used to show the interdependence between variables by
describing the scatterplots’ direction, strength, and linearity. Moreover,
they can easily display the outliers.
For this example, our question is: Do the salaries in the City of Seattle
depend on the age range of the employees, or is there no relationship?
You will find the data file in the Datasets Folder, or you can download
the CSV file from Data.Gov project at the following link: https://web.
archive.org/web/20170116035611/http://catalog.data.gov/dataset/
city-of-seattle-wages-comparison-by-gender-average-hourly-wage-
by-age-353b2.
3. Create a scatterplot.
b. Adding a title
46
Chapter 3 Data Visualizations
df=pd.read_csv('../Datasets/Chapter 3/City_of_Seattle_Wages__
Comparison_by_Gender_-_Average_Hourly_Wage_by_Age.csv')
print(df.head())
47
Chapter 3 Data Visualizations
From Figure 3-1, we find that all eight columns are numeric except the
AGE RANGE. It is of a string datatype.
Let us explore the dataset more by using the describe() method. It
returns the summary statistics report of the dataset, as in Listing 3-1-3.
48
Chapter 3 Data Visualizations
Figure 3-2. Figure 3-2 shows the output of the describe() method
Figure 3-2 shows a few statistics for each column. Note that it
only gives statistics for the numerical columns. Here are the statistics
definitions:
• Count: This is the count of non-missed rows.
49
Chapter 3 Data Visualizations
plt.figure(figsize=(12, 6))
plt.scatter(df['AGE RANGE'], df['Total Average of
HOURLY RATE'])
plt.xlabel('AGE RANGE')
plt.ylabel('Total Average of HOURLY RATE')
To avoid having the plots being drawn over each other, we use the
figure method, where we set up the plot dimensions as well, as in
Listing 3-1-4. Then, we use the scatter method and pass to it the two
variables we want to plot. For enhancing the diagram, we set the label of
the x and y axes using the xlabel and ylabel methods.
50
Chapter 3 Data Visualizations
In Figure 3-3, the last two values, the AGE RANGE are not needed
and give the wrong impression about the relationship between the two
variables. Hence, we shall filter and remove them from the data, as in
Listing 3-1-5.
Listing 3-1-5. Filter the last two values in the AGE RANGE
plt.figure(figsize=(12, 6))
df1=df['AGE RANGE']
df2=df['Total Average of HOURLY RATE']
plt.scatter(df1[:-2],df2[:-2] )
plt.xlabel('AGE RANGE')
plt.ylabel('Total Average of HOURLY RATE')
plt.title('Interdependence between Age Range and Total Average
of HOURLY RATE')
From the final scatterplot in Figure 3-4, we can assure that there is an
interdependence between the AGE RANGE and Total Average of HOURLY
RATE in Seattle. The relationship is not linear; it is curvilinear.
51
Chapter 3 Data Visualizations
Scatterplot Relationships
Figure 3-5 displays different examples of scatterplot relationships. To
describe the scatterplot, you should state if it is a linear relationship or
curvilinear as we have seen in the previous example.
52
Chapter 3 Data Visualizations
53
Chapter 3 Data Visualizations
As always, we read the dataset and explore it to verify that it has been
read correctly using the head() method, as in Listing 3-2-1.
df=pd.read_excel('../Datasets/Chapter 3/
MAINECOUNTIESPARTIES.xlsx')
print(df.head())
The output is as in Figure 3-6 showing the first five rows of the dataset.
The code plots two overlapped scatterplots where both have counties
on their x axis, while the y axis in the first scatterplot is the number of the
Democrats, and in the second one is the number of the Republicans. To
change the colors to distinguish between each one of them, we set the
color option, as in Listing 3-2-2, using c=['r']. Moreover, we change the
marker of the Republican Party points only to distinguish between the
points even in grayscale using marker option to make it as a triangle using
marker='^' option to the scatter function. We changed the label of the y
axis to "Republican:Red Triangles Democrat:Blue Circles." Finally, we set
the title of the plot.
55
Chapter 3 Data Visualizations
plt.xlabel('counties')
plt.ylabel('Republican:Red Triangles Democrat:Blue Circles')
plt.title('Maine Counties and Parties')
From Figure 3-7, we can say that Cumberland County has the highest
Democratic voters and the highest registered voters in general. It might
imply that it has the highest population as well, but we cannot ensure
that and will need further verification. However, we are sure that the
difference between the Democrats and Republicans is enormous there.
Other than this county, Democrats are more than the Republicans in 8 of
the remaining 15 counties with trivial differences. Hence, Democrats must
give more care to Maine, so it does not turn back to red.
56
Chapter 3 Data Visualizations
Histogram
The histogram task creates a chart that displays the frequency distribution
of a numeric variable. For this example, we shall see how annual wages
of employees are being distributed in March 2018 at the City of Charlotte,
North Carolina.
You will find the data file in the Datasets Folder, or you can download
the CSV file from the following link: https://data.amerigeoss.org/
dataset/city-employee-salaries-march-2018/resource/264fa25e-
93cb-46b8-8edd-24ad224f3b74?view_id=4c9ee967-`a039-4d72-
ad78-2f556909b207.
As always, load the CSV file to a DataFrame using the Pandas read_
csv() method. Then, explore it using the head() method to verify that it
has been read correctly, as in Listing 3-3-1.
df=pd.read_csv('../Datasets/Chapter 3/City_Employee_Salaries_
March_2018.csv')
print(df.head())
The output is as in Figure 3-8 showing the first five rows of the dataset.
We have nine columns. However, we are interested in one column only,
Annual_Rt, to plot our histogram.
57
Chapter 3 Data Visualizations
plt.figure(figsize=(18, 10))
plt.hist(df['Annual_Rt'])
plt.xlabel('Annual_Rt')
Figure 3-9 shows the histogram plot. However, we need to enhance the
appearance of this plot. It is a right skewed curve, and there are some clear
outliers. Also, we can resize the bins to have a more normally distributed
curve. The last step will be adding a normally distributed curve over the
histogram. We shall do all these steps in Listing 3-3-3.
58
Chapter 3 Data Visualizations
Listing 3-3-3 starts with importing the norm from spicy.stats. This
library is for plotting the normal distribution curve; it has functions for
working with normal (Gaussian) distributions.
Then, we generate an array of values for the x axis ranging from 0 to
14000 with step=1. We set the maximum annual salary rate to 140000 and will
ignore any value larger than that by setting a mask over the Annual_RT series.
Afterward, we plot the histogram using the new masked series, change
the bins to 10, and normalize the histogram to draw the density and return
a probability density. Next is plotting the normal distribution with mean
and standard deviation calculated from the data. The final changes are
giving a label to the x axis, setting the title, setting the legend, and showing
all that.
59
Chapter 3 Data Visualizations
plt.figure(figsize=(15, 10))
60
Chapter 3 Data Visualizations
61
Chapter 3 Data Visualizations
Series Plot
The series plot displays the values against time. In this example, let us plot
stock trends of prices over time.
In this exercise, we do not have a dataset to load. Rather, we will
use the Yahoo Finance library, yfinance, to load the stocks data of any
company. First, as in Listing 3-4-1, we will load Apple Inc. stocks’ historical
data. Then, in Listing 3-4-2, we will load multiple companies and plot
them vs. each other to compare values. The companies are Apple (AAPL),
Intel (INTC), Microsoft (MSFT), and ASML Holding (ASML).
62
Chapter 3 Data Visualizations
Listing 3-4-1. Series plot to show Apple Inc. closing prices over time
63
Chapter 3 Data Visualizations
by setting the color. Finally, as usual, we set the plot title and the labels of
the x and y axes, show grid, and display the plot with all these settings. The
output is as in Figure 3-11.
64
Chapter 3 Data Visualizations
The output of Listing 3-4-2 is Figure 3-12. This code fetches historical
closing prices for the specified stocks and plots them on the same graph
for the given date range. You can customize the stock symbols and date
range based on your preferences.
65
Chapter 3 Data Visualizations
It is clear from Figure 3-12 that ASML has the highest closing rate even
at it is lowest rates around October 2022, while Intel (INTC) has the lowest
closing rate from January 2021 to January 2023.
Bar Chart
The bar chart is for comparing the values side to side whether vertical
or horizontal. For this example, we shall use the crime data in NYC to
compare the crime rate in the different NYC boroughs in 2018. You can
download the 2018 NYPD Complaint Data from this link: https://www.
kaggle.com/datasets/mihalw28/nypd-complaint-data-current-ytd-
july-2018/data?select=NYPD_Complaint_Data_Current_YTD.csv.
For downloading the most recent data, it is available through the
NYC Open Data Project website. You will find the data file in the Datasets
Folder, or you can download the dataset from this link: https://data.
cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-
Year-To-Date-/5uac-w243/data.
66
Chapter 3 Data Visualizations
To download the dataset of the current year from the above link, the
steps are similar to the ones in Figure 3-13 by clicking on Export then CSV.
67
Chapter 3 Data Visualizations
To load the data, we write Listing 3-5-1. In this example, we insert the
values in the code instead of reading external data to practice other ways to
read data in Python.
# Data
Borough = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN
ISLAND']
crimes = [50153, 67489, 56691, 44137, 10285]
68
Chapter 3 Data Visualizations
In Listing 3-5-1, we insert the names of the boroughs and the total
number of crimes in each one in two lists, Borough and crimes. Then, we
create the bar chart using the plt.bar() method and set the two variables
and the color. Finally, as always, we set the labels of the x and y axes and
the plot title and then show the plot.
The output of Listing 3-5-1 is Figure 3-14.
69
Chapter 3 Data Visualizations
# Data
Borough = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN
ISLAND']
crimes = [50153, 67489, 56691, 44137, 10285]
70
Chapter 3 Data Visualizations
sorted_crimes variable is on the y axis, and the bars are colored sky
blue. The plt.xlabel, plt.ylabel, and plt.title functions add labels to
the x axis, y axis, and the title of the chart, respectively. Finally, plt.show()
is used to display the created bar chart. The output of Listing 3-5-2 is the
ascending sorted graph in Figure 3-15.
To reverse the output to get a descending vertical bar chart, set the
reverse option to True in the sorted_data = sorted(zip(Borough,
crimes), key=lambda x: x[1], reverse=True), as in Listing 3-5-3.
71
Chapter 3 Data Visualizations
# Data
Borough = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN
ISLAND']
crimes = [50153, 67489, 56691, 44137, 10285]
72
Chapter 3 Data Visualizations
# Data
Borough = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN
ISLAND']
crimes = [50153, 67489, 56691, 44137, 10285]
73
Chapter 3 Data Visualizations
74
Chapter 3 Data Visualizations
Obama 56.27
Romney 40.98
Johnson 1.31
Stein 1.14
Paul 0.29
Anderson 0.01
Reed 0
Again, instead of loading the data from an external file, we shall insert
the values in the code as in Listing 3-6.
75
Chapter 3 Data Visualizations
The added value of this bar chart in Listing 3-6 is coloring the bars with
the corresponding party color. Again, we read the data from the two lists,
candidates and winning_percentages. Then, we define a list of colors that
we want to use in our chart. The plt.figure(figsize=(10, 6)) sets the
figure size. The plt.bar function is then used to create a vertical bar chart,
where the candidates column is on the x axis, the winning_percentages
variable is on the y axis, and the bars are colored according to the
specified colors’ list that we pre-assigned. Then, we loop to iterate through
each bar and its corresponding winning percentage.
The plt.text function is then used to add a text label above each bar,
displaying the winning percentage with two decimal places. The labels are
centered using ha='center' and aligned at the bottom using va='bottom'.
76
Chapter 3 Data Visualizations
The labels are also bold and black. Then, at the end, as always, set the
labels of the x and y axes and the plot title. The output is Figure 3-18.
Bubble Chart
The bubble chart is useful in displaying three variables of data at a time.
In this example, we will create a bubble chart to display the height, weight,
and the age of a class.
For this example, I generated random data for the class, as shown in
Listing 3-7-1, to function as a third data source in this chapter.
77
Chapter 3 Data Visualizations
students_data = {
'Name': [f'Student{i}' for i in range(1, 21)],
'Height': np.random.uniform(150, 190, 20), # Heights in
centimeters
'Weight': np.random.uniform(45, 90, 20), # Weights in
kilograms
'Age': np.random.randint(18, 25, 20) # Ages in years
}
students_df = pd.DataFrame(students_data)
78
Chapter 3 Data Visualizations
# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Age (years)')
Listing 3-7-1 starts with setting the random seed for reproducibility
using np.random.seed(42). Making the seed fixed in the randomization
means that every time we recompile and re-run the code, the same
random values will be generated. But if we do not specify a seed, every
time we re-run the cell, new random values will be generated. Then, use
np.random.uniform function to generate random data for 20 students.
The dataset includes columns for student names, heights, weights, and
ages. Then, create a bubble chart using Matplotlib. As always, the plt.
figure(figsize) sets the figure size.
Now, the surprise is using plt.scatter function again. Yes, we use
it for drawing scatter and bubble charts. In the scatterplot, the x axis
represents heights, the y axis represents weights, the bubble sizes are
proportional to ages scaled by a factor of 500, and the colors represent
ages. The cmap='viridis' specifies the color map, and alpha=0.7 sets the
transparency of the bubbles.
Then use the plt.text function to add annotations for each point on
the chart, displaying the student names at the corresponding positions.
The code adds labels to the x axis, y axis, and a title to the chart.
A colorbar legend is added to the chart to represent the ages of
students. The cbar.set_label function sets the label for the colorbar.
Finally, plt.show() is used to display the created bubble chart. The output
is Figure 3-19.
79
Chapter 3 Data Visualizations
80
Chapter 3 Data Visualizations
81
Chapter 3 Data Visualizations
# Add colorbar
cbar = plt.colorbar(scatter)
cbar.set_label('Age (years)')
The output of Listing 3-7-2 is Figure 3-20. Now, the chart clearly
displays the class outliers. For example, Student 5 in a purple bubble is
the smallest student in the class vs. the biggest student, Student 4, in a
yellow one.
82
Chapter 3 Data Visualizations
Summary
This chapter digs deeper into data visualization. It starts with the most
essential plots, such as scatterplots and histograms. Moreover, it shows
how to concatenate plots over each other by changing the colors and
patterns to ease the comparisons of the findings. Data visualization in
Python is one of the most powerful features and has plenty of options to
customize plots.
83
CHAPTER 4
Statistical Analysis
and Linear Models
In Chapter 3, we explored graph and data visualization. In Chapter 4, we
will explore statistics and linear models.
Statistical Analysis
After reading datasets, there are usual statistical analysis tasks that should
be performed to understand the data and to be included in the reports.
In this chapter, we will discuss frequency tables, summary statistics,
correlation analysis, and T-tests.
Frequency Tables
The frequency tables task is usually from the first steps when exploring any
dataset. It generates frequency tables from your data showing the unique
levels of each variable. Also, the table can help you identify the outliers.
For this example, we explore the HEART.csv in the Datasets Folder
to see how many people have high cholesterol. To do that, we focus on
CHOL_STATUS, which is a string variable. The code is in Listing 4-1 in
Chapter 4.ipynb in the Example Code Folder.
In Listing 4-1, we learn how to generate a frequency table and plot the
frequency distribution in a bar chart.
import pandas as pd
import matplotlib.pyplot as plt
After importing the required libraries and reading the csv file in a Panda
DataFrame, heart_df, we calculate the frequency table. This code line,
frequency_table = heart_df['Chol_Status'].value_counts().reset_
index().rename(columns={'index': 'Chol_Status', 'Chol_Status':
'Frequency'}), performs the following:
86
Chapter 4 Statistical Analysis and Linear Models
• heart_df['Chol_Status'].value_counts(): This
calculates the frequency of unique values in the 'Chol_
Status' column of the DataFrame heart_df.
87
Chapter 4 Statistical Analysis and Linear Models
Summary Statistics
The summary statistics task provides descriptive statistics for variables
across all observations and within groups of observations. You can also
summarize your data in a graphical display, such as a histogram or a
box plot.
For this example, we review Maine’s past elections to see if there are
any predictable outcomes (patterns). The dataset is in the Datasets folder
and has the name Maine_Past_Elections.xlsx. The contents of the Excel file
are as shown in Table 4-1.
88
Chapter 4 Statistical Analysis and Linear Models
Let us read the data and compute the summary statistics for both
columns of Democrats and Republicans, D and R variables in the dataset,
as in Listing 4-2.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
89
Chapter 4 Statistical Analysis and Linear Models
After importing the libraries and loading the dataset, we compute and print
descriptive statistics for columns D and R in the DataFrame past_elections_
df. The .describe() method generates summary statistics such as count,
mean, standard deviation, minimum, and maximum for the specified column.
This .isnull() method checks for missing values in the DataFrame
past_elections_df, which returns a DataFrame of Boolean values
indicating the presence of missing values. The .sum() method then
calculates the sum of missing values for each column and prints the result.
The output is Figure 4-3. As shown, there are no null values in the dataset.
Correlation Analysis
Before explaining the correlation concepts, we want to set the difference
between the causation and the correlation. Causation is when one thing
causes another thing to happen, while correlation is when two or more
things appear to be related. Correlations do not always mean causation.
For example, travel causes buying suitcases and tickets, but suitcase
sales and tickets are correlated, as shown in Figure 4-4.
91
Chapter 4 Statistical Analysis and Linear Models
Listing 4-3. Import the required libraries, load the dataset, and
explore it
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Selecting variables
variables = cars_df[['Weight', 'Length', 'Horsepower']]
plt.figure(figsize=(12, 6))
plt.scatter(cars_df['Weight'], cars_df['Horsepower'])
plt.xlabel('Weight')
plt.ylabel('Horsepower')
92
Chapter 4 Statistical Analysis and Linear Models
plt.figure(figsize=(12, 6))
plt.scatter(cars_df['Length'], cars_df['Horsepower'])
plt.xlabel('Length')
plt.ylabel('Horsepower')
93
Chapter 4 Statistical Analysis and Linear Models
94
Chapter 4 Statistical Analysis and Linear Models
• https://en.wikipedia.org/wiki/United_States_
presidential_election_in_Maine,_2000
• http://www.270towin.com/states/Maine
• https://www.pressherald.com/2016/11/08/mainers-
head-to-polls-in-historic-election/
• https://www.mainepublic.org/politics/
2016-11-08/maine-voter-turnout-extremely-heavy
• https://www.csmonitor.com/USA/Elections/2012/
1106/Voter-turnout-the-6-states-that-rank-
highest-and-why/Maine
• https://www.census.gov/history/pdf/2008
presidential_election-32018.pdf
• https://bipartisanpolicy.org/report/2012-
voter-turnout/
• https://www.census.gov/content/dam/Census/
library/publications/2008/demo/p20-557.pdf
• https://www.census.gov/content/dam/Census/
library/publications/2006/demo/p20-556.pdf
• https://www.census.gov/content/dam/Census/
library/publications/2002/demo/p20-542.pdf
• https://www.deptofnumbers.com/gdp/maine/
95
Chapter 4 Statistical Analysis and Linear Models
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
correlation = pd.read_excel('../Datasets/Chapter 4/
correlation.xlsx')
print(correlation.head())
print(correlation.describe())
plt.figure(figsize=(12, 6))
plt.scatter(correlation['gdp'], correlation['voterturnout'])
plt.xlabel('gdp')
plt.ylabel('voterturnout')
96
Chapter 4 Statistical Analysis and Linear Models
97
Chapter 4 Statistical Analysis and Linear Models
Hypothesis Testing
Hypothesis testing is a statistical method used to make inferences about
a population parameter based on sample data. It involves comparing
observed data to an expected or null hypothesis to determine whether any
observed differences or relationships are statistically significant or simply
due to random chance.
Examples of hypothesis testing for continuous data analysis are T-test
and ANOVA, while chi-square test is for categorical data analysis.
T-Test
Frequently, we seek to ascertain whether there exists a genuine disparity in
means between two distinct groups or if the perceived distinction is merely
a result of random variation. The T-test serves this purpose by gauging the
likelihood of disparity between two datasets. In other words, it assesses the
mean values of two samples. It is worth noting that the T-test is employed for
small sample sizes since they may not adhere to the normal distribution.
98
Chapter 4 Statistical Analysis and Linear Models
One-Sample T-Test
In this example, I gathered all the voting data from past presidential
elections in Maine for candidates representing both major political parties
across various years. This dataset served as the basis for predicting the
election outcomes in Maine for both the 2016 and 2020 elections.
The dataset, MaineVotesDR.xlsx, comprises four columns: Election
Year, Democratic Votes, Republican Votes, and Result. The Result
column records a binary outcome: 0 signifies victory for the Republican
candidate, while 1 indicates success for the Democratic candidate.
Our hypothesis is that the Republican candidate will be victorious in
the 2020 presidential elections in Maine, while the alternative hypothesis
suggests the victory of the Democratic candidate.
Null hypothesis (H0): result = 0Alternative hypothesis (Ha): result = 1
Now, let us load the dataset and explore how the T-test can shed light
on the outcomes of the 2020 election, as shown in Listing 4-5.
import pandas as pd
from scipy import stats
maine_votes_df = pd.read_excel('../Datasets/Chapter 4/
MaineVotesDR.xlsx')
# Test for normality
normality_test = stats. shapiro (maine_votes_df['Result'])
print("Normality Test:")
print(normality_test)
# t-test
t_stat, p_value = stats.ttest_1samp(maine_votes_df['Result'], 0)
print("\nT-test:")
print("T-statistic:", t_stat)
print("P-value:", p_value)
99
Chapter 4 Statistical Analysis and Linear Models
The p-value = 0.0004. Since the p-value is less than 0.05, we can
reject the null hypothesis. Hence, we can conclude that the Republican
candidate will lose in Maine in the 2020 election. He did lose Maine.
Two-Sample T-Test
In this example, the question is: Is there a significant difference in height
between the two genders?
The null hypothesis is that there is no difference in height. The
alternate hypothesis is that there is a difference in height based on gender.
In this example, load the CLASS dataset, as in Listing 4-6-1.
import pandas as pd
import numpy as np
from scipy import stats
100
Chapter 4 Statistical Analysis and Linear Models
# t-test by Gender
# comparing males to females
male_heights = class_df[class_df['Gender'] == 'M']['Height']
female_heights = class_df[class_df['Gender'] == 'F']['Height']
t_stat, p_value = stats.ttest_ind(male_heights, female_heights)
print(f"\nT-test between male and female Heights:")
print(f"T-statistic: {t_stat:.2f}, p-value: {p_value:.4f}")
101
Chapter 4 Statistical Analysis and Linear Models
102
Chapter 4 Statistical Analysis and Linear Models
Figure 4-11. Box plot and line plot for the height distribution
by Gender
103
Chapter 4 Statistical Analysis and Linear Models
Summary
This chapter explains the most crucial statistical concepts that will be
needed for almost all analytics and data science reports. They are the
frequency tables, summary statistics, correlation analysis, and T-tests. In
the next chapter, we shall learn the operators and loops.
104
CHAPTER 5
Data Preprocessing
and Feature
Engineering
This chapter shows some essential data cleaning and querying methods,
such as comment statement, arithmetic, and comparison operators, and
how to represent missing values and loops.
Comment Statement
It is crucial to document your code so you and your peers can understand
the reasoning behind how the analysis is constructed. In general, if you
do not document your code, you might look at it after a few months and
not remember why you used a certain function or why you made a certain
check. Hence, code documentation is crucial to follow the logic flow.
The syntax of the comment statement in Python can be done in two
ways as follows:
#message
"""message"""
+ Addition 2+3 5
+= X = 5X += 15 X=20
- Subtraction 5–3 2
-= X = 5X -= 15 X=-10
/ Division 5/2 2.5
/= X = 15 X=3.0
X /= 5
// Quotient 5 // 2 2
% Remainder 5%2 1
* Multiplication 2*3 6
*= X = 15 X=75
X *= 5
** Exponentiation 2 ** 3 8
106
Chapter 5 Data Preprocessing and Feature Engineering
of the methods of how to represent and handle the missing values in your
dataset later in this chapter and in the next one.
• E: Exponentiation
• M: Multiplication
• D: Division
• A: Addition
• S: Subtraction
Comparison Operators
Comparison operators set up a comparison, operation, or calculation with
two variables, constants, or expressions. If the comparison is true, the
result is True. If the comparison is false, the result is False. Comparison
operators can be expressed as in Table 5-2.
107
Chapter 5 Data Preprocessing and Feature Engineering
108
Chapter 5 Data Preprocessing and Feature Engineering
109
Chapter 5 Data Preprocessing and Feature Engineering
The Python interpreter will convert the above values to NaN and
handles it as a missing value. Please pay attention to the case, as the
following values are not interpreted as missing: None, none, na, Null.
Let us test these values and see with an example how the Python
interpreter computes them. In Listing 5-1, we test all the mentioned
values above and some of the missing values’ representations of other
statistical programming languages. For example, “.” represents the missing
values in SAS.
Listing 5-1. Try different values to represent missing values, and see
which ones the Python interpreter is able to read correctly
#import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#create a new empty dataframe that has the Height and Missing
variables
110
Chapter 5 Data Preprocessing and Feature Engineering
111
Chapter 5 Data Preprocessing and Feature Engineering
112
Chapter 5 Data Preprocessing and Feature Engineering
From Figure 5-2, we verified that Python only accepts null, , NA,
nan, NaN to represent the missing values in raw data.
Loops
There are two types of loops in Python for iteration: while and for loops.
While Loop
For this example of the while loop, we keep looping asking the user to
input numbers, and the program will output if the number is even or odd.
Again, the most important thing to think about when using the while loop
is the exiting or breaking condition, in other words, when the loop will end;
otherwise, it is an infinite loop and will generate an error. For this example,
the exit condition is when the user enters zero. The program will display an
exit message and branches.
The loops in Python have a unique feature that is not in other
languages, which is using else with the while and for loops. In this
example, we use the while-else to display the exit message, as in
Listing 5-2.
113
Chapter 5 Data Preprocessing and Feature Engineering
#input the first number from the user before the loop
input_no = int(input('Enter your number (0 to exit): '))
#check the input value if it equals zero or not; if not, the
while loop is #entered
while input_no != 0:
#check the remainder after dividing by 2 if it is or
zero or not
if input_no%2==0:
#if the remainder is zero, the input number is even
print("your input is even")
else:
print("your input is odd")
#ask the user to enter a new number
input_no = int(input('Enter your number (0 to exit): '))
else:
#it keeps looping till the user enters zero. It prints this
message and #exits
print("Bye and Have a great day!")
In Listing 5-2, we start with asking the user to enter the first value. The
program checks in the while condition if it equals zero or not. If the input
is not equal to zero, it enters the loop and checks using the if-condition the
modulus of the number. If the remainder of the modulus equals zero, then
it is an even number; otherwise, it is odd.
If the input number equals zero, it goes to the else of the while loop.
The program prints the message and exits. Figure 5-3 shows a sample
output of the listing.
114
Chapter 5 Data Preprocessing and Feature Engineering
For Loop
The for loop is more common than the while loop and is used to loop
over containers, DataFrames, lines in a file, and more. In this section, we
have some examples to show the usage of the for loop. The first example
is the most standard looping example of printing a sequence of numbers
and printing their sum. The second one is for looping over the letters of a
word. The third one is looping over a list, as one of the containers’ datatype
examples. The last example is looping over a dataset to replace the null
values with zeros.
115
Chapter 5 Data Preprocessing and Feature Engineering
The for loop in Python has a different syntax than the other
programming languages. The syntax is as follows:
116
Chapter 5 Data Preprocessing and Feature Engineering
We slice the string and use -1 in the step to reverse it. The slicing square
brackets syntax is as follows: [start :end :step].
Figure 5-5 shows that when we used the step=-1, it started from the last
letter and reversed the word’s letters. Also, it is worth mentioning that the
print() function in Python prints each iteration on a new line by default.
It is equivalent to println() function in C/C++ programming languages.
Later in Listing 5-6, we will see how to print all the iterations’ output on the
same line.
In the following example, we re-use the range() function for printing
the string in a reverse order. In Listing 5-5, we initialize a variable p with
the string. Then, we loop in the range starting from the length of the string
to zero with a step of -1 to reverse the order. Remember that the range
loops from the start number to the end number -1; so, if you put 1 instead
of zero, the loop won’t print the first letter of the string. However, feel
free to play with the values to understand how it works. The output is as
Figure 5-5.
117
Chapter 5 Data Preprocessing and Feature Engineering
p='Python'
for i in range(len(p),0,-1):
print(p[i-1])
The output of Listing 5-5 is the same as the previous listing. Listing 5-4.
Both will get the output in Figure 5-5.
my_list=['I','Love',"Python"]
for c in my_list:
print(c,end="^")
118
Chapter 5 Data Preprocessing and Feature Engineering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(students_df,'\n\n\n')
119
Chapter 5 Data Preprocessing and Feature Engineering
for i in students_df.columns:
students_df[i]=students_df[i].fillna(0)
print(students_df)
Figure 5-7. The dataset before and after handling the missing values
120
Chapter 5 Data Preprocessing and Feature Engineering
Summary
This chapter starts with how to make a comment in Python programs and
how to use the arithmetic, comparison, logical, membership, and identity
operators. Further, we discussed the different values that you can use to
represent missing values in your datasets, so the Python interpreter reads
them correctly. Moreover, we discussed the loops in detail with plenty of
examples. There are the while and for loops. At the end of the chapter, we
learned one of the ways of how to handle the missing values in Python. In
the next chapter, we shall learn how to prepare for analysis and learn the
essential conditional statement.
121
CHAPTER 6
Preparing Data
for Analysis
As we mentioned earlier, in any programming language, you learn the data
types, operators, IF condition, and loops. We covered all these topics in
Chapters 2 and 5, except the IF condition statements. In this chapter, we
learn about them and more advanced data preparation and processing
techniques.
Rename
In data analysis, an essential first step is to understand the variables of
your dataset. You need to identify which variables are dependent and
independent, as well as distinguish between numeric/continuous and
character variables. Typically, your client provides this information in a
text file accompanying the dataset. This text file, known as the dictionary,
explains the variable names and their types. For instance, without the
dictionary, variable names like YOB, ENROLL, and DT_ACCEPT might be
unclear. The dictionary clarifies that YOB stands for Year Of Birth.
To add these descriptive labels to a dataset, we use a process of
renaming columns. The dataset is in the Datasets Folder and is named
Voter_A.CSV. After uploading and importing the file, we rename the
dataset to Voters. Note that this dataset is synthetic, created using the
variable names provided by the Secretary of Maine, but does not contain
import pandas as pd
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print(df.head())
# Renaming the columns to add labels
df = df.rename(columns={
'FIRST_NAME': 'Name',
'YOB': 'Year Of Birth',
'ENROLL': 'Enrollment Code',
'DESIGNATE': 'Special Designations',
'DT_ACCEPT': 'Date Accepted (Date of Registration)',
'CG': 'Congressional District',
'CTY': 'County ID',
'DT_CHG': 'Date Changed',
'DT_LAST_VPH': 'Date Of Last Statewide Election with VPH'
})
124
Chapter 6 Preparing Data for Analysis
125
Chapter 6 Preparing Data for Analysis
126
Chapter 6 Preparing Data for Analysis
Format
One of the most popular data exercises in healthcare is calculating the
time a patient passes at a clinic or hospital and, based on historical data,
making predictions for the patients’ arrival times to decrease the waiting
time. In this example, Listing 6-2, we make up a dataset with patients’ ID,
age, arrival date, arrival time, leaving date, leaving time, and visit fees.
Then, we compute the time the patients passed inside the clinic and print
a report of this duration and the money with dollar sign and commas in
the thousands. This example aims to learn how Python handles date and
time processing and currency formatting.
import pandas as pd
import pandas as pd
from datetime import datetime
# Sample DataFrame
data = {
'PatientID': [1, 2, 3],
'Age': [25, 34, 50],
'ArrivingDate': ['2024-06-10', '2024-06-11', '2024-06-12'],
'ArrivingTime': ['09:00', '10:30', '14:45'],
'LeavingDate': ['2024-06-10', '2024-06-11', '2024-06-12'],
'LeavingTime': ['10:00', '11:15', '15:30'],
'VisitFees': [100.0, 150.75, 200.5]
}
df = pd.DataFrame(data)
print("Datatypes of raw data:")
print("\n",df.dtypes)
127
Chapter 6 Preparing Data for Analysis
dummy=df[['VisitDuration', 'LeavingDateTime',
'ArrivingDateTime']]
print("\nComputing the visit duration:")
print("\n",dummy.head())
128
Chapter 6 Preparing Data for Analysis
df['FormattedDuration'] = df['VisitDuration'].apply(format_
duration)
df['FormattedFees'] = df['VisitFees'].apply(format_fees)
print("\n", df[['PatientID', 'Age', 'FormattedDuration',
'FormattedFees']])
Let us explain the code in Listing 6-2. First, we create a DataFrame with
columns: PatientID, Age, ArrivingDate, ArrivingTime, LeavingDate,
LeavingTime, and VisitFees, and initialize it with dummy values.
Then, we use the pd.to_datetime() method to combine
ArrivingDate and ArrivingTime into a single ArrivingDateTime column.
Similarly, combine LeavingDate and LeavingTime into a LeavingDateTime
column. To calculate the duration of the visit, subtract ArrivingDateTime
from LeavingDateTime to compute the VisitDuration. We will next write
a function to format this VisitDuration variable.
To format the duration, we define a function using the def keyword
that takes the duration as an argument and returns it formatted. The
format_duration(duration) converts the duration into a string format
of hours and minutes, as in the output in Figure 6-2.Please check the
following link to learn more about formatting dates: https://www.
w3schools.com/python/python_datetime.asp.
About formatting the money currency, we define another function,
format_fees(fees), to format the fees with a dollar sign and commas
using '${:,.2f}'.format(fees), where $ is to set the currency symbol, :,
is for the thousand’s separator, and .2f is for the floating-point precision,
setting it to two decimal places. Here is a link to learn more about currency
formatting: https://www.geeksforgeeks.org/how-to-format-numbers-
as-currency-strings-in-python/.
Finally, we specify the columns to print the report, as in Figure 6-2.
129
Chapter 6 Preparing Data for Analysis
130
Chapter 6 Preparing Data for Analysis
import pandas as pd
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("Dataframe Shape (rows, columns):",df.shape)
print("\nRaw data-first 5 rows:\n",df.head())
print("\nDatatypes of raw data:\n")
print("\n",df.dtypes)
#Typecasting-converting to numeric
df['YOB_numeric'] = pd.to_numeric(df['YOB'], errors='coerce')
print("\nDatatypes after typecasting:\n")
print("\n",df.dtypes)
# Calculate age using 'YOB_numeric'
age = 2024 - df['YOB_numeric']
print("\nage:\n",age.head())
df['age']=age
print("\nDataframe's first 5 rows after creating new
variables:\n",df.head())
print("\nDataframe Shape after adding new columns to it (rows,
columns):",df.shape)
As always, after loading the libraries and the dataset, you verify that it
has been loaded correctly by printing the dataset dimensions using df.
shape(), the first five rows using the df.head() method, and the datatypes
of the variables using the df.types property to check if we need any
typecasting before performing the mathematical operations.
As shown in Figure 6-3, the dimensions of the dataset are 100 rows
and 9 columns. However, the datatypes of all the columns are object. This
output indicates that there are empty strings or non-numeric values in
the YOB column that cannot be directly converted to integers. Therefore,
we will need to change the datatype of the YOB to numeric to be able to
131
Chapter 6 Preparing Data for Analysis
perform an arithmetic operation over it. As in Listing 6-3, we use the pd.
to_numeric() method to do the typecasting with the errors='coerce'
parameter, which will convert invalid parsing to NaN. This identification
allows us to fill these NaN values with a default value before calculating the
age. However, in this exercise, we will leave it as NaN.
If you would like to fill the NaN values in YOB with a default value (e.g.,
2006 to get the minimum voting age of 18 years old or any appropriate
value), you can use df['YOB'].fillna(2006, inplace=True). It will
replace the NaN in place.
Afterward, we print the datatypes again to check the new variable
YOB_numeric’s datatype. As in Figure 6-3, it is float64. Now, it is a numeric
datatype, and we can perform the subtraction to compute age. Then, we
print the first five rows of age using age.head() method. To add this Series
as a new column to the DataFrame df, we use df['age']=age. For final
verification, we printed the head and the dimensions of the DataFrame to
find that the columns increased from 9 to 11.
132
Chapter 6 Preparing Data for Analysis
133
Chapter 6 Preparing Data for Analysis
import pandas as pd
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("\nRaw data-first 5 rows:\n",df.head())
df['YOB_numeric'] = pd.to_numeric(df['YOB'], errors='coerce')
134
Chapter 6 Preparing Data for Analysis
Figure 6-4. Voters dataset after reordering its variables to have age
after YOB
IF Statement
The age column has a wrong value for one of the voters, where they are 344
years old. To correct this, if the dataset data was a real one, we would have
had several options:
135
Chapter 6 Preparing Data for Analysis
import pandas as pd
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("\nRaw data-first 5 rows:\n",df.head())
df['YOB_numeric'] = pd.to_numeric(df['YOB'], errors='coerce')
for x in df.index:
if df.loc[x, "age"] > 150:
print("\nThe rows with the wrong ages:\n",df.loc[x])
df.loc[x, "age"] = "NaN"
print("\nReplacing the wrong ages with NaN:
\n",df.loc[x])
136
Chapter 6 Preparing Data for Analysis
IF-ELIF Statements
Now, we will analyze the number of voters in the following age groups:
18–24, 25–29, 30–39, 40–49, 50–64, and 65+.
137
Chapter 6 Preparing Data for Analysis
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("\nRaw data-first 5 rows:\n",df.head())
df['YOB_numeric'] = pd.to_numeric(df['YOB'], errors='coerce')
print("\nchecking age_group:\n",df.head())
138
Chapter 6 Preparing Data for Analysis
140
Chapter 6 Preparing Data for Analysis
141
Chapter 6 Preparing Data for Analysis
Drop a Row
As mentioned in the “IF Statement” section, there are various ways to
handle the wrong data. In the previous section, we saw how to replace
them with missing values. In this section, we will learn how to drop the
whole raw.
import pandas as pd
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("Dataframe Shape (rows, columns):",df.shape)
print("\nRaw data-first 5 rows:\n",df.head())
rows_with_errors = df[df['YOB_numeric'].isna()]
print("\nRows with coercing errors in 'YOB':")
print(rows_with_errors)
df.drop([83],axis=0,inplace=True)
rows_with_errors = df[df['YOB_numeric'].isna()]
print("\nChecking if the row with errors was deleted or not:")
print(rows_with_errors)
print("Further verification by checking the Dataframe Shape
(rows, columns):",df.shape)
Listing 6-7 starts by loading the libraries and the dataset, as always.
Then, we print the dataset dimensions, shown in Figure 6-7, as 100
rows and 9 columns. Then, we print the first five rows using the df.
head() method.
Afterward, we typecast YOB to a new numeric column using the pd.
to_numeric() method. We subset the rows with errors using the .isna()
142
Chapter 6 Preparing Data for Analysis
Drop a Column
The DROP method specifies excluding some variables from the dataset. In
the Voters_A.csv dataset, the Designate column is empty. In cleaning
the data, we should delete this column; in other words, drop it, as in
Listing 6-8.
143
Chapter 6 Preparing Data for Analysis
import pandas as pd
import numpy as np
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
print("The size of the dataframe:",df.shape)
df.replace(to_replace=[r'^\s*$', '', None], value=np.nan,
regex=True, inplace=True)
print(df.isnull().sum())
After loading the file correctly using the pd.csv() method, we verified
the DataFrame size, which is shown in Figure 6-8 as (100,9).
Sometimes, empty cells might not be read as empty strings (' ') but
as None, whitespace characters, or other representations. To handle various
types of empty values and replace them with NaN. Therefore, we can
use df.replace(to_replace=[r'^\s*$', '', None], value=np.nan,
regex=True, inplace=True). This method replaces all representations
of empty values with NaN, including a RegEx (Regular Expression) pattern
to match any string that consists only of whitespace characters, r'^\s*$'.
Also, it will replace empty strings and None.
The df.isnull().sum() counts the NaN values in each column. The
output, as in Figure 6-8, shows that all the rows in the Designate column
are NaN. We verify that using the .head() method.
144
Chapter 6 Preparing Data for Analysis
145
Chapter 6 Preparing Data for Analysis
Subset
There are multiple ways to subset the Pandas DataFrames in Python.
One of them is Boolean indexing. In this example, we continue using the
dataset of the previous sections, Voters_A.CSV. We want to create two new
datasets as subsets of the original, where the first one contains only voters
who are enrolled as Democrats and the other includes voters who are
enrolled as Republicans.
import pandas as pd
import numpy as np
df = pd.read_csv('../Datasets/Chapter 6/Voters_A.csv')
democrats=df[df['ENROLL']=="D"]
print(democrats.iloc[:,:6].head())
print(democrats.shape)
republicans=df[df['ENROLL']=="R"]
print(republicans.iloc[:,:6].head())
print(republicans.shape)
frequency_table = df['ENROLL'].value_counts().reset_index().
rename(columns={'index': 'ENROLL', 'ENROLL': 'Frequency'})
frequency_table = frequency_table.sort_values(by='Frequency',
ascending=False)
print(frequency_table)
In Listing 6-9, after importing the library and loading the dataset, we
use df[df['ENROLL'] == "D"] to filter the DataFrame to include only the
rows where the ENROLL column is equal to "D." This subset is stored in the
democrats DataFrame.
146
Chapter 6 Preparing Data for Analysis
147
Chapter 6 Preparing Data for Analysis
Append
Appending datasets is the same as vertical merging, which means
adding rows. In Python, you can merge DataFrames both vertically and
horizontally using the pandas library. In this section, we will append the
two datasets, democrats and republicans, datasets that we created in the
previous section. In the next section, we will have an example of horizontal
merging.
You must run Listing 6-9 first to load the Democratic and Republican
datasets before appending them using Listing 6-10.
148
Chapter 6 Preparing Data for Analysis
import pandas as pd
149
Chapter 6 Preparing Data for Analysis
150
Chapter 6 Preparing Data for Analysis
Merge
In the previous section, we learned how to append two datasets and
increase the rows. In this section, we will learn how to merge two datasets
over one common variable, where one has many variables and the other
has a few ones.
In this example, we will use the datasets of a Kaggle competition
posted by Zillow to make house predictions. The contest has two datasets.
The Many dataset is loaded from a CSV file called properties_2016.csv,
which contains 58 columns and about three million rows. The Few dataset
is loaded from another CSV file called train_2016_v2.csv, which contains
three columns and more than 90K rows. The common variable between
both files is parcelid. You can download the original datasets from this
link: https://www.kaggle.com/competitions/zillow-prize-1/data.
As the sizes of these files are enormous, I selected only 3000 rows from
the first file and 1000 rows from the second file and named them many.csv
and few.csv. You will find them in the Datasets Folder.
import pandas as pd
151
Chapter 6 Preparing Data for Analysis
file1 = file1.drop_duplicates(subset='parcelid')
file2 = file2.drop_duplicates(subset='parcelid')
If you want to include all rows from file1 regardless of whether there
is a matching parcelid in file2, you can use several types of joins, such
as a Left Join. The statement will be zillow_merged = pd.merge(file1,
file2, on='parcelid', how='left'). You will have NaN for the
nonmatching columns. The merged DataFrame, in such case, will have
about the number of rows as the Many file, which is 3000 rows.
Merging in Python is easy and does not require sorting by the key
identifier as in other languages.
The output is shown in Figure 6-11.
152
Chapter 6 Preparing Data for Analysis
153
Chapter 6 Preparing Data for Analysis
Summary
This chapter explains the steps to clean and prepare data for analysis. It
discusses how to rename the variable and adjust the reports’ formatting.
It also explains how to create new variables and how to rearrange them in
datasets. It explains the IF and IF-ELIF statements. Moreover, it describes
how to drop a row and a column, subset a dataset, and append and merge
multiple datasets. Now, we are done with the most common data analysis
tasks, and the next chapter will introduce predicting the future using
regression.
154
CHAPTER 7
Regression
The linear regression task fits a linear model to predict a single continuous
dependent variable from one or more continuous or categorical predictor
variables. This task produces statistics and graphs for interpreting the
results.
Y=a+bX
where Y is the dependent variable that we want to predict its value, a is the
intercept, b is the coefficient, and X is the independent variable. The error
is ignored.
The null hypothesis is that there is no linear relationship between the
variables. In other words, the value of b is zero. The alternative hypothesis
is that there is a linear relationship, and b is not equal to zero.
H0: b = 0
Ha: b ^= 0
There are several libraries that you can use in Python to perform linear
regression. Two of the popular libraries are statsmodels and scikit-
learn. The choice between using Ordinary Least Squares (OLS) from
the statsmodels library and linear regression from scikit-learn
(sklearn) depends on your specific needs and the context of your
analysis. Both libraries offer linear regression functionalities, but they
serve somewhat different purposes and have different features.
If you are primarily interested in statistical analysis, hypothesis testing,
and detailed diagnostics, and you don't need the model for broader
machine learning tasks, statsmodels might be a better fit.
If you are more focused on predictive modeling, integration with other
machine learning tools, and simplicity, scikit-learn is a solid choice.
In some cases, analysts use both libraries: statsmodels for initial
exploration and hypothesis testing and then scikit-learn for building
a predictive model that can be easily integrated into a broader machine
learning workflow. In our code, we shall include both scikit-learn
and statsmodels, followed by diagnostic plots using seaborn to use the
advantages of each one.
In the following example, we shall use the scikit-learn library for the
model fitting and the fit diagnostics diagrams.
The steps to do the simple linear regression with its fit diagnostics are
as follows:
156
Chapter 7 Regression
4. Plot five fit diagnostics for the oil price per barrel in
the presidential election years:
In this step, we import the required libraries, load the dataset, and explore
it. To explore the data more, we plot the multiple pairwise bivariate
distributions of the variables.
Listing 7-1-1. Import the required libraries, load the dataset, and
explore it
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
# importing r2_score module
from sklearn.metrics import r2_score
157
Chapter 7 Regression
df = pd.read_excel('../Datasets/sp_oil_gold.xlsx')
print(df.head())
df=df[['oil price','gold_prices']]
print(df.head())
To verify that the data has been loaded corrected, we print the header
and the first five rows of the dataset using df.head() function. We keep
only the oil price and the gold_prices variables because this is a simple
158
Chapter 7 Regression
linear regression that needs only one independent variable, and we will
drop the rest of the columns. We do that by statement:
df=df[['oil price','gold_prices']]
159
Chapter 7 Regression
Figure 7-1 shows the output of Listing 7-1-1. The figure shows at first
the first five rows from all the columns in the dataset. Then, it shows the
histogram of the oil price and the scatterplot of the oil price on the
y axis with the gold_price on the x axis. In the second row of the figure
distribution, the scatterplot is repeated again but with the gold_price on
the y axis and the oil price on the x axis, and the second row and column
is the gold_price histogram. In both scatterplots, there are three clear
outliers, which we will discuss the reason causing this phenomenon.
In this step, we specify the independent and target variables. Because the
dataset is small, we shall skip the training and testing splitting step. We
shall use the whole dataset for training and testing. Then, we create the
model and fit the data to it. From the parameter estimates, we shall write
the line equation. Plot the fitted line visually overlaid over the variables’
scatterplot.
x=df['gold_prices'].values.reshape(-1, 1)
y=df['oil price'].values
#parameter estimates
print('intercept:', LR.intercept_)
print('slope:', LR.coef_)
print('equation of oil prices in the presidential election
years=',LR.intercept_,"+",LR.coef_[0],"*gold_prices")
160
Chapter 7 Regression
161
Chapter 7 Regression
Figure 7-2 shows that there is a linear relationship between oil and
gold prices. The slope of the line = 0.05965. From the parameter
estimate, the equation of the line is:
Finally, we evaluate the model using R-squared, mean squared error, and
root mean squared error.
Listing 7-1-3. Test the model, and compute the predictions and
accuracy score.
#Listing 7-1-3
#Test Model:
y_prediction = LR.predict(x)
print('y_prediction=',y_prediction)
162
Chapter 7 Regression
163
Chapter 7 Regression
This part of the code uses statsmodels to fit an OLS model, and then,
it extracts influence statistics such as R-Student, leverage, and Cook's
D. The final part of the code creates diagnostic plots for R-Student vs.
Predicted, R-Student vs. Leverage, and Cook's D vs. Observation using
seaborn.
#Listing 7-1-4
plt.figure(3)
plt.tight_layout()
plt.show() # Finalize and render the figure
164
Chapter 7 Regression
X_ols = sm.add_constant(x)
ols_model = sm.OLS(y, X_ols).fit()
165
Chapter 7 Regression
plt.tight_layout()
plt.show()
166
Chapter 7 Regression
Figure 7-4 shows the fit diagnostics for oil_price. The Dependent
Variable (task) vs. the Predicted Value plot visualizes variability in the
prediction. In this plot, the dots are random, and there is not a pattern that
indicates a constant variance of the error. The outliers are clear in this plot.
Residual vs. Fitted shows a random pattern of dots above and below
the 0 line, which indicates an adequate model. Again, the three outliers are
clear here.
The R-Student vs. Predicted Value plot shows a couple of outside the
±2 limits. A third dot lies in between them but is far away from the rest of
the dots.
The R-Student vs. Leverage plot shows the outliers which have leverage
on the calculation of the regression coefficients.
The Cook's D plot is designed to identify outliers. Here, the three points
are so clear with their values. They are at Rows 12, 13, and 14. We shall
explore these outliers and the reason behind them in the next section.
These diagnostic plots help assess the assumptions and performance
of the linear regression model. They include checking for linearity,
homoscedasticity (constant variance of residuals), and influential
observations. The plots can provide insights into the model's behavior and
identify potential issues that might require further investigation or model
refinement.
Y = a + b1 X1 + b2 X2+ … + bi Xi
where Y is the dependent variable that we want to predict its value, a is the
intercept, bi are the coefficients, and Xi are the independent variables. The
error is ignored.
167
Chapter 7 Regression
Ha: bi ^= 0
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
df = pd.read_excel('../Datasets/sp_oil_gold.xlsx')
# Figure Size
plt.figure(0)
# set width of bar
barWidth = 0.25
168
Chapter 7 Regression
# Show Plot
plt.show()
plt.figure(1)
# set width of bar
barWidth = 0.25
fig = plt.subplots(figsize =(20, 8))
plt.xticks(years)
plt.yticks(oil_price)
plt.title('oil_price')
plt.axhline(y=0, color='b', linestyle='-')
# Bar Plot
plt.bar(years,oil_price)
# Show Plot
plt.show()
plt.figure(2)
# set width of bar
barWidth = 0.25
fig = plt.subplots(figsize =(20, 8))
plt.xticks(years)
plt.yticks(gold_prices)
plt.title('gold_prices')
plt.axhline(y=0, color='b', linestyle='-')
169
Chapter 7 Regression
# Bar Plot
plt.bar(years,gold_prices)
# Show Plot
plt.show()
x=df[['oil price','gold_prices']].values
y=df['Stock Market Returns'].values
print('intercept:', LR.intercept_)
print('slope:', LR.coef_)
print('equation of Stock market returns=',LR.intercept_,"+",LR.
coef_[0],'* oil price +',LR.coef_[1],"*gold_prices")
The output will be as in Figures 7-5, 7-6, 7-7 and 7-8. The 2008 financial
crisis is clear in all the graphs. This crisis is the worst economic disaster
in the United States since the Great Depression of 1929. The oil and gold
prices spiked in 2012.
170
Chapter 7 Regression
171
Chapter 7 Regression
Figure 7-8 shows the output of the parameter estimates and the
equation.
You can repeat the same steps of the last section to plot fit diagnostics.
However, we find that this is enough for demonstrating the concept of the
multiple linear regression.
Logistic Regression
Logistic regression is similar to linear regression except that the dependent
variable is of binary values, not continuous. The model equation checks
the probability of one of the discrete binary values of the dependent
variable occurring.
The binary logistic regression model equation is as follows:
172
Chapter 7 Regression
#Listing 7-3-1
import pandas as pd
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('../Datasets/titanic.csv')
print(df.head())
173
Chapter 7 Regression
All the columns of this dataset are numeric except the Gender, which is
string. We cannot use a string variable in regression. There are multiple
ways to encode the string variables to numeric, for example, the ordinal
and one-hot encodings. However, in our case, we do not need these
encodings because the gender has two values only, male and female. So,
it is easier to add a mask on one of the values, which means if the mask is
true, the value will be 1; otherwise, it will be 0.
We create a Pandas Series that will be a series of True and False
values (True if the passenger is male and False if the passenger is female).
df['Gender']=='male'
#Listing7-3-2
model = LogisticRegression()
model.fit(x, y)
print('model.coef_=',model.coef_)
print('model.intercept_=',model.intercept_)
174
Chapter 7 Regression
Figure 7-10 shows the parameter estimates, and you can substitute the
values in the equation of the logistic regression.
#Listing 7-3-3
print('The predictions of the first 5 rows=',model.predict(x[:5]))
# [0 1 1 1 0]
print('The actual y values of the first 5 rows=',y[:5])
# [0 1 1 1 0]
y_pred = model.predict(x)
print('accuracy=',(y == y_pred).sum()/y.shape[0])
print('model score=',model.score(x,y))
175
Chapter 7 Regression
Figure 7-11 shows the output of Listing 7-3-3. In the figure, we compare
the manual computing of the accuracy and the output of the model.
score() function. As shown, there is no difference between them, and this
model’s accuracy is about 81%.
Summary
This chapter explains single, multiple linear, and the logistic regressions.
It explains in detail the diagnostic matrix plots and the meaning of each of
them. Moreover, it shows how to easily spot the outliers in several plots of
the diagnostic plots.
176