ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS
7. Pandas and Matplotlib
Yingpeng Robin Zhu
JUL 06, 2022
1
Pandas
2
Pandas Introduction
What is Pandas?
Pandas is a Python library used for working with data sets
It has functions for analyzing, cleaning, exploring, and manipulating data
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008
Anaconda already has Pandas installed, therefore, no need for additional
installation
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical
theories
Pandas can clean messy data sets, and make them readable and relevant
Relevant data is very important in data science
3
Pandas Installation (In case Pandas not installed)
Installation of Pandas
If you have Python and pip already installed on a system, then installation of
Pandas is very easy
If you have Python but have not installed pip, then install the pip command first
https://phoenixnap.com/kb/install-pip-windows , and then install pandas
4
Pandas Introduction
Let’s create a Pandas data frame first:
5
Pandas Introduction
Data sets in Pandas are usually multi-dimensional tables, called DataFrames
A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array,
or a table with rows and columns
How does a DataFrame looks like? A preview
Columns
Rows
Index
Data
6
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns
Pandas use the .loc() attribute to return one or more specified row(s)
7
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns
Pandas use the .loc() attribute to return one or more specified row(s)
8
Locate Column
You can get access to the values of a column by indicating column names
9
Locate Items
You can locate a specific item in the data frame with .at[row_index, column_name]
10
Pandas Read CSV
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas
For the data that we have created, we could also import directly from .csv file
11
Load Files Into a DataFrame
You can choose to load the selected columns, by indicating usecols
12
Load Files Into a DataFrame
read_csv() parameters
You can adjust the way of data input by adjusting the parameters
13
Data Cleaning
Data cleaning means fixing bad data in your data set
Bad data could be
Empty cells
Data in wrong format
Wrong data
Duplicates
Now, please open our sample data: data_lab7_demo_dirtydata.csv
The data set contains some empty cells ("Date" in row 24, and "Calories" in row 20
and 30)
The data set contains wrong format ("Date" in row 28)
The data set contains wrong data ("Duration" in row 9)
The data set contains duplicates (row 13 and 14)
14
Data Cleaning
Remove Rows (e.g., rows that include empty cells)
o If you want to change the original DataFrame, use the
inplace = True argument
o Now, the dropna(inplace = True) will NOT return a new
DataFrame, but it will remove all rows containg NULL values
from the original DataFrame
Note: By default, the dropna() method returns a new DataFrame, and will not
change the original 15
Data Cleaning
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead
This way you do not have to delete entire rows just because of some empty cells
The fillna() method allows us to replace empty cells with a value
16
Data Cleaning
Replace Only For Specified Columns
Most of the times, we may want to replace a specific column with mean or median
17
Data Cleaning
Discovering Duplicate
Duplicate rows are rows that have been registered more than one time
To discover duplicates, we can use the duplicated() method
The duplicated() method returns a Boolean values for each row
Removing Duplicates
To remove duplicates, use the drop_duplicates() method
18
Data Cleaning
19
Python Matplotlib
20
What is Matplotlib?
21
What is Matplotlib?
Installation of Matplotlib
python distribution like Anaconda already has Matplotlib installed
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the import
module statement: import matplotlib
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
22
First Plot Example
Draw a line in a diagram from position (0,0) to position (10,500)
23
Matplotlib Plotting
Plotting x and y points
The plot() function is used to draw points (markers) in a diagram
By default, the plot() function draws a line from point to point
The function takes parameters for specifying points in the diagram
Parameter 1 is an array containing the points on the x-axis or the horizontal axis.
(e.g., np.array([0,10]))
Parameter 2 is an array containing the points on the y-axis or the vertical axis (e.g.,
np.array([10,500]))
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and
[3, 10] to the plot function
24
Matplotlib Plotting
25
Matplotlib Plotting
You can plot as many points as you like, just make sure you have the same number of
points in both axis
26
Default X-Points
If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3,…
(etc. depending on the length of the y-points)
o, if we take the same example as above, and leave out the x-points, the diagram will
look like this
27
Matplotlib Plotting
You can use the keyword argument marker to emphasize each point with a specified
marker
28
Matplotlib Plotting
You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line
29
Matplotlib Plotting
You can use the keyword argument color or the shorter c to set the color of the line
30
Matplotlib Plotting
You can use the keyword argument color or the shorter c to set the color of the line
31
Matplotlib Plotting
Multiple Lines: You can plot as many lines as you like by simply adding more plt.plot()
functions
Note: we only specified the points on the y-axis, meaning that the points on the x-axis got the the default values (0, 1, 2, 3, 5) 32
Matplotlib Plotting
You can also plot many lines by adding the points for the x- and y-axis for each line in
the same plt.plot() function, so that the x- and y- values come in pairs.
33
Matplotlib Labels and Title
With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and
y-axis
With Pyplot, you can use the title() function to set a title for the plot
34
Matplotlib Labels and Title
With Pyplot, you can use label="name" within the plt.plot() method to label a line
Remember to use legend() function
35
Matplotlib Labels and Title
Add Grid Lines to a Plot
With Pyplot, you can use the grid() function to add grid lines to the plot
36
Matplotlib Subplot
Display Multiple Plots (1 row two columns)
Two columns
One row
The first plot The second plot
37
Matplotlib Subplot
Display Multiple Plots (2 rows one column)
One column
The first plot
Two rows
The second plot
38
Matplotlib Subplot
Display Multiple Plots (as many as you want)
39
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot
The scatter() function plots one dot for each observation. It needs two arrays of the
same length, one for the values of the x-axis, and one for values on the y-axis
40
Matplotlib Scatter
Change color and size of the markers
Notes: make sure the array for colors and sizes has the same
length as the arrays for the x- and y-axis
41
Matplotlib Bars
With Pyplot, you can use the bar() function to draw bar graphs
42
Matplotlib Bars
You can use plt.savefig(“path”) to save the generated image
Note: It’s important to use plt.show() after saving the figure,
otherwise it might not work
43
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh()
function
44
Horizontal Bars
Use the keyword argument color to set the color of the bars
Use the keyword argument width/height to set the width/height of the bars
45
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions
It is a graph showing the number of observations within each given interval
Say you ask for the height of 250 people, you might end up with a histogram like
this:
46
Matplotlib Histograms
Create Histogram
In Matplotlib, we use the hist() function to create histograms
The hist() function will use an array of numbers to create a histogram, the array is
sent into the function as an argument
We use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10
47
Matplotlib Histograms
Create Histogram
In Matplotlib, we use the hist() function to create histograms
48
Matplotlib Pie Charts
Create Pie Charts
With Pyplot, you can use the pie() function to draw pie charts
By default, the plotting of the first wedge starts from the x-axis and move
counterclockwise
49
Matplotlib Pie Charts
Labels, titles, and percentages
50
Matplotlib Pie Charts
Explode
Maybe you want one of the wedges to stand out? The explode parameter allows
you to do that
The explode parameter, if specified, and not None, must be an array with one value
for each wedge
Each value represents how far from the center each wedge is displayed
51
Matplotlib Pie Charts
Explode with Shadow
52
Python in Business Analytics
53
Class Objectives
Revisit important concepts of machine learning
What is Machine Learning?
What is linear regression and how does it work?
What are the evaluation metrics for linear regression?
How to train and interpret a linear regression model using Scikit-learn?
54
What is Machine Learning?
“Machine Learning is a field of study that gives computers the ability
to learn without being programmed.”
---- Samuel, A. (1959)
“An umbrella of a specific set of algorithms that all have one specific
purpose to learn to detect certain patterns from data.”
---- neurospace
55
What is Machine Learning?
Machine Learning is making the computer learn from studying data and statistics
A computer program that analyses data and learns to predict the outcome
https://www.youtube.com/watch?v=nKW8Ndu7Mjw
56
What is Machine Learning?
Semi-automated extraction of
knowledge from data
Start with questions that might be
answerable using data
Exploit different types of models to
provide insights of the data using
computers
Still require human judgement and
decision-making
57
Two Main Categories of ML
Supervised Learning
Is an email a ‘spam’? How does sales differ
by gender?
There is a specific outcome we are trying to
predict (Label)
Unsupervised learning
Extracting structure from data or best
represent data
Segment grocery shoppers to clusters with
similar behavior
Recommend movies/music based on past
viewing data
There is no right or wrong answer
58
Two Main Categories of ML
Supervised Learning Unsupervised Learning
59
How Does Supervised Learning Work?
Step One: Model training (train a machine
learning model using labeled data)
Step Two: Model prediction on new data
(for which the label is unknown)
Step Three: Evaluate the accuracy of the
model (percentage of correct prediction
using labeled data)
60
Some Concepts
Features Target/Label
Student ID Attendance GPA Grade
20465532 0.95 4.0 95
20339901 0.82 3.8 88
Training
20567789 0.5 2.2 60
Dataset
20339912 1.0 3.5 98
… … … …
20429981 0.90 3.9 93 90
Testing
20890012 0.89 2.5 85 86 Dataset
61
Real Value Predicted Value
Supervised Learning Terminology
Features – the values we observe and use to
Also known as observation, sample, instance, record, independent variable
Response – the value we try to predict
Also known as target, outcome, label, dependent variable
The task of supervised learning
For a new observation, given features, we want to predict the label of this
observation
62
Supervised Learning Algorithms
Supervised learning in which the response is
continuous
Linear regression
Supervised learning in which the response is
categorical
Logistic Regression
K-nearest Neighbors Classifiers
Naïve Bayes Classifier
63
Linear Regression
A machine learning model can be used to predict continuous variables,
such as sales, stock price, amount
Runs quickly
No tuning required
Easily understandable
It is well-known and well-documented
64
Linear Regression
It assumes a linear relationship between features and response
So may not generate good prediction if the underlying relationship is
nonlinear
65
Linear Regression Example
Assume a person’s IQ is jointly determined by his/her father’s IQ and
his/her mother’s IQ according to the following regression function:
66
Linear Regression Example
Question 1: how do we get this function?
Question 2: how accurate is the prediction? (i.e., how to evaluate?)
Real Value
67
Evaluation Metrics For Regression
Mean absolute error (MAE) – the mean of the absolute value of the errors
Mean squared error (MSE) – the mean of the squared errors
Root mean squared error (RMSE) – the squared root of the mean of the
squared errors
68
Scikit-learn for Model Development
69
Scikit-learn Requirement
Features and response are separate objects
Features and response should be Numpy Arrays
Data Frame and Data Series build on top of Numpy arrays
Features and response should have specific shapes
70
Scikit-learn 7-step Modeling
Step 1: define features and response columns
Step 2: split data into training vs. test data
Step 3: import the model you want to use
Step 4: instantiate the model
Step 5: fit your model with training data
Step 6: make prediction for the test data
Step 7: estimate the accuracy of the model
71
Jupyter Notebook
72