100% found this document useful (3 votes)
199 views

Python Pandas and Matplotlib 7

Pandas is a Python library used for working with and analyzing data. It allows users to load, clean, and manipulate data stored in CSV files or other formats. Data is stored and manipulated as a DataFrame, which contains rows and columns like a spreadsheet. Pandas provides methods for selecting, filtering, aggregating, and cleaning data. Matplotlib is a Python library for creating plots and visualizing data. It allows customizing plots with options like titles, labels, colors, markers and line styles. Multiple lines or datasets can be plotted on the same axes.

Uploaded by

denny
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
199 views

Python Pandas and Matplotlib 7

Pandas is a Python library used for working with and analyzing data. It allows users to load, clean, and manipulate data stored in CSV files or other formats. Data is stored and manipulated as a DataFrame, which contains rows and columns like a spreadsheet. Pandas provides methods for selecting, filtering, aggregating, and cleaning data. Matplotlib is a Python library for creating plots and visualizing data. It allows customizing plots with options like titles, labels, colors, markers and line styles. Multiple lines or datasets can be plotted on the same axes.

Uploaded by

denny
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS

7. Pandas and Matplotlib


Yingpeng Robin Zhu

JUL 06, 2022

1
Pandas

2
Pandas Introduction

 What is Pandas?
 Pandas is a Python library used for working with data sets
 It has functions for analyzing, cleaning, exploring, and manipulating data
 The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008
 Anaconda already has Pandas installed, therefore, no need for additional
installation

 Why Use Pandas?


 Pandas allows us to analyze big data and make conclusions based on statistical
theories
 Pandas can clean messy data sets, and make them readable and relevant
 Relevant data is very important in data science
3
Pandas Installation (In case Pandas not installed)

 Installation of Pandas
 If you have Python and pip already installed on a system, then installation of
Pandas is very easy

 If you have Python but have not installed pip, then install the pip command first
https://phoenixnap.com/kb/install-pip-windows , and then install pandas

4
Pandas Introduction
 Let’s create a Pandas data frame first:

5
Pandas Introduction

 Data sets in Pandas are usually multi-dimensional tables, called DataFrames


 A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array,
or a table with rows and columns
 How does a DataFrame looks like? A preview
Columns

Rows

Index
Data

6
Locate Row

 As you can see from the result above, the DataFrame is like a table with rows and
columns
 Pandas use the .loc() attribute to return one or more specified row(s)

7
Locate Row

 As you can see from the result above, the DataFrame is like a table with rows and
columns
 Pandas use the .loc() attribute to return one or more specified row(s)

8
Locate Column

 You can get access to the values of a column by indicating column names

9
Locate Items

 You can locate a specific item in the data frame with .at[row_index, column_name]

10
Pandas Read CSV

 A simple way to store big data sets is to use CSV files (comma separated files).
 CSV files contains plain text and is a well know format that can be read by everyone
including Pandas
 For the data that we have created, we could also import directly from .csv file

11
Load Files Into a DataFrame

 You can choose to load the selected columns, by indicating usecols

12
Load Files Into a DataFrame

 read_csv() parameters
 You can adjust the way of data input by adjusting the parameters

13
Data Cleaning

 Data cleaning means fixing bad data in your data set


 Bad data could be
 Empty cells
 Data in wrong format
 Wrong data
 Duplicates
 Now, please open our sample data: data_lab7_demo_dirtydata.csv
 The data set contains some empty cells ("Date" in row 24, and "Calories" in row 20
and 30)
 The data set contains wrong format ("Date" in row 28)
 The data set contains wrong data ("Duration" in row 9)
 The data set contains duplicates (row 13 and 14)

14
Data Cleaning

 Remove Rows (e.g., rows that include empty cells)

o If you want to change the original DataFrame, use the


inplace = True argument
o Now, the dropna(inplace = True) will NOT return a new
DataFrame, but it will remove all rows containg NULL values
from the original DataFrame

Note: By default, the dropna() method returns a new DataFrame, and will not
change the original 15
Data Cleaning

 Replace Empty Values


 Another way of dealing with empty cells is to insert a new value instead
 This way you do not have to delete entire rows just because of some empty cells
 The fillna() method allows us to replace empty cells with a value

16
Data Cleaning

 Replace Only For Specified Columns


 Most of the times, we may want to replace a specific column with mean or median

17
Data Cleaning

 Discovering Duplicate
 Duplicate rows are rows that have been registered more than one time

 To discover duplicates, we can use the duplicated() method


 The duplicated() method returns a Boolean values for each row
 Removing Duplicates
 To remove duplicates, use the drop_duplicates() method

18
Data Cleaning

19
Python Matplotlib

20
What is Matplotlib?

21
What is Matplotlib?

 Installation of Matplotlib
 python distribution like Anaconda already has Matplotlib installed
 Import Matplotlib
 Once Matplotlib is installed, import it in your applications by adding the import
module statement: import matplotlib
 Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:

22
First Plot Example

 Draw a line in a diagram from position (0,0) to position (10,500)

23
Matplotlib Plotting

 Plotting x and y points


 The plot() function is used to draw points (markers) in a diagram
 By default, the plot() function draws a line from point to point
 The function takes parameters for specifying points in the diagram
 Parameter 1 is an array containing the points on the x-axis or the horizontal axis.
(e.g., np.array([0,10]))
 Parameter 2 is an array containing the points on the y-axis or the vertical axis (e.g.,
np.array([10,500]))
 If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and
[3, 10] to the plot function

24
Matplotlib Plotting

25
Matplotlib Plotting

 You can plot as many points as you like, just make sure you have the same number of
points in both axis

26
Default X-Points

 If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3,…
(etc. depending on the length of the y-points)
 o, if we take the same example as above, and leave out the x-points, the diagram will
look like this

27
Matplotlib Plotting

 You can use the keyword argument marker to emphasize each point with a specified
marker

28
Matplotlib Plotting

 You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line

29
Matplotlib Plotting

 You can use the keyword argument color or the shorter c to set the color of the line

30
Matplotlib Plotting

 You can use the keyword argument color or the shorter c to set the color of the line

31
Matplotlib Plotting

 Multiple Lines: You can plot as many lines as you like by simply adding more plt.plot()
functions

Note: we only specified the points on the y-axis, meaning that the points on the x-axis got the the default values (0, 1, 2, 3, 5) 32
Matplotlib Plotting

 You can also plot many lines by adding the points for the x- and y-axis for each line in
the same plt.plot() function, so that the x- and y- values come in pairs.

33
Matplotlib Labels and Title
 With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and
y-axis
 With Pyplot, you can use the title() function to set a title for the plot

34
Matplotlib Labels and Title
 With Pyplot, you can use label="name" within the plt.plot() method to label a line
 Remember to use legend() function

35
Matplotlib Labels and Title
 Add Grid Lines to a Plot
 With Pyplot, you can use the grid() function to add grid lines to the plot

36
Matplotlib Subplot
 Display Multiple Plots (1 row two columns)
Two columns

One row

The first plot The second plot

37
Matplotlib Subplot
 Display Multiple Plots (2 rows one column)

One column

The first plot

Two rows

The second plot

38
Matplotlib Subplot
 Display Multiple Plots (as many as you want)

39
Matplotlib Scatter
 Creating Scatter Plots
 With Pyplot, you can use the scatter() function to draw a scatter plot
 The scatter() function plots one dot for each observation. It needs two arrays of the
same length, one for the values of the x-axis, and one for values on the y-axis

40
Matplotlib Scatter
 Change color and size of the markers

Notes: make sure the array for colors and sizes has the same
length as the arrays for the x- and y-axis

41
Matplotlib Bars
 With Pyplot, you can use the bar() function to draw bar graphs

42
Matplotlib Bars
 You can use plt.savefig(“path”) to save the generated image

Note: It’s important to use plt.show() after saving the figure,


otherwise it might not work

43
Horizontal Bars
 If you want the bars to be displayed horizontally instead of vertically, use the barh()
function

44
Horizontal Bars
 Use the keyword argument color to set the color of the bars
 Use the keyword argument width/height to set the width/height of the bars

45
Matplotlib Histograms
 Histogram
 A histogram is a graph showing frequency distributions
 It is a graph showing the number of observations within each given interval
 Say you ask for the height of 250 people, you might end up with a histogram like
this:

46
Matplotlib Histograms
 Create Histogram
 In Matplotlib, we use the hist() function to create histograms
 The hist() function will use an array of numbers to create a histogram, the array is
sent into the function as an argument
 We use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10

47
Matplotlib Histograms
 Create Histogram
 In Matplotlib, we use the hist() function to create histograms

48
Matplotlib Pie Charts
 Create Pie Charts
 With Pyplot, you can use the pie() function to draw pie charts
By default, the plotting of the first wedge starts from the x-axis and move
counterclockwise

49
Matplotlib Pie Charts
 Labels, titles, and percentages

50
Matplotlib Pie Charts
 Explode
 Maybe you want one of the wedges to stand out? The explode parameter allows
you to do that
 The explode parameter, if specified, and not None, must be an array with one value
for each wedge
 Each value represents how far from the center each wedge is displayed

51
Matplotlib Pie Charts
 Explode with Shadow

52
Python in Business Analytics

53
Class Objectives

Revisit important concepts of machine learning


 What is Machine Learning?
 What is linear regression and how does it work?
 What are the evaluation metrics for linear regression?
 How to train and interpret a linear regression model using Scikit-learn?

54
What is Machine Learning?

“Machine Learning is a field of study that gives computers the ability


to learn without being programmed.”
---- Samuel, A. (1959)

“An umbrella of a specific set of algorithms that all have one specific
purpose to learn to detect certain patterns from data.”
---- neurospace

55
What is Machine Learning?

 Machine Learning is making the computer learn from studying data and statistics
 A computer program that analyses data and learns to predict the outcome

https://www.youtube.com/watch?v=nKW8Ndu7Mjw

56
What is Machine Learning?

 Semi-automated extraction of
knowledge from data
 Start with questions that might be
answerable using data
 Exploit different types of models to
provide insights of the data using
computers
 Still require human judgement and
decision-making

57
Two Main Categories of ML

 Supervised Learning
 Is an email a ‘spam’? How does sales differ
by gender?
 There is a specific outcome we are trying to
predict (Label)
 Unsupervised learning
 Extracting structure from data or best
represent data
 Segment grocery shoppers to clusters with
similar behavior
 Recommend movies/music based on past
viewing data
 There is no right or wrong answer
58
Two Main Categories of ML

Supervised Learning Unsupervised Learning


59
How Does Supervised Learning Work?

 Step One: Model training (train a machine


learning model using labeled data)

 Step Two: Model prediction on new data


(for which the label is unknown)

 Step Three: Evaluate the accuracy of the


model (percentage of correct prediction
using labeled data)
60
Some Concepts
Features Target/Label

Student ID Attendance GPA Grade

20465532 0.95 4.0 95

20339901 0.82 3.8 88


Training
20567789 0.5 2.2 60
Dataset
20339912 1.0 3.5 98

… … … …

20429981 0.90 3.9 93 90


Testing
20890012 0.89 2.5 85 86 Dataset

61
Real Value Predicted Value
Supervised Learning Terminology

Features – the values we observe and use to


 Also known as observation, sample, instance, record, independent variable

Response – the value we try to predict


 Also known as target, outcome, label, dependent variable

 The task of supervised learning


 For a new observation, given features, we want to predict the label of this
observation

62
Supervised Learning Algorithms

Supervised learning in which the response is


continuous
 Linear regression

Supervised learning in which the response is


categorical
 Logistic Regression
 K-nearest Neighbors Classifiers
 Naïve Bayes Classifier

63
Linear Regression

A machine learning model can be used to predict continuous variables,


such as sales, stock price, amount
 Runs quickly
 No tuning required
 Easily understandable
 It is well-known and well-documented

64
Linear Regression

It assumes a linear relationship between features and response


So may not generate good prediction if the underlying relationship is
nonlinear

65
Linear Regression Example

Assume a person’s IQ is jointly determined by his/her father’s IQ and


his/her mother’s IQ according to the following regression function:

66
Linear Regression Example

 Question 1: how do we get this function?

 Question 2: how accurate is the prediction? (i.e., how to evaluate?)


Real Value

67
Evaluation Metrics For Regression

Mean absolute error (MAE) – the mean of the absolute value of the errors

Mean squared error (MSE) – the mean of the squared errors

Root mean squared error (RMSE) – the squared root of the mean of the
squared errors

68
Scikit-learn for Model Development

69
Scikit-learn Requirement

Features and response are separate objects

Features and response should be Numpy Arrays


 Data Frame and Data Series build on top of Numpy arrays

Features and response should have specific shapes

70
Scikit-learn 7-step Modeling

 Step 1: define features and response columns


 Step 2: split data into training vs. test data
 Step 3: import the model you want to use
 Step 4: instantiate the model
 Step 5: fit your model with training data
 Step 6: make prediction for the test data
 Step 7: estimate the accuracy of the model

71
Jupyter Notebook

72

You might also like