Lesson 5 - Supervised Learning-Classification
Lesson 5 - Supervised Learning-Classification
Decision Tree
Random Forest
Naïve Bayes
Kernel SVM
Learning Objectives
Acme Article
Acme
AcmeArticle
Article
Technology
Acme Article
Food
FoodArticle
FoodArticle
Article
Sports
Bar Article
Bar Article
Bar Article
Entertainment
Classification: Example
Classification
Training Algorithms
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
A typical classifier model workflow with input training data and output labels
(a) Training
label
machine
learning
feature algorithm
extractor
features
input
(a) Prediction
feature
Classifier label
extractor
model
features
input
Classification: A Supervised Learning Algorithm
Decision Tree
ROOT Node
Branch/Sub-Tree
Splitting
B C
a1 a2 a3 a4 a5 a6
X Y
Z
Splitting Attributes
Tid Refund Marital Taxable Cheat
Status Income
Training Data
Decision Tree: Example 2
Forming a decision tree to check if the match will be played or not based on climatic
conditions
Play = 3
Play = 3 Don’t play = 0
Don’t play = 2
Play = 0
Rainy Don’t play = 2
Play = 9
Windy
Don’t play = 5
Play = 4
Outlook ? Don’t play = 0
Cloudy
Play = 0
Don’t play = 3
Play = 2 oC
Don’t play = 3
Play = 2
Don’t play = 0
Sunny
Decision Tree Formation
The attribute with the highest information gain is selected as the splitting attribute
S: [9+,5-] S: [9+,5-]
E=0.940 E=0.940
Gain(S, Outlook) = 0.246
Humidity Windy
The attributes within outlook are further splitted with respect to their gains
For each possible value of Humidity, you can add a successor to the tree.
Which Attribute Is the Best Classifier?
Outlook
Humidity Windy
{D3, D7, D12, D13}
[4+, 0-]
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
Overfitting of Decision Trees
Post Pruning
Classification
Topic 4: Random Forest Classifier
Bagging and Bootstrapping
Bagging Bootstrapping
M features
N examples
....…
Decision Tree Classifier
Each sample contributes to a decision tree classifier
Location
M features Similarity
N examples
Gene
Expression Domain-motif
Interact
....…
Neighbor Gene
Degree
Function Similarity Expression
Gene
Expression Domain-motif
Interact
Neighbor Gene
Degree
M features
Function Similarity Expression
N examples
Neighbor
process similarity
Tissue
Expression
Centraility Take the
majority
vote
....…
Neighbor
process similarity
Tissue
Centraility
Expression
PREDICTED CLASS
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Limitation of Accuracy
PREDICTED CLASS
wa + w d
Cost of classifying class j example as class i Weighted Accuracy = 1 4
wa + wb+ wc + w d
1 2 3 4
Computing Cost of Classification
ACTUAL
Class=Yes a b N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
Cost PREDICTED CLASS
= p (a + d) + q (N – a – d)
Class=Yes Class=No
= q N – (q – p)(a + d)
ACTUAL Class=Yes p q
= N [q – (q-p) Accuracy]
CLASS
Class=No q p
Assisted Practice
Random Forest Classifier Duration: 15 mins.
Problem Statement: Predict the survival of a horse based on various observed medical conditions. Load
the data from “horses.csv” and observe whether it contains missing values. The dataset contains many
categorical features; replace them with label encoding. Replace the missing values by the most frequent value
in each column. Fit a decision tree classifier and random forest classifier, and observe the accuracy.
Objective: Learn to fit a decision tree, and compare its accuracy with random forest classifier.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password
that are generated. Click on the Launch Lab button. On the page that appears, enter the username and
password in the respective fields, and click Login.
Unassisted Practice
Random Forest Classifier Duration: 15 mins.
Problem Statement: PeerLoanKart is an NBFC (Non-banking Financial Company) that facilitates peer-to-peer loan.
It connects people who need money (borrowers) with people who have money (investors). As an investor, you would
want to invest in people who showed a profile of having a high probability of paying you back.
You “as an ML expert” create a model that will help predict whether a borrower will pay the loan or not.
Objective: Increase profits up to 20% as NPA will be reduced due to loan disbursal for only creditworthy borrowers
Note: This practice is not graded. It is only intended for you to apply the knowledge you gained to solve real-world
problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Import Libraries
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
Get the Data
Code
loans = pd.read_csv('loan_borowwer_data.csv’)
loans.describe()
Exploratory Data Analysis
Create a histogram of two FICO distributions on top of each other, one for
each credit.policy outcome.
Code
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')
Exploratory Data Analysis
Exploratory Data Analysis
Code
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
Exploratory Data Analysis
Exploratory Data Analysis
Create a countplot using seaborn showing the counts of loans by purpose,
with the hue defined by not.fully.paid.
Code
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')
Setting Up the Data
Create a list of elements, containing the string “purpose.” Call this list
cat_feats.
Code
cat_feats = ['purpose’]
Setting Up the Data
Now use pd.get_dummies (loans,columns=cat_feats,drop_first=True) to create a fixed larger data
frame that has new feature columns with dummy variables. Set this data frame as final_data.
Code
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
Train-Test Split
Code
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=101)
Training Decision Tree Model
Code
Create predictions from the test set, and create a classification report and a confusion matrix.
Code
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
Confusion Matrix
Code
print(confusion_matrix(y_test,predictions))
Training Random Forest Model
Code
Code
predictions = rfc.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
Printing the Confusion Matrix
Code
print(confusion_matrix(y_test,predictions))
Classification
Topic 7: Naïve Baye’s Classifier
Naïve Baye’s Classifier and Baye’s Theorem
Classification technique based on Baye’s
theorem
Example:
Baye’s Theorem
Note: Naive Baye’s classifier assumes that the presence of a particular feature in a class is unrelated to the
presence of any other feature.
Naïve Baye’s Classifier: Example
As the first step toward prediction using naïve bayes, you will have to estimate frequency of each and every
attribute
Play
Frequency Table
Yes No
Sunny 3 2
Outlook Overcast 4 0
Rainy 3 2
Play
Frequency Table
Yes No
High 3 4
Humidity
Normal 6 1
Play
Frequency Table
Yes No
Strong 6 2
Wind
Weak 3 3
Building Likelihood Tables
Play
Likelihood Table P(B|A) = P(Sunny|Yes) = 3/9 = 0.33
Yes No
Sunny 3/9 2/5 5/14
P(B) = P(Sunny) = 5/14 = 0.36
Outlook Overcast 4/9 0/5 4/14
Rainy 3/9 2/5 5/14
P(A) = P(Yes) = 10/14 = 0.71
10/14 4/14
P(Yes|High) = 0.33 x 0.6 / 0.5 = 0.42 P(Yes|Weak) = 0.67 x 0.64 / 0.57 = 0.75
P(No|High) = 0.8 x 0.36 / 0.5 = 0.58 P(No|Weak) = 0.4 x 0.36 / 0.57 = 0.25
Getting the Output
Outlook = Rain
Humidity = High
Wind = Weak
Play = ?
Consider a binary separation which can be viewed as the task of separating classes
in feature space.
𝒘𝑻 𝒙 + 𝒃 = 𝟎
𝒘𝑻 𝒙 + 𝒃 > 𝟎 𝒘𝑻 𝒙 + 𝒃 < 𝟎
𝒇 𝒙 = 𝒔𝒊𝒈𝒏(𝒘𝑻 𝒙 + 𝒃)
Optimal Separation
wT xi + b
𝒓 Distance from example xi to the separator ris=
w
y s ( w T x s + b) 1
r= =
Then the margin can be expressed w w
through (rescaled) w and b as:
2
= 2r =
w
Linear SVM: Mathematically
0 x
Scenario 2
0 x
Scenario 3
𝝓: 𝒙 → 𝝋(𝒙)
The Kernel Trick
Problem Statement: Motion Studios is the largest radio production house in Europe. Its total revenue is $
1B+. The company has launched a new reality show "The Star RJ." The show is about finding a new radio jockey
who will be the star presenter on upcoming shows.
In the first round, participants have to upload their voice clip online. The clip will be evaluated by experts for
selection to the next round. There is a separate team in the first round for evaluation of male and female
voice.
Response to the show is unprecedented, and company is flooded with voice clips.
You “as an ML” expert have to classify the voice as either male or female so that the first level of filtration is
quicker.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that
are generated. Click on the Launch Lab button. On the page that appears, enter the username and password
in the respective fields, and click Login.
Unassisted Practice
Support Vector Machines Duration: 15 mins.
Problem Statement: Load the data from “college.csv” that has attributes collected about private and public colleges for a
particular year. Predict the private/public status of the colleges from other attributes.
Use LabelEncoder to encode the target variable to numerical form. Split the data such that 20% of the data is set aside for
testing. Fit a linear svm from scikit learn and observe the accuracy. [Hint: Use Linear SVC]
Preprocess the data using StandardScalar and fit the same model again. Observe the change in accuracy.
Use scikit learn’s gridsearch to select the best hyperparameter for a nonlinear SVM. Identify the model with best score and
its parameters. [Hint: Refer to model_selection module of Scikit learn]
Objective: Employ SVM from scikit learn for binary classification and measure the impact of preprocessing data and hyper
parameter search using grid search.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-world
problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the respective
fields, and click Login.
Import the Dataset
Code
import pandas as pd
df = pd.read_csv("College.csv")
df.columns
Label Encoding
Code
Code
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test)
classifier.score(X_test,y_test)
Obtain Performance Matrix
Code
print(confusion_matrix(y_predict,y_test))
Fit the SVC Classifier
Code
classifier = SVC()
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)
Preprocess the Data
Code
Code
classifier = SVC()
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)
Fitting Grid Search
Code
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict( gamma=gamma_range,C=C_range)
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train, y_train)
Getting the Best Hyperparameter
Code
a. x1
b. x2
c. x3
d. y
a. x1
b. x2
c. x3
d. y
d. Go back to the parent node and select a different feature to split so that the y values are
not all the same at this node
d. Go back to the parent node and select a different feature to split so that the y values are
not all the same at this node
The correct answer is b. Create a leaf that predicts the y value of all the data
You should create a leaf that predicts the y value of all the data.
Lesson-End Project Duration: 20 mins.
Problem Statement: Load the kinematics dataset as measured on mobile sensors from the file
“run_or_walk.csv.” List the columns in the dataset. Let the target variable “y” be the activity, and assign all the
columns after it to “x.”
Using Scikit-learn, fit a Gaussian Naive Bayes model and observe the accuracy. Generate a classification report
using Scikit-learn. Repeat the model once using only the acceleration values as predictors and then using only
the gyro values as predictors. Comment on the difference in accuracy between both the models.
Objective: Practice classification based on Naive Bayes algorithm. Identify the predictors that can be
influential.
Access: Click the Labs tab in the left side panel of the LMS. Copy or note the username and password that are
generated. Click the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Thank You