0% found this document useful (0 votes)
11 views51 pages

14 Model Selection and Boosting

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views51 pages

14 Model Selection and Boosting

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

How To Make The Best Use Of Live Sessions

• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session

• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class

• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of
on-going topic

• Raise a ticket through your LMS in case of any queries. Our dedicated support team is available 24 x 7 for your assistance

• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience

Copyright © edureka and/or its affiliates. All rights reserved.


Course Outline

Introduction to Python Dimensionality Reduction

Sequences and File Operations Supervised Learning - II

Deep Dive-Functions, OOPS,


Modules, Errors and Exceptions Unsupervised Learning

Introduction to Numpy, Pandas Association Rules Mining and


and Matplotlib Recommendation Systems

Data Manipulation Reinforcement Learning

Introduction to Machine Learning


with Python Time Series Analysis

Supervised Learning - I Model Selection and Boosting

Copyright © edureka and/or its affiliates. All rights reserved.


Model Selection and Boosting
Topics
The topics covered in this module are:

▪ Model Evaluation

▪ Parameters of Model Evaluation

▪ Boosting

▪ AdaBoost Mechanism

Copyright © edureka and/or its affiliates. All rights reserved.


Objectives
After completing this module, you should be able to:

▪ Understand what is Model Evaluation

▪ Discuss different measures of Model Evaluation

▪ Explain the significance of Model Selection

▪ Describe Boosting and its importance

▪ Demonstrate AdaBoost Mechanism

Copyright © edureka and/or its affiliates. All rights reserved.


Model Evaluation & its Parameters

Copyright © edureka and/or its affiliates. All rights reserved.


Importance of Model Evaluation

1 You know when you’ve succeeded

2 You know how much you’ve succeeded

3 You can decide when to stop

4 You can decide when to update the model

Copyright © edureka and/or its affiliates. All rights reserved.


Model Evaluation
Models can be evaluated based on their measure of performance

I have the
highest accuracy
and hence, I’m
the Best Model

Copyright © edureka and/or its affiliates. All rights reserved.


Evaluating Measure of Performance
Three ways were introduced to evaluate model performance

Train / Test on same data

Train / Test on different data

Cross Validation

Copyright © edureka and/or its affiliates. All rights reserved.


Train / Test on Same Data
Train / Test on same data

Train / Test on different data Training + Testing

Cross Validation
A dataset

Maximizing training accuracy rewards only complex models, overfitting the training data

Copyright © edureka and/or its affiliates. All rights reserved.


Train / Test on Different Data
Train / Test on same data

Train / Test on different data Training Testing

Cross Validation
A dataset

Though, a better estimate but at the same time a high variance estimate due to changing
observations in the testing set

Copyright © edureka and/or its affiliates. All rights reserved.


Cross – Validation
Train / Test on same data

Train/Test split 1 Train/Test split 3


Train / Test on different data
Train/Test split 2

Train/Test split 5 Train/Test split 4


Cross Validation
A dataset

A bunch of train/test splits created and testing accuracy for each of them is checked,
averaging the results together

Copyright © edureka and/or its affiliates. All rights reserved.


Now, let’s check for
the accuracy of each
of the methods with a
use case in Python

Copyright © edureka and/or its affiliates. All rights reserved.


Scenario
Following is the sample dataset of mtcars, containing the variables:

Cars mpg cyl disp hp drat Feedback


MazdaRX4 21 6 160 110 3.9 1
cars: types of cars you MazdaRX4_WAG 21 6 160 110 3.9 0
have
Datsun_710 22.8 4 108 93 3.85 1
mpg: mileage per gallon
Hornet_4_Drive 21.4 6 258 110 3.08 1
cyl: number of cylinders
Hornet_Sportabout 18.7 8 360 175 3.15 1
disp: displacement
Valiant 18.1 6 225 105 2.76 1
hp: horsepower Duster_360 14.3 8 360 245 3.21 0
drat: real axle ratio
Merc_240D 24.4 4 146.7 62 3.69 0
Feedback: feedback Merc_230 22.8 4 140.8 95 3.92 1
received on cars Merc_280 19.2 6 167.6 123 3.92 0
Merc_280C 17.8 6 167.6 123 3.92 0
Merc_450SE 16.4 8 275.8 180 3.07 0

Copyright © edureka and/or its affiliates. All rights reserved.


Tasks To Do

03
Calculate the accuracy of the KNN
model each time by changing the
amount of testing data

Implement the KNN algorithm to


Import the ‘mtcars for manymerge’
dataset in python 01 02 find the feedback received with
respect to the cyl and hp of cars

Copyright © edureka and/or its affiliates. All rights reserved.


Checking Accuracy if Random State = 5
If, you perform the train/test split such that random state = 5 then,

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
df = pd.read_csv ('mtcars for manymerge.csv')
x = df[[‘cyl’,’hp’]]
y = df[‘Feedback’]
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state =5)
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)
metrics.accuracy_score(y_test,y_pred)

Copyright © edureka and/or its affiliates. All rights reserved.


The Accuracy came out
to be 66%. Now, let’s
see the accuracy if the
random state was
changed, say 30

Copyright © edureka and/or its affiliates. All rights reserved.


Checking Accuracy if Random State = 10
If, you perform the train/test split such that random state =10 then,

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
df = pd.read_csv ('mtcars for manymerge.csv’)
x = df[[‘cyl’,’hp’]]
y = df[‘Feedback’]
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state = 10)
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)
metrics.accuracy_score(y_test,y_pred)

Copyright © edureka and/or its affiliates. All rights reserved.


The Accuracy came out to be 50%.
You can say that if the
training/testing set varies, there’s a
change in the accuracy. However,
you can solve this problem by
creating a bunch of train/test splits
and averaging them together, which
is the idea of Cross Validation

Copyright © edureka and/or its affiliates. All rights reserved.


Cross Validation (K – Fold)
Steps for cross validation:
Calculate testing accuracy

Repeat 2 and 3 k
Split the dataset times, using a
into K equal
partitions/folds
01 03 05 different fold as
the testing set
each time

02 04
Use fold 1 as the testing Use the average testing
set and the union of other accuracy as the estimate of out-
folds as the training set of-sample accuracy

Copyright © edureka and/or its affiliates. All rights reserved.


5 – Fold Cross Validation Example

Each fold access the


testing set for one
iteration and access
the training set for
the other 4 iterations

Copyright © edureka and/or its affiliates. All rights reserved.


Cross Validation vs Train/Test Split

Cross-Validation Train/Test Split


Data Step Proc Step
▪ More accurate estimate ▪ Runs K times faster than K
▪ Starts with a data
of out-of-sample accuracy fold cross-validation
statement
▪ Place your data into
▪ More efficient use of ▪ Simpler to examine the
SAS
data(every observation detailed results of testing
▪ Produce new
used for both training process
datasets by
and testing)
modifying and
merging datasets

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s see how Cross
Validation can help
choosing between
different models

Copyright © edureka and/or its affiliates. All rights reserved.


Cross Validation – Model Selection
Select the model
Implement cross with highest
validation on Compare accuracy
another model accuracies of
Implement 10 fold and check the both the models 4
cross validation accuracy on the
within one model. same 3
Check the accuracy
of the model 2

Copyright © edureka and/or its affiliates. All rights reserved.


Cross Validation – Model Selection (Example)
Consider the sample dataset of mtcars, containing the variables:
Cars mpg cyl disp hp drat Feedback
MazdaRX4 21 6 160 110 3.9 1
cars: types of cars you MazdaRX4_WAG 21 6 160 110 3.9 0
have
Datsun_710 22.8 4 108 93 3.85 1
mpg: mileage per gallon
Hornet_4_Drive 21.4 6 258 110 3.08 1
cyl: number of cylinders
Hornet_Sportabout 18.7 8 360 175 3.15 1
disp: displacement
Valiant 18.1 6 225 105 2.76 1
hp: horsepower Duster_360 14.3 8 360 245 3.21 0
drat: real axle ratio
Merc_240D 24.4 4 146.7 62 3.69 0
Feedback: feedback Merc_230 22.8 4 140.8 95 3.92 1
received on cars Merc_280 19.2 6 167.6 123 3.92 0
Merc_280C 17.8 6 167.6 123 3.92 0
Merc_450SE 16.4 8 275.8 180 3.07 0

Copyright © edureka and/or its affiliates. All rights reserved.


Tasks To Do
Run a 10 fold KNN model on
01 the ‘mtcars for manymerge’
dataset
Perform 10 fold cross
validation with another
02 model, say logistic regression

03 Check for the accuracy of


both the models

04 Select the model with


highest accuracy

Copyright © edureka and/or its affiliates. All rights reserved.


Defining a 10 Fold KNN Model
Define a 10 fold KNN model and calculate the average of each of the accuracy matrix so obtained

from sklearn.model_selection import cross_val_score


knn = KNeighborsClassifier(n_neighbors=4)
print(cross_val_score(knn, x, y, cv=10, scoring ='accuracy').mean())

The accuracy came out to be 56.66%

Note: cross_val_score directly computes numpy array comprising of accuracy scores for
different train/test sets

Copyright © edureka and/or its affiliates. All rights reserved.


Defining a 10 Fold Logistic Regression Model
Define a 10 fold Logistic Regression model and calculate the average of each of the accuracy matrix so obtained

from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()
print (cross_val_score(logreg, x, y, cv=10, scoring = 'accuracy').mean())

The accuracy came out to be 28.33%

Hence, you can infer that KNN model for the task is better as compared to Logistic Regression model

Copyright © edureka and/or its affiliates. All rights reserved.


Accuracy is one of the
primary features for
Model Selection. Let’s
work on how to improve it

Copyright © edureka and/or its affiliates. All rights reserved.


Concept of Weak Learners
Consider a classification scenario:

Class 1 Class 2

Error Rate

0 0.5 1

Weak Learner: A
Strong Learner: A classifier with Poor Learner:
classifier with zero somewhat Classifier is
error rate significant error rate discarded

Copyright © edureka and/or its affiliates. All rights reserved.


Boosting
Concept of converting a group of Weak Learners to Strong Learners

Copyright © edureka and/or its affiliates. All rights reserved.


Boosting – Working
Consider a scenario, where you have 3 classifiers (H1, H2, H3) for your data

Step 1 Step 2 Step 3


• Identifies the classifier • Identifies the region • Exaggerate those samples
that best classifies the where H1 produces errors, for which H1 gives a
data with respect to add weights to it and different result from H2
accuracy, say H1 produces a H2 classifier and produces H3 classifier

Copyright © edureka and/or its affiliates. All rights reserved.


Different Boosting Algorithms

AdaBoost (Adaptive Boosting)

Gradient Tree Boosting

XGBoost

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s understand
AdaBoost with the help
of most popular Decision
Tree Stumping process

Copyright © edureka and/or its affiliates. All rights reserved.


Scenario
Consider a scenario, where you have +’s and –’s

+ -
-
Task: Classify +‘s and –’s
+
+
+ - +
- -
-

Copyright © edureka and/or its affiliates. All rights reserved.


AdaBoost: Step 01
Step01

Initially equal weights are assigned to each data point and a decision stump
is applied to classify them as + (plus) or – (minus). The decision stump (D1)
has generated vertical line at left side to classify the data points

Note: This vertical line has incorrectly predicted three + (plus) as – (minus). In such case, higher
weights are assigned to these three + (plus) and another decision stump is applied. Also, note
that the classification can proceed using a horizontal line as well

Copyright © edureka and/or its affiliates. All rights reserved.


AdaBoost: Step 02
Step01

Setp02
Step02

Size of three incorrectly predicted + (plus) is made


bigger as compared to rest of the data points. In
this case, the second decision stump (D2) will try to
predict them correctly

Note: A vertical line (D2) at right side of this box has classified three mis-classified + (plus)
correctly. But again, it has caused mis-classification errors to three – (minus)

Copyright © edureka and/or its affiliates. All rights reserved.


AdaBoost: Step 03
Step01

Step02
Step02

Step03

Higher weights are assigned to three – (minus)


and another decision stump (D3) is applied.
This time a horizontal line is generated to
classify + (plus) and – (minus) based on higher
weight of mis-classified observation

Copyright © edureka and/or its affiliates. All rights reserved.


AdaBoost: Step 04
Step01

Step02
Step02

Step03

Step04

D1, D2 and D3 are


combined to form a strong
prediction having complex
rule as compared to
individual weak learner

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s understand
the mechanism on
which AdaBoost
works!!

Copyright © edureka and/or its affiliates. All rights reserved.


The AdaBoost Mechanism

Step01
Step02
Initially each data Step03
point is weighted A classifier is Step04
equally with weight picked up that best
1 The weighing
𝑊𝑖 = , where n is classifies the data
𝑛 factor 𝛼 is The weight after time t can
the number of with minimal error dependent on
samples rate. Let the thus be defined as: 𝑊𝑖𝑡+1 =
errors (𝜖𝑡 ) caused 𝑊𝑡+1
classifier be H1 by the H1 classifier,
𝑖 𝑒 −𝛼𝑡.ℎ1 𝑥 .𝑦(𝑥) , where z
𝑍
1
𝛼 𝑡 = ln 1−𝜖 𝑡 is the normalizing factor and
2 𝜖
𝑡 h1(x).y(x) will tell you the
sign of the current output

The sample data with error will now be weighted with 𝑾𝒕+𝟏
𝒊 to produce H2 classifier and so on

Copyright © edureka and/or its affiliates. All rights reserved.


The AdaBoost Mechanism: Flowchart

Weigh each data


points equally with
1
weight, 𝑊𝑖 =
𝑛

Pick a classifier that


Pick 𝛼, a kind of
minimizes the error
weighing factor
rate, 𝜖𝑡

Calculate 𝑊 𝑡+1

Copyright © edureka and/or its affiliates. All rights reserved.


Let us understand
the implementation
of AdaBoost in
Python

Copyright © edureka and/or its affiliates. All rights reserved.


Scenario
Consider the ‘Pima Indians Diabetes Dataset’ with the following attributes:
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)

Sample Dataset
Goal: Classify whether the person is diabetic or not with maximum accuracy

Copyright © edureka and/or its affiliates. All rights reserved.


Tasks To Do

Apply the AdaBoost classification


algorithm maximizing the
classification accuracy. Also,
3
calculate the same
Use the 10 fold cross validation
2 to split data into training and
testing sets

Import the ‘Diabetes.txt’ file


from your local system within 1
python environment

Copyright © edureka and/or its affiliates. All rights reserved.


AdaBoost Using Python
import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
df = pd.read_csv('Diabetes.txt', sep=",", header=None)
data.columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age', 'class']
array = df.values
X = array[:,0:8]
Y = array[:,8]
kfold = model_selection.KFold(n_splits=10, random_state=7)
model = AdaBoostClassifier(n_estimators=30, random_state=7)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

Copyright © edureka and/or its affiliates. All rights reserved.


Congratzz…!!
The accuracy of your
classification model
as a result of boosting
is noted as 76%

Copyright © edureka and/or its affiliates. All rights reserved.


Summary
▪ Evaluating Measure of Performance

▪ Cross Validation

▪ Boosting

▪ Boosting Algorithms

▪ AdaBoost Mechanism

▪ AdaBoost using Python

Copyright © edureka and/or its affiliates. All rights reserved.


Copyright © edureka and/or its affiliates. All rights reserved.
Copyright © edureka and/or its affiliates. All rights reserved.

You might also like