14 Model Selection and Boosting
14 Model Selection and Boosting
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of
on-going topic
• Raise a ticket through your LMS in case of any queries. Our dedicated support team is available 24 x 7 for your assistance
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
▪ Model Evaluation
▪ Boosting
▪ AdaBoost Mechanism
I have the
highest accuracy
and hence, I’m
the Best Model
Cross Validation
Cross Validation
A dataset
Maximizing training accuracy rewards only complex models, overfitting the training data
Cross Validation
A dataset
Though, a better estimate but at the same time a high variance estimate due to changing
observations in the testing set
A bunch of train/test splits created and testing accuracy for each of them is checked,
averaging the results together
03
Calculate the accuracy of the KNN
model each time by changing the
amount of testing data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
df = pd.read_csv ('mtcars for manymerge.csv')
x = df[[‘cyl’,’hp’]]
y = df[‘Feedback’]
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state =5)
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)
metrics.accuracy_score(y_test,y_pred)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
df = pd.read_csv ('mtcars for manymerge.csv’)
x = df[[‘cyl’,’hp’]]
y = df[‘Feedback’]
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state = 10)
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)
metrics.accuracy_score(y_test,y_pred)
Repeat 2 and 3 k
Split the dataset times, using a
into K equal
partitions/folds
01 03 05 different fold as
the testing set
each time
02 04
Use fold 1 as the testing Use the average testing
set and the union of other accuracy as the estimate of out-
folds as the training set of-sample accuracy
Note: cross_val_score directly computes numpy array comprising of accuracy scores for
different train/test sets
Hence, you can infer that KNN model for the task is better as compared to Logistic Regression model
Class 1 Class 2
Error Rate
0 0.5 1
Weak Learner: A
Strong Learner: A classifier with Poor Learner:
classifier with zero somewhat Classifier is
error rate significant error rate discarded
XGBoost
+ -
-
Task: Classify +‘s and –’s
+
+
+ - +
- -
-
Initially equal weights are assigned to each data point and a decision stump
is applied to classify them as + (plus) or – (minus). The decision stump (D1)
has generated vertical line at left side to classify the data points
Note: This vertical line has incorrectly predicted three + (plus) as – (minus). In such case, higher
weights are assigned to these three + (plus) and another decision stump is applied. Also, note
that the classification can proceed using a horizontal line as well
Setp02
Step02
Note: A vertical line (D2) at right side of this box has classified three mis-classified + (plus)
correctly. But again, it has caused mis-classification errors to three – (minus)
Step02
Step02
Step03
Step02
Step02
Step03
Step04
Step01
Step02
Initially each data Step03
point is weighted A classifier is Step04
equally with weight picked up that best
1 The weighing
𝑊𝑖 = , where n is classifies the data
𝑛 factor 𝛼 is The weight after time t can
the number of with minimal error dependent on
samples rate. Let the thus be defined as: 𝑊𝑖𝑡+1 =
errors (𝜖𝑡 ) caused 𝑊𝑡+1
classifier be H1 by the H1 classifier,
𝑖 𝑒 −𝛼𝑡.ℎ1 𝑥 .𝑦(𝑥) , where z
𝑍
1
𝛼 𝑡 = ln 1−𝜖 𝑡 is the normalizing factor and
2 𝜖
𝑡 h1(x).y(x) will tell you the
sign of the current output
The sample data with error will now be weighted with 𝑾𝒕+𝟏
𝒊 to produce H2 classifier and so on
Calculate 𝑊 𝑡+1
Sample Dataset
Goal: Classify whether the person is diabetic or not with maximum accuracy
▪ Cross Validation
▪ Boosting
▪ Boosting Algorithms
▪ AdaBoost Mechanism