Lesson 8 - Ensemble Learning
Lesson 8 - Ensemble Learning
Ensemble Learning
Model Selection
Cross-validation
Learning Objectives
Inputs
Model 1
Model 3
Significance
Robustness Accuracy
01 02
Ensemble Learning Methods
OR
50
[50, 0, 0]
True setosa
p1 + p2 + p3
Equal weights are assigned to different models P=
3
Weighted Averaging
50
[50, 0, 0]
True setosa
50
35
15
Original
D Training data
D1 D2 …. Dt-1 Dt Resamples
stomach pain and leg hurts agree for X, leg hurts because of
(headache because of X) Y, but cannot explain stomach pain
Boosting (Contd.)
headache is surely caused by X agree for X, leg hurts because of agree on X, sort of agree on Y,
, but cannot explain others Y, cannot explain stomach pain sure on stomach pain (Z)
Final Diagnosis
Train a classifier H1 that best Identify the region where H1 Exaggerate those samples
classifies the data with produces errors, add for which H1 gives a
respect to accuracy weights to it and produce a different result from H2 and
H2 classifier produces H3 classifier
Features
1, 2,3,4
Features 1,2 Features 2,3 Features 3,4 Features 1,4 Smaller trees are built using these subsets creating tree diversity
+ -
+ +
-
Consider a scenario, where there are ‘+’ and ‘–’
-
Objective : Classify ‘+’ and ‘–’ + -
+ -
-
Adaboost Working: Step 01
• Decision stump (D1) has generated vertical plane at the left side to classify
• The second decision stump (D2) will try to predict them correctly
• Now, vertical plane (D2) has classified three mis-classified + (plus) correctly
Iteration 03
Adaboost Working: Step 04
• D1, D2, and D3 are combined to form a strong prediction having complex
Final Classifier
Adaboost Algorithm
1 2 3 4
Initially each data point A classifier ‘H1’ is The weighing factor 𝜶 Weight after time t is
is weighted equally picked up that best is dependent on given as :
with weight classifies the data with errors (𝝐𝒕 ) caused by
minimal error rate the H1 classifier
𝑊𝑖 = 1/𝑛 𝑾𝒕+𝟏
𝒊
𝒁
𝒆−𝜶𝒕.𝒉𝟏 𝒙 .𝒚(𝒙)
Calculate 𝑊 𝑡+1
Gradient Boosting (GBM)
Gradient boosting involves three elements:
1 2 3
GBM minimizes the loss function (MSE) of a model by adding weak learners using a gradient descent procedure.
GBM Mechanism
Step 03 Fit a new model on error residuals as target variable with same input variables
Fit another model on residuals that are remaining and repeat steps 2
Step 05 and 5 until model is overfit or the sum of residuals becomes constant
XGBoost
eXtreme Gradient Boosting is a library for developing fast and high-performance gradient boosting tree models.
custom tree
eXtreme Gradient
building algorithm
Boosting
Used for:
• Classification Interfaces for
• Regression Python and R,
• Ranking can be executed
With custom loss on YARN
functions
1
General Parameters
Number of threads
3 2
Task Parameters Booster Parameters
• Objective • Step size
• Evaluation metric • Regularization
XGBoost Library Features
XGBoost library features tools are built for the sole purpose of model performance and computational speed.
01 02 03
• nthread
- Number of parallel threads
- If no value is entered, algorithm automatically detects the number of cores and
runs on all the cores
• booster
- gbtree: tree-based model
- gblinear: linear function
• eta
- Step size shrinkage is used in update to prevent overfitting
- Range in [0,1], default 0.3
• gamma
- Minimum loss reduction required to make a split
- Range [0,∞ ], default 0
• max_depth
- Maximum depth of a tree
- Range [1, ∞ ], default 6
• min_child_weight
- Minimum sum of instance weight needed in a child
- Range [0, ∞], default 1
Booster Parameters (Contd.)
• max_delta_step
- Maximum delta step allowed in each tree's weight estimation
- Range in [0, ∞ ], default 0
• subsample
- Subsample ratio of the training instance
- Range [0,1 ], default 1
• Colsample_bytree
- Subsample ratio of columns when constructing each tree
- Range [0, 1 ], default 1
Booster Parameters (Contd.)
• lambda
- L2 regularization term on weights
- default 0
• alpha
- L1 regularization term on weights
- default 0
• Lambda_bias
- L2 regularization term on bias
- default 0
Task Parameters
Task parameters guide optimization objective to be calculated at each step
2. Evaluation Metric
• "rmse"
• "logloss"
• "error"
• "auc"
• "merror"
• "mlogloss"
i https://xgboost.readthedocs.io/en/latest/parameter.html
Assisted Practice
Boosting Duration: 15 mins.
Problem Statement: The Pima Indians Diabetes dataset has diagnostic measures like BMI, blood pressure of female
patients of more than 21 years old.
Objective:
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Unassisted Practice
Boosting Duration: 20 mins.
Problem Statement: The Iris plant has 3 species : Iris Setosa, Iris Versicolour, Iris Virginica
One class is linearly separable from the other two whereas the latter are not linearly separable from each other.
Objective:
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Step 1: Data Import
Code
datasets.load_iris
from sklearn.ensemble import AdaBoostClassifier
y = iris.target
from sklearn import datasets
X = iris.data
from sklearn.model_selection import train_test_split
from sklearn import metrics
iris = datasets.load_iris()
iris =
X = iris.data
y = iris.target
()
dataset
s.load_
iris.da
iris.ta
iris =
iris()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rget
X =
y =
ta
Step 2: Classifier
Code
Examples
- + -
- +
-
- - -
+ +
+ +
- + - Train
+
+
+
- - -
+
Hypothesis Space
+
Testing set
-
-
-
-
+ + +
+
+
+
- -
Hypothesis space H
+
Testing set
-
+
+
-
+
+ +
Test
- +
+
- -
- Hypothesis space H
Train/Test Split (Contd.)
9/13 correct
Testing set
- -
- +
-+
- -
+ +
+ + + +
+ -
++
+ +
- - - -
Hypothesis space H
+ -
Train Test
Leave-one-out
K-Fold Cross-Validation
Leave-one-out
Train/Test Split vs. Cross-Validation
▪ More accurate estimate of out-of- ▪ Runs K-times faster than K-fold cross-
sample accuracy validation
Problem Statement: Few learners have implemented random forest classifier on the Iris data but,
better accuracy can be achieved using cross-validation sampling technique.
Objective:
• Generate the random forest using cross validation splitting technique.
• Determine the accuracy such that it is the average of all the resultant accuracies.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter the
username and password in the respective fields, and click Login.
Unassisted Practice
Cross-Validation Duration: 20 mins.
Problem Statement: Mtcars, an automobile company in Chambersburg, United States has recorded the
production of its cars within a dataset. In order to classify cars, the company has come up with two classification
models (KNN and Logistic Regression).
Objective: Perform a model selection between the above two models using the sampling technique as 10-fold
cross- validation.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Defining a 10-Fold KNN Model
Define a 10-fold KNN model and calculate the average of each accuracy
matrix obtained.
Code
Code
Hence, you can infer that KNN model for the task is better as compared to Logistic
Regression model.
Key Takeaways
a. Decision Tree
b. Random Forest
c. Adaboost
a. Decision Tree
b. Random Forest
c. Adaboost
a. Parallelization
a. Parallelization
It has proven to push the limits of computing power for boosted trees.
Lesson-End Project
Car Evaluation Database Duration: 30 mins
Problem Statement: Used car market has significantly grown in recent times with clients ranging from used car dealers
and buyers.You are provided with a car evaluation dataset that has features like price, doors, safety, and so on.
You are required to create a robust model that allows stakeholders to predict the condition of a used vehicle.
Objective:
▪ Predict the condition of a vehicle based on features
▪ Plot the most import features
▪ Train multiple classifiers and compare the accuracy
▪ Evaluate XGBoost model with K-fold cross-validation
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Thank You