0% found this document useful (0 votes)
23 views

U02Lecture08 Statistical Machine Learning

The document discusses machine learning algorithms including K-nearest neighbors (KNN), decision trees, bagging, boosting, and random forests. For KNN, it covers distance metrics, standardization, choosing K, and using KNN for classification and regression. For decision trees, it describes the recursive partitioning algorithm, measures of impurity, stopping criteria, and predicting continuous values. It also briefly discusses bagging, random forests, boosting, and their hyperparameters.

Uploaded by

tunio.bscsf21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

U02Lecture08 Statistical Machine Learning

The document discusses machine learning algorithms including K-nearest neighbors (KNN), decision trees, bagging, boosting, and random forests. For KNN, it covers distance metrics, standardization, choosing K, and using KNN for classification and regression. For decision trees, it describes the recursive partitioning algorithm, measures of impurity, stopping criteria, and predicting continuous values. It also briefly discusses bagging, random forests, boosting, and their hyperparameters.

Uploaded by

tunio.bscsf21
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Statistical Machine Learning

Unit 02 Lecture 08

Dr. Mohammad Asif Khan


Contents
 K-Nearest Neighbors
 A Small Example: Predicting Loan Default
 Distance Metrics
 One Hot Encoder
 Standardization (Normalization, z-Scores)
 Choosing K
 KNN as a Feature Engine
 Tree Models
 A Simple Example
 The Recursive Partitioning Algorithm
 Measuring Homogeneity or Impurity
 Stopping the Tree from Growing
 Predicting a Continuous Value
 How Trees Are Used
 Bagging and the Random Forest
 Bagging
 Random Forest
 Variable Importance
 Hyperparameters
 Boosting
 The Boosting Algorithm
 XGBoost
 Regularization: Avoiding Overfitting
 Hyperparameters and Cross
2
K-Nearest Neighbors
 KNN is one of the simpler prediction/classification
techniques:
 There is no model to be fit (as in regression).
 The idea behind nearest neighbors is to select one or more
examples from the training data to decide the predicted value
for the sample at hand.
 The simplest way to do that is to simply iterate through the
whole dataset and pick the closest data points from the training
dataset.
 The prediction results depend on how the features are
scaled, how similarity is measured, and how big K is set.
 Also, all predictors must be in numeric form. 3
K-Nearest Neighbors
 The idea behind K-Nearest Neighbors (KNN) is:
 Find K records that have similar features (atleast 5 to 7)
 For classification, find out what the majority class is among those
similar records and assign that class to the new record.
 For prediction, find the average among those similar records,
and predict that average for the new record.

4
K-Nearest Neighbors (Distance metrics)
 Similarity (nearness) is determined using a
distance metric, which is a function that
measures how far two records (x1, x2, …, xp)
and (u1, u2, …, up) are from one another.
 The most popular distance metric between two
vectors is Euclidean distance.
 To measure the Euclidean distance between two
vectors, subtract one from the other, square the
differences, sum them, and take the square root:

 Another common distance metric for numeric data


is Manhattan distance:
5
K-Nearest Neighbors (Distance metrics)
 Similarity (nearness) is determined using a
distance metric, which is a function that
measures how far two records (x1, x2, …, xp)
and (u1, u2, …, up) are from one another.
 The most popular distance metric between two
vectors is Euclidean distance.
 To measure the Euclidean distance between two
vectors, subtract one from the other, square the
differences, sum them, and take the square root:

 Another common distance metric for numeric data


is Manhattan distance:
6
K-Nearest Neighbors (Distance metrics)

7
K-Nearest Neighbors (Choosing k value)
 The choice of K is very important to the performance of KNN.
 The simplest choice is to set K = 1, known as the 1-nearest neighbor classifier.
 if K is too low, we may be overfitting: including the noise in the data.
 Overfitting is an undesirable machine learning behavior that occurs when the machine
learning model gives accurate predictions for training data but not for new data.
 Higher values of K provide smoothing that reduces the risk of overfitting in the
training data.
 if K is too high, we may oversmooth the data and miss out on KNN’s ability to
capture the local structure in the data, one of its main advantages.
 The K that best balances between overfitting and oversmoothing is typically
determined by accuracy metrics and, in particular, accuracy with holdout or validation
data.
 If you are using K and you have an even number of classes (e.g. 2) it is a good idea
to choose a K value with an odd number to avoid a tie
 And the inverse, use an even number for K when you have an odd number of 8
classes.
K-Nearest Neighbors (Choosing k value)
 In scikit-learn, you will see that KNeighborsClassifier
implements three different algorithms.
 One is brute force, which is just traversing the whole dataset
as described,
 but you also have BallTree
 and KDTree
 These are data structures that can accelerate the search for
nearest neighbors,
 but they need to be constructed in advance from the data.
So the fitting step here is building the data structure that will
help you find the nearest neighbors 9
K-Nearest Neighbors (code)

It is taking 20 nearest neighbors

10
K-Nearest Neighbors (code)

fit_transform() is used on the training data so that we can scale


the training data and also learn the scaling parameters of that
data. Here, the model built by us will learn the mean and
variance of the features of the training set.
These learned parameters are then used to scale our test data.
So what actually is happening here!
The fit method is calculating the mean and variance of each of
the features present in our data. The transform method is
transforming all the features using the respective mean and
variance.
Using the transform method we can use the same mean and
variance as it is calculated from our training data to transform
our test data.

11
Tree Models
 A tree model is a set of “if-then-else” rules that are easy to
understand and to implement.
 In contrast to linear and logistic regression, trees have the ability
to discover hidden patterns corresponding to complex
interactions in the data.
 Unlike KNN or naive Bayes, simple tree models can be expressed
in terms of predictor relationships that are easily interpretable.
 Tree models, also called Classification and Regression Trees
(CART), decision trees, or just trees,
 Tree models, and their more powerful descendants random
forests and boosted trees, form the basis for the most
widely used and powerful predictive modeling tools in data
science for prediction/classification. 12
Decision Tree
 A Decision tree is one of the easiest and most popular
classification algorithms used to understand and interpret
data.
 It can be utilized for both classification and regression
problems.
 A decision tree is a flowchart-like tree structure where an
internal node represents a feature(or attribute), the branch
represents a decision rule, and each leaf node represents the
outcome.
 The topmost node in a decision tree is known as the root node.
 It learns to partition on the basis of the attribute value.
 It partitions the tree in a recursive manner called recursive
partitioning. 13
Decision Tree

14
Decision Tree
 A decision tree is a white box type of ML algorithm.
 It shares internal decision-making logic, which is not
available in the black box type of algorithms such as with a
neural network.
 Its training time is faster compared to the neural network
algorithm.
 The time complexity of decision trees is a function of the
number of records and attributes in the given data.

15
Decision Tree (Algorithm)
 The basic idea behind any decision tree algorithm is as follows:
1. Select the best attribute using Attribute Selection Measures
(ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset
into smaller subsets.
3. Start tree building by repeating this process recursively for
each child until one of the conditions will match:
 All the tuples belong to the same attribute value.
 There are no more remaining attributes.
 There are no more instances.

16
Decision Tree (Measuring Homogeneity or Impurity)
 Tree models recursively create partitions (sets of records), A,
that predict an outcome of Y = 0 or Y = 1.
 You can see from the preceding algorithm that we need a way
to measure homogeneity, also called class purity, within a
partition.
 Accuracy is not a good measure for impurity.
 Instead, two common measures for impurity are the Gini
impurity and entropy of information.

17
Decision Tree (Attribute Selection Method)
 Attribute selection measure is a heuristic for selecting the
splitting criterion that partitions data in the best possible
manner.
 It is also known as splitting rules because it helps us to
determine breakpoints for tuples on a given node.
 ASM provides a rank to each feature (or attribute) by
explaining the given dataset.
 The best score attribute will be selected as a splitting
attribute (Source).

18
Decision Tree (Attribute Selection Method)
 Entropy is defined as the randomness or measuring the
disorder of the information being processed in Machine
Learning.
 Further, in other words, we can say that entropy is the machine
learning metric that measures the unpredictability or impurity in
the system.
 If it is easier to draw a valuable conclusion from a piece of
information, then entropy will be lower in Machine Learning, or
if entropy is higher, then it will be difficult to draw any
conclusion from that piece of information.

19
Decision Tree (Attribute Selection Method)
 Where Pi is the probability that an arbitrary tuple in D belongs to class Ci.

 Info(D) is the average amount of information needed to identify the class label of a tuple in
D.
 |Dj|/|D| acts as the weight of the jth partition.
 InfoA(D) is the expected information required to classify a tuple from D based on the
partitioning by A.
 The attribute A with the highest information gain, Gain(A), is chosen as the splitting
attribute at node N().
 The Gini method to create split points 20

 *For more concept use additional slides for Decision Tree on elearning
Decision Tree (code)

21
Decision Tree (Stopping the Tree from Growing)
 As the tree grows bigger, the splitting rules become more detailed,
and the tree gradually shifts from identifying “big” rules that identify
real and reliable relationships in the data to “tiny” rules that reflect
only noise.
 A fully grown tree results in completely pure leaves and, hence,
100% accuracy in classifying the data that it is trained on (i.e.,
overfitted classifier).
 Thus need to make it generalize to fit on new data. Thus are
various ways to stop splitting
 Avoid splitting a partition if a resulting subpartition is too small,
 or if a terminal leaf is too small, don’t split a partition if the new
partition does not “significantly” reduce the impurity.
 Max depth should be approx. 5 to 30
22

 Min split 20 to 100


Ensemble
 Ensemble is averaging (or taking majority votes) of multiple
models—an ensemble of models—turns out to be more
accurate than just selecting one model.
 It combines multiple algorithms to obtain better predictive
performance than the one from a single model.
 The simple version of ensembles is as follows:
 Develop a predictive model and record the predictions for a
given data set.
 Repeat for multiple models on the same data.
 For each record to be predicted, take an average (or a weighted
average), or a majority vote of the predictions.

23
Model Error and Reducing this Error with Ensembles

 There is no predefined number of models to consider, and


some business goals may require more models than others.
 Model Error and Reducing this Error with Ensembles
 Bias + Variance + Irreducible error
 Bias error
 This is useful to quantify how much, on an average, the predicted
values are different from the actual value.
 A high bias error means we have an underperforming model that
keeps missing essential trends.

24
Model Error and Reducing this Error with Ensembles

 Bias error
 This is useful to quantify how much, on an average, the
predicted values are different from the actual value.
 A high bias error means we have an underperforming
model that keeps missing essential trends.
 Variance
 On the other side, Variance quantifies how the
predictions made on the same observation differ.
 A high variance model will over-fit on your training
population and perform poorly on any observation
beyond training.
 The following diagram will give you more clarity (assume
that the red spot is the real value, and the blue dots are
predictions):
25
Model Error and Reducing this Error with Ensembles

 Typically, as you increase the complexity of your


model, you will see a reduction in error due to
lower bias in the model.
 However, this only happens until a particular
point.
 As you continue to make your model more
complex, you end up over-fitting your model,
and hence your model will start suffering from the
high variance.
 Here we discussion two main types of ensemble
 Bagging
 Boosting
26
Bagging

 Bagging, which stands for “bootstrap aggregating,” was


introduced by Leo Breiman in 1994.
 Suppose we have a response Y and P predictor variables X =
X1, X2, ⋯, XP with N records.
 Bagging is like the basic algorithm for ensembles, except that,
instead of fitting the various models to the same data, each
new model is fitted to a bootstrap resample (with
replacment).
 This technique is similar to random forest, but it uses all the
predictors, whereas random forest uses only a subset of
predictors in each tree.

27
Bagging

 You can use any model as classifier with bagging

28
Bagging

 Here is the algorithm presented more formally:

29
Random Forest (Bagging+DT+Random attribute selection)

 Random forests are an ensemble method, meaning they combine


predictions from other models.
 The random forest is based on applying bagging to decision trees, with
one important extension:
 In addition to sampling the records, the algorithm also samples the
variables.
 With random forests, at each stage of the algorithm, the choice of variable
is limited to a random subset of variables.
 Compared to the basic tree algorithm, the random forest algorithm adds
two more steps:
 the bagging, and the bootstrap sampling of variables at each split:
 How many variables to sample at each step? A rule of thumb is to choose
30

(P)0.5 where P is the number of predictor variables.


Random Forest (Bagging+DT+Random attribute selection)

 Imagine you have a complex problem to solve, and you gather


a group of experts from different fields to provide their input.
Each expert provides their opinion based on their expertise and
experience. Then, the experts would vote to arrive at a final
decision
 In a random forest classification, multiple decision trees are
created using different random subsets of the data and
features. Each decision tree is like an expert, providing its
opinion on how to classify the data. Predictions are made by
calculating the prediction for each decision tree, then taking the
most popular result. (For regression, predictions use an
averaging technique instead.)

31
Random Forest (Bagging+DT+Random attribute selection)

 In the diagram below, we have a random forest with n decision trees, and
we’ve shown the first 5, along with their predictions (either “Dog” or “Cat”).
Each tree is exposed to a different number of features and a different
sample of the original dataset, and as such, every tree can be different.
Each tree makes a prediction. Looking at the first 5 trees, we can see that
4/5 predicted the sample was a Cat. The green circles indicate a
hypothetical path the tree took to reach its decision. The random forest
would count the number of predictions from decision trees for Cat and for
Dog, and choose the most popular prediction.

32
Random Forest (Bagging+DT+Random attribute selection)

33

For code see elearning


Boosting

 Boosting is a general technique to create an ensemble of models.


 Like bagging, boosting is most commonly used with decision trees.
 bagging can be done with relatively little tuning, boosting requires much
greater care in its application.
 Boosting fits a series of models, in which each successive model seeks
to minimize the error of the previous model.
 Several variants of the algorithm are commonly used: Adaboost, gradient
boosting, and stochastic gradient boosting.

34
Boosting

35
Boosting

 You continue till all observations are correctly classified.

36
Boosting

37
Types of Boosting

 Ada Boosting
 Gradient Boosting
 XG Boosting
 AdaBoost
 AdaBoost is a boosting algorithm, which also works on the principle of the stagewise
addition method where multiple weak learners are used for getting strong learners.
Unlike Gradient Boosting in XGBoost, the alpha parameter I calculated is related to the
errors of the weak learner, here the value of the alpha parameter will be indirectly
proportional to the error of the weak learner.
 Gradient Boosting
 In the gradient boosting algorithm, the first weak learner will not be trained on the
dataset, it will simply return the mean of the particular column, and the residual for
output of the first weak learner algorithm will be calculated which will be used as output
or target column for next weak learning algorithm which is to be trained.
 Following the same pattern, the second weak learner will be trained and the residuals
will be calculated which will be used as an output column again for the next weak learner,
38
this is how this process will continue until we reach zero residuals.
Types of Boosting

 XGBoost
 XGBoost is also a boosting machine learning algorithm, which is the next version on top
of the gradient boosting algorithm.
 The full name of the XGBoost algorithm is the eXtreme Gradient Boosting algorithm, as
the name suggests it is an extreme version of the previous gradient boosting algorithm.

 The main difference between GradientBoosting is XGBoost is that XGbost uses a


regularization technique in it. In simple words, it is a regularized form of the existing
gradient-boosting algorithm.

 Due to this, XGBoost performs better than a normal gradient boosting algorithm and
that is why it is much faster than that also. It also performs better when there is a
presence of numerical and categorical features in the dataset.

39
Single Classifier/Bagging/Boosting

40
Summary

 We learned different classifications methods of machine learning


 We learned KNN method
 We have learned the concept of Decision trees
 We have learned Boosting
 We have learned Bagging

41

You might also like