U02Lecture08 Statistical Machine Learning
U02Lecture08 Statistical Machine Learning
Unit 02 Lecture 08
4
K-Nearest Neighbors (Distance metrics)
Similarity (nearness) is determined using a
distance metric, which is a function that
measures how far two records (x1, x2, …, xp)
and (u1, u2, …, up) are from one another.
The most popular distance metric between two
vectors is Euclidean distance.
To measure the Euclidean distance between two
vectors, subtract one from the other, square the
differences, sum them, and take the square root:
7
K-Nearest Neighbors (Choosing k value)
The choice of K is very important to the performance of KNN.
The simplest choice is to set K = 1, known as the 1-nearest neighbor classifier.
if K is too low, we may be overfitting: including the noise in the data.
Overfitting is an undesirable machine learning behavior that occurs when the machine
learning model gives accurate predictions for training data but not for new data.
Higher values of K provide smoothing that reduces the risk of overfitting in the
training data.
if K is too high, we may oversmooth the data and miss out on KNN’s ability to
capture the local structure in the data, one of its main advantages.
The K that best balances between overfitting and oversmoothing is typically
determined by accuracy metrics and, in particular, accuracy with holdout or validation
data.
If you are using K and you have an even number of classes (e.g. 2) it is a good idea
to choose a K value with an odd number to avoid a tie
And the inverse, use an even number for K when you have an odd number of 8
classes.
K-Nearest Neighbors (Choosing k value)
In scikit-learn, you will see that KNeighborsClassifier
implements three different algorithms.
One is brute force, which is just traversing the whole dataset
as described,
but you also have BallTree
and KDTree
These are data structures that can accelerate the search for
nearest neighbors,
but they need to be constructed in advance from the data.
So the fitting step here is building the data structure that will
help you find the nearest neighbors 9
K-Nearest Neighbors (code)
10
K-Nearest Neighbors (code)
11
Tree Models
A tree model is a set of “if-then-else” rules that are easy to
understand and to implement.
In contrast to linear and logistic regression, trees have the ability
to discover hidden patterns corresponding to complex
interactions in the data.
Unlike KNN or naive Bayes, simple tree models can be expressed
in terms of predictor relationships that are easily interpretable.
Tree models, also called Classification and Regression Trees
(CART), decision trees, or just trees,
Tree models, and their more powerful descendants random
forests and boosted trees, form the basis for the most
widely used and powerful predictive modeling tools in data
science for prediction/classification. 12
Decision Tree
A Decision tree is one of the easiest and most popular
classification algorithms used to understand and interpret
data.
It can be utilized for both classification and regression
problems.
A decision tree is a flowchart-like tree structure where an
internal node represents a feature(or attribute), the branch
represents a decision rule, and each leaf node represents the
outcome.
The topmost node in a decision tree is known as the root node.
It learns to partition on the basis of the attribute value.
It partitions the tree in a recursive manner called recursive
partitioning. 13
Decision Tree
14
Decision Tree
A decision tree is a white box type of ML algorithm.
It shares internal decision-making logic, which is not
available in the black box type of algorithms such as with a
neural network.
Its training time is faster compared to the neural network
algorithm.
The time complexity of decision trees is a function of the
number of records and attributes in the given data.
15
Decision Tree (Algorithm)
The basic idea behind any decision tree algorithm is as follows:
1. Select the best attribute using Attribute Selection Measures
(ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset
into smaller subsets.
3. Start tree building by repeating this process recursively for
each child until one of the conditions will match:
All the tuples belong to the same attribute value.
There are no more remaining attributes.
There are no more instances.
16
Decision Tree (Measuring Homogeneity or Impurity)
Tree models recursively create partitions (sets of records), A,
that predict an outcome of Y = 0 or Y = 1.
You can see from the preceding algorithm that we need a way
to measure homogeneity, also called class purity, within a
partition.
Accuracy is not a good measure for impurity.
Instead, two common measures for impurity are the Gini
impurity and entropy of information.
17
Decision Tree (Attribute Selection Method)
Attribute selection measure is a heuristic for selecting the
splitting criterion that partitions data in the best possible
manner.
It is also known as splitting rules because it helps us to
determine breakpoints for tuples on a given node.
ASM provides a rank to each feature (or attribute) by
explaining the given dataset.
The best score attribute will be selected as a splitting
attribute (Source).
18
Decision Tree (Attribute Selection Method)
Entropy is defined as the randomness or measuring the
disorder of the information being processed in Machine
Learning.
Further, in other words, we can say that entropy is the machine
learning metric that measures the unpredictability or impurity in
the system.
If it is easier to draw a valuable conclusion from a piece of
information, then entropy will be lower in Machine Learning, or
if entropy is higher, then it will be difficult to draw any
conclusion from that piece of information.
19
Decision Tree (Attribute Selection Method)
Where Pi is the probability that an arbitrary tuple in D belongs to class Ci.
Info(D) is the average amount of information needed to identify the class label of a tuple in
D.
|Dj|/|D| acts as the weight of the jth partition.
InfoA(D) is the expected information required to classify a tuple from D based on the
partitioning by A.
The attribute A with the highest information gain, Gain(A), is chosen as the splitting
attribute at node N().
The Gini method to create split points 20
*For more concept use additional slides for Decision Tree on elearning
Decision Tree (code)
21
Decision Tree (Stopping the Tree from Growing)
As the tree grows bigger, the splitting rules become more detailed,
and the tree gradually shifts from identifying “big” rules that identify
real and reliable relationships in the data to “tiny” rules that reflect
only noise.
A fully grown tree results in completely pure leaves and, hence,
100% accuracy in classifying the data that it is trained on (i.e.,
overfitted classifier).
Thus need to make it generalize to fit on new data. Thus are
various ways to stop splitting
Avoid splitting a partition if a resulting subpartition is too small,
or if a terminal leaf is too small, don’t split a partition if the new
partition does not “significantly” reduce the impurity.
Max depth should be approx. 5 to 30
22
23
Model Error and Reducing this Error with Ensembles
24
Model Error and Reducing this Error with Ensembles
Bias error
This is useful to quantify how much, on an average, the
predicted values are different from the actual value.
A high bias error means we have an underperforming
model that keeps missing essential trends.
Variance
On the other side, Variance quantifies how the
predictions made on the same observation differ.
A high variance model will over-fit on your training
population and perform poorly on any observation
beyond training.
The following diagram will give you more clarity (assume
that the red spot is the real value, and the blue dots are
predictions):
25
Model Error and Reducing this Error with Ensembles
27
Bagging
28
Bagging
29
Random Forest (Bagging+DT+Random attribute selection)
31
Random Forest (Bagging+DT+Random attribute selection)
In the diagram below, we have a random forest with n decision trees, and
we’ve shown the first 5, along with their predictions (either “Dog” or “Cat”).
Each tree is exposed to a different number of features and a different
sample of the original dataset, and as such, every tree can be different.
Each tree makes a prediction. Looking at the first 5 trees, we can see that
4/5 predicted the sample was a Cat. The green circles indicate a
hypothetical path the tree took to reach its decision. The random forest
would count the number of predictions from decision trees for Cat and for
Dog, and choose the most popular prediction.
32
Random Forest (Bagging+DT+Random attribute selection)
33
34
Boosting
35
Boosting
36
Boosting
37
Types of Boosting
Ada Boosting
Gradient Boosting
XG Boosting
AdaBoost
AdaBoost is a boosting algorithm, which also works on the principle of the stagewise
addition method where multiple weak learners are used for getting strong learners.
Unlike Gradient Boosting in XGBoost, the alpha parameter I calculated is related to the
errors of the weak learner, here the value of the alpha parameter will be indirectly
proportional to the error of the weak learner.
Gradient Boosting
In the gradient boosting algorithm, the first weak learner will not be trained on the
dataset, it will simply return the mean of the particular column, and the residual for
output of the first weak learner algorithm will be calculated which will be used as output
or target column for next weak learning algorithm which is to be trained.
Following the same pattern, the second weak learner will be trained and the residuals
will be calculated which will be used as an output column again for the next weak learner,
38
this is how this process will continue until we reach zero residuals.
Types of Boosting
XGBoost
XGBoost is also a boosting machine learning algorithm, which is the next version on top
of the gradient boosting algorithm.
The full name of the XGBoost algorithm is the eXtreme Gradient Boosting algorithm, as
the name suggests it is an extreme version of the previous gradient boosting algorithm.
Due to this, XGBoost performs better than a normal gradient boosting algorithm and
that is why it is much faster than that also. It also performs better when there is a
presence of numerical and categorical features in the dataset.
39
Single Classifier/Bagging/Boosting
40
Summary
41