0% found this document useful (0 votes)
86 views117 pages

UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning

This document discusses non-parametric supervised learning and density estimation techniques. It begins by explaining that non-parametric learning does not assume a functional form for the model and instead relies on similarity between training examples. It then covers several density estimation methods: histogram, naive estimator, smoothing with a Gaussian kernel, and k-nearest neighbor density estimation. The key advantages and disadvantages of each method are described.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views117 pages

UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning

This document discusses non-parametric supervised learning and density estimation techniques. It begins by explaining that non-parametric learning does not assume a functional form for the model and instead relies on similarity between training examples. It then covers several density estimation methods: histogram, naive estimator, smoothing with a Gaussian kernel, and k-nearest neighbor density estimation. The key advantages and disadvantages of each method are described.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

UE20EC352-Machine Learning &

Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ In supervised learning, whether regression or classification, we use


N training examples/samples to learn a mapping between d
attributes/features/factors x = [x1, x2, x3, ……..xd]T and
response/target/label r (y is another commonly used symbol for
response).
⚫ In parametric supervised learning we assume at the beginning that
the above mapping can be completely described in terms of a finite
number of parameters w0,w1,w2,…….. (β, θ are other commonly
used symbols for parameters).
⚫ And then we use the N training examples to learn the parameters
θ0, θ1, θ2,……..
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ Recall this is exactly what we did in linear regression, logistic


regression, discriminant analysis (during Unit 2).
⚫ In non parametric supervised learning, we do not make such an
assumption to begin with.
⚫ We try to build a model using the N training samples, a model
that exploits the “similarity” between N training examples.
⚫ Non parametric supervised learning is composed of finding the
similar past instances from the training set using a “suitable
distance measure” and interpolating from them to find the right
output for the test data.
MACHINE LEARNING
What is Non Parametric Supervised Learning

Let us consider the x y/r


univariate regression 0.5 1.2
training set given below. 0.8 1.7

In parametric regression, 0.9 2


1.9 -1.5
we are asked to learn a
2.3 -1.8
mapping between x and r. 2.9 2.6
We, suppose, assume that 4.0 2.2
this mapping can be 5.5 2.2
described by a straight 5.7 2.4
line, i.e., r = θ0 + θ1x. We 6 1.8
6.3 1.2
therefore learn θ0, θ1
7.3 2.6
from N training examples.
MACHINE LEARNING
What is Non Parametric Supervised Learning
⚫ Prediction part : For any new x (outside the training set) given to us,
we would predict the response to be θ0 + θ1x .
⚫ Note, number of parameters, which is 2 here, does not depend on N
⚫ In a parametric model, all of the N training instances affect the final
global model θ0 + θ1x
⚫ However, in the nonparametric case, there is no single global model;
local models are estimated as they are needed, affected only by the
nearby training instances.
⚫ In nonparametric supervised learning, we would indeed end up
calculating a large number of parameters, where the number of
parameters grow with number of training instances, N.
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ Non parametric learning does not mean no parameter to be


estimated.
⚫ Also called Instance-based learning, Memory-based learning, lazy
learning.
⚫ Nonparametric needs training instances/examples to be stored –
space complexity grows as O(N).
⚫ Nonparametric needs similarity to be found between a test
example and N training instances – time complexity grows as O(N).
⚫ Time complexity of parametric learning techniques grows as O(d)
or O(d2).
MACHINE LEARNING
When is Non Parametric Supervised Learning preferred over Parametric Learning

⚫ Gives more accurate prediction than parametric techniques


⚫ When we suspect presence of outlier in the training set.
⚫ When we need to do anomaly detection
⚫ When we are looking to build a mapping that can be easily
interpreted, can be understood by someone not skilled in ML.
MACHINE LEARNING
Disadvantages of Non Parametric Supervised Learning

⚫ More training data needed compared to parametric techniques


⚫ Slower to learn compared to parametric techniques
⚫ Susceptible to overfitting
MACHINE LEARNING
What we will learn in this Unit

⚫ Density Estimation
Histogram
Naïve Estimator
Estimation using Gaussian Kernel
K nearest neighbour density estimator
⚫ Nonparametric Regression
Regressogram
Running Mean Smoother
Kernel Smoother with Gaussian Kernel
LOESS
MACHINE LEARNING
What we will learn in this Unit

⚫ Nonparametric Classification
Discriminant Function based
Distance Measure based
K Nearest Neighbour
⚫Condensed Nearest Neighbour
⚫ Decision Tree
Classification
Regression
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation - Histogram

X
0.4 • The nonparametric estimate for the density function,
0.7 which is the derivative of the cumulative distribution, can
0.8 be calculated as
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation - Histogram

X
0.4
⚫ h = bin width/smoothing parameter
0.7 ⚫ Choose h. Let h = 6
0.8 ⚫ Choose a beginning point, say x = 0
1.9
2.4
⚫ Count number of samples in bins (0,6], (6,12].
6.1 ⚫ These counts are 5, 4
6.3
⚫ Histogram is 5/54 in the range (0,6] and 4/54 in the range (6,12].
7.0
10.3 ⚫ Area under the curve has to be 1
MACHINE LEARNING
Histogram
MACHINE LEARNING
Disadvantage of Histogram
MACHINE LEARNING
Disadvantage of Histogram

X
0.4
0.7 ⚫ The histogram depends on the choice of beginning point
0.8 ⚫ There will be jumps in histogram at bin boundaries
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation – Naïve Estimator

Naive estimator: frees us from setting an origin.


to solve the discontinuity problem at boundaries follow Naïve Approach

pˆ ( x ) =
 
# x − h / 2  xt  x + h / 2
Nh

1 N  x − xt  1 if u  1 / 2
pˆ ( x ) =  w  w(u ) = 
Nh t =1  h  0 otherwise
MACHINE LEARNING
Density Estimation – Naïve Estimator

X ⚫ h = bin width/smoothing parameter


0.4
⚫ Choose h. Let h = 6
0.7
0.8 ⚫ Partition the real line (x takes real values), into segments
1.9 (x1-h/2, x1+h/2], (x2-h/2, x2+h/2], …………… (x9-h/2, x9+h/2].
2.4
6.1
⚫ Count number of samples in these segments
6.3 ⚫ These counts are 5, 5, 5, 5, 3, 3, 3, 1
7.0 ⚫ Area under the curve has to be 1
10.3
⚫ Density is 5/57.1 in the range (-2.6, 5.4], 3/57.1 in the range
(5.4, 10], 1/57.1 in the range (10, 13.3]
MACHINE LEARNING
Naïve Estimator
Naive estimator:
MACHINE LEARNING
Disadvantage of Naïve Estimator

Naive estimator: there will be jumps in histogram at

The smaller the h, the more is the number of jumps, more


jagged is the estimated density
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel

To get a smooth estimate, we use a function called Kernel function:

• Kernel function, e.g., Gaussian kernel:

1  u2 
K (u ) = exp − 
2  2
• Kernel estimator (Parzen windows)
1 N  x − xt 
p̂ (x ) =  K  
Nh t =1  h 
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel

o The kernel function K(·) determines the shape of the influences


and the window width h determines the width.

o All the xi have an effect on the density estimate at x, and this


effect decreases smoothly as |x − xi | increases.

o To simplify calculation, K(·) can be taken to be 0 if |x − xi | > 3h


(this comes from property of Gaussian distribution)
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator

• The degree of smoothing is controlled by k, number of nearest


neighbours

• k is chosen much smaller than N

• Instead of fixing bin width h and counting the number of training


instances in the bin, (as we did in histogram, Naive estimator,
Smoothing with Gaussian Kernel), we fix k, the number of
observations to fall in the bin, and compute the bin size.
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator

• Estimated density = p(x) = k/(2Ndk(x))

• p(x) integrates to ∞, not 1, therefore p(x) is not a density.

• N = total no of training samples

• Choose k, suppose k = 3

• dk(x)= distance of x to the farthest neighbour, among these k


nearest neighbours
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator
MACHINE LEARNING
Density Estimation

To get a smoother estimate, we can use a kernel function whose effect


decreases with increasing distance. This is like a kernel estimator with
adaptive smoothing parameter h = dk(x).
K(·) is typically taken to be the Gaussian kernel.

Note: K is used for number of classes during classification, K is used here


to denote Kernel, again we see k in k-NN. They have different meanings.
Therefore interpret k/K as per the problem/ context.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Regressogram

x y/r
0.5 1.2
0.8 1.7
0.9 2
1.9 -1.
2.3 -1.8
2.9 2.6
4.0 2.2
5.5 2.2
5.7 2.4
6 1.8
6.3 1.2
7.3 2.6
MACHINE LEARNING
Non Parametric Regression – Regressogram
MACHINE LEARNING
Non Parametric Regression – Regressogram

x y/r ⚫ h = bin width/smoothing parameter


0.5 1.2 ⚫ Choose h. Let h = 6
0.8 1.7
0.9 2
⚫ Choose a beginning point, say x = 0
1.9 -1. ⚫ Count number of samples in bins (0,6], (6,12].
2.3 -1.8
⚫ These counts are 10, and 2
2.9 2.6
4.0 2.2 ⚫ Estimated response/output is (1.2+1.7+2-1-
5.5 2.2 1.8+2.6+2.2+2.2+2.4+1.8)/10 = 1.33 in the range (0,6],
5.7 2.4 and (1.2+2.6)/2 = 1.9 in the range (6,12].
6 1.8
⚫ The higher the h, the smoother is the curve, lesser is the
6.3 1.2
7.3 2.6
variance !!
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother

x y/r
⚫ h = bin width/smoothing parameter
0.5 1.2
0.8 1.7 ⚫ Choose h. Let h = 6
0.9 2 ⚫ Partition the real line (x takes real values), into segments
1.9 -1.
(x1-h, x1+h], (x2-h, x2+h], …………… (x12-h, x12+h].
2.3 -1.8
2.9 2.6 ⚫ Count number of samples in these segments
4.0 2.2 ⚫ These counts are 8,11,11,12,12,12,12,12,12,12,12,9.
5.5 2.2
5.7 2.4
⚫ Estimated response/output is (1.2+1.7+2-1-1.8+2.6+2.2+2.2)/8
6 1.8 = 1.125 in the range (-5.5, 6.5], and (1.2+1.7+2-1-
6.3 1.2 1.8+2.6+2.2+2.2+2.4+1.8+1.2)/11 = 1.27 in the range (6.5,6.9],
7.3 2.6 1.383 in the range (6.9,12.3], 1.300 in the range (12.3,13.3]
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother

⚫ The mapping between attributes and response is described by 4


quantities, the smoothed responses in the ranges (-5.5, 6.5],
(6.5,6.9], (6.9,12.3], (12.3,13.3].
⚫ For reduced h, number of bins and number of computations
would increase
⚫ If instead of 12, we had 18 training instances, then even for h =
6, number of bins and number of computations would have
been higher.
MACHINE LEARNING
Non Parametric Regression – Kernel Smoother with Gaussian Kernel

1  u2 
K (u ) = exp − 
2  2
MACHINE LEARNING
Non Parametric Regression – Kernel Smoother with Gaussian Kernel
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Running Line Smoother - LOESS

⚫ Instead of taking an average of the responses over a bin of


width h, and giving a constant fit at a point, we can take into
account responses of neighbouring points to fit a line.
⚫ k = no of neighbours (we use symbol K for number of classes
when we deal with classification problem)
⚫ For a given training example, say, xi, find its nearest k
neighbours, find intercept and slope of a straight line that best
fits the xi and its k nearest neighbours.
⚫ We will end up representing the mapping between attributes
and response in terms of multiple local regression lines, as
show in the next slide.
MACHINE LEARNING
LOESS – Locally Weighted Least Squares Regression
MACHINE LEARNING
Multivariate Data

⚫ The non parametric density estimation and regression


examples discussed till now assumed no of attributes = 1, or d
= 1, or all were univariate examples.
⚫ All the techniques described so far can be extended for
multivariate case i.e., when number of attributes >= 2.
MACHINE LEARNING
Multivariate Data

⚫ For example,
⚫ Multivariate Kernel density estimator for a sample of d-dimensional
observations X={xt}

1 N
 x − xt 
p̂ (x ) =  K  
Nh d t =1  h 

⚫ Multivariate Gaussian kernel

 1  
d
u 
2

K (u ) =   exp − 
 2   2 
MACHINE LEARNING
Multivariate Data

⚫ When we discuss non parametric classification and decision


tree, we will take univariate as well as multivariate examples.
MACHINE LEARNING
Curse of Dimensionality

⚫ However, care should be applied to using nonparametric


estimates in high-dimensional attribute/feature spaces
because of the curse of dimensionality:
⚫ Let us x is eight-dimensional, d = 8, and we use a histogram
with ten bins per dimension, then there are 108 bins, and
unless we have lots of data, most of these bins will be empty
and the estimates in there will be 0
MACHINE LEARNING
Choice of h or k
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based

⚫ Discriminant Functions, as we learnt in parametric classification –


are used to partition feature space into K disjoint
volumes/regions. K is no of classes.

⚫ Where p(Ci) is a priori density, can be estimated from training


samples, if needed.
P(x|Ci) is class conditional density. We assumed it to be
Gaussian in Unit 2, and then estimated the means and
covariance matrices from training samples.
⚫ Here i goes from 1 to K
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based

⚫ In non parametric classification, we cannot assume any


distribution for the class conditional densities.

⚫ We have to estimate the class conditional densities from


training instances/examples/samples using nonparametric
density estimators, for e.g., Naïve estimator, estimation with
Gaussian Kernel, etc.
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour

⚫ If we use k-NN density estimator for estimating the class conditional


densities in discriminant function based classification, then it turns
out that

⚫ And the posterior probability of the class Ci turns out to be

⚫ where ki is the number of neighbors out of the k nearest that belong to Ci and Vk(x) is the
volume of the d-dimensional hypersphere centered at x, with radius r = ∥x − x(k)∥ where x(k) is the
k-th nearest observation to x (among all neighbors from all classes of x)
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour

⚫ This nonparametric
classifier is called k-
NN classifier
⚫ A simple yet very
good classifier
⚫ The test case belongs
to ‘green’ class
because out its 3
nearest neighbors, 2
neighbors belong to
the class ‘green’
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
k-Nearest Neighbour – choice of k (Extra Reading)

⚫ Changing k changes the prediction for a given test case/example. It


means k affects the prediction. The following plots show that
when k=1, the classifier does overfitting, (no misclassification, all
training instances are predicted correctly). Such a classifier will
have a high variance and a high testing error or generalization
error. Whereas when k = 100, the classifier is much less
flexible,(there are misclassifications), but will therefore have a low
variance and low generalization error/testing error.
MACHINE LEARNING
k-Nearest Neighbour – choice of k (Extra Reading)
MACHINE LEARNING
k-Nearest Neighbour – Special case k = 1

⚫ A special case of k-nn is the nearest neighbor classifier where k = 1


and the input is assigned to the class of the nearest pattern. This
divides the space in the form of a Voronoi tessellation
⚫ Observe carefully, 1-NN approximates the discriminant in a piecewise
linear manner.
MACHINE LEARNING
k-Nearest Neighbour – Special case k = 1

⚫ Time and space complexity of k-NN is O(N)


⚫ only the instances that define the discriminant need to be kept.

⚫ instance inside the class regions need not be stored as its nearest
neighbor is of the same class and its absence does not cause any error
(on the training set).

⚫ Such a subset is called a consistent subset, and we would like to find


the minimal consistent subset.
⚫ And this gives rise to Condensed Nearest Neighbour
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

⚫ Given N training instances from K classes, we have to determine to


which class test case/instance belongs to – this is what we do in
classification.
⚫ We can use Mahalanobis distance as the distance metric to determine
which class the test case belongs to.
⚫ Mean and covariance matrix to be estimated for each of the K classes
from the training instances.
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

⚫ There are 13 training examples, 7 from class ‘green’ and 6 from class
‘yellow’. I need to determine to which class the test data ‘red’ belongs
to. Use Mahalanobis distance as the distance measure.
⚫ I need to find sample mean of ‘green’ examples, estimate covariance
matrix of ‘green’ examples, and then calculate MD of ‘red’ point from
the ‘green’ cluster/distribution.
⚫ I need to repeat the above step for ‘yellow’ examples. And calculate
MD of ‘red’ point from the ‘yellow’ cluster/distribution.
⚫ The smallest MD corresponds to the class the ‘red’ test case belongs
to.
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

mi is mean is Ci
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Given Training Data Set, determine to which class the point (66,640,44)
belongs to. Use Mahalanobis Distance as distance metric.
x286 x3 Class
64 580 29 C0
64 570 33 C0
68 590 37 C0
69 660 46 C0
73 600 55 C0
80 580 21 C1
82 570 22 C1
89 590 39 C1
87 660 19 C1
77 600 25 C1
72 595 38 C1
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Sample Mean m0 = [67.6 600 40]T

Covariance matrix, S0 is

11.44 52 30.6
52 1000 164
30.6 64 88

Mahalanobis distance, MD0 of point x = (66,640,44) from the cluster


consisting of examples from class C0 is 11.65
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

Similarly calculate m1 covariance matrix, S1

Calculate Mahalanobis distance, MD1 of point x = (66,640,44) from the


cluster consisting of examples from class C1

If MD0 < MD1, the point x belongs to class C0, else belongs to class C1.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree

⚫ Very widely used Non Parametric Supervised Learning techniques used


for Classification, and Regression.
⚫ Advantages
⚫ Easy interpretability - someone not skilled in ML understands
Decision Tree easily
⚫ Identifies outliers and handles outliers very well
⚫ Disadvantages
⚫ Prone to overfitting – Pruning is the solution
⚫ Very sensitive to changes in training data set – not in textbook, you
may skip
MACHINE LEARNING
Decision Tree – An example

Root node
depth 0

Intermediate node
depth 1

depth 2 Leaf
MACHINE LEARNING
Decision Tree for Classification
MACHINE LEARNING
Decision Tree for Classification

⚫ Binary Classification Problem


⚫ 10 training examples given
⚫ 4 attributes/features , namely, ear shape, face shape, whiskers, and
weight.
⚫ Given an animal with certain ear shape, face shape, whiskers and
weight, we have to predict if the animal is a cat or not.
MACHINE LEARNING
Decision Tree for Classification

⚫ Three attributes take values from a finite set, whereas weight is a


continuous valued variable.
⚫ We will build a decision tree based on these 10 instances. And when a
new example/test case arrives, we will predict its class by traversing the
tree top to bottom beginning from root node. The traversal will end at
one of the leaf nodes.
MACHINE LEARNING
Decision Tree for Classification

⚫ We ask a question at root node and split the dataset into two, thereby
creating 2 intermediate nodes at depth 1. We ask further questions at
the 2 intermediate nodes and split the two datasets, thereby creating 4
regions, or 4 intermediate nodes at depth 2. This is called Recursive
Splitting. We split until some criterion (splitting criterion) is satisfied.
⚫ Which attribute to be chosen as splitting variable?
MACHINE LEARNING
Decision Tree for Classification

⚫ Once a splitting variable is selected, what should be the splitting point?


for example, should I ask “Is weight > 7.2 ?”, or should I ask “Is weight > 11 ?”.
7.2 or 11 is the splitting point.
⚫ We would design an error function (also widely known in literature as cost
function/ loss function/ objective function), and write an algorithm that
would suggest a splitting variable and a splitting point that minimizes the
error function.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm will continue to split the tree (increase the depth of the
tree) until some stopping criterion is made.
⚫ Widely used stopping criterion (not in text book, you may skip)
⚫ When node is 100% pure
⚫ When splitting a node will result in the tree exceeding a maximum depth (also called pre pruning)
⚫ Information gain from additional splits is less than threshold
⚫ When number of examples in a node is below a threshold
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm may use weighted Entropy / Impurity as the error function.
⚫ It means every split will maximise the gain in Information (which is nothing
but change in Impurity or change in weighted Entropy)
⚫ Algorithm will favour that splitting variable and that splitting point – which
leads to maximum reduction in entropy, or leads to maximum reduction in
Impurity
⚫ Typical stopping criterion for this case should be – splitting should stop
when Information gain from additional splits is less than threshold
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm may use weighted Gini Index as the error function.
⚫ It means every split will maximise the change in Gini Index
⚫ Algorithm will favour that splitting variable and that splitting point –
which leads to minimum possible Gini index at the child node.
⚫ Typical stopping criterion for this case should be 100% node purity
(when a node has all examples belonging to a single class)
⚫ When an attribute takes two values (the attribute whiskers: present or
not present), no decision on splitting point to be taken – the complexity
of the problem gets reduced significantly.
MACHINE LEARNING
Decision Tree for Classification

⚫ Entropy =

⚫ K = no of classes, log is base 2


⚫ is the proportion of training observations in the m-th
region/node that are from the kth class.
⚫ If K = 2, expression for Entropy becomes
- plog2p - (1-p)log2(1-p)
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ The splitting variable for the root node is ‘Ear Shape’. We are asking if the
animal’s ‘Ear Shape is Pointy or not?’
MACHINE LEARNING
Decision Tree for Classification – Calculation of Entropy, Impurity

⚫ Training set has 5 cats, 5 No cats. Entropy at root node (before any
splitting on the dataset) = - plog2p - (1-p)log2(1-p) = 1, p=5/10 =0.5
⚫ After splitting, the left node has 4 cats, 1 no cat. Entropy = - plog2p - (1-
p)log2(1-p) = 0.72, p=4/5 = 0.8
⚫ After splitting, the right node has 1 cat, 4 no cats. Entropy = - plog2p -
(1-p)log2(1-p) = 0.72, p=1/5 = 0.2
⚫ Weighted Entropy = Impurity = (5/10)*0.72 + (5/10)*0.72 = 0.72
⚫ Why 5/10 ? Because out of 10 examples at root node, 5 examples are
into the left child of root node and the other 5 examples fall into the
right child of the root node.
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ The splitting variable for the root node is ‘Face Shape’. We are asking if the
animal’s ‘Face Shape is Round or not?’
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ Whichever splitting variable and splitting point gives highest change in


Impurity (or weighted Entropy), we need to go for that.
⚫ To be repeated at every node at every depth, until stopping criterion is
met.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ IRIS dataset
⚫ 3 classes – Sertosa, Virginica, Vesicolor
⚫ 50 training examples of each class – total 150 training cases
⚫ 4 attributes – what are those?
⚫ Gini Index

⚫ is the proportion of training observations in the m-th region/node


that are from the kth class.
⚫ If K = 2, expression for Gini index becomes p(1-p) + (1-p)p
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ From 150, train-test split has done. Among 105 training examples, 35
belong to Setosa, 33 to Vesicolor, 37 to Virginica
⚫ Gini index at root node is (35/105)*(1-35/105) + (33/105)*(1-33/105) +
(37/105)*(1-37/105) = 0.666
⚫ Splitting variable is petal width at root node
⚫ Splitting point is 0.8 at root node
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ Calculate Gini Index for left child of root node and right child of root
node and verify that you indeed get these values.
⚫ To be repeated at every node at every depth, until stopping criterion is
met.
⚫ The ‘orange’ node has a Gini index = 0. Therefore no more splitting, it
becomes a leaf.
MACHINE LEARNING
Decision Tree for Classification – Classification Error Rate

For K = 2, i.e., Binary Classification


MACHINE LEARNING
Decision Tree for Classification – Prediction

⚫ Once a decision tree is built using training examples, you need to


predict a class for the test case.
⚫ Suppose test case falls into l-th leaf.
⚫ l-th leaf has N1 examples with labels = class 1, N2 examples with labels =
class 2 , and N3 examples with labels = class 3 .
⚫ Prediction for the test case will be that k for which Nk is the highest
⚫ Predicted class is Virginica if test case falls into this bin
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression

⚫ Error function /Loss function/ Cost Function/ Objective Function for


Regression?
⚫ Can be mean square error
MACHINE LEARNING
Decision Tree for Regression

⚫ rt is t-th response
⚫ For the above example t = 1,2, 3….10.
⚫ Corresponding r are 7.2, 8.8, 15, 9.2,…….
⚫ gm is average (mean) of all the training examples falling into the m-th
node/leaf/region
⚫ g1 = (7.2+8.4+7.2+10.2)/4 = 8.25
⚫ g2 = (9.2)/1 = 9.2
⚫ g3 = (15+18+20)/3 = 17.66
⚫ How many leaves are there?
MACHINE LEARNING
Decision Tree for Regression

⚫ Stopping Criterion? (Not in text book, you may skip)


⚫ When splitting a node will result in the tree exceeding a maximum
depth (also called pre pruning)
⚫ Reduction in mean square error (you can think of it as reduction in
variance) from additional splits is less than threshold
⚫ When number of examples in a node is below a threshold
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree - Pruning

⚫ Decision Tree (Regression Tree or Classification Tree) tends to overfit


⚫ More is the depth of the tree, higher is the overfitting.
⚫ Overfitting means high variance, which in turn means high testing error.
⚫ Solution – Pruning
⚫ Two types of pruning
⚫ Pre Pruning
⚫ Post Pruning
MACHINE LEARNING
Decision Tree – Pre Pruning

⚫ We don’t let the tree grow beyond a depth


⚫ We keep tree depth as stopping criterion
MACHINE LEARNING
Decision Tree – Post Pruning

⚫ Let the tree grow as per stopping criterion. Use any criterion for
stopping other than tree depth
⚫ Now prune the tree to see if a modified tree with reduced
number of leaves (less leaves means smaller depth) serves similar
purpose as the original tree
⚫ How to measure which tree is best?
⚫ Use Cross Validation – this is the only way out
MACHINE LEARNING
Decision Tree – Post Pruning
⚫ Tree built using hitters database.
⚫ Cross validation shows that a tree with 3 leaves gives least test error.
⚫ Therefore the pruned tree with 3 leaves (shown in next slide) to be used
MACHINE LEARNING
Decision Tree – Post Pruning

⚫ Tree size = no of leaves


⚫ Pruning concept to be understood clearly.
⚫ No need to run this dataset and go into details of this example.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:


Prof. Veena
Department of ECE, PESU.
MACHINE LEARNING
Rule extraction from trees

➢ A decision tree does its own feature extraction. The univariate tree only uses the necessary
variables, and after the tree is built, certain features may not be used at all.
➢ We can also say that features closer to the root are more important globally.
➢ Another main advantage of decision trees is interpretability.
➢ Each path from the root to a leaf corresponds to one conjunction of tests, as all those
conditions should be satisfied to reach to the leaf.
➢ These paths together can be written down as a set of IF-THEN rules, called a rule base
MACHINE LEARNING
Rule extraction from trees

➢ The decision tree given in


figure uses x1, x2, and x4, but
not x3. It is possible to use a
decision tree for feature
extraction
MACHINE LEARNING
Rule extraction from trees

➢ Such a rule base allows knowledge extraction; it can be easily understood and allows
experts to verify the model learned from data.
➢ For each rule, one can also calculate the percentage of training data covered by the
rule, namely, rule support.
➢ In the case of a classification tree, there may be more than one leaf labeled with the
same class. In such a case, these multiple conjunctive expressions corresponding to
different paths can be combined as a disjunction (OR).
• IF (x1 ≤ w10) OR ((x1 > w10) AND (x2 ≤ w20)) THEN C1
• One can also prune rules for simplification.
MACHINE LEARNING
Learning Rules from data

➢ Learning rules directly from data instead of IF-THEN statement –Rule induction.
➢ Rules are learned one at a time.
➢ Each rule is a conjunction of conditions on discrete or numeric attributes (as in decision trees) and
these conditions are added one at a time, to optimize some criterion, for example, minimize entropy.
➢ Sequential Covering:
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met
MACHINE LEARNING
Learning Rules from data
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

➢ It stands for Repeated Incremental Pruning to Produce Error Reduction. The Ripper Algorithm
is a Rule-based classification algorithm. It derives a set of rules from the training set. It is a
widely used rule induction algorithm.
➢ Case I: Training records belong to only two classes
• Among the records given, it identifies the majority class ( which has appeared the most ) and
takes this class as the default class. For example: if there are 100 records and 80 belong to
Class A and 20 to Class B. then Class A will be default class.
• For the other class, it tries to learn/derive various rules to detect that class.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm
➢ How rule is learned:
• In the first instance, it tries to derive rules for those records which belong to class C1. Records
belonging to C1 will be considered as positive examples(+ve) and other classes will be
considered as negative examples(-ve).
• Next, at this junction Ripper tries to derive rules for C2 distinguishing it from the other classes.
• This process is repeated until stopping criteria is met, which is- when we are left with Cn
(default class).
• Ripper extracts rules from minority class to the majority class.
➢ Rule Growing in RIPPER Algorithm:
• Ripper makes use of general to a specific strategy of growing rules. It starts from an empty rule
and goes on adding the best conjunct to the rule .
• For evaluation of conjuncts the metric is chosen is FOIL’s Information Gain. Using this the best
conjunct is chosen.

• Stopping Criteria for adding the conjuncts – when the rule starts covering the negative (-ve)
examples.
• The new rule is pruned based on its performance on the validation set.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

➢ Rule Pruning Using RIPPER:


• We need to identify whether a particular rule should be pruned or not. To determine this a metric
is used, which is (P-N)/(P+N)

P = number of positive examples in the validation set covered by the rule.


N = number of negative examples in the validation set covered by the rule.
➢ Whenever a conjunct is added or removed we calculate the value of the above metric for the
original rule (before adding/removing) and the new rule (after adding/removing).
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

Building a Rule Set:


➢ Use sequential covering algorithm
• Finds the best rule that covers the current set of positive examples
• Eliminate both positive and negative examples covered by the rule
➢ Each time a rule is added to the rule set, compute the new description length
• stop adding new rules when the new description length is d bits longer than the smallest
description length obtained so far
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

Optimize the rule set:


➢ For each rule r in the rule set R
• Consider 2 alternative rules:
– Replacement rule (r*): grow new rule from scratch
– Revised rule(r’): add conjuncts to extend the rule r
• Compare the rule set for r against the rule set for r* and r’
• Choose rule set that minimizes MDL principle
➢ Repeat rule generation and rule optimization for the remaining positive examples
MACHINE LEARNING
Outlier Detection

➢ An outlier, novelty, or anomaly is an instance that is very much different from


other instances in the sample. An outlier may indicate an abnormal behavior of
the system.
➢ Outlier detection is sometimes one class classification.
➢ Once we model the typical instances, any instance that does not fit the model
(and this may occur in many different ways) is an anomaly.
➢ Outlier detection basically implies spotting what does not normally happen;
that is, it is density estimation followed by checking for instances with too
small probability under the estimated density.
➢ As usual, the fitted model can be parametric, semiparametric, or
nonparametric.
MACHINE LEARNING
Outliers
➢ In nonparametric density estimation, as we discussed in the preceding
sections, the estimated probability is high where there are many training
instances nearby and the probability decreases as the neighborhood
becomes more sparse.

➢ local outlier factor that compares the denseness of the neighborhood of an


instance with the average denseness of the neighborhoods of its neighbors

➢ LOF is given by

➢ If LOF(x) is close to 1, x is not an outlier; as it gets larger, the probability that


it is an outlier increases
MACHINE LEARNING
Outliers

You might also like