0% found this document useful (0 votes)

86 views117 pages

UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning

This document discusses non-parametric supervised learning and density estimation techniques. It begins by explaining that non-parametric learning does not assume a functional form for the model and instead relies on similarity between training examples. It then covers several density estimation methods: histogram, naive estimator, smoothing with a Gaussian kernel, and k-nearest neighbor density estimation. The key advantages and disadvantages of each method are described.

Uploaded by

Sai Satya Krishna Pathuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views117 pages

UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning

Uploaded by

Sai Satya Krishna Pathuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 117

UE20EC352-Machine Learning &

Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ In supervised learning, whether regression or classification, we use

N training examples/samples to learn a mapping between d
attributes/features/factors x = [x1, x2, x3, ……..xd]T and
response/target/label r (y is another commonly used symbol for
response).
⚫ In parametric supervised learning we assume at the beginning that
the above mapping can be completely described in terms of a finite
number of parameters w0,w1,w2,…….. (β, θ are other commonly
used symbols for parameters).
⚫ And then we use the N training examples to learn the parameters
θ0, θ1, θ2,……..
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ Recall this is exactly what we did in linear regression, logistic

regression, discriminant analysis (during Unit 2).
⚫ In non parametric supervised learning, we do not make such an
assumption to begin with.
⚫ We try to build a model using the N training samples, a model
that exploits the “similarity” between N training examples.
⚫ Non parametric supervised learning is composed of finding the
similar past instances from the training set using a “suitable
distance measure” and interpolating from them to find the right
output for the test data.
MACHINE LEARNING
What is Non Parametric Supervised Learning

Let us consider the x y/r

univariate regression 0.5 1.2
training set given below. 0.8 1.7

In parametric regression, 0.9 2

1.9 -1.5
we are asked to learn a
2.3 -1.8
mapping between x and r. 2.9 2.6
We, suppose, assume that 4.0 2.2
this mapping can be 5.5 2.2
described by a straight 5.7 2.4
line, i.e., r = θ0 + θ1x. We 6 1.8
6.3 1.2
therefore learn θ0, θ1
7.3 2.6
from N training examples.
MACHINE LEARNING
What is Non Parametric Supervised Learning
⚫ Prediction part : For any new x (outside the training set) given to us,
we would predict the response to be θ0 + θ1x .
⚫ Note, number of parameters, which is 2 here, does not depend on N
⚫ In a parametric model, all of the N training instances affect the final
global model θ0 + θ1x
⚫ However, in the nonparametric case, there is no single global model;
local models are estimated as they are needed, affected only by the
nearby training instances.
⚫ In nonparametric supervised learning, we would indeed end up
calculating a large number of parameters, where the number of
parameters grow with number of training instances, N.
MACHINE LEARNING
What is Non Parametric Supervised Learning

⚫ Non parametric learning does not mean no parameter to be

estimated.
⚫ Also called Instance-based learning, Memory-based learning, lazy
learning.
⚫ Nonparametric needs training instances/examples to be stored –
space complexity grows as O(N).
⚫ Nonparametric needs similarity to be found between a test
example and N training instances – time complexity grows as O(N).
⚫ Time complexity of parametric learning techniques grows as O(d)
or O(d2).
MACHINE LEARNING
When is Non Parametric Supervised Learning preferred over Parametric Learning

⚫ Gives more accurate prediction than parametric techniques

⚫ When we suspect presence of outlier in the training set.
⚫ When we need to do anomaly detection
⚫ When we are looking to build a mapping that can be easily
interpreted, can be understood by someone not skilled in ML.
MACHINE LEARNING
Disadvantages of Non Parametric Supervised Learning

⚫ More training data needed compared to parametric techniques

⚫ Slower to learn compared to parametric techniques
⚫ Susceptible to overfitting
MACHINE LEARNING
What we will learn in this Unit

⚫ Density Estimation
Histogram
Naïve Estimator
Estimation using Gaussian Kernel
K nearest neighbour density estimator
⚫ Nonparametric Regression
Regressogram
Running Mean Smoother
Kernel Smoother with Gaussian Kernel
LOESS
MACHINE LEARNING
What we will learn in this Unit

⚫ Nonparametric Classification
Discriminant Function based
Distance Measure based
K Nearest Neighbour
⚫Condensed Nearest Neighbour
⚫ Decision Tree
Classification
Regression
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation - Histogram

X
0.4 • The nonparametric estimate for the density function,
0.7 which is the derivative of the cumulative distribution, can
0.8 be calculated as
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation - Histogram

X
0.4
⚫ h = bin width/smoothing parameter
0.7 ⚫ Choose h. Let h = 6
0.8 ⚫ Choose a beginning point, say x = 0
1.9
2.4
⚫ Count number of samples in bins (0,6], (6,12].
6.1 ⚫ These counts are 5, 4
6.3
⚫ Histogram is 5/54 in the range (0,6] and 4/54 in the range (6,12].
7.0
10.3 ⚫ Area under the curve has to be 1
MACHINE LEARNING
Histogram
MACHINE LEARNING
Disadvantage of Histogram
MACHINE LEARNING
Disadvantage of Histogram

X
0.4
0.7 ⚫ The histogram depends on the choice of beginning point
0.8 ⚫ There will be jumps in histogram at bin boundaries
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation – Naïve Estimator

Naive estimator: frees us from setting an origin.

to solve the discontinuity problem at boundaries follow Naïve Approach

pˆ ( x ) =
 
# x − h / 2  xt  x + h / 2
Nh

1 N  x − xt  1 if u  1 / 2
pˆ ( x ) =  w  w(u ) = 
Nh t =1  h  0 otherwise
MACHINE LEARNING
Density Estimation – Naïve Estimator

X ⚫ h = bin width/smoothing parameter

0.4
⚫ Choose h. Let h = 6
0.7
0.8 ⚫ Partition the real line (x takes real values), into segments
1.9 (x1-h/2, x1+h/2], (x2-h/2, x2+h/2], …………… (x9-h/2, x9+h/2].
2.4
6.1
⚫ Count number of samples in these segments
6.3 ⚫ These counts are 5, 5, 5, 5, 3, 3, 3, 1
7.0 ⚫ Area under the curve has to be 1
10.3
⚫ Density is 5/57.1 in the range (-2.6, 5.4], 3/57.1 in the range
(5.4, 10], 1/57.1 in the range (10, 13.3]
MACHINE LEARNING
Naïve Estimator
Naive estimator:
MACHINE LEARNING
Disadvantage of Naïve Estimator

Naive estimator: there will be jumps in histogram at

The smaller the h, the more is the number of jumps, more

jagged is the estimated density
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel

To get a smooth estimate, we use a function called Kernel function:

• Kernel function, e.g., Gaussian kernel:

1  u2 
K (u ) = exp − 
2  2
• Kernel estimator (Parzen windows)
1 N  x − xt 
p̂ (x ) =  K  
Nh t =1  h 
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel

o The kernel function K(·) determines the shape of the influences

and the window width h determines the width.

o All the xi have an effect on the density estimate at x, and this

effect decreases smoothly as |x − xi | increases.

o To simplify calculation, K(·) can be taken to be 0 if |x − xi | > 3h

(this comes from property of Gaussian distribution)
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator

• The degree of smoothing is controlled by k, number of nearest

neighbours

• k is chosen much smaller than N

• Instead of fixing bin width h and counting the number of training

instances in the bin, (as we did in histogram, Naive estimator,
Smoothing with Gaussian Kernel), we fix k, the number of
observations to fall in the bin, and compute the bin size.
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator

• Estimated density = p(x) = k/(2Ndk(x))

• p(x) integrates to ∞, not 1, therefore p(x) is not a density.

• N = total no of training samples

• Choose k, suppose k = 3

• dk(x)= distance of x to the farthest neighbour, among these k

nearest neighbours
MACHINE LEARNING
Density Estimation – K Nearest Neighbour Estimator
MACHINE LEARNING
Density Estimation

To get a smoother estimate, we can use a kernel function whose effect

decreases with increasing distance. This is like a kernel estimator with
adaptive smoothing parameter h = dk(x).
K(·) is typically taken to be the Gaussian kernel.

Note: K is used for number of classes during classification, K is used here

to denote Kernel, again we see k in k-NN. They have different meanings.
Therefore interpret k/K as per the problem/ context.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Regressogram

x y/r
0.5 1.2
0.8 1.7
0.9 2
1.9 -1.
2.3 -1.8
2.9 2.6
4.0 2.2
5.5 2.2
5.7 2.4
6 1.8
6.3 1.2
7.3 2.6
MACHINE LEARNING
Non Parametric Regression – Regressogram
MACHINE LEARNING
Non Parametric Regression – Regressogram

x y/r ⚫ h = bin width/smoothing parameter

0.5 1.2 ⚫ Choose h. Let h = 6
0.8 1.7
0.9 2
⚫ Choose a beginning point, say x = 0
1.9 -1. ⚫ Count number of samples in bins (0,6], (6,12].
2.3 -1.8
⚫ These counts are 10, and 2
2.9 2.6
4.0 2.2 ⚫ Estimated response/output is (1.2+1.7+2-1-
5.5 2.2 1.8+2.6+2.2+2.2+2.4+1.8)/10 = 1.33 in the range (0,6],
5.7 2.4 and (1.2+2.6)/2 = 1.9 in the range (6,12].
6 1.8
⚫ The higher the h, the smoother is the curve, lesser is the
6.3 1.2
7.3 2.6
variance !!
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother

x y/r
⚫ h = bin width/smoothing parameter
0.5 1.2
0.8 1.7 ⚫ Choose h. Let h = 6
0.9 2 ⚫ Partition the real line (x takes real values), into segments
1.9 -1.
(x1-h, x1+h], (x2-h, x2+h], …………… (x12-h, x12+h].
2.3 -1.8
2.9 2.6 ⚫ Count number of samples in these segments
4.0 2.2 ⚫ These counts are 8,11,11,12,12,12,12,12,12,12,12,9.
5.5 2.2
5.7 2.4
⚫ Estimated response/output is (1.2+1.7+2-1-1.8+2.6+2.2+2.2)/8
6 1.8 = 1.125 in the range (-5.5, 6.5], and (1.2+1.7+2-1-
6.3 1.2 1.8+2.6+2.2+2.2+2.4+1.8+1.2)/11 = 1.27 in the range (6.5,6.9],
7.3 2.6 1.383 in the range (6.9,12.3], 1.300 in the range (12.3,13.3]
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother

⚫ The mapping between attributes and response is described by 4

quantities, the smoothed responses in the ranges (-5.5, 6.5],
(6.5,6.9], (6.9,12.3], (12.3,13.3].
⚫ For reduced h, number of bins and number of computations
would increase
⚫ If instead of 12, we had 18 training instances, then even for h =
6, number of bins and number of computations would have
been higher.
MACHINE LEARNING
Non Parametric Regression – Kernel Smoother with Gaussian Kernel

1  u2 
K (u ) = exp − 
2  2
MACHINE LEARNING
Non Parametric Regression – Kernel Smoother with Gaussian Kernel
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Running Line Smoother - LOESS

⚫ Instead of taking an average of the responses over a bin of

width h, and giving a constant fit at a point, we can take into
account responses of neighbouring points to fit a line.
⚫ k = no of neighbours (we use symbol K for number of classes
when we deal with classification problem)
⚫ For a given training example, say, xi, find its nearest k
neighbours, find intercept and slope of a straight line that best
fits the xi and its k nearest neighbours.
⚫ We will end up representing the mapping between attributes
and response in terms of multiple local regression lines, as
show in the next slide.
MACHINE LEARNING
LOESS – Locally Weighted Least Squares Regression
MACHINE LEARNING
Multivariate Data

⚫ The non parametric density estimation and regression

examples discussed till now assumed no of attributes = 1, or d
= 1, or all were univariate examples.
⚫ All the techniques described so far can be extended for
multivariate case i.e., when number of attributes >= 2.
MACHINE LEARNING
Multivariate Data

⚫ For example,
⚫ Multivariate Kernel density estimator for a sample of d-dimensional
observations X={xt}

1 N
 x − xt 
p̂ (x ) =  K  
Nh d t =1  h 

⚫ Multivariate Gaussian kernel

 1  
d
u 
2

K (u ) =   exp − 
 2   2 
MACHINE LEARNING
Multivariate Data

⚫ When we discuss non parametric classification and decision

tree, we will take univariate as well as multivariate examples.
MACHINE LEARNING
Curse of Dimensionality

⚫ However, care should be applied to using nonparametric

estimates in high-dimensional attribute/feature spaces
because of the curse of dimensionality:
⚫ Let us x is eight-dimensional, d = 8, and we use a histogram
with ten bins per dimension, then there are 108 bins, and
unless we have lots of data, most of these bins will be empty
and the estimates in there will be 0
MACHINE LEARNING
Choice of h or k
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based

⚫ Discriminant Functions, as we learnt in parametric classification –

are used to partition feature space into K disjoint
volumes/regions. K is no of classes.

⚫ Where p(Ci) is a priori density, can be estimated from training

samples, if needed.
P(x|Ci) is class conditional density. We assumed it to be
Gaussian in Unit 2, and then estimated the means and
covariance matrices from training samples.
⚫ Here i goes from 1 to K
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based

⚫ In non parametric classification, we cannot assume any

distribution for the class conditional densities.

⚫ We have to estimate the class conditional densities from

training instances/examples/samples using nonparametric
density estimators, for e.g., Naïve estimator, estimation with
Gaussian Kernel, etc.
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour

⚫ If we use k-NN density estimator for estimating the class conditional

densities in discriminant function based classification, then it turns
out that

⚫ And the posterior probability of the class Ci turns out to be

⚫ where ki is the number of neighbors out of the k nearest that belong to Ci and Vk(x) is the
volume of the d-dimensional hypersphere centered at x, with radius r = ∥x − x(k)∥ where x(k) is the
k-th nearest observation to x (among all neighbors from all classes of x)
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour

⚫ This nonparametric
classifier is called k-
NN classifier
⚫ A simple yet very
good classifier
⚫ The test case belongs
to ‘green’ class
because out its 3
nearest neighbors, 2
neighbors belong to
the class ‘green’
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
k-Nearest Neighbour – choice of k (Extra Reading)

⚫ Changing k changes the prediction for a given test case/example. It

means k affects the prediction. The following plots show that
when k=1, the classifier does overfitting, (no misclassification, all
training instances are predicted correctly). Such a classifier will
have a high variance and a high testing error or generalization
error. Whereas when k = 100, the classifier is much less
flexible,(there are misclassifications), but will therefore have a low
variance and low generalization error/testing error.
MACHINE LEARNING
k-Nearest Neighbour – choice of k (Extra Reading)
MACHINE LEARNING
k-Nearest Neighbour – Special case k = 1

⚫ A special case of k-nn is the nearest neighbor classifier where k = 1

and the input is assigned to the class of the nearest pattern. This
divides the space in the form of a Voronoi tessellation
⚫ Observe carefully, 1-NN approximates the discriminant in a piecewise
linear manner.
MACHINE LEARNING
k-Nearest Neighbour – Special case k = 1

⚫ Time and space complexity of k-NN is O(N)

⚫ only the instances that define the discriminant need to be kept.

⚫ instance inside the class regions need not be stored as its nearest
neighbor is of the same class and its absence does not cause any error
(on the training set).

⚫ Such a subset is called a consistent subset, and we would like to find

the minimal consistent subset.
⚫ And this gives rise to Condensed Nearest Neighbour
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

⚫ Given N training instances from K classes, we have to determine to

which class test case/instance belongs to – this is what we do in
classification.
⚫ We can use Mahalanobis distance as the distance metric to determine
which class the test case belongs to.
⚫ Mean and covariance matrix to be estimated for each of the K classes
from the training instances.
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

⚫ There are 13 training examples, 7 from class ‘green’ and 6 from class
‘yellow’. I need to determine to which class the test data ‘red’ belongs
to. Use Mahalanobis distance as the distance measure.
⚫ I need to find sample mean of ‘green’ examples, estimate covariance
matrix of ‘green’ examples, and then calculate MD of ‘red’ point from
the ‘green’ cluster/distribution.
⚫ I need to repeat the above step for ‘yellow’ examples. And calculate
MD of ‘red’ point from the ‘yellow’ cluster/distribution.
⚫ The smallest MD corresponds to the class the ‘red’ test case belongs
to.
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

mi is mean is Ci
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Given Training Data Set, determine to which class the point (66,640,44)
belongs to. Use Mahalanobis Distance as distance metric.
x286 x3 Class
64 580 29 C0
64 570 33 C0
68 590 37 C0
69 660 46 C0
73 600 55 C0
80 580 21 C1
82 570 22 C1
89 590 39 C1
87 660 19 C1
77 600 25 C1
72 595 38 C1
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Sample Mean m0 = [67.6 600 40]T

Covariance matrix, S0 is

11.44 52 30.6
52 1000 164
30.6 64 88

Mahalanobis distance, MD0 of point x = (66,640,44) from the cluster

consisting of examples from class C0 is 11.65
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification

Similarly calculate m1 covariance matrix, S1

Calculate Mahalanobis distance, MD1 of point x = (66,640,44) from the

cluster consisting of examples from class C1

If MD0 < MD1, the point x belongs to class C0, else belongs to class C1.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree

⚫ Very widely used Non Parametric Supervised Learning techniques used

for Classification, and Regression.
⚫ Advantages
⚫ Easy interpretability - someone not skilled in ML understands
Decision Tree easily
⚫ Identifies outliers and handles outliers very well
⚫ Disadvantages
⚫ Prone to overfitting – Pruning is the solution
⚫ Very sensitive to changes in training data set – not in textbook, you
may skip
MACHINE LEARNING
Decision Tree – An example

Root node
depth 0

Intermediate node
depth 1

depth 2 Leaf
MACHINE LEARNING
Decision Tree for Classification
MACHINE LEARNING
Decision Tree for Classification

⚫ Binary Classification Problem

⚫ 10 training examples given
⚫ 4 attributes/features , namely, ear shape, face shape, whiskers, and
weight.
⚫ Given an animal with certain ear shape, face shape, whiskers and
weight, we have to predict if the animal is a cat or not.
MACHINE LEARNING
Decision Tree for Classification

⚫ Three attributes take values from a finite set, whereas weight is a

continuous valued variable.
⚫ We will build a decision tree based on these 10 instances. And when a
new example/test case arrives, we will predict its class by traversing the
tree top to bottom beginning from root node. The traversal will end at
one of the leaf nodes.
MACHINE LEARNING
Decision Tree for Classification

⚫ We ask a question at root node and split the dataset into two, thereby
creating 2 intermediate nodes at depth 1. We ask further questions at
the 2 intermediate nodes and split the two datasets, thereby creating 4
regions, or 4 intermediate nodes at depth 2. This is called Recursive
Splitting. We split until some criterion (splitting criterion) is satisfied.
⚫ Which attribute to be chosen as splitting variable?
MACHINE LEARNING
Decision Tree for Classification

⚫ Once a splitting variable is selected, what should be the splitting point?

for example, should I ask “Is weight > 7.2 ?”, or should I ask “Is weight > 11 ?”.
7.2 or 11 is the splitting point.
⚫ We would design an error function (also widely known in literature as cost
function/ loss function/ objective function), and write an algorithm that
would suggest a splitting variable and a splitting point that minimizes the
error function.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm will continue to split the tree (increase the depth of the
tree) until some stopping criterion is made.
⚫ Widely used stopping criterion (not in text book, you may skip)
⚫ When node is 100% pure
⚫ When splitting a node will result in the tree exceeding a maximum depth (also called pre pruning)
⚫ Information gain from additional splits is less than threshold
⚫ When number of examples in a node is below a threshold
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm may use weighted Entropy / Impurity as the error function.
⚫ It means every split will maximise the gain in Information (which is nothing
but change in Impurity or change in weighted Entropy)
⚫ Algorithm will favour that splitting variable and that splitting point – which
leads to maximum reduction in entropy, or leads to maximum reduction in
Impurity
⚫ Typical stopping criterion for this case should be – splitting should stop
when Information gain from additional splits is less than threshold
MACHINE LEARNING
Decision Tree for Classification

⚫ The algorithm may use weighted Gini Index as the error function.
⚫ It means every split will maximise the change in Gini Index
⚫ Algorithm will favour that splitting variable and that splitting point –
which leads to minimum possible Gini index at the child node.
⚫ Typical stopping criterion for this case should be 100% node purity
(when a node has all examples belonging to a single class)
⚫ When an attribute takes two values (the attribute whiskers: present or
not present), no decision on splitting point to be taken – the complexity
of the problem gets reduced significantly.
MACHINE LEARNING
Decision Tree for Classification

⚫ Entropy =

⚫ K = no of classes, log is base 2

⚫ is the proportion of training observations in the m-th
region/node that are from the kth class.
⚫ If K = 2, expression for Entropy becomes
- plog2p - (1-p)log2(1-p)
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ The splitting variable for the root node is ‘Ear Shape’. We are asking if the
animal’s ‘Ear Shape is Pointy or not?’
MACHINE LEARNING
Decision Tree for Classification – Calculation of Entropy, Impurity

⚫ Training set has 5 cats, 5 No cats. Entropy at root node (before any
splitting on the dataset) = - plog2p - (1-p)log2(1-p) = 1, p=5/10 =0.5
⚫ After splitting, the left node has 4 cats, 1 no cat. Entropy = - plog2p - (1-
p)log2(1-p) = 0.72, p=4/5 = 0.8
⚫ After splitting, the right node has 1 cat, 4 no cats. Entropy = - plog2p -
(1-p)log2(1-p) = 0.72, p=1/5 = 0.2
⚫ Weighted Entropy = Impurity = (5/10)*0.72 + (5/10)*0.72 = 0.72
⚫ Why 5/10 ? Because out of 10 examples at root node, 5 examples are
into the left child of root node and the other 5 examples fall into the
right child of the root node.
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ The splitting variable for the root node is ‘Face Shape’. We are asking if the
animal’s ‘Face Shape is Round or not?’
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity

⚫ Whichever splitting variable and splitting point gives highest change in

Impurity (or weighted Entropy), we need to go for that.
⚫ To be repeated at every node at every depth, until stopping criterion is
met.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ IRIS dataset
⚫ 3 classes – Sertosa, Virginica, Vesicolor
⚫ 50 training examples of each class – total 150 training cases
⚫ 4 attributes – what are those?
⚫ Gini Index

⚫ is the proportion of training observations in the m-th region/node

that are from the kth class.
⚫ If K = 2, expression for Gini index becomes p(1-p) + (1-p)p
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ From 150, train-test split has done. Among 105 training examples, 35
belong to Setosa, 33 to Vesicolor, 37 to Virginica
⚫ Gini index at root node is (35/105)*(1-35/105) + (33/105)*(1-33/105) +
(37/105)*(1-37/105) = 0.666
⚫ Splitting variable is petal width at root node
⚫ Splitting point is 0.8 at root node
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index

⚫ Calculate Gini Index for left child of root node and right child of root
node and verify that you indeed get these values.
⚫ To be repeated at every node at every depth, until stopping criterion is
met.
⚫ The ‘orange’ node has a Gini index = 0. Therefore no more splitting, it
becomes a leaf.
MACHINE LEARNING
Decision Tree for Classification – Classification Error Rate

For K = 2, i.e., Binary Classification

MACHINE LEARNING
Decision Tree for Classification – Prediction

⚫ Once a decision tree is built using training examples, you need to

predict a class for the test case.
⚫ Suppose test case falls into l-th leaf.
⚫ l-th leaf has N1 examples with labels = class 1, N2 examples with labels =
class 2 , and N3 examples with labels = class 3 .
⚫ Prediction for the test case will be that k for which Nk is the highest
⚫ Predicted class is Virginica if test case falls into this bin
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression

⚫ Error function /Loss function/ Cost Function/ Objective Function for

Regression?
⚫ Can be mean square error
MACHINE LEARNING
Decision Tree for Regression

⚫ rt is t-th response
⚫ For the above example t = 1,2, 3….10.
⚫ Corresponding r are 7.2, 8.8, 15, 9.2,…….
⚫ gm is average (mean) of all the training examples falling into the m-th
node/leaf/region
⚫ g1 = (7.2+8.4+7.2+10.2)/4 = 8.25
⚫ g2 = (9.2)/1 = 9.2
⚫ g3 = (15+18+20)/3 = 17.66
⚫ How many leaves are there?
MACHINE LEARNING
Decision Tree for Regression

⚫ Stopping Criterion? (Not in text book, you may skip)

⚫ When splitting a node will result in the tree exceeding a maximum
depth (also called pre pruning)
⚫ Reduction in mean square error (you can think of it as reduction in
variance) from additional splits is less than threshold
⚫ When number of examples in a node is below a threshold
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Dr. Arpita Thakre
Department of ECE, PESU.

Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree - Pruning

⚫ Decision Tree (Regression Tree or Classification Tree) tends to overfit

⚫ More is the depth of the tree, higher is the overfitting.
⚫ Overfitting means high variance, which in turn means high testing error.
⚫ Solution – Pruning
⚫ Two types of pruning
⚫ Pre Pruning
⚫ Post Pruning
MACHINE LEARNING
Decision Tree – Pre Pruning

⚫ We don’t let the tree grow beyond a depth

⚫ We keep tree depth as stopping criterion
MACHINE LEARNING
Decision Tree – Post Pruning

⚫ Let the tree grow as per stopping criterion. Use any criterion for
stopping other than tree depth
⚫ Now prune the tree to see if a modified tree with reduced
number of leaves (less leaves means smaller depth) serves similar
purpose as the original tree
⚫ How to measure which tree is best?
⚫ Use Cross Validation – this is the only way out
MACHINE LEARNING
Decision Tree – Post Pruning
⚫ Tree built using hitters database.
⚫ Cross validation shows that a tree with 3 leaves gives least test error.
⚫ Therefore the pruned tree with 3 leaves (shown in next slide) to be used
MACHINE LEARNING
Decision Tree – Post Pruning

⚫ Tree size = no of leaves

⚫ Pruning concept to be understood clearly.
⚫ No need to run this dataset and go into details of this example.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning

Slides prepared by:

Prof. Veena
Department of ECE, PESU.
MACHINE LEARNING
Rule extraction from trees

➢ A decision tree does its own feature extraction. The univariate tree only uses the necessary
variables, and after the tree is built, certain features may not be used at all.
➢ We can also say that features closer to the root are more important globally.
➢ Another main advantage of decision trees is interpretability.
➢ Each path from the root to a leaf corresponds to one conjunction of tests, as all those
conditions should be satisfied to reach to the leaf.
➢ These paths together can be written down as a set of IF-THEN rules, called a rule base
MACHINE LEARNING
Rule extraction from trees

➢ The decision tree given in

figure uses x1, x2, and x4, but
not x3. It is possible to use a
decision tree for feature
extraction
MACHINE LEARNING
Rule extraction from trees

➢ Such a rule base allows knowledge extraction; it can be easily understood and allows
experts to verify the model learned from data.
➢ For each rule, one can also calculate the percentage of training data covered by the
rule, namely, rule support.
➢ In the case of a classification tree, there may be more than one leaf labeled with the
same class. In such a case, these multiple conjunctive expressions corresponding to
different paths can be combined as a disjunction (OR).
• IF (x1 ≤ w10) OR ((x1 > w10) AND (x2 ≤ w20)) THEN C1
• One can also prune rules for simplification.
MACHINE LEARNING
Learning Rules from data

➢ Learning rules directly from data instead of IF-THEN statement –Rule induction.
➢ Rules are learned one at a time.
➢ Each rule is a conjunction of conditions on discrete or numeric attributes (as in decision trees) and
these conditions are added one at a time, to optimize some criterion, for example, minimize entropy.
➢ Sequential Covering:
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met
MACHINE LEARNING
Learning Rules from data
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

➢ It stands for Repeated Incremental Pruning to Produce Error Reduction. The Ripper Algorithm
is a Rule-based classification algorithm. It derives a set of rules from the training set. It is a
widely used rule induction algorithm.
➢ Case I: Training records belong to only two classes
• Among the records given, it identifies the majority class ( which has appeared the most ) and
takes this class as the default class. For example: if there are 100 records and 80 belong to
Class A and 20 to Class B. then Class A will be default class.
• For the other class, it tries to learn/derive various rules to detect that class.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm
➢ How rule is learned:
• In the first instance, it tries to derive rules for those records which belong to class C1. Records
belonging to C1 will be considered as positive examples(+ve) and other classes will be
considered as negative examples(-ve).
• Next, at this junction Ripper tries to derive rules for C2 distinguishing it from the other classes.
• This process is repeated until stopping criteria is met, which is- when we are left with Cn
(default class).
• Ripper extracts rules from minority class to the majority class.
➢ Rule Growing in RIPPER Algorithm:
• Ripper makes use of general to a specific strategy of growing rules. It starts from an empty rule
and goes on adding the best conjunct to the rule .
• For evaluation of conjuncts the metric is chosen is FOIL’s Information Gain. Using this the best
conjunct is chosen.

• Stopping Criteria for adding the conjuncts – when the rule starts covering the negative (-ve)
examples.
• The new rule is pruned based on its performance on the validation set.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

➢ Rule Pruning Using RIPPER:

• We need to identify whether a particular rule should be pruned or not. To determine this a metric
is used, which is (P-N)/(P+N)

P = number of positive examples in the validation set covered by the rule.

N = number of negative examples in the validation set covered by the rule.
➢ Whenever a conjunct is added or removed we calculate the value of the above metric for the
original rule (before adding/removing) and the new rule (after adding/removing).
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

Building a Rule Set:

➢ Use sequential covering algorithm
• Finds the best rule that covers the current set of positive examples
• Eliminate both positive and negative examples covered by the rule
➢ Each time a rule is added to the rule set, compute the new description length
• stop adding new rules when the new description length is d bits longer than the smallest
description length obtained so far
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm

Optimize the rule set:

➢ For each rule r in the rule set R
• Consider 2 alternative rules:
– Replacement rule (r*): grow new rule from scratch
– Revised rule(r’): add conjuncts to extend the rule r
• Compare the rule set for r against the rule set for r* and r’
• Choose rule set that minimizes MDL principle
➢ Repeat rule generation and rule optimization for the remaining positive examples
MACHINE LEARNING
Outlier Detection

➢ An outlier, novelty, or anomaly is an instance that is very much different from

other instances in the sample. An outlier may indicate an abnormal behavior of
the system.
➢ Outlier detection is sometimes one class classification.
➢ Once we model the typical instances, any instance that does not fit the model
(and this may occur in many different ways) is an anomaly.
➢ Outlier detection basically implies spotting what does not normally happen;
that is, it is density estimation followed by checking for instances with too
small probability under the estimated density.
➢ As usual, the fitted model can be parametric, semiparametric, or
nonparametric.
MACHINE LEARNING
Outliers
➢ In nonparametric density estimation, as we discussed in the preceding
sections, the estimated probability is high where there are many training
instances nearby and the probability decreases as the neighborhood
becomes more sparse.

➢ local outlier factor that compares the denseness of the neighborhood of an

instance with the average denseness of the neighborhoods of its neighbors

➢ LOF is given by

➢ If LOF(x) is close to 1, x is not an outlier; as it gets larger, the probability that

it is an outlier increases
MACHINE LEARNING
Outliers

Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
Densityestimation
No ratings yet
Densityestimation
28 pages
UNIT IV Non Parametric Methods
No ratings yet
UNIT IV Non Parametric Methods
37 pages
AIML Unit-IV & V
100% (1)
AIML Unit-IV & V
47 pages
Forecasting MethodsandApplicationsABookreview
No ratings yet
Forecasting MethodsandApplicationsABookreview
4 pages
LWP Manual
No ratings yet
LWP Manual
14 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Curve Fitting Toolbox™ II
No ratings yet
Curve Fitting Toolbox™ II
703 pages
Unit 5a
No ratings yet
Unit 5a
60 pages
04 Nonparametric Methods
No ratings yet
04 Nonparametric Methods
33 pages
SAS 9.4 ODS Graphics: Getting Started With Business and Statistical Graphics
No ratings yet
SAS 9.4 ODS Graphics: Getting Started With Business and Statistical Graphics
110 pages
Notes
No ratings yet
Notes
32 pages
Forensic Science International - Volume 164 PDF
No ratings yet
Forensic Science International - Volume 164 PDF
199 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
GAMS Getting Started
No ratings yet
GAMS Getting Started
31 pages
Pa 01 Density Estimation
No ratings yet
Pa 01 Density Estimation
25 pages
Lecture 07
No ratings yet
Lecture 07
31 pages
Artificial Intelligence Lec 3
No ratings yet
Artificial Intelligence Lec 3
17 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
2-Parametric and Non Parametric ML
No ratings yet
2-Parametric and Non Parametric ML
30 pages
Chap 5-2 Machine Learning Basics - Hyun-Lim Yang
No ratings yet
Chap 5-2 Machine Learning Basics - Hyun-Lim Yang
39 pages
Unit 5-6
No ratings yet
Unit 5-6
18 pages
09 ML Nonparametric Machine Learning
No ratings yet
09 ML Nonparametric Machine Learning
19 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
PeakFit 4.12
No ratings yet
PeakFit 4.12
2 pages
003-KNN Complete Updated
No ratings yet
003-KNN Complete Updated
72 pages
5th Unit Answer Bank AIML
No ratings yet
5th Unit Answer Bank AIML
24 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
I2ml3e Chap8
No ratings yet
I2ml3e Chap8
28 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
UNIT2SVMKNN
No ratings yet
UNIT2SVMKNN
31 pages
INT354 - Unit 3
No ratings yet
INT354 - Unit 3
60 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
No ratings yet
Lecture Notes On Bayesian Nonparametrics: Version: May 16, 2014
108 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
Session 5
No ratings yet
Session 5
36 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Aosp ch06 (TA)
No ratings yet
Aosp ch06 (TA)
48 pages
جلسه پنجم-3
No ratings yet
جلسه پنجم-3
17 pages
Lec 1
No ratings yet
Lec 1
54 pages
Machine Learning (Part 1) : Iykra Data Fellowship Batch 3
No ratings yet
Machine Learning (Part 1) : Iykra Data Fellowship Batch 3
28 pages
ML Unit-4
No ratings yet
ML Unit-4
29 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Chapter1: Introduction: Notes On MLAPP
No ratings yet
Chapter1: Introduction: Notes On MLAPP
25 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
32 pages
BNP PDF
No ratings yet
BNP PDF
108 pages
SDV
No ratings yet
SDV
82 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Unit 3 - Supervise Learning Classification
No ratings yet
Unit 3 - Supervise Learning Classification
23 pages
Lec 04
No ratings yet
Lec 04
70 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Parametric and Nonparametric Machine Learning Algorithms
No ratings yet
Parametric and Nonparametric Machine Learning Algorithms
16 pages
RO47002 - Lecture 2B - ML Formalized - Part2
No ratings yet
RO47002 - Lecture 2B - ML Formalized - Part2
8 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
Ggplot2 Scatter Plots - Quick Start Guide - R Software and Data Visualization - Easy Guides - Wiki - STHDA
No ratings yet
Ggplot2 Scatter Plots - Quick Start Guide - R Software and Data Visualization - Easy Guides - Wiki - STHDA
25 pages
Pattern Recognition 21BR551 MODULE 03 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 03 NOTES
16 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
TO Machine Learning: Lecture Slides For
No ratings yet
TO Machine Learning: Lecture Slides For
28 pages
AD3461 - ML Lab Manual
No ratings yet
AD3461 - ML Lab Manual
54 pages
Research Notes 1 1 Attitudinal Equity and Market Share
No ratings yet
Research Notes 1 1 Attitudinal Equity and Market Share
7 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Ggplot2 Exercise
No ratings yet
Ggplot2 Exercise
6 pages
MLT UNIT-3 Notes
No ratings yet
MLT UNIT-3 Notes
35 pages
Time Series Decomposition
No ratings yet
Time Series Decomposition
4 pages
Mid Radius Location
No ratings yet
Mid Radius Location
12 pages
Short Circuit Network Equivalents of Systems With Inverter-Based Resources
No ratings yet
Short Circuit Network Equivalents of Systems With Inverter-Based Resources
8 pages
Chapter 3 - Build Linear Functions and Models PDF
No ratings yet
Chapter 3 - Build Linear Functions and Models PDF
3 pages
Chapter 3: Multivariate Data & Distributions
No ratings yet
Chapter 3: Multivariate Data & Distributions
21 pages
Cinemetrics - Michael - Cinemetrics - R Notes - ARnotes - Dvi
No ratings yet
Cinemetrics - Michael - Cinemetrics - R Notes - ARnotes - Dvi
203 pages
08 - RA - Large-Scale Winter Catch Crop Monitoring With Sentinel-2 Time Series and Machine Learning-An Alternative To On-Site Controls
No ratings yet
08 - RA - Large-Scale Winter Catch Crop Monitoring With Sentinel-2 Time Series and Machine Learning-An Alternative To On-Site Controls
15 pages
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
No ratings yet
Behrens Et Al. 2018 - Spatial Modelling With Euclidean Distance Fields and Machine Learning
14 pages
Polynomial Regression
No ratings yet
Polynomial Regression
15 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
A Strategy To Assess Water Meter Perform
No ratings yet
A Strategy To Assess Water Meter Perform
11 pages
Bagging Exponential Smoothing Methods Using STL Decomposition and Box-Cox Transformation
No ratings yet
Bagging Exponential Smoothing Methods Using STL Decomposition and Box-Cox Transformation
10 pages
Data Analytics Unit IV
No ratings yet
Data Analytics Unit IV
13 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Time Is Brain
No ratings yet
Time Is Brain
9 pages
Roundness and Sphericity of Soil Particles in Assemblies by Computational Geometry
No ratings yet
Roundness and Sphericity of Soil Particles in Assemblies by Computational Geometry
13 pages