UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning
UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning
Applications
Unit 3 - Non Parametric Supervised
Learning
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
What is Non Parametric Supervised Learning
⚫ Density Estimation
Histogram
Naïve Estimator
Estimation using Gaussian Kernel
K nearest neighbour density estimator
⚫ Nonparametric Regression
Regressogram
Running Mean Smoother
Kernel Smoother with Gaussian Kernel
LOESS
MACHINE LEARNING
What we will learn in this Unit
⚫ Nonparametric Classification
Discriminant Function based
Distance Measure based
K Nearest Neighbour
⚫Condensed Nearest Neighbour
⚫ Decision Tree
Classification
Regression
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation - Histogram
X
0.4 • The nonparametric estimate for the density function,
0.7 which is the derivative of the cumulative distribution, can
0.8 be calculated as
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation - Histogram
X
0.4
⚫ h = bin width/smoothing parameter
0.7 ⚫ Choose h. Let h = 6
0.8 ⚫ Choose a beginning point, say x = 0
1.9
2.4
⚫ Count number of samples in bins (0,6], (6,12].
6.1 ⚫ These counts are 5, 4
6.3
⚫ Histogram is 5/54 in the range (0,6] and 4/54 in the range (6,12].
7.0
10.3 ⚫ Area under the curve has to be 1
MACHINE LEARNING
Histogram
MACHINE LEARNING
Disadvantage of Histogram
MACHINE LEARNING
Disadvantage of Histogram
X
0.4
0.7 ⚫ The histogram depends on the choice of beginning point
0.8 ⚫ There will be jumps in histogram at bin boundaries
1.9
2.4
6.1
6.3
7.0
7.3
MACHINE LEARNING
Density Estimation – Naïve Estimator
pˆ ( x ) =
# x − h / 2 xt x + h / 2
Nh
1 N x − xt 1 if u 1 / 2
pˆ ( x ) = w w(u ) =
Nh t =1 h 0 otherwise
MACHINE LEARNING
Density Estimation – Naïve Estimator
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel
1 u2
K (u ) = exp −
2 2
• Kernel estimator (Parzen windows)
1 N x − xt
p̂ (x ) = K
Nh t =1 h
MACHINE LEARNING
Density Estimation – Smoothing with Gaussian Kernel
• Choose k, suppose k = 3
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Regressogram
x y/r
0.5 1.2
0.8 1.7
0.9 2
1.9 -1.
2.3 -1.8
2.9 2.6
4.0 2.2
5.5 2.2
5.7 2.4
6 1.8
6.3 1.2
7.3 2.6
MACHINE LEARNING
Non Parametric Regression – Regressogram
MACHINE LEARNING
Non Parametric Regression – Regressogram
x y/r
⚫ h = bin width/smoothing parameter
0.5 1.2
0.8 1.7 ⚫ Choose h. Let h = 6
0.9 2 ⚫ Partition the real line (x takes real values), into segments
1.9 -1.
(x1-h, x1+h], (x2-h, x2+h], …………… (x12-h, x12+h].
2.3 -1.8
2.9 2.6 ⚫ Count number of samples in these segments
4.0 2.2 ⚫ These counts are 8,11,11,12,12,12,12,12,12,12,12,9.
5.5 2.2
5.7 2.4
⚫ Estimated response/output is (1.2+1.7+2-1-1.8+2.6+2.2+2.2)/8
6 1.8 = 1.125 in the range (-5.5, 6.5], and (1.2+1.7+2-1-
6.3 1.2 1.8+2.6+2.2+2.2+2.4+1.8+1.2)/11 = 1.27 in the range (6.5,6.9],
7.3 2.6 1.383 in the range (6.9,12.3], 1.300 in the range (12.3,13.3]
MACHINE LEARNING
Non Parametric Regression – Running Mean Smoother
1 u2
K (u ) = exp −
2 2
MACHINE LEARNING
Non Parametric Regression – Kernel Smoother with Gaussian Kernel
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Regression – Running Line Smoother - LOESS
⚫ For example,
⚫ Multivariate Kernel density estimator for a sample of d-dimensional
observations X={xt}
1 N
x − xt
p̂ (x ) = K
Nh d t =1 h
1
d
u
2
K (u ) = exp −
2 2
MACHINE LEARNING
Multivariate Data
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Discriminant Function based
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
⚫ where ki is the number of neighbors out of the k nearest that belong to Ci and Vk(x) is the
volume of the d-dimensional hypersphere centered at x, with radius r = ∥x − x(k)∥ where x(k) is the
k-th nearest observation to x (among all neighbors from all classes of x)
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
⚫ This nonparametric
classifier is called k-
NN classifier
⚫ A simple yet very
good classifier
⚫ The test case belongs
to ‘green’ class
because out its 3
nearest neighbors, 2
neighbors belong to
the class ‘green’
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
Non Parametric Classification – k-Nearest Neighbour
MACHINE LEARNING
k-Nearest Neighbour – choice of k (Extra Reading)
⚫ instance inside the class regions need not be stored as its nearest
neighbor is of the same class and its absence does not cause any error
(on the training set).
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
⚫ There are 13 training examples, 7 from class ‘green’ and 6 from class
‘yellow’. I need to determine to which class the test data ‘red’ belongs
to. Use Mahalanobis distance as the distance measure.
⚫ I need to find sample mean of ‘green’ examples, estimate covariance
matrix of ‘green’ examples, and then calculate MD of ‘red’ point from
the ‘green’ cluster/distribution.
⚫ I need to repeat the above step for ‘yellow’ examples. And calculate
MD of ‘red’ point from the ‘yellow’ cluster/distribution.
⚫ The smallest MD corresponds to the class the ‘red’ test case belongs
to.
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
mi is mean is Ci
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Given Training Data Set, determine to which class the point (66,640,44)
belongs to. Use Mahalanobis Distance as distance metric.
x286 x3 Class
64 580 29 C0
64 570 33 C0
68 590 37 C0
69 660 46 C0
73 600 55 C0
80 580 21 C1
82 570 22 C1
89 590 39 C1
87 660 19 C1
77 600 25 C1
72 595 38 C1
MACHINE LEARNING
Non Parametric Classification – Distance measure based Classification
Sample Mean m0 = [67.6 600 40]T
Covariance matrix, S0 is
11.44 52 30.6
52 1000 164
30.6 64 88
If MD0 < MD1, the point x belongs to class C0, else belongs to class C1.
UE20EC352-Machine Learning &
Applications
Unit 3 - Non Parametric Supervised
Learning
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree
Root node
depth 0
Intermediate node
depth 1
depth 2 Leaf
MACHINE LEARNING
Decision Tree for Classification
MACHINE LEARNING
Decision Tree for Classification
⚫ We ask a question at root node and split the dataset into two, thereby
creating 2 intermediate nodes at depth 1. We ask further questions at
the 2 intermediate nodes and split the two datasets, thereby creating 4
regions, or 4 intermediate nodes at depth 2. This is called Recursive
Splitting. We split until some criterion (splitting criterion) is satisfied.
⚫ Which attribute to be chosen as splitting variable?
MACHINE LEARNING
Decision Tree for Classification
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification
⚫ The algorithm will continue to split the tree (increase the depth of the
tree) until some stopping criterion is made.
⚫ Widely used stopping criterion (not in text book, you may skip)
⚫ When node is 100% pure
⚫ When splitting a node will result in the tree exceeding a maximum depth (also called pre pruning)
⚫ Information gain from additional splits is less than threshold
⚫ When number of examples in a node is below a threshold
MACHINE LEARNING
Decision Tree for Classification
⚫ The algorithm may use weighted Entropy / Impurity as the error function.
⚫ It means every split will maximise the gain in Information (which is nothing
but change in Impurity or change in weighted Entropy)
⚫ Algorithm will favour that splitting variable and that splitting point – which
leads to maximum reduction in entropy, or leads to maximum reduction in
Impurity
⚫ Typical stopping criterion for this case should be – splitting should stop
when Information gain from additional splits is less than threshold
MACHINE LEARNING
Decision Tree for Classification
⚫ The algorithm may use weighted Gini Index as the error function.
⚫ It means every split will maximise the change in Gini Index
⚫ Algorithm will favour that splitting variable and that splitting point –
which leads to minimum possible Gini index at the child node.
⚫ Typical stopping criterion for this case should be 100% node purity
(when a node has all examples belonging to a single class)
⚫ When an attribute takes two values (the attribute whiskers: present or
not present), no decision on splitting point to be taken – the complexity
of the problem gets reduced significantly.
MACHINE LEARNING
Decision Tree for Classification
⚫ Entropy =
⚫ The splitting variable for the root node is ‘Ear Shape’. We are asking if the
animal’s ‘Ear Shape is Pointy or not?’
MACHINE LEARNING
Decision Tree for Classification – Calculation of Entropy, Impurity
⚫ Training set has 5 cats, 5 No cats. Entropy at root node (before any
splitting on the dataset) = - plog2p - (1-p)log2(1-p) = 1, p=5/10 =0.5
⚫ After splitting, the left node has 4 cats, 1 no cat. Entropy = - plog2p - (1-
p)log2(1-p) = 0.72, p=4/5 = 0.8
⚫ After splitting, the right node has 1 cat, 4 no cats. Entropy = - plog2p -
(1-p)log2(1-p) = 0.72, p=1/5 = 0.2
⚫ Weighted Entropy = Impurity = (5/10)*0.72 + (5/10)*0.72 = 0.72
⚫ Why 5/10 ? Because out of 10 examples at root node, 5 examples are
into the left child of root node and the other 5 examples fall into the
right child of the root node.
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity
⚫ The splitting variable for the root node is ‘Face Shape’. We are asking if the
animal’s ‘Face Shape is Round or not?’
MACHINE LEARNING
Decision Tree for Classification - Calculation of Entropy, Impurity
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index
⚫ IRIS dataset
⚫ 3 classes – Sertosa, Virginica, Vesicolor
⚫ 50 training examples of each class – total 150 training cases
⚫ 4 attributes – what are those?
⚫ Gini Index
⚫ From 150, train-test split has done. Among 105 training examples, 35
belong to Setosa, 33 to Vesicolor, 37 to Virginica
⚫ Gini index at root node is (35/105)*(1-35/105) + (33/105)*(1-33/105) +
(37/105)*(1-37/105) = 0.666
⚫ Splitting variable is petal width at root node
⚫ Splitting point is 0.8 at root node
MACHINE LEARNING
Decision Tree for Classification - Calculation of Gini Index
⚫ Calculate Gini Index for left child of root node and right child of root
node and verify that you indeed get these values.
⚫ To be repeated at every node at every depth, until stopping criterion is
met.
⚫ The ‘orange’ node has a Gini index = 0. Therefore no more splitting, it
becomes a leaf.
MACHINE LEARNING
Decision Tree for Classification – Classification Error Rate
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression
MACHINE LEARNING
Decision Tree for Regression
⚫ rt is t-th response
⚫ For the above example t = 1,2, 3….10.
⚫ Corresponding r are 7.2, 8.8, 15, 9.2,…….
⚫ gm is average (mean) of all the training examples falling into the m-th
node/leaf/region
⚫ g1 = (7.2+8.4+7.2+10.2)/4 = 8.25
⚫ g2 = (9.2)/1 = 9.2
⚫ g3 = (15+18+20)/3 = 17.66
⚫ How many leaves are there?
MACHINE LEARNING
Decision Tree for Regression
Acknowledgement: Examples and figures taken from text book and reference books
MACHINE LEARNING
Decision Tree - Pruning
⚫ Let the tree grow as per stopping criterion. Use any criterion for
stopping other than tree depth
⚫ Now prune the tree to see if a modified tree with reduced
number of leaves (less leaves means smaller depth) serves similar
purpose as the original tree
⚫ How to measure which tree is best?
⚫ Use Cross Validation – this is the only way out
MACHINE LEARNING
Decision Tree – Post Pruning
⚫ Tree built using hitters database.
⚫ Cross validation shows that a tree with 3 leaves gives least test error.
⚫ Therefore the pruned tree with 3 leaves (shown in next slide) to be used
MACHINE LEARNING
Decision Tree – Post Pruning
➢ A decision tree does its own feature extraction. The univariate tree only uses the necessary
variables, and after the tree is built, certain features may not be used at all.
➢ We can also say that features closer to the root are more important globally.
➢ Another main advantage of decision trees is interpretability.
➢ Each path from the root to a leaf corresponds to one conjunction of tests, as all those
conditions should be satisfied to reach to the leaf.
➢ These paths together can be written down as a set of IF-THEN rules, called a rule base
MACHINE LEARNING
Rule extraction from trees
➢ Such a rule base allows knowledge extraction; it can be easily understood and allows
experts to verify the model learned from data.
➢ For each rule, one can also calculate the percentage of training data covered by the
rule, namely, rule support.
➢ In the case of a classification tree, there may be more than one leaf labeled with the
same class. In such a case, these multiple conjunctive expressions corresponding to
different paths can be combined as a disjunction (OR).
• IF (x1 ≤ w10) OR ((x1 > w10) AND (x2 ≤ w20)) THEN C1
• One can also prune rules for simplification.
MACHINE LEARNING
Learning Rules from data
➢ Learning rules directly from data instead of IF-THEN statement –Rule induction.
➢ Rules are learned one at a time.
➢ Each rule is a conjunction of conditions on discrete or numeric attributes (as in decision trees) and
these conditions are added one at a time, to optimize some criterion, for example, minimize entropy.
➢ Sequential Covering:
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met
MACHINE LEARNING
Learning Rules from data
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm
➢ It stands for Repeated Incremental Pruning to Produce Error Reduction. The Ripper Algorithm
is a Rule-based classification algorithm. It derives a set of rules from the training set. It is a
widely used rule induction algorithm.
➢ Case I: Training records belong to only two classes
• Among the records given, it identifies the majority class ( which has appeared the most ) and
takes this class as the default class. For example: if there are 100 records and 80 belong to
Class A and 20 to Class B. then Class A will be default class.
• For the other class, it tries to learn/derive various rules to detect that class.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm
➢ How rule is learned:
• In the first instance, it tries to derive rules for those records which belong to class C1. Records
belonging to C1 will be considered as positive examples(+ve) and other classes will be
considered as negative examples(-ve).
• Next, at this junction Ripper tries to derive rules for C2 distinguishing it from the other classes.
• This process is repeated until stopping criteria is met, which is- when we are left with Cn
(default class).
• Ripper extracts rules from minority class to the majority class.
➢ Rule Growing in RIPPER Algorithm:
• Ripper makes use of general to a specific strategy of growing rules. It starts from an empty rule
and goes on adding the best conjunct to the rule .
• For evaluation of conjuncts the metric is chosen is FOIL’s Information Gain. Using this the best
conjunct is chosen.
• Stopping Criteria for adding the conjuncts – when the rule starts covering the negative (-ve)
examples.
• The new rule is pruned based on its performance on the validation set.
MACHINE LEARNING
Learning Rules from data- RIPPER Algorithm
➢ LOF is given by