0% found this document useful (0 votes)

25 views8 pages

Lec 10

The document discusses classification problems where the output is discrete rather than continuous. It introduces the concept of a loss matrix for classification and describes the 0-1 loss function. It then shows that the Bayes optimal classifier is to predict the class with the highest probability, and that k-nearest neighbors can be used to estimate these probabilities and perform classification by taking the majority class among the k nearest neighbors.

Uploaded by

Chinmay Giri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views8 pages

Lec 10

Uploaded by

Chinmay Giri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Lecture 10

Prof. Balaraman Ravindran

Computer Science and Engineering
Indian Institute of Technology Madras

Statistical Decision Theory –

Classification

(Refer Slide Time: 00:16)

In this module we are going to look at the case where the output variable is drawn from a
discrete space or in other words we are going to look at the classification problem as before the
input is coming from a P dimensional space RP the output which I am denoting by G here I am
going to assume is coming from some space script G which is a discrete value right it could be
Bayer computer does not by computer so capital the script G could just consists of Bayer
computer does not by computer or it should consist of like 5 different outcomes has the disease a
mild form of the disease a severe form of the disease does not have the disease and soon so forth
right.
So it could be a variety of outcomes but a small discrete set, so that space is denoted by script g
well capital g is the random variable corresponding to the output right then like before we are
going to have a joint distribution on the input on the output right and the training data is going to
consist of pairs x1 g1 x2 g2 all the way up to x and Gn and the goal here is to learn a function f(x)
that is going to take you from a P-dimensional input space R to the discrete space script g right.

And so the thing that we have to look at now is what is an appropriate loss function in this case
so what is an appropriate loss function in this case since we are talking about the discrete output
right so I really cannot talk about squared error as a loss function even though in cases where the
discrete values have been encoded as numeric outputs people do use squared error and we will
see that later right so people do use squared error is an appropriate measure as long as your space
g has been encoded numerically right.

So but in general so we are going to define the loss in as a k / k matrix where k is the cardinality
of the discrete space script g that we are looking at, so suppose there are 5 classes then my last
matrix is going to be a 5/ 5 matrix right so that the thing here is it is going to have 0 on 0 on the
diagonal right and so the klth at entry in the last matrix essentially is the cost that you incur of
classifying the output k as n so the true output is k but you output you say l right so that is
essentially the cost of classifying k as l so that is denoted by the klth entry of the loss matrix right.

So frequently the most popular loss function that you use is known as the 0 - 1 loss function right
so the 0-1 loss function essentially says that suppose I have three classes right, so my loss
function would look like this right, so if I if I classified to the right class I get a penalty of zero
but if I classify to the wrong class right I get a penalty of one regardless of which wrong class I
classify too, so this entry says that okay the data point actually belongs to class one I have
classified it as class two what is the penalty so 1 data point belongs to class 1 I classify it as class
3.

What is the penalty one and so on so forth so this is called the 0-1 loss function because all the
entries in the loss matrix are either 0-1-1 right.

(Refer Slide Time: 04:26)

So what we are again going to look at is the expected prediction error of here the right and we
can do the same thing that we did earlier so I can start conditioning it on x right so the expected
prediction error and then the expectation of g / g given x which essentially becomes a right so the
loss of g, f^ given that the input is x but if you think about it this is not a continuous distribution
this is actually a discrete distribution because g can take only finitely many values, so in so
writing it out as this expectation i can actually simplify that and write it as right.

So this is a loss that i will incur if k was the true class again my prediction was f ^(x) times the
probability that k is the true class given the input x right so this is essentially i am writing out the
expectation here right, so the because it is a discrete distribution i am able to write it out in a
compact form right and again i can do this minimization of this point twice like we talked about
earlier, so point wise would mean that I make a specific assumption about what is the value of x
right.
So I am going to look at right, so they are essentially following the same treatment that we did
with the reduction case except that we are using a discrete output space since of a continuous
output space right, so this essentially says that I am going to pick the data point right that gives
me the I am going to prove sorry I am going to pick the prediction g that gives me the smallest
expected error right, so what suppose I have the 0-1 loss function right assume the 0-1 loss so
what does this mean I should essentially set my g to be that k right which has the highest
probability why is that right.
So if we think about it this probability term contributes to every element in the summation right
so what I can do is among all these probability terms I can pick one term right and set it to 0 by
my choice of g right so suppose I choose g to be 1 then my l(1,1) will become 0 right and but
my l(2,1)(3,1) so on so forth will all be 1 so what will happen this there is a probability of 2
given X probability of 3 given x all of this will actually appear in this summation right.

So if I set my g to that value of k which has the highest probability right then that will yield the
best possible solution here right, so if you are able to see that let us assume that there are 3
classes right I assume that there are 3 classes so and my true distribution is says that the
probability of class 1 given the data point x let us say is 0.6 probability of + 2 given x let us say
is 0.2 this is another 0.2 okay and of course my loss function is going to be such that and mean
00011111 right so this is my loss function right.

So if I guess that my class label is going to be 2 let us say so I said g = 2 so what is going to
happen my if the class label is 1 right so I am going to look at the loss corresponding to 1, 2
which is this right, so I will get 1 times 0.6 then if the class label is 2 so I will be looking at this
so I will get 0 times 0.2 + if the class label is 3 I look at this I will get 1 times 0.2, so I will get a
score of 0.8 right.

So as you can see depending on which value I choose if I g = 2 then I will be zeroing out the
second entering if I choose g = 1 I will be zeroing out the 1 st entry right 4 by choosing g equal to
1 I will basically get a score of 0.4 right, so what I have to do in order to get the minimum here is
to pick that g for which this probability is the highest right.
(Refer Slide Time: 11:31)
So I will set f^(x) okay so can you how people realize why the min here became the max here
based on the argument that we just did right, so this is essentially saying that from your training
data classify it to the most problem class right and if I knew this if I knew this probability right
so what will I do I can set it to the most probable output so this is this kind of a classifier so what
is the Bayer optimal classifier say I can look at the conditional distribution right given x look at
the probability of g take the g that has the highest probability and assign it as the output so this is
essentially what the Bayer optimal classifier would say all right.

But then you do not know g right, so what they have to do is you have estimate this probability
so how would you estimate this probability do we know of any method for estimating this
probability of course we do we know how to do nearest neighbor right, so what you would do in
this case is that instead of taking the average over the neighbors like we did in the regression
case so what you would do is you our estimate the probabilities in the neighborhood, so what you
would do you will take a data point look at the k neighbors of the data point k nearest neighbors
of the data point find out what their class labels are right.

So and then divide by up so for each label count the number of occurrences of that label in the k
neighbors and divided by k right so this will give you the probability of the class label in the
neighborhood but we really do not have to do this much work why because we are not interested
in the actual probability we all we need is the one that has the maximum probability since the
denominator is going to be k for all the probabilities we can ignore the denominator we can just
look at the numerator.

So what we can do is we can count the occurrences of the class label in the neighborhood and
whichever occurs more often we can assign that as the class label right, so think about it for a
minute right, so what we are essentially doing when we take the majority is actually estimating
this probability and taking the max probability right, so take the majority label in the
neighborhood and use that as your prediction so this essentially gives you the k nearest neighbor
classifier.

So what we saw earlier was a k nearest neighbor regress or so all the caveats that we talked about
for the k nearest neighbor regress or appear when applied to the k nearest neighbor classifier as
well so you have to be careful about using it in very high dimensions right and you really need
large values of k and large values of n before you can get stable estimates but having said all that
I should say that it turns out to be a really powerful classifier in practice and we will come back
to that a little later as to why it is such a powerful classifier right.

And can we use linear regression or the linearity assumption here it turns out that you could use
linear regression in almost directly for solving this problem so the way you do it is the following
you take this data set x1 g1 x2 g2 and so on so forth and convert it into a data set suitable for doing
regression, so how do I do that so I take that x1 g1 right.

(Refer Slide Time: 16:10)

Let us say that I have only two classes for simplicity sake let us say I have only two classes right
so I have g1 and g2 right so let us say 1 and 2 so I will say I will say that 0 or 1 right so instead
of having some arbitrary classes I am going to say it is 0 or 1 so what I am going to now do is my
thing will leave that become something like this right so instead of having some arbitrary
symbols g’s g1 g2 I am going to have 0, 1, 1, 0 and so on so forth now what I can do is I can
solve this as a regression problem I can just solve this as a regression problem and whatever
output I get I can read that as an estimate of the probability of g given x if you think about it
right.

So probability of g = 1 given x so for the same value of x if there are multiple ones right suppose
I the same value x occurs say 5 times in my training data 3 of the times it was 1 and 2 of the
times it was 0 right so when I am trying to do a prediction I would expect to end up at the
average of this prediction right, so just be like 3 / 5 and it also turns out to be the probability with
which the output is 1given an x right, so if I do regression with this as my training data, so what I
will be learning is the probability that g = 1 given x right roughly there are lot of caveats in this
which we will look at when we do regression later obviously you cannot treat this directly as
probabilities because the regression curve can become negative right.
So you cannot really treat it as probabilities but it just how it is useful intuition to have and so the
output that you learn here so f^ (x) right in this case if it ≥ 0.5 then you say in the classes 1 if it is
< than 0.5 you say the class is 0 let us say can use a linear regression to solve this as well so what
we have done in this couple of modules is to look at a unifying formulation for classification and
regression problem so supervised learning problems and looked at a couple of different
classifiers that arise out of making certain assumptions about classifies and regresses that arise
out of making certain assumptions about the function that we are trying to learn right.

In the subsequent classes we will start looking at each of these in more detail starting off with
linear regression we'll look at this different classifiers in greater detail thank you.

IIT Madras Production

Funded by
Department of Higher Education
Ministry of Human Resource Development
Government of India

www.nptel.ac.in

Copyrights Reserved

Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
NLP Using Python
100% (3)
NLP Using Python
12 pages
Din 1685 - 1
67% (3)
Din 1685 - 1
4 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
No ratings yet
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
117 pages
Review Materials 0 8 1
No ratings yet
Review Materials 0 8 1
140 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
PR January20 03 PDF
No ratings yet
PR January20 03 PDF
74 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Pat 02 Sol
100% (1)
Pat 02 Sol
5 pages
Week 2
No ratings yet
Week 2
43 pages
Smath Studio
No ratings yet
Smath Studio
47 pages
SML Lecture4
No ratings yet
SML Lecture4
38 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
03 Bayes Nearest Neighbors
No ratings yet
03 Bayes Nearest Neighbors
34 pages
Lec 2
No ratings yet
Lec 2
37 pages
Unix PPT Lesson
75% (4)
Unix PPT Lesson
70 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Mil780 Classification
No ratings yet
Mil780 Classification
18 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Lec 9
No ratings yet
Lec 9
14 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Yio Chu Kang Secondary Sec 1 SA2 2020 Science
No ratings yet
Yio Chu Kang Secondary Sec 1 SA2 2020 Science
21 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Lec 25
No ratings yet
Lec 25
15 pages
Lec 1
No ratings yet
Lec 1
42 pages
Lec 6
No ratings yet
Lec 6
14 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
PC System Power Supply Diagrams, Schematics and Service Manuals PDF - Google Search PDF
50% (2)
PC System Power Supply Diagrams, Schematics and Service Manuals PDF - Google Search PDF
1 page
Lec 20
No ratings yet
Lec 20
16 pages
Representer Function
No ratings yet
Representer Function
12 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Solutions To Selected Problems-Duda, Hart
67% (3)
Solutions To Selected Problems-Duda, Hart
12 pages
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
No ratings yet
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
7 pages
Lec 43
No ratings yet
Lec 43
9 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Cours ML
No ratings yet
Cours ML
14 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
172 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Thinking Avant La Lettre A Review of 4E Cognition Carney 2020
No ratings yet
Thinking Avant La Lettre A Review of 4E Cognition Carney 2020
15 pages
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
No ratings yet
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
9 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Algebra 1
No ratings yet
Algebra 1
90 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Group 3: Molecular Orbital Theory
No ratings yet
Group 3: Molecular Orbital Theory
37 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Leading Dan Lagging Indicators Highlights
No ratings yet
Leading Dan Lagging Indicators Highlights
78 pages
Measurement and Error: Definition, Accuracy and Precision Significant Figures, Types of Errors Electrical Standards, IEEE Standards
No ratings yet
Measurement and Error: Definition, Accuracy and Precision Significant Figures, Types of Errors Electrical Standards, IEEE Standards
3 pages
Welding Machine Pre Start Checklist
No ratings yet
Welding Machine Pre Start Checklist
2 pages
Ec34 Question Bank
No ratings yet
Ec34 Question Bank
6 pages
Ionic Equilibrium DPP
No ratings yet
Ionic Equilibrium DPP
33 pages
Computer Ebook English RBE
No ratings yet
Computer Ebook English RBE
69 pages
XDM Datasheet
No ratings yet
XDM Datasheet
6 pages
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
No ratings yet
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
42 pages
Heat of Combustion Lab 2
No ratings yet
Heat of Combustion Lab 2
14 pages
Ultrapac 2000 Standard, Ultrapac 2000 Superplus, Mini (Typ 0005 Bis 0025)
No ratings yet
Ultrapac 2000 Standard, Ultrapac 2000 Superplus, Mini (Typ 0005 Bis 0025)
3 pages
User Manual GALILEO: 06/2013 MN04802104Z-EN
No ratings yet
User Manual GALILEO: 06/2013 MN04802104Z-EN
17 pages
Optimisation of Intake Trashracks: S. Bjarnason T.S. Leifsson G. Pétursson H. Jóhannesson
No ratings yet
Optimisation of Intake Trashracks: S. Bjarnason T.S. Leifsson G. Pétursson H. Jóhannesson
7 pages
Advanced Micro Controller: Unit I - AVR Microcontroller
No ratings yet
Advanced Micro Controller: Unit I - AVR Microcontroller
52 pages
DF0LS35 - Celestial Navigation
No ratings yet
DF0LS35 - Celestial Navigation
12 pages
Motor Current Calculator
No ratings yet
Motor Current Calculator
2 pages
Fungsi Sistem Otot
No ratings yet
Fungsi Sistem Otot
8 pages
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
No ratings yet
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
10 pages
TS01C Toko Oput Et Al 11323 PPT
No ratings yet
TS01C Toko Oput Et Al 11323 PPT
9 pages
Study of Suspension System in All Terrain Vehicle: Presented by
No ratings yet
Study of Suspension System in All Terrain Vehicle: Presented by
14 pages
Homomorphism
No ratings yet
Homomorphism
10 pages
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
No ratings yet
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
1 page
The Magic box
From Everand
The Magic box
Toni Edge
No ratings yet
The Little Book of Javascript
From Everand
The Little Book of Javascript
Karl Agius
No ratings yet
An Introduction to Elementary Algebra Part 1
From Everand
An Introduction to Elementary Algebra Part 1
Francesco Paolo Tramontano
No ratings yet
How Pi Can Save Your Life: Using Math to Survive Plane Crashes, Zombie Attacks, Alien Encounters, and Other Improbable Real-World Situations
From Everand
How Pi Can Save Your Life: Using Math to Survive Plane Crashes, Zombie Attacks, Alien Encounters, and Other Improbable Real-World Situations
Chris Waring
No ratings yet
Functions and Probability for Sixth Graders
From Everand
Functions and Probability for Sixth Graders
Home School Brew
No ratings yet
The Logic of Long Division
From Everand
The Logic of Long Division
Ned Tarrington
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Lec 10

Uploaded by

Lec 10

Uploaded by

NPTEL

NPTEL ONLINE CERTIFICATION COURSE

Introduction to Machine Learning

Prof. Balaraman Ravindran

Statistical Decision Theory –

(Refer Slide Time: 00:16)

(Refer Slide Time: 04:26)

(Refer Slide Time: 16:10)

IIT Madras Production

You might also like