Mod09-ppt2-ML_in_Image_Classification
Mod09-ppt2-ML_in_Image_Classification
CLASSIFICATION
K NEAREST NEIGHBOUR (KNN)
SUPPORT VECTOR MACHINE
NAÏVE BAYES
We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1
= 3 and X2 = 7 is included in Good category.
SUPPORT VECTOR MACHINE (SVM)
SUPPORT VECTOR MACHINE (SVM)
• SVM is a kernel method
• Give better classification performance than other ML algorithms on
reasonably sized datasets.
• They do not work well on extremely large datasets since they involve
a data matrix inversion which is very expensive.
SVM – When data is linearly separable
OPTIMAL SEPARATION
Three different classification lines. Is there any reason why one is better than the others?
OPTIMAL SEPARATION
• All three of the lines that are drawn separate out the two classes,
• so in some sense they are ‘correct’, and
• the Perceptron would stop its training if it reached any one of them.
• we prefer a line that runs through the middle of the separation
between the datapoints from the two classes,
• staying approximately equidistant from the data in both classes.
• If we pick the lines shown in the left or right graphs,
• then there is a chance that a datapoint from one class will be on the wrong
side of the line,
• just because we have put the line tight up against some of the datapoints we
have seen in the training set.
The Margin and Support Vectors
The margin is the largest region we can put that separates the classes without there being any points inside, where the
box is made from two lines that are parallel to the decision boundary.
The classifier in the middle of the Figure has the largest margin of the three. It has the imaginative name of the
maximum margin (linear) classifier.
The datapoints in each class that lie closest to the classification line are called support vectors.
SVM
• Using the argument that the best classifier is the one that goes
through the middle of no-man’s land, we can now make two
arguments:
• the margin should be as large as possible, and
• the support vectors are the most useful datapoints because they are the ones
that we might get wrong.
• This leads to an interesting feature of these algorithms:
• after training we can throw away all of the data except for the support
vectors, and use them for classification
SVM
• Computing optimal decision boundary from a given set of datapoints
• w - weight vector (a vector, not a matrix, since there is only one output)
• x - input vector
• Output y = w· x+b, with b being the contribution from the bias weight
• We use the classifier line by saying that
• any x value that gives a positive value for w · x + b is above the line, and so is
an example of the ‘+’ class,
• any x that gives a negative value is in the ‘o’ class.
SVM
• Let us include our no-man’s land
• If the absolute value is less than our margin M, which would put it inside the
grey box
• w · x is the inner or scalar product, w · x.
• This can also be written as wT x, which means that we can treat the vectors as
degenerate matrices and use the normal matrix multiplication rules.
• For a given margin value M we can say that any point x
• where wT x + b M is a plus, and
• any point where wT x + b −M is a circle.
• The actual separating hyperplane is specified by wT x + b = 0.
SVM
• support vector - a point x+ that lies on the ‘+’ class boundary line, so that wT
x+ = M
• If we want to find the closest point that lies on the boundary line for the ‘o’
class, then we travel perpendicular to the ‘+’ boundary line until we hit the
‘o’ boundary line.
• The point that we hit is the closest point, and we’ll call it x−
• the distance travelled to get to the separating hyperplane is M
• from x+to x-is 2M
• to write down the margin size M in terms of w
• w is perpendicular to the classifier line, the ‘+’ and ‘o’ boundary lines
• so the direction travelled from x+to x-is along w.
• to make w a unit vector w/||w||, and so we see that the margin is 1/||w||
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
22
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B) P(B | A )P( A )
i i
i 1
24
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
25
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard 1 deviation σ
( x ) 2
g ( x, , ) e 2 2
2
and P(xk|Ci) is
P ( X | C i ) g ( xk , C i , C i )
26
Naïve Bayes Classifier: Training Dataset
buys
_co
Class: stude credit_rati mpu
age income nt ng ter
C1:buys_computer = ‘yes’ <=30 high no fair no
C2:buys_computer = ‘no’ <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
27
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes