0% found this document useful (0 votes)
3 views

Mod09-ppt2-ML_in_Image_Classification

The document provides an overview of three machine learning classification methods: K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Naïve Bayes. It explains the principles behind each method, including instance-based classification, optimal separation in SVM, and Bayesian probability in Naïve Bayes. Additionally, it highlights the advantages and limitations of these classifiers in practical applications.

Uploaded by

harsha vardhini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Mod09-ppt2-ML_in_Image_Classification

The document provides an overview of three machine learning classification methods: K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Naïve Bayes. It explains the principles behind each method, including instance-based classification, optimal separation in SVM, and Bayesian probability in Naïve Bayes. Additionally, it highlights the advantages and limitations of these classifiers in practical applications.

Uploaded by

harsha vardhini
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MACHINE LEARNING

CLASSIFICATION
K NEAREST NEIGHBOUR (KNN)
SUPPORT VECTOR MACHINE
NAÏVE BAYES

DR. SHILOAH ELIZABETH D


Assistant Professor
Department of Computer Science and Engineering
Anna University
SUPERVISED LEARNING
Instance-Based Classifiers
• Store the training records
• Use training records to predict the class label of unseen cases
• Examples:
• Rote-learner - Memorizes entire training data and performs classification only
if attributes of record match one of the training examples exactly
• Nearest neighbor - Uses k “closest” points (nearest neighbors) for performing
classification
Nearest Neighbor Classifiers
• Requires three things
• The set of stored records
• Distance Metric to compute distance
between records
• The value of k, the number of nearest
neighbors to retrieve
• To classify an unknown record:
• Compute distance to other training
records
• Identify k nearest neighbors
• Use class labels of nearest neighbors
to determine the class label of
unknown record (e.g., by taking
majority vote)
• K-nearest neighbors of a record x are data points that have the k smallest
distance to x
Nearest Neighbor
• Compute distance between two points:
Nearest Neighbor Classification
• Determine the class from nearest neighbor list
Nearest Neighbor Classification
• Choosing the value of k:
• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from other classes
Nearest Neighbor Classification
• Normalization
• Curse of Dimensionality
• k-NN classifiers are lazy learners
• It does not build models explicitly
• Unlike eager learners such as decision tree induction and rule-based systems
• Classifying unknown records are relatively expensive
Example
We have data from the questionnaires survey (to ask people opinion) and objective testing with two
attributes (acid durability and strength) to classify whether a special paper tissue is good or not.
Here is four training samples:
X1 = Acid Durability (seconds) and X2 = Strength(kg/square meter); Y = Classification
(7, 7, Bad); (7, 4, Bad); (3, 4, Good); (1, 4, Good)
Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7.
Without another expensive survey, can we guess what the classification of this new tissue is?

We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1
= 3 and X2 = 7 is included in Good category.
SUPPORT VECTOR MACHINE (SVM)
SUPPORT VECTOR MACHINE (SVM)
• SVM is a kernel method
• Give better classification performance than other ML algorithms on
reasonably sized datasets.
• They do not work well on extremely large datasets since they involve
a data matrix inversion which is very expensive.
SVM – When data is linearly separable
OPTIMAL SEPARATION

Three different classification lines. Is there any reason why one is better than the others?
OPTIMAL SEPARATION
• All three of the lines that are drawn separate out the two classes,
• so in some sense they are ‘correct’, and
• the Perceptron would stop its training if it reached any one of them.
• we prefer a line that runs through the middle of the separation
between the datapoints from the two classes,
• staying approximately equidistant from the data in both classes.
• If we pick the lines shown in the left or right graphs,
• then there is a chance that a datapoint from one class will be on the wrong
side of the line,
• just because we have put the line tight up against some of the datapoints we
have seen in the training set.
The Margin and Support Vectors

The margin is the largest region we can put that separates the classes without there being any points inside, where the
box is made from two lines that are parallel to the decision boundary.
The classifier in the middle of the Figure has the largest margin of the three. It has the imaginative name of the
maximum margin (linear) classifier.
The datapoints in each class that lie closest to the classification line are called support vectors.
SVM
• Using the argument that the best classifier is the one that goes
through the middle of no-man’s land, we can now make two
arguments:
• the margin should be as large as possible, and
• the support vectors are the most useful datapoints because they are the ones
that we might get wrong.
• This leads to an interesting feature of these algorithms:
• after training we can throw away all of the data except for the support
vectors, and use them for classification
SVM
• Computing optimal decision boundary from a given set of datapoints
• w - weight vector (a vector, not a matrix, since there is only one output)
• x - input vector
• Output y = w· x+b, with b being the contribution from the bias weight
• We use the classifier line by saying that
• any x value that gives a positive value for w · x + b is above the line, and so is
an example of the ‘+’ class,
• any x that gives a negative value is in the ‘o’ class.
SVM
• Let us include our no-man’s land
• If the absolute value is less than our margin M, which would put it inside the
grey box
• w · x is the inner or scalar product, w · x.
• This can also be written as wT x, which means that we can treat the vectors as
degenerate matrices and use the normal matrix multiplication rules.
• For a given margin value M we can say that any point x
• where wT x + b  M is a plus, and
• any point where wT x + b  −M is a circle.
• The actual separating hyperplane is specified by wT x + b = 0.
SVM
• support vector - a point x+ that lies on the ‘+’ class boundary line, so that wT
x+ = M
• If we want to find the closest point that lies on the boundary line for the ‘o’
class, then we travel perpendicular to the ‘+’ boundary line until we hit the
‘o’ boundary line.
• The point that we hit is the closest point, and we’ll call it x−
• the distance travelled to get to the separating hyperplane is M
• from x+to x-is 2M
• to write down the margin size M in terms of w
• w is perpendicular to the classifier line, the ‘+’ and ‘o’ boundary lines
• so the direction travelled from x+to x-is along w.
• to make w a unit vector w/||w||, and so we see that the margin is 1/||w||
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
22
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

• Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
23
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost

24
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

25
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1

• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard 1  deviation σ
( x ) 2

g ( x,  ,  )  e 2 2
2 
and P(xk|Ci) is
P ( X | C i )  g ( xk ,  C i ,  C i )
26
Naïve Bayes Classifier: Training Dataset
buys
_co
Class: stude credit_rati mpu
age income nt ng ter
C1:buys_computer = ‘yes’ <=30 high no fair no
C2:buys_computer = ‘no’ <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
27
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes

• Compute P(X|Ci) for each class >40


<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
28 Therefore, X belongs to class (“buys_computer = yes”)
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
29
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of
accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
30

You might also like