0% found this document useful (0 votes)
65 views40 pages

Classification (Naive Bayes)

This document discusses Bayesian classification and naive Bayes classifiers. It provides examples to illustrate Bayes' theorem and the rule of multiplication. It explains that a naive Bayes classifier makes the assumption that attributes are conditionally independent given the class. Probabilities for a naive Bayes classifier can be estimated from the training data by calculating the frequency of attributes within each class. The classifier predicts the class with the highest posterior probability calculated from the attribute probabilities.

Uploaded by

Mahad Gul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views40 pages

Classification (Naive Bayes)

This document discusses Bayesian classification and naive Bayes classifiers. It provides examples to illustrate Bayes' theorem and the rule of multiplication. It explains that a naive Bayes classifier makes the assumption that attributes are conditionally independent given the class. Probabilities for a naive Bayes classifier can be estimated from the training data by calculating the frequency of attributes within each class. The classifier predicts the class with the highest posterior probability calculated from the attribute probabilities.

Uploaded by

Mahad Gul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Classification (Bayes, Lazy)

Data Mining* (CSC521)


Dr M Muzammal

*The instructor thanks Dr Jae-Gil Lee for sharing the lecture slides.
Contents

• Decision Tree Induction


• Bayes Classification
• Support Vector Machines (SVM)
• Ensemble Methods
Bayes Classifier
• A probabilistic framework for solving classification problems
P ( A, C )
• Conditional probability: P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

• Bayes theorem: P ( A | C ) P (C )
P(C | A) 
P ( A)
Rule of Multiplication

The probability that Events A and B both occur is equal to the probability that
Event A occurs times the probability that Event B occurs, given that A has
occurred.
P(A ∩ B) = P(A) P(B|A)

5
Example: Rule of Multiplication
• An urn contains 6 red marbles and 4 black marbles. Two marbles are
drawn without replacement from the urn. What is the probability that both of
the marbles are black?

6
Example: Rule of Multiplication
• An urn contains 6 red marbles and 4 black marbles. Two marbles are
drawn without replacement from the urn. What is the probability that both of the
marbles are black?
• Let A = the event that the first marble is black; and
• let B = the event that the second marble is black
P(A) = 4/10 (4 out of 10 in the urn are black)
P(B|A) = 3/9 (3 out of 9 in the urn are black now)

P(A ∩ B) = P(A) P(B|A) 


P(A ∩ B) = (4/10) * (3/9) = 12/90 = 2/15
7
Bayes Classifier
• A probabilistic framework for solving classification problems
P ( A, C )
• Conditional probability: P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

• Bayes theorem: P ( A | C ) P (C )
P(C | A) 
P ( A)
Example (1): Bayes Theorem

• A doctor knows that meningitis causes stiff neck 50% of the time  P(S|
M)=0.5
• Prior probability of any patient having meningitis is 1/50,000  P(M) =
1/50000
• Prior probability of any patient having stiff neck is 1/20  P(S) = 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis? 
P(M|S) ?

P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20
Example (2): Bayes Theorem

• Anka is getting married tomorrow, at an outdoor ceremony in the Hills. In


recent years, it has rained only 5 days each year. Unfortunately, the
weatherman has predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When it doesn't rain,
he incorrectly forecasts rain 10% of the time. What is the probability that it
will rain on the day of Anka's wedding?

10
Example (2): Bayes Theorem

• The sample space is defined by two mutually-exclusive events – it rains or it


does not rain.
Additionally, a third event occurs when the weatherman predicts rain.
• Event A1. It rains on Anka's wedding.

• Event A2. It does not rain on Anka's wedding.


• Event B. The weatherman predicts rain.

11
Example (2): Bayes Theorem
• P( A  ) = 5/365 =0.0136985
1 [It rains 5 days out of the year.]
• P( A  ) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]
2

• P( B | A  ) = 0.9
1 [When it rains, the weatherman predicts rain 90% of
the
time.]
• P( B | A  ) = 0.1 [When it does not rain, the weatherman predicts rain 10% of
2
the
time.]

12
Example (2): Bayes Theorem

• Compute P( A  | B ), the probability it will rain on the day of Anka's wedding


1

• given a forecast for rain by the weatherman

13
Example (2): Bayes Theorem

• Compute P( A  | B ), the probability it will rain on the day of Anka's wedding


1

• given a forecast for rain by the weatherman

14
Even when the weatherman predicts rain, it only rains only about 11%
of the time.
Example (2): Bayes Theorem

Even when the weatherman predicts rain, it only rains only about 11%
of the time.
15
Bayesian Classifiers (1/2)

• Consider each attribute and class label as random variables

• Given a record with attributes (A , A ,…,A )


1 2 n

• The goal is to predict the class C


• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A , A ,…,A ) directly from data?


1 2 n
Bayesian Classifiers (2/2)
• Approach
• Compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes
theorem

P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

• Choose the value of C that maximizes P(C | A1, A2, …, An)


• Equivalent to choosing the value of C that maximizes P(A1, A2, …, An | C) P(C) since P(A1,
A2, …, An) is constant for all classes
• How to estimate P(A , A , …, A | C )?
1 2 n
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes)
P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj) … P(An| Cj)

•• We
 
can estimate P(A | C ) for all A and C
i j i j

• A new point is classified to C if P(C ) P(A | C ) is maximal


j j i j
How to Estimate Probabilities from Data?
l l
(1/3)
s a a u
r ic r ic o
o o u
e g e g tin as
s
c at c at c on c l
Tid Refund Marital Taxable
• For discrete attributes: Status Income Evade

1 Yes Single 125K No


P(Ai | Ck) = |Aik|/ Nc
2 No Married 100K No
3 No Single 70K No
• where |Aik| is number of instances that has the
4 Yes Married 120K No
attribute Ai and belongs to the class Ck
5 No Divorced 95K Yes
• e.g., P(Status=Married|No) = 4/7
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data? (2/3)

• For continuous attributes:


• Assume the attribute follows a normal distribution
• Use data to estimate parameters of the distribution (e.g., mean and standard
deviation)
• Once the probability distribution is known, we can use it to estimate the conditional
probability P(Ai | C)

• e.g., see the next page


a l a l s
u
ric ric
How to Estimate Probabilities from teData?
go
t e g o (3/3)
nt
in
la
s s
uo
ca ca co c
• Normal distribution: Tid Refund Marital
Status
Taxable
Income Evade

1 ( Ai   ij ) 2

P( A | c )  e 2  ij2 1 Yes Single 125K No

2
i j 2 2 No Married 100K No

ij 3 No Single 70K No
4 Yes Married 120K No

• One for each (A , c ) pair


i j 5 No Divorced 95K Yes
6 No Married 60K No
• e.g., (Income, Class=No) 7 Yes Divorced 220K No

• If Class=No 8 No Single 85K Yes


9 No Married 75K No
• sample mean = 110
10 No Single 90K Yes
• sample variance = 2975
10

1 
( 120110 ) 2

P ( Income  120 | No )  e 2 ( 2975)


 0.0072
2 (54.54)
An Example
Given a test record: X  (Refund  No, Married, Income  120K)
Naïve Bayes Classifier:  P(X | Class=No) = P(Refund=No | Class=No)
P(Refund=Yes|No) = 3/7  P(Married | Class=No)
P(Refund=No|No) = 4/7  P(Income=120K | Class=No)
P(Refund=Yes|Yes) = 0/3 = 4/7  4/7  0.0072 = 0.0024
P(Refund=No|Yes) = 3/3
P(Martial Status=Single|No) = 2/7
 P(X | Class=Yes) = P(Refund=No | Class=Yes)
P(Martial Status=Divorced|No) = 1/7
P(Martial Status=Married|No) = 4/7  P(Married | Class=Yes)
P(Martial Status=Single|Yes) = 2/3  P(Income=120K | Class=Yes)
P(Martial Status=Divorced|Yes) = 1/3 = 1  0  1.2  10-9 = 0
P(Martial Status=Married|Yes) = 0/3

Taxable income:
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class=No: sample mean = 110 0.0024 * 7/10 0 * 3/10
sample variance = 2975
If class=Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25 => Class = No
M-Estimate of Conditional Probability

• If one of the conditional probability is zero, then the entire expression


becomes zero
• Probability estimation:
N ic
Original : P ( Ai | C )  c: number of classes
Nc
p: prior probability
N ic  1
Laplace : P ( Ai | C )  m: parameter
Nc  c
N ic  mp
m - estimate : P ( Ai | C ) 
Nc  m
M-Estimate
• The basic idea for estimating conditional probabilities is that the prior probabilities can be
estimated from an unconditional sample
• If we don't have any knowledge of , assume the attribute is uniformly distributed over all possible
values
• Interpretation of
• A higher value of means that we are more confident in the prior probability

•  
controls the balance between relative frequency and prior probabilities
Nc Nic m
P(Ai|C)    p
N c  m Nc N c  m

• An example
• (instead of 0!)
• We are assuming that = 1 / number of attribute values = 1 / 3
• Our value is arbitrary, and we will use = 4
Summary of Naïve Bayes

• Robust to isolated noise points

• Able to handle missing values by ignoring the instance during probability


estimate calculations

• Robust to irrelevant attributes


• If Xi is an irrelevant attribute, P(Xi | Y) becomes almost uniformly distributed

• Independence assumption may not hold for some attributes


Classification (SVM)
Ice Breaking
Measuring Happiness Using Wearable Technology

Amount and direction of movement in three dimensions and with


high resolution (50 times a second, or once every 20 ms)
History and Applications

• Proposed by Vapnik and colleagues (1992)—groundwork from Vapnik &


Chervonenkis’ statistical learning theory in 1960s
• Characteristics: training can be slow, but accuracy is high owing to its ability
to model complex nonlinear decision boundaries (margin maximization)
• Usage: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
B1
Intuition (1/2)

B2

Find a linear hyperplane (decision boundary)


that will separate the data
B1
Intuition (2/2)

B2

Which one is better? How do you define better?


B1

Basic Idea

B2 Support vectors
b21
b22

margin
b11

b12

Find the hyperplane that maximizes the margin


 B1 is better than B2
When Linearly Separable

Let the data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes, but we want to find
the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
Formalization (1/3)
• A separating hyperplane: w  x – b = 0
• w: a normal vector
• b: a scalar value (bias)

• Two parallel hyperplanes:


• wx–b=1
• w  x – b = -1
•  Maximize (minimize ||w||)

subjected to
• w  xi – b 1 for xi of the first class
• w  xi – b -1 for xi of the second class
 yi (w  xi – b) 1, yi
Formalization (2/3)

• Primal form
 Substituting ||w|| with for
mathematical convenience

• This becomes a constrained (convex) quadratic optimization problem


Soft Margin (1/2)

• If there exists no hyperplane that can split


the "yes" and "no" examples, the soft
margin method will choose a hyperplane
that splits the examples as cleanly as
possible, while still maximizing the
distance to the nearest cleanly split
examples
 Allow mislabeled examples
When Linearly Inseparable

Linearly separable

Not linearly separable

Projecting data that is not linearly separable into a higher


dimensional space can make it linearly separable
Why Is SVM Effective on High Dimensional
Data?
• The complexity of a trained classifier is characterized by the number of support
vectors rather than the dimensionality of the data
• The support vectors are the essential or critical training examples —they lie
closest to the decision boundary (MMH)
• If all other training examples are removed and the training is repeated, the
same separating hyperplane would be found
• The number of support vectors found can be used to compute an (upper)
bound on the expected error rate of the SVM classifier, which is independent of
the data dimensionality
• Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
Multi-Class SVM

• Reducing the single multiclass problem into multiple binary classification


problems

• Two methods to build binary classifiers


• Between one of the labels to the rest (one-versus-all)
• Between every pair of classes (one-versus-one)
LIBSVM
• A library for Support Vector Machines

• Providing the interface for many programming languages including Java,


MATLAB, R, Python, and C#

• Developed by Chih-Chung Chang and Chih-Jen Lin in National Taiwan University

• http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SVM Related Links

• http://www.svms.org/

• http://www.support-vector-machines.org/

• http://www.kernel-machines.org/
Thank You!
Questions?

41

You might also like