Chapter 4: Classification & Prediction: 4.1 Basic Concepts of Classification and Prediction 4.2 Decision Tree Induction
Chapter 4: Classification & Prediction: 4.1 Basic Concepts of Classification and Prediction 4.2 Decision Tree Induction
Chapter 4: Classification & Prediction: 4.1 Basic Concepts of Classification and Prediction 4.2 Decision Tree Induction
P( X | H ) P( H )
P( H | X ) =
P( X )
} P(H|X) is the posterior probability of H conditioned on X
Example: predict whether a costumer will buy a computer or not
" Costumers are described by two attributes: age and income
" X is a 35 years-old costumer with an income of 40k
" H is the hypothesis that the costumer will buy a computer
" P(H|X) reflects the probability that costumer X will buy a computer
given that we know the costumers’ age and income
Bayes’ Theorem In the Classification Context
} X is a data tuple. In Bayesian term it is considered “evidence”
} H is some hypothesis that X belongs to a specified class C
P( X | H ) P( H )
P( H | X ) =
P( X )
} P(X|H) is the posterior probability of X conditioned on H
Example: predict whether a costumer will buy a computer or not
" Costumers are described by two attributes: age and income
" X is a 35 years-old costumer with an income of 40k
" H is the hypothesis that the costumer will buy a computer
" P(X|H) reflects the probability that costumer X, is 35 years-old and
earns 40k, given that we know that the costumer will buy a
computer
Bayes’ Theorem In the Classification Context
} X is a data tuple. In Bayesian term it is considered “evidence”
} H is some hypothesis that X belongs to a specified class C
P( X | H ) P( H )
P( H | X ) =
P( X )
} P(H) is the prior probability of H
Example: predict whether a costumer will buy a computer or not
" H is the hypothesis that the costumer will buy a computer
" The prior probability of H is the probability that a costumer will buy
a computer, regardless of age, income, or any other information
for that matter
" The posterior probability P(H|X) is based on more information than
the prior probability P(H) which is independent from X
Bayes’ Theorem In the Classification Context
} X is a data tuple. In Bayesian term it is considered “evidence”
} H is some hypothesis that X belongs to a specified class C
P( X | H ) P( H )
P( H | X ) =
P( X )
} P(X) is the prior probability of X
Example: predict whether a costumer will buy a computer or not
" Costumers are described by two attributes: age and income
" X is a 35 years-old costumer with an income of 40k
" P(X) is the probability that a person from our set of costumers is 35
years-old and earns 40k
Naïve Bayesian Classification
D: A training set of tuples and their associated class labels
Each tuple is represented by n-dimensional vector X(x1,…,xn), n
measurements of n attributes A1,…,An
Classes: suppose there are m classes C1,…,Cm
Principle
} Given a tuple X, the classifier will predict that X belongs to the
class having the highest posterior probability conditioned on X
} Predict that tuple X belongs to the class Ci if and only if
P( X | Ci ) P(Ci )
P(Ci | X ) =
P( X )
} P(X) is constant for all classes, thus, maximize P(X|Ci)P(Ci)
Naïve Bayesian Classification
} To maximize P(X|Ci)P(Ci), we need to know class prior
probabilities
" If the probabilities are not known, assume that P(C1)=P(C2)=…=P
(Cm) ⇒ maximize P(X|Ci)
" Class prior probabilities can be estimated by P(Ci)=|Ci,D|/|D|
} Assume Class Conditional Independence to reduce
computational cost of P(X|Ci)
" given X(x1,…,xn), P(X|Ci) is:
n
P ( X | Ci ) = ∏ P ( x k | C i )
k =1
= P ( x1 | Ci ) × P ( x2 | Ci ) × ... × P ( xn | Ci )
P( xk | Ci ) = g ( xk , µCi , σ Ci )
" Estimate µCi and σCi the mean and standard variation of the values
of attribute Ak for training tuples of class Ci
" Example
X a 35 years-old costumer with an income of 40k (age, income)
Assume the age attribute is continuous-valued
Consider class Cyes (the costumer will buy a computer)
We find that in D, the costumers who will buy a computer are
38±12 years of age ⇒ µCyes=38 and σCyes=12
Tuple to classify is
} Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
} Use Laplacian correction (or Laplacian estimator)
" Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
" The “corrected” prob. estimates are close to their “uncorrected”
counterparts
Summary of Section 4.3
} Advantages
" Easy to implement
" Good results obtained in most of the cases
} Disadvantages
" Assumption: class conditional independence, therefore loss of
accuracy
" Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
} How to deal with these dependencies?
" Bayesian Belief Networks
4.3.2 Bayesian Belief Networks
} Bayesian belief network allows a subset of the variables
conditionally independent
} A graphical model of causal relationships
" Represents dependency among the variables
" Gives a specification of joint probability distribution
" Given both the network structure and all variables observable: learn
only the CPTs