Classification
Classification
1. Objectives .................................................................................2
2. Classification vs. Prediction.....................................................3
2.1. Definitions ........................................................................3
2.2. Supervised vs. Unsupervised Learning............................3
2.3. Classification and Prediction Related Issues ...................4
3. Common Test Corpora .............................................................5
4. Classification ............................................................................6
5. Decision Tree Induction .........................................................11
5.1. Decision Tree Induction Algorithm ...............................13
5.2. Other Attribute Selection Measures...............................18
5.3. Extracting Classification Rules from Trees: ..................19
5.4. Avoid Overfitting in Classification................................19
5.5. Classification in Large Databases ..................................20
6. Bayesian Classification ..........................................................21
6.1. Basics..............................................................................22
6.2. Naïve Bayesian Classifier ..............................................24
7. Bayesian Belief Networks......................................................27
7.1. Definition........................................................................27
8. Neural Networks: Classification by Backpropagation...........30
8.1. Neural network Issues ....................................................31
8.2. Backpropagation Algorithm...........................................32
9. Prediction................................................................................35
9.1. Regress Analysis and Log-Linear Models in Prediction
35
10. Classification Accuracy: Estimating Error Rates ..............36
A. Bellaachia Page: 1
1. Objectives
A. Bellaachia Page: 2
2. Classification vs. Prediction
2.1. Definitions
• Classification:
Predicts categorical class labels (discrete or
nominal)
Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Prediction:
Models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications
Document categorization
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis
• Data Preparation
Data cleaning
o Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
o Remove the irrelevant or redundant
attributes
Data transformation
o Generalize and/or normalize data
• Performance Analysis
Predictive accuracy:
o Ability to classify new or previously unseen
data.
Speed and scalability
o Time to construct the model
o Time to use the model
Robustness
o Model makes correct predictions: Handling
noise and missing values
Scalability
o Efficiency in disk-resident databases
Interpretability:
o Understanding and insight provided by the
model
Goodness of rules
A. Bellaachia Page: 4
o Decision tree size
o Compactness of classification rules
A. Bellaachia Page: 5
4. Classification
• Another Example:
If x >= 90 then grade =A.
If 80<=x<90 then grade =B.
If 70<=x<80 then grade =C. x
If 60<=x<70 then grade =D.
If x<50 then grade =F. <90 >=90
x A
<80 >=80
x B
<70 >=70
x C
<50 >=60
F D
• Classification types:
A. Bellaachia Page: 6
o Distance based
o Partitioning based
A. Bellaachia Page: 7
• Model construction: describing a set of predetermined
classes
o Each data sample is assumed to belong to a predefined
class, as determined by the class label attribute
o Use a training dataset for model construction.
o The model is represented as classification rules,
decision trees, or mathematical formula
Classification
Algorithms
Training
Data
A. Bellaachia Page: 8
• Model usage: for classifying future or unknown objects
o Estimate accuracy of the model
The known label of test sample is compared with
the classified result from the model
Test accuracy rate is the percentage of test set
samples that are correctly classified by the model
Test set is independent of training set but from the
same probability distribution
Classifier
Testing
Data
Tenured? Yes
• Common Techniques
A. Bellaachia Page: 9
• Neural Networks (NNet) - Learn non-linear mapping from
input data samples to categories.
• Support Vector Machines (SVMs).
A. Bellaachia Page: 10
5. Decision Tree Induction
A. Bellaachia Page: 11
• A decision tree for “buys_computer”
age?
excellent fair
no
NO YES NO YES
A. Bellaachia Page: 12
5.1. Decision Tree Induction Algorithm
ms s
I ( s1, s2 ,..., sm ) = − ∑ i log 2 i
i =1 s s
A. Bellaachia Page: 13
o Entropy after choosing attribute A with values
{a1,a2,…,av}
v s1 j + ... + smj
E ( A) = ∑ I ( s1 j ,..., smj )
j =1 s
o Class P:
buys_computer = “yes”
p: number of samples
o Class N:
buys_computer = “no”
n: number of samples
A. Bellaachia Page: 14
5 4 5
E (age) = I (2,3) + I (4,0) + I (3,2) = 0.694
14 14 14
5
I (2,3) : means “age <=30” has 5 out of 14 samples, with 2
14
yes’es and 3 no’s.
Hence
Gain(age) = I(p,n)-E(age) = 0.94- 0.694 = 0.246
Similarly,
Gain(income) = 0.029
Gain(student) = 0.151
Gain(credit_rating) = 0.048
A. Bellaachia Page: 15
Age?
>30
<=30
31...4
0
income student credit_rating class income student credit_rating class
high no fair no medium no fair yes
high no excellent no low yes fair yes
low yes fair yes low yes excellent no
medium no fair no medium yes fair yes
medium yes excellent yes medium no excellent no
A. Bellaachia Page: 16
• ID3 Algorithm:
A. Bellaachia Page: 17
5.2. Other Attribute Selection Measures
• Formal Definition
o If a data set T contains examples from n classes, gini
index, gini(T) is defined as
n 2
gini(T ) = 1 − ∑ p j
j =1
Where pj is the relative frequency of class j in T.
N 1 gini( ) + N 2 gini ( )
gini split (T ) = T1 T2
N N
o The attribute provides the smallest ginisplit(T) is
chosen to split the node (need to enumerate all possible
splitting points for each attribute).
A. Bellaachia Page: 18
5.3. Extracting Classification Rules from Trees:
A. Bellaachia Page: 19
Use a set of data different from the training data
to decide which is the “best pruned tree”
A. Bellaachia Page: 20
• Visualization of a Decision Tree in SGI/MineSet 3.0
A. Bellaachia Page: 21
6. Bayesian Classification
• Why Bayesian?
o Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems
6.1. Basics
• Prior Probability:
o P(H): prior probability of hypothesis H
o It is the initial probability before we observe any data
A. Bellaachia Page: 22
o It reflects the background knowledge
• Bayesian Theorem:
P( X | H ) P( H )
P( H | X ) =
P( X )
likelihood x prior
posterior =
evidence
• Practical difficulty:
o Require initial knowledge of many probabilities
o Significant computational cost
A. Bellaachia Page: 23
6.2. Naïve Bayesian Classifier
• Algorithm:
o A simplified assumption: attributes are conditionally
independent:
n
P( X | C i) = ∏ P( xk | C i)
k =1
• Example:
X=(age<=30,Income=medium,Student=yes,Credit_rating=Fair)
o Compute P(xk|Ci):
s
P(xk|Ci)= ik
si
A. Bellaachia Page: 24
Where sik is the number of training samples of Class
Ci having the value xk for the attribute Ak and si is
the number of training samples belonging to Ci
o Compute P(X|Ci) :
n
P(X|Ci) = ∏ P( xk | Ci )
k =1
o P(X|Ci)*P(Ci) :
A. Bellaachia Page: 25
9
P(buys_computer=“yes”)= =0.643
14
5
P(buys_computer=“no”)= =0.36
14
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
• Advantages:
o Easy to implement
o Good results obtained in most of the cases
• Disadvantages
o Assumption: class conditional independence, therefore
loss of accuracy
o Practically, dependencies exist among variables
o E.g., hospitals: patients: Profile: age, family history etc
o Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
o Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
A. Bellaachia Page: 26
7. Bayesian Belief Networks
• Objectives:
o The naïve Bayesian classifier assume that attributes
are conditionally independents
7.1. Definition
J B
• Example:
P(H=1) P(H=0)
0.05 0.95
P(H)
h P(B=1 | H=h) P(B=0 | H=h)
H
1 0.95 0.05
P(B|H)
0 0.03 0.97
P(J=1|h,b) P(J=0|h,b)
B
h b
1 1 0.8 0.2
1 0 0.8 0.2 J P(J|H,B))
0 1 0.3 0.7
0 0 0.3 0.7
A. Bellaachia Page: 28
• Challenges
o Efficient ways to use BNs
o Ways to create BNs
o Ways to maintain BNs
o Reason about time
A. Bellaachia Page: 29
8. Neural Networks: Classification by Backpropagation
• Perceptron
It is one of the simplest NN
No hidden layers.
• Supervised learning
A. Bellaachia Page: 30
• Algorithms: Propagation, Backpropagation, Gradient
Descent
• Example:
A. Bellaachia Page: 31
8.2. Backpropagation Algorithm
• A Neuron:
w0 θj
x0
w1
x1
∑
f
wn output y
xn
A. Bellaachia Page: 32
• Algorithm:
o Initialize the weights in the NN and the bias associated
to each neuron. The value are generally chosen between
–1 and 1 or –5 and +5.
o For each sample, X, do: Propagate the inputs forward
The net input and output of each neuron j in the
hidden and output layers are computed:
I j = ∑ wij Oi + θ j
i
Where
wij is the weight of the connection from unit
i in the previous layer to unit j
Oi is the output of unit I from the previous
layer.
θj is the bias of the unit: this is used to vary
the activity of the unit.
Err j = O j (1 − O j )∑ Errk w jk
k
A. Bellaachia Page: 34
9. Prediction
• Linear regression: Y = a + b X
o Two parameters, a and b specify the line and are to be
estimated by using the data at hand.
o Using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, …. (See example 7.6 on page 320
of your textbook).
• Non-linear models:
o Can always be converted to a linear model
A. Bellaachia Page: 35
10. Classification Accuracy: Estimating Error Rates
• Partition: Training-and-testing
o Use two independent data sets, e.g., training set (2/3),
test set(1/3)
o Used for data set with large number of samples
• Cross-validation
o Divide the data set into k subsamples
o Use k-1 subsamples as training data and one sub-sample
as test data—k-fold cross-validation
o For data set with moderate size
• Bootstrapping (leave-one-out)
o For small size data
• Confusion Matrix:
o This matrix shows not only how well the classifier
predicts different classes
o It describes information about actual and detected
classes:
Detected
Positive Negative
Positive A: True positive B: False Negative
Actual
Negative C: False Positive D: True Negative
• The recall (or the true positive rate) and the precision (or
the positive predictive rate) can be derived from the
confusion matrix as follows:
o Recall = A
A+ B
A
o Precision =
A+C
A. Bellaachia Page: 36