Unit 3 Classification
Unit 3 Classification
Unsupervised Learning
General Approach to Solving a
Classification Problem
• A classification technique (or classifier) is a
systematic approach to building classification models
from an input data set.
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
• Each technique employs a learning algorithm
to identify a model that best fits the
relationship between the attribute set and
class label of the input data.
• key objective of the learning algorithm is to
build models with good generalization
capability i.e., models that accurately predict
the class labels of previously unknown records.
• First, a training set
consisting of records whose
class labels are known must
be provided.
• The training set is used to
build a classification model,
which is subsequently
applied to the test set,
which consists of records
with unknown class labels.
• Evaluation of the performance of a classification
model is based on the counts of test records
correctly and incorrectly predicted by the
model. These counts are tabulated in a table
known as a confusion matrix.
Issues regarding classification and prediction
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
10/27/22 12
Evaluation of Classifiers
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
10/27/22 13
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
10/27/22 14
Training Dataset
age?
<=30 overcast
30..40 >40
no yes no yes
10/27/22 16
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
10/27/22 17
Attribute Selection Measure
10/27/22 18
Information Gain (ID3/C4.5)
p p n n
I ( p , n) log2 log2
pn pn pn pn
10/27/22 19
Information Gain in Decision Tree
Induction
10/27/22 21
Gini Index (IBM IntelligentMiner)
10/27/22 23
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree
—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
10/27/22 24
Approaches to Determine the Final Tree Size
Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Size
{Small,
What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
How to Find the Best Split
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
GINI (t ) 1 [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p ( j | t )]2
j
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the
dataset
Use the count matrix to make decisions
Yes No
Bayes Classifier
• A probabilistic framework for solving
classification problems P ( A, C )
P (C | A)
• Conditional Probability: P ( A)
P ( A, C )
P( A | C )
P(C )
• Bayes theorem:
P( A | C) P(C)
P(C | A)
P( A)
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Consider each attribute and class label as random
variables
P( A A A )
1 2 n
1 2 n
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities from Data?
Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N)
yes no yes no ?
=> Mammals
Naïve Bayes (Summary)
• Robust to isolated noise points
10/27/22 55
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables)
are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first
10/27/22 56
Bayesian Belief Networks (I)
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
PositiveXRay Dyspnea
10/27/22 57
Bayesian Belief Networks (II)
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
– Given both network structure and all the variables: easy
– Given network structure but only some variables
– When the network structure is not known in advance
10/27/22 58
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test Record
X X X
d ( p, q ) ( pi
i
q )
i
2
X
Nearest Neighbor Classification…
• Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
• Problem with Euclidean measure:
– High dimensional data
• curse of dimensionality
– Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142