Classification
Classification
Classification
Contents
• Classification vs. Prediction
• Classification—A Two-Step Process
• Supervised vs. Unsupervised Learning
• Decision Tree Induction
• Attribute Selection Measures
• Bayesian Classification
• Rule Based Classification
• Classification by Back Propagation
• Support Vector Machines
• Associative Classification
• Lazy Learners – k-Nearest Neighbor Classifiers
• Prediction
• Accuracy and Error Measures
Classification vs. Prediction
• Classification
– predicts categorical class labels (discrete or
nominal)
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
Classfication vs. Prediction
• Classification predicts categorical (discrete,
unordered) labels.
– Ex : categorize bank loans as safe or risky.
– Data analysis task is classification where a model or
classifier is constructed to predict categorical labels
such as “safe” , “risky”, ”yes”, ”no” “treatment A” etc.
• Prediction models continuous-valued functions
– Ex : predict the expenditures in dollars of potential
customers on computer equipment given their income
and occupation.
– The model is a “predictor”
– Regression Analysis is typically used for prediction
Classification—A Two-Step
Process
• Model construction(Learning Step) : describing a set of predetermined
classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage (Classification) : for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Training Set
• Training Set :
- Since the class label of each training tuple is provided this step is also
known as supervised learning.
( Unsupervised Learning is Clustering)
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the data
Preparing the data for
Classification or Prediction
• Data Cleaning – remove noisy data or missing
values
• Attribute relevance analysis - using Correlation
Analysis or Attribute subset selection to find the
reduced set of attributes
• Data transformation and reduction – using
normalization or methods involving distance
measurements , concept hierarchies or
generalization into higher concepts.
Decision Tree Induction
• In 70s and early 80s Ross Quinlan , a researcher in machine learning
developed ID3 ( Iterative Dichotomiser)
– uses entropy as measure of how informative the node is.
• C4.5 which is a benchmark to which new supervised learning algorithms are
often compared.
– Extension of ID3. accounts for unavailable values, continuous attributes
, pruning of decision trees and rule derivation
– Does not generate a binary tree
• Classification and Regression Trees (CART)
– uses gini index for determining best split
– Build binary decision tree
– Adopt a greedy ie. Non back tracking approach in which decision trees
are constructed in a top down recursive divide and conquer manner.
• Training set s recursively partitioned into smaller sub sets as the tree is
being built.
Classification by Decision Tree
Induction
• A decision tree is a flow-chart like structure
- each internal node denotes a test on an
attribute
- each branch represents an outcome of the test
- leaf node holds a class label
<=30 overcast
31..40 >40
no yes yes
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
follows an 31…40
>40
high
medium
no
no
fair
fair
yes
yes
example >40 low yes fair yes
of >40
31…40
low
low
yes excellent
yes excellent
no
yes
Quinlan’s <=30 medium no fair no
ID3 <=30
>40
low
medium
yes fair
yes fair
yes
yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Algorithm for Decision Tree
Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information
gain)
– The splitting criterion specifies splitting attribute and split-point.
– A partition is pure if all the tuples belong to the same class
– Splitting variable A can be
• Discrete , then the outcomes of the test are the known values of A
• Continuous-valued, then the outcomes are <= split point and > split_point
• Binary , then yes and no outcomes
• Discrete-valued and binary tree has to be produced
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is employed for
classifying the leaf
– There are no tuples left , ie. Partition is empty
Generate_Decision_tree
• Input :
– Data partition D which is a set of training
tuples and their associated class label
– attribute list, the set of candidate attributes
– Attribute_selection_method – a procedure to
determine the splitting criterion that best
partitions the data tuples into individual
classes. The criterion consists of a splitting
attribute and the split point or splitting subset
• Output : a decision tree
Generate_Decision_tree : Method
Create a node N;
If tuples in D are all of the same class C then
Return N as leaf node labeled with the class C;
If attribute_list is empty then
return N as a leaf node labeled with the majority class in D;
Apply Attribute_selection_method(D,attribute list) to find the best splitting criterion
Label node N with splitting_criterion;
If splitting_attribute is discrete_valued and multiway splits are allowed then
attribute_list <- attribute_list – splitting_attribute
For each outcome j of the splitting criterion
let Dj be the set of data tuples in D satisfying outcome j;
if Dj is empty then
attach a leaf labeled with the majority class in D to node N;
else
attach the node returned by Generate_decision_tree(Dj, attribute_list ) to node N;
End For
Return N;
Pros and Cons of Decision Tree
Classification
• Advantages
– Construction of Decision tree does not require any domain
knowledge or parameter setting and so is appropriate for
exploratory knowledge discovery
– Can handle high dimensional data
– are able to generate understandable rules
– able to handle both numerical and categorical attributes
– indicate which fields are important for prediction or classification
• Disadvantages
– Error prone where training class is small
– Can be computationally expensive.
Attribute Selection Measure
• Is a heuristic for selecting the splitting criterion
that best seperates a given data partition D
• Ideally each partition has to be pure.
• Information Gain
– Attribute with highest information gain is chosen as
splitting attribute for node N
– Attribute minimizes the information required to
classify the tuples in the resulting partition
– A log function to the base 2 is used because
information is encoded in bits
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
in D: m
Info( D) pi log 2 ( pi )
i 1
j 1 | D |
conjunction: the leaf holds the class prediction no yes excellent fair
yes
• Rules are mutually exclusive and exhaustive no yes
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
• and P(xk|Ci) is 1
( x ) 2
g ( x, , ) 2 2
e
2
P(X | C i) g ( xk , Ci , Ci )
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_comput
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
Data sample >40 low yes excellent no
X = (age <=30, 31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayesian Classifier: An
Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input layer
I j wij Oi j
i
Input vector: X
Back Propagation Algorithm
Input :
D, a data set consisting of the training tuples and associated target values
l, the learning rate
Network, a multi layer feed-forward network
Output :
A trained neural network
Method :
Initialize all weights and biases in network
While terminating condition is not satisfied {
for each training tuple X in D {
for each input layer unit j {
Oj = Ij; // output of an input unit is its actual input value
for each hidden or output layer unit j {
Ij= Σi wijOi + Øj; // compute the net input of unit j with resp. to prev. layer,i
Oj= 1 / (1 + e –Ij); } // compute the output of each unit j
Back Propagation Algorithm
// Back propagate the errors :
for each unit j in the output layer
Errj= Oj(1-Oj)(Tj-Oj); // Compute the error
for each unit j in the hidden layers from last to first hidden layer
Errj = Oj(1-Oj) Σk Errk wjk; // Compute error with resp. to next
higher layer,k
for each weight wij in network {
Δwij = (l) ErrjOi; // weight increment
wij = wij + Δwij ; } // weight update
For each bias Øj in network {
ΔØj = (l)Errj;
Øj = Øj + ΔØj ;
}}
Important points
• Initialize the weights with small random numbers ranging from -1.0
to 1.0
• - is bias of the unit – thresholds to vary the activity of the unit.
• The logistic or sigmoid function - squashing function - maps a large
input domain to a smaller range between 0 & 1
• l is the learning rate – typically value between 0.0 and 1.0 - can be
set to 1/t where t is the number of iterations through the training set
so far
• Terminating condition
– When increments in weights are too small
– Number of tuples misclassified in the previous epoch is below some threshold
– A prespecified number of epochs has expired
Neural Network as a Classifier
• Advantages
– High tolerance to noisy data
– Ability to classify untrained patterns
– Well-suited for continuous-valued inputs and outputs
– Successful on a wide array of real-world data
– Algorithms are inherently parallel
– Techniques have recently been developed for the extraction of
rules from trained neural networks
• Disadvantages
– Long training time
– Require a number of parameters typically best determined
empirically, e.g., the network topology or ``structure."
– Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of ``hidden units" in the network
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their
ability to model complex nonlinear decision boundaries (margin
maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
SVM—General Philosophy
(x x )( yi y )
w w y w x
i
i 1
1 | D|
(x i x )2
0 1
i 1
(y
i
i 1
i y)2
i 1
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous
classifiers