Week 5
Week 5
- Classification
Classification
2
Supervised vs. Unsupervised Learning
• Supervised learning
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Classification - predicts categorical class labels
• Regression – predicts continues values
• Unsupervised learning
• The class labels of training data is unknown
• E.g. Clustering: Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or clusters in the data
3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Tenured?
6
Classification
7
Decision Tree Induction: An Example
age income student credit_rating buys_computer
no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
n Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D ) = å
j
´ Info( D j )
j =1 | D |
n Information gained by branching on attribute A
14
Comparing Attribute Selection Measures
15
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
17
Classification
18
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
• Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
19
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
recognition rate
n Specificity = TN/N
21
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
22
Classifier Evaluation Metrics: Example
23
Evaluating Classifier Accuracy
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
25
Model Selection Using Statistical Tests of
Significance
Estimating Confidence Intervals: Classifier Models M1 vs. M2
26
Estimating Confidence Intervals:
Null Hypothesis
27
Estimating Confidence Intervals: t-test
• pairwise comparison
• For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
• Average over 10 rounds to get and
• t-test computes t-statistic with k-1 degrees of freedom:
where
28
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
critical value(sig/2,df)
29
Estimating Confidence Intervals:
Statistical Significance
30
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
with numerical outputs
• Shows the trade-off between the true
positive rate and the false positive rate
• Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
• The closer to the diagonal line (i.e.,
the closer the area is to 0.5), the less
accurate is the model
• The area under the ROC curve (AUC)
is a measure of the accuracy of the
model
31
More Measures in Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
32
Classification
33
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging
• Boosting
34
Bagging
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• Majority voting: The bagged classifier M* counts the votes and assigns
the class with the most votes to X
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
35
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy d
1 - error ( M i )
error ( M i ) = å w j ´ err ( X j ) Wj= log
j error ( M i )
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
36
Classification of Class-Imbalanced Data Sets
38
Multiclass Classification - from binary
• Classification involving more than two classes (i.e., > 2 Classes)
• Method 1. One-vs.-all (OVA): Learn a classifier one at a time
• Given m classes, train m classifiers: one for each class
• Classifier j: treat tuples in class j as positive & all others as negative
• To classify a tuple X, the set of classifiers vote as an ensemble
• Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
• Given m classes, construct m(m-1)/2 binary classifiers
• A classifier is trained using tuples of the two classes
• To classify a tuple X, each classifier votes. X is assigned to the class with
maximal vote
39
Classification
40
Classification: A Mathematical Mapping
41
Linear Classifier
𝑥! In order to classify new
𝑦! = +1
data closer to the line
correctly , the minimum
distance between the
line and training samples
should be as far as
possible.
𝑦! = −1
0 𝑥"
• In addition to correction, we manage to find a robust classifier which
has good generalization ability to new data.
Pick the one with the largest margin
𝑥! margin
𝑦! = +1
Maximum margin
classifier
𝑦! = −1
0 𝑥"
Specifying a hyper-plane
• A hyper-plane: 𝑤 (x + b = 0
• For (𝑥) ,𝑦) )∈ 𝐷, if 𝑦) =+1, 𝑤 (𝑥) + b > 0; if 𝑦) =-1, 𝑤 (𝑥) + b < 0
𝑥! 𝑤 !x + b > 0 𝑤 Correction constrains
𝑦! = +1
𝑤 !x + b = 0
𝑤 !x + b < 0
𝑦! = −1
0 𝑥"
Maximum margin
2𝑐
• Margin 𝛾=
𝑤
• Goal : max{margin}
𝛼 𝑤 ! x +𝛼b =𝛼c 2𝑐
𝛾=
𝑥! 𝑤
𝑦! = +1
𝑤 !x + b = 0
Support vectors
𝑤 !x + b = - c
𝑦! = −1
Note: ‘c’ is arbitrary (can
normalize equations by c)
0 𝑥"
Maximum margin
2
• Margin 𝛾=
𝑤
• Goal : max{margin} "
𝛼=0
𝑤 !x + b = 1 2
𝛾=
𝑥! 𝑦! = +1
𝑤
𝑤 !x + b = 0
Support vectors
𝑤 !x + b = -1
𝑦! = −1
Note: the ’𝑤’ here is equal
"
to 𝑤 of last slide , ’b’ too
0 𝑥" #
Primal optimization problemlem
1 (
m𝑖𝑛 𝑤
%,' 2
(1)
s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚
Dual problem
• Lagrangian:
&
1 $
𝐿 𝑤, 𝑏, 𝛼 = 𝑤 + 4 𝛼! (1 − 𝑦! (𝑤 ' 𝑥! + 𝑏))
2
!%"
𝛼 = 𝛼" ; 𝛼$ ; … ; 𝛼& , 𝛼! ≥ 0
1 $
𝑤 , If w,b saWsfies primal constraints
max 𝐿 𝑤, 𝑏, 𝛼 = 2
(:(! *+
∞, otherwise
s.t. ∑&!%" 𝛼! 𝑦! = 0,
𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚
𝑓 𝒙 = 𝒘' 𝒙 + 𝑏 = 4 𝛼! 𝑦! 𝒙! ' 𝒙 + 𝑏
!%"
SVM—Linearly Inseparable
s.t. ∑&
!%" 𝛼! 𝑦! = 0,
𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚
• Calculating the inner product of feature vectors in the
feature space can be costly because it is high dimensional.
• The kernel trick comes to rescue:
Kernel function 𝑘 𝒙! , 𝒙/ = ∅(𝒙! )' ∅(𝒙/ )
the inner product of feature vectors in the feature space calculation
in the original input space by function 𝑘(=,=).
SVM: Different Kernel functions
Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔<
φ(x), φ(y ) > for some φ(.).
Kernel function
• Examples of commonly-used kernel functions:
Linear kernel: K(xi ,xj ) = xiTx j
0 𝑥"
Optimization object
Minimize # training samples that
Maximize margin
don’t satisfy the constrain
&
1 $
min 𝑤 + 𝐶 4 𝑙+⁄" (𝑦! 𝑤 ' 𝑥! + 𝑏 − 1 )
,,. 2
!%"
-2 -1 0 1 2
60
Classification
61
Artificial Neural Networks
• A set of connected input/output units
where each connection has a weight
associated with it
Perception
Activation function : 1
f(x) =
1 + e- x
1
Then = 0.3775
1+e 0.5
Activation functions
• Limit the output to the range
• Enable non-linear transformation
• Learning:
• Given the network structure – determined empirically
• Learn the weights
• Learning Approach:
• Random initial weights – usually small
• Iterative: Backpropagation to update the weights based
on the model output “error”
• Terminating condition (when error is very small, or
enough epochs completed , etc.)
Backpropagation
• An optimization problem
• Loss function: evaluates how well the neural
network models the training data, for example,
mean squared error
Gradient Descent
• In each iteration:
• Propagate the inputs forward
• Backpropagate the error
Neural Network as a Classifier
• Multiple Layer Perception (MLP)
• Strength
• High tolerance to noisy data
• Well-suited for continuous-valued inputs and outputs
• Algorithms are inherently parallel
• Weakness
• Long training time
• Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
• Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network
72
Deep Learning
• Deep Learning is a branch of machine learning
based on a set of algorithms that attempt to model
high level abstractions in data by using a deep
graph with multiple processing layers, composed of
multiple linear and non- linear transformations.