0% found this document useful (0 votes)

23 views72 pages

Week 5

- The document discusses classification techniques for supervised learning, including decision tree induction. - It describes the basic process of classification as having two steps - model construction using a training set, and then model usage to classify future data. - Decision tree induction is explained as a top-down process that recursively partitions data based on attribute tests, selected using an information gain heuristic, to construct a tree for classification.

Uploaded by

veceki2439

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views72 pages

Week 5

Uploaded by

veceki2439

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

CP610 Data Analysis

- Classification
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

2
Supervised vs. Unsupervised Learning
• Supervised learning
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Classification - predicts categorical class labels
• Regression – predicts continues values

• Unsupervised learning
• The class labels of training data is unknown
• E.g. Clustering: Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or clusters in the data

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

Tenured?

6
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

7
Decision Tree Induction: An Example
age income student credit_rating buys_computer

q Training data set: Buys_computer <=30

<=30
high
high
no fair
no excellent
no
no
q The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
q Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
n Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D ) = å
j
´ Info( D j )
j =1 | D |
n Information gained by branching on attribute A

Gain(A) = Info(D) - Info A(D)

10
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3)means “age <=30” has 5 out of 14
<=30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s.
31…40 4 0 0 Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) - Infoage ( D) = 0.246
<=30 high
31…40 high
no
no
excellent
fair
no
yes
Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) = 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) = 0.048
31…40 11high yes fair yes
>40 medium no excellent no
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the max information gain for A is selected as
the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
12
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the
splitting attribute
13
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n
gini( D) = 1- å p 2j
j =1

• Replace the entropy Info(D) in ID3 algorithm.

14
Comparing Attribute Selection Measures

• The three measures, in general, return good results but

• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity in
both partitions

15
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases

• G-statistic: has a close approximation to χ2 distribution

• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
• The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
• Most give good results, none is significantly superior than others
16
Overfitting and Tree Pruning
• Overfitting: a model that models the training data too well, can
not generalize to new data.
• An induced Decision Tree may be overfit: Too many branches, some may
reflect anomalies due to noise or outliers

• Approaches to avoid overfitting

• Prepruning: pruning the model by halting the tree’s
formation in advance
• Postpruning: Remove branches from a “fully grown” tree

17
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

18
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
• Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves

19
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates #

of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
20
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C n Class Imbalance Problem
C TP FN P (anomaly detection):
¬C FP TN N
n One class may be rare, e.g.
P’ N’ All
fraud, or HIV-positive
• Classifier Accuracy, or recognition n Significant majority of the
rate: percentage of test set tuples negative class and minority of
that are correctly classified the positive class
Accuracy = (TP + TN)/All n Sensitivity: True Positive

• Error rate: 1 – accuracy, or recognition rate

Error rate = (FP + FN)/All n Sensitivity = TP/P

n Specificity: True Negative

recognition rate
n Specificity = TN/N
21
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the

classifier label as positive?

• F measure (F1 or F-score): harmonic mean of precision and

recall,

• Fß: weighted measure of precision and recall

• assigns ß times as much weight to recall as to precision

22
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

23
Evaluating Classifier Accuracy
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation

• Cross-validation (k-fold, where k = 10 is most popular)

• Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
• At i-th iteration, use Di as test set and others as training set

• Variation 1. Leave-one-out: k folds where k = # of tuples, for

small sized data
• Variation 2. Stratified cross-validation: folds are stratified so
that class dist. in each fold is approx. the same as that in the
initial data
24
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

25
Model Selection Using Statistical Tests of
Significance
Estimating Confidence Intervals: Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• What if the difference between the 2 error rates is just attributed

to chance?
• Use a test of statistical significance (e.g. t-test, student t-test)

• Obtain confidence limits for our error estimates

26
Estimating Confidence Intervals:
Null Hypothesis

• Assume samples follow a t distribution with k–1 degrees of

freedom
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
• we conclude that the difference between M1 & M2 is
statistically significant
• Chose model with lower error rate

27
Estimating Confidence Intervals: t-test

• pairwise comparison
• For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
• Average over 10 rounds to get and
• t-test computes t-statistic with k-1 degrees of freedom:

where

28
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
critical value(sig/2,df)

29
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?

• Compute t. Select significance level (e.g. sig = 5%)
• Consult table for t-distribution: Find t value
corresponding to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
t = critical value(sig/2,df)
• If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are
same
• Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance

30
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
with numerical outputs
• Shows the trade-off between the true
positive rate and the false positive rate
• Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
• The closer to the diagonal line (i.e.,
the closer the area is to 0.5), the less
accurate is the model
• The area under the ROC curve (AUC)
is a measure of the accuracy of the
model

31
More Measures in Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
32
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

33
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging
• Boosting

34
Bagging
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• Majority voting: The bagged classifier M* counts the votes and assigns
the class with the most votes to X
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction

35
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy d
1 - error ( M i )
error ( M i ) = å w j ´ err ( X j ) Wj= log
j error ( M i )
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data

36
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous

negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs
• Typical methods for imbalance data in binary classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative
class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance
of costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers

38
Multiclass Classification - from binary
• Classification involving more than two classes (i.e., > 2 Classes)
• Method 1. One-vs.-all (OVA): Learn a classifier one at a time
• Given m classes, train m classifiers: one for each class
• Classifier j: treat tuples in class j as positive & all others as negative
• To classify a tuple X, the set of classifiers vote as an ensemble
• Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
• Given m classes, construct m(m-1)/2 binary classifiers
• A classifier is trained using tuples of the two classes
• To classify a tuple X, each classifier votes. X is assigned to the class with
maximal vote

39
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

40
Classification: A Mathematical Mapping

x Î X = Ân, y Î Y = {+1, –1}, x

We want to derive a function f: X à Y x x
x x
x
x x x o
• Linear Classification o
x o
• Binary Classification problem ooo o
o o
• Data above the line belongs to class o o o o
‘x’
• Data below line belongs to class ‘o’
• Examples: SVM, Perceptron,
Probabilistic Classifiers, Logistic
Regression

41
Linear Classifier
𝑥! In order to classify new
𝑦! = +1
data closer to the line
correctly , the minimum
distance between the
line and training samples
should be as far as
possible.
𝑦! = −1

0 𝑥"
• In addition to correction, we manage to find a robust classifier which
has good generalization ability to new data.
Pick the one with the largest margin

𝑥! margin
𝑦! = +1

Maximum margin
classifier

𝑦! = −1

0 𝑥"
Specifying a hyper-plane
• A hyper-plane: 𝑤 (x + b = 0
• For (𝑥) ,𝑦) )∈ 𝐷, if 𝑦) =+1, 𝑤 (𝑥) + b > 0; if 𝑦) =-1, 𝑤 (𝑥) + b < 0
𝑥! 𝑤 !x + b > 0 𝑤 Correction constrains
𝑦! = +1
𝑤 !x + b = 0

𝑤 !x + b < 0
𝑦! = −1

0 𝑥"
Maximum margin
2𝑐
• Margin 𝛾=
𝑤
• Goal : max{margin}

𝛼 𝑤 ! x +𝛼b =𝛼c 2𝑐
𝛾=
𝑥! 𝑤
𝑦! = +1

𝑤 !x + b = 0

Support vectors
𝑤 !x + b = - c

𝑦! = −1
Note: ‘c’ is arbitrary (can
normalize equations by c)
0 𝑥"
Maximum margin
2
• Margin 𝛾=
𝑤
• Goal : max{margin} "
𝛼=0
𝑤 !x + b = 1 2
𝛾=
𝑥! 𝑦! = +1
𝑤

𝑤 !x + b = 0

Support vectors
𝑤 !x + b = -1

𝑦! = −1
Note: the ’𝑤’ here is equal
"
to 𝑤 of last slide , ’b’ too
0 𝑥" #
Primal optimization problemlem

• 𝑤 " 𝑥# + 𝑏 ≥ 1, if 𝑦# = +1 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚

• 𝑤 " 𝑥# + 𝑏 ≤ −1, if 𝑦# = −1
2
max
%,' 𝑤

s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚

1 (
m𝑖𝑛 𝑤
%,' 2
(1)
s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚
Dual problem
• Lagrangian:
&
1 $
𝐿 𝑤, 𝑏, 𝛼 = 𝑤 + 4 𝛼! (1 − 𝑦! (𝑤 ' 𝑥! + 𝑏))
2
!%"
𝛼 = 𝛼" ; 𝛼$ ; … ; 𝛼& , 𝛼! ≥ 0

1 $
𝑤 , If w,b saWsﬁes primal constraints
max 𝐿 𝑤, 𝑏, 𝛼 = 2
(:(! *+
∞, otherwise

• re-written (1) : m𝑖𝑛 max 𝐿 𝑤, 𝑏, 𝛼

%,' ):) +,
!
Lagrangian duality
• The dual problem:
"
max m𝑖𝑛 𝐿 𝒘, 𝑏, 𝛼 = 𝒘 $ + ∑& '
!%" 𝛼! (1 − 𝑦! (𝒘 𝒙! + 𝑏))
(:(! *+ ,,. $

• We minimize L with respect to w and b first:

%
𝜕𝐿 𝒘 = ! 𝛼" 𝑦" 𝒙"
=0 (*)
𝜕𝒘 "#$
%
𝜕𝐿
=0 ! 𝛼" 𝑦" = 0 (**)
𝜕𝑏 "#$

• Plug (*) back to 𝐿 𝒘, 𝑏, 𝛼 , and using (**), we have:

& & &
1
𝐿 𝒘, 𝑏, 𝛼 = 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ 𝒙! ' 𝒙/
2
!%" !%" /%"
Dual problem
• Now we have the
&
dual&problem:
&
1
max 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ 𝒙! ' 𝒙/
( 2
!%" !%" /%"

s.t. ∑&!%" 𝛼! 𝑦! = 0,
𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚

• Resolve the above problem, we will get 𝛼, then 𝒘, 𝑏, and

then the final model: &

𝑓 𝒙 = 𝒘' 𝒙 + 𝑏 = 4 𝛼! 𝑦! 𝒙! ' 𝒙 + 𝑏
!%"
SVM—Linearly Inseparable

n Transform the original input data into a higher dimensional

space

n Search for a linear separating hyperplane in the new space

53
Kernel function
With this mapping ,our dual optimization problem becomes :
& & &
1
max 𝐿 𝛼 = 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ ∅(𝒙! )' ∅(𝒙/ ) 𝒙! ' 𝒙/
( 2
!%" !%" /%"

s.t. ∑&
!%" 𝛼! 𝑦! = 0,

𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚
• Calculating the inner product of feature vectors in the
feature space can be costly because it is high dimensional.
• The kernel trick comes to rescue:
Kernel function 𝑘 𝒙! , 𝒙/ = ∅(𝒙! )' ∅(𝒙/ )
the inner product of feature vectors in the feature space calculation
in the original input space by function 𝑘(=,=).
SVM: Different Kernel functions
Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔<
φ(x), φ(y ) > for some φ(.).
Kernel function
• Examples of commonly-used kernel functions:
Linear kernel: K(xi ,xj ) = xiTx j

Polynomial kernel: K(xi ,xj ) = (1 + xiT xj )p

xi - x j
Gaussian (Radial-Basis Function(RBF)) kernel:K(xi ,xj ) = exp(- )
2s 2

Sigmoid kernel: K(xi ,xj ) = tanh( b 0 xiT xj + b1 )

Data with noise

𝑤 !x + b = 1 • Allow some training

𝑥! samples don’t satisfy the
constrain:
𝑦! 𝑤 ' 𝑥! + 𝑏 ≥ 1
𝑤 !x + b = 0
• Maximize margin and
minimize # training
𝑤 !x + b = -1 samples that don’t satisfy
the constrain

0 𝑥"
Optimization object
Minimize # training samples that
Maximize margin
don’t satisfy the constrain

&
1 $
min 𝑤 + 𝐶 4 𝑙+⁄" (𝑦! 𝑤 ' 𝑥! + 𝑏 − 1 )
,,. 2
!%"

• 𝑙,⁄. ---loss function 1, 𝑖𝑓 𝑧 < 0;

𝑙+⁄" 𝑧 =
0, otherwise.

𝐶 > 0, tradeoff parameter； 𝐶 = ∞ hard margin SVM

But 𝑙+⁄" is not convex, discontinuous
Surrogate loss function

-2 -1 0 1 2

hinge loss: 𝑙1!234 𝑧 = max 0,1 − 𝑧

exponential loss: 𝑙456 𝑧 = exp(−𝑧)
logistic loss：𝑙783 𝑧 = log(1 + exp(−𝑧))
Soft margin SVM
If we apply hinge loss , the optimization object becomes:
&
1 $
min 𝑤 + 𝐶 4 max 0,1 − 𝑦! 𝑤 ' 𝑥! + 𝑏 (∗)
,,. 2
!%"

Introduce ‘slack variables’−𝜉) , rewrite (∗):

&
1
Soft margin SVM min 𝑤 $
+ 𝐶 4 𝜉!
,,.,9! 2
!%"
s.t. 𝑦! 𝑤 ' 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝜉! ≥ 0, 𝑖 = 1,2, … 𝑚

• every sample have a 𝜉# , 𝜉# denotes the degree of disobeying constrains.

SVM Related Links
• SVM Website: http://www.kernel-machines.org/
• Representative implementations
• LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
• SVM-light: simpler but performance is not better than
LIBSVM, support only binary classification and only in C
• SVM-torch: another recent implementation also written in C

60
Classification

• Classification: Basic Concepts

• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

61
Artificial Neural Networks
• A set of connected input/output units
where each connection has a weight
associated with it

• During the learning phase, the network

learns by adjusting the weights so as
to be able to predict the correct class
label of the input tuples
May have more hidden layers
• At one node/neuron (hidden layer or output layer)

Perception

• Given enough hidden units and enough training

samples, they can closely approximate any function
• An example：
1 0.2
5
f(x) 0.3775
5
-1.
0. 5

Weighted Sum : (1 ´ 0.25) + (0.5 ´ (-1.5)) = 0.25 + (-0.75) = - 0.5

Activation function : 1
f(x) =
1 + e- x
1
Then = 0.3775
1+e 0.5
Activation functions
• Limit the output to the range
• Enable non-linear transformation
• Learning:
• Given the network structure – determined empirically
• Learn the weights

• Learning Approach:
• Random initial weights – usually small
• Iterative: Backpropagation to update the weights based
on the model output “error”
• Terminating condition (when error is very small, or
enough epochs completed , etc.)
Backpropagation
• An optimization problem
• Loss function: evaluates how well the neural
network models the training data, for example,
mean squared error
Gradient Descent
• In each iteration:
• Propagate the inputs forward
• Backpropagate the error
Neural Network as a Classifier
• Multiple Layer Perception (MLP)
• Strength
• High tolerance to noisy data
• Well-suited for continuous-valued inputs and outputs
• Algorithms are inherently parallel
• Weakness
• Long training time
• Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
• Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network

72
Deep Learning
• Deep Learning is a branch of machine learning
based on a set of algorithms that attempt to model
high level abstractions in data by using a deep
graph with multiple processing layers, composed of
multiple linear and non- linear transformations.

• Based on artificial neural networks with

representation learning.

• Applications: Text, Image, Video…

Special structures
• No longer Fully Connected Neural Networks!

• May not be feed-forward!

• Model Structure based on application

Crush Hypothesis Testing
From Everand
Crush Hypothesis Testing
Allison Dillard
No ratings yet
Basics of Titration
No ratings yet
Basics of Titration
48 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Module 4
No ratings yet
Module 4
99 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Class Basic
No ratings yet
Class Basic
75 pages
05 Classification
No ratings yet
05 Classification
79 pages
Unit 3
No ratings yet
Unit 3
98 pages
Classification
No ratings yet
Classification
73 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
Classification
No ratings yet
Classification
45 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
7 Classification
100% (3)
7 Classification
63 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
CH 5
No ratings yet
CH 5
84 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Class Basic
No ratings yet
Class Basic
67 pages
DM 3
No ratings yet
DM 3
37 pages
Classification
No ratings yet
Classification
81 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
Classification and Prediction
No ratings yet
Classification and Prediction
130 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Classification
No ratings yet
Classification
75 pages
Decision Tree
No ratings yet
Decision Tree
41 pages
Unit 4
No ratings yet
Unit 4
186 pages
DM 4
No ratings yet
DM 4
68 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
8 Classification
No ratings yet
8 Classification
82 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Classification-1
No ratings yet
Classification-1
48 pages
Chapter 02 - DM Tasks - Part I - Classification
No ratings yet
Chapter 02 - DM Tasks - Part I - Classification
58 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Unit Iii
No ratings yet
Unit Iii
11 pages
DWM Unit-III
No ratings yet
DWM Unit-III
24 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
159 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Spectral Sensor Calibration Methods: Application Note
No ratings yet
Spectral Sensor Calibration Methods: Application Note
31 pages
Стаття 1
No ratings yet
Стаття 1
11 pages
General Physics Week 1 4
No ratings yet
General Physics Week 1 4
23 pages
AN296097 Hall Effect System With Two Linear Sensor ICs
No ratings yet
AN296097 Hall Effect System With Two Linear Sensor ICs
8 pages
Machine Learning Module 3 Logistic Regression
No ratings yet
Machine Learning Module 3 Logistic Regression
22 pages
Delone Information 1992
No ratings yet
Delone Information 1992
37 pages
Intermidiate Check Vâlcu - 2018 - J. - Phys. - Conf. - Ser. - 1065 - 042033
No ratings yet
Intermidiate Check Vâlcu - 2018 - J. - Phys. - Conf. - Ser. - 1065 - 042033
5 pages
1220 Analytical Procedure Life Cycle
No ratings yet
1220 Analytical Procedure Life Cycle
12 pages
Comparative Evaluation of Credit Card Fraud Detection
No ratings yet
Comparative Evaluation of Credit Card Fraud Detection
7 pages
Iso Fdis 9073 1
No ratings yet
Iso Fdis 9073 1
10 pages
A Study of Typing Speed and Accuracy Development Using Computer-B
No ratings yet
A Study of Typing Speed and Accuracy Development Using Computer-B
52 pages
CAIE-IGCSE-Physics - Alternative To Practical
No ratings yet
CAIE-IGCSE-Physics - Alternative To Practical
5 pages
Aerial Surveys To Monitor Bluefin Tuna Abundance and Track Efficiency of Management Measures
No ratings yet
Aerial Surveys To Monitor Bluefin Tuna Abundance and Track Efficiency of Management Measures
25 pages
Chapter 1 Linear and Angular Measurement
No ratings yet
Chapter 1 Linear and Angular Measurement
29 pages
GD&T Form Measurement
No ratings yet
GD&T Form Measurement
19 pages
Performance Prediction Through OEE-Model
No ratings yet
Performance Prediction Through OEE-Model
11 pages
Fluidization - Expansion Equations For Fluidized Solid Liquid Systems (Akgiray and Soyer, 2006)
No ratings yet
Fluidization - Expansion Equations For Fluidized Solid Liquid Systems (Akgiray and Soyer, 2006)
10 pages
Comparators and Its Types
No ratings yet
Comparators and Its Types
16 pages
IELTS TASK 2 Writing Band Descriptors (Public Version)
No ratings yet
IELTS TASK 2 Writing Band Descriptors (Public Version)
2 pages
COMP 527 - 2019 - CA1 Re-Sit Assignment Data Classification Implementing K-NN Classifier
No ratings yet
COMP 527 - 2019 - CA1 Re-Sit Assignment Data Classification Implementing K-NN Classifier
3 pages
MINITAB User's Guide Chap - 11
100% (3)
MINITAB User's Guide Chap - 11
32 pages
2024 Agustus CRAJ
No ratings yet
2024 Agustus CRAJ
4 pages
Example Candidate Responses: Cambridge IGCSE Physics 0625
100% (2)
Example Candidate Responses: Cambridge IGCSE Physics 0625
41 pages
Surveying Definition of Terms
No ratings yet
Surveying Definition of Terms
4 pages
Weighing The Right Way Brochure
No ratings yet
Weighing The Right Way Brochure
36 pages
30 Interview Questions To Test Your Skills On KNN Algorithm
No ratings yet
30 Interview Questions To Test Your Skills On KNN Algorithm
12 pages
TRIZ Matrix
No ratings yet
TRIZ Matrix
10 pages
Mil STD 883L
No ratings yet
Mil STD 883L
28 pages
A Comparison of The Prediction Accuracy of Two IVIVC Modelling Techniques
No ratings yet
A Comparison of The Prediction Accuracy of Two IVIVC Modelling Techniques
11 pages

Week 5

Uploaded by

Week 5

Uploaded by

CP610 Data Analysis

• Classification: Basic Concepts

NAM E RANK YEARS TENURED Classifier

• Classification: Basic Concepts

q Training data set: Buys_computer <=30

student? yes credit rating?

no yes excellent fair

Gain(A) = Info(D) - Info A(D)

• gain_ratio(income) = 0.029/1.557 = 0.019

• Replace the entropy Info(D) in ID3 algorithm.

• The three measures, in general, return good results but

• G-statistic: has a close approximation to χ2 distribution

• Approaches to avoid overfitting

• Classification: Basic Concepts

Example of Confusion Matrix:

• Given m classes, an entry, CMi,j in a confusion matrix indicates #

• Error rate: 1 – accuracy, or recognition rate

n Specificity: True Negative

• Recall: completeness – what % of positive tuples did the

• F measure (F1 or F-score): harmonic mean of precision and

• Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

• Cross-validation (k-fold, where k = 10 is most popular)

• Variation 1. Leave-one-out: k folds where k = # of tuples, for

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• What if the difference between the 2 error rates is just attributed

• Obtain confidence limits for our error estimates

• Assume samples follow a t distribution with k–1 degrees of

• Are M1 & M2 significantly different?

• Classification: Basic Concepts

• Class-imbalance problem: Rare positive example but numerous

• Classification: Basic Concepts

x Î X = Ân, y Î Y = {+1, –1}, x

• 𝑤 " 𝑥# + 𝑏 ≥ 1, if 𝑦# = +1 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚

s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚

• re-written (1) : m𝑖𝑛 max 𝐿 𝑤, 𝑏, 𝛼

• We minimize L with respect to w and b first:

• Plug (*) back to 𝐿 𝒘, 𝑏, 𝛼 , and using (**), we have:

• Resolve the above problem, we will get 𝛼, then 𝒘, 𝑏, and

n Transform the original input data into a higher dimensional

n Search for a linear separating hyperplane in the new space

Polynomial kernel: K(xi ,xj ) = (1 + xiT xj )p

Sigmoid kernel: K(xi ,xj ) = tanh( b 0 xiT xj + b1 )

𝑤 !x + b = 1 • Allow some training

• 𝑙,⁄. ---loss function 1, 𝑖𝑓 𝑧 < 0;

𝐶 > 0, tradeoff parameter； 𝐶 = ∞ hard margin SVM

hinge loss: 𝑙1!234 𝑧 = max 0,1 − 𝑧

Introduce ‘slack variables’−𝜉) , rewrite (∗):

• every sample have a 𝜉# , 𝜉# denotes the degree of disobeying constrains.

• Classification: Basic Concepts

• During the learning phase, the network

• Given enough hidden units and enough training

Weighted Sum : (1 ´ 0.25) + (0.5 ´ (-1.5)) = 0.25 + (-0.75) = - 0.5

• Based on artificial neural networks with

• Applications: Text, Image, Video…

• May not be feed-forward!

• Model Structure based on application

You might also like