0% found this document useful (0 votes)
23 views72 pages

Week 5

- The document discusses classification techniques for supervised learning, including decision tree induction. - It describes the basic process of classification as having two steps - model construction using a training set, and then model usage to classify future data. - Decision tree induction is explained as a top-down process that recursively partitions data based on attribute tests, selected using an information gain heuristic, to construct a tree for classification.

Uploaded by

veceki2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views72 pages

Week 5

- The document discusses classification techniques for supervised learning, including decision tree induction. - It describes the basic process of classification as having two steps - model construction using a training set, and then model usage to classify future data. - Decision tree induction is explained as a top-down process that recursively partitions data based on attribute tests, selected using an information gain heuristic, to construct a tree for classification.

Uploaded by

veceki2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

CP610 Data Analysis

- Classification
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

2
Supervised vs. Unsupervised Learning
• Supervised learning
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Classification - predicts categorical class labels
• Regression – predicts continues values

• Unsupervised learning
• The class labels of training data is unknown
• E.g. Clustering: Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or clusters in the data

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier


M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

Tenured?

6
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

7
Decision Tree Induction: An Example
age income student credit_rating buys_computer

q Training data set: Buys_computer <=30


<=30
high
high
no fair
no excellent
no
no
q The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
q Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
n Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D ) = å
j
´ Info( D j )
j =1 | D |
n Information gained by branching on attribute A

Gain(A) = Info(D) - Info A(D)


10
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3)means “age <=30” has 5 out of 14
<=30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s.
31…40 4 0 0 Hence
>40 3 2 0.971
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) = Info( D) - Infoage ( D) = 0.246
<=30 high
31…40 high
no
no
excellent
fair
no
yes
Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) = 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
<=30 medium
31…40 medium
yes excellent
no excellent
yes
yes Gain(credit _ rating ) = 0.048
31…40 11high yes fair yes
>40 medium no excellent no
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the max information gain for A is selected as
the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
12
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the
splitting attribute
13
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n
gini( D) = 1- å p 2j
j =1

• Replace the entropy Info(D) in ID3 algorithm.

14
Comparing Attribute Selection Measures

• The three measures, in general, return good results but


• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity in
both partitions

15
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases

• G-statistic: has a close approximation to χ2 distribution


• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
• The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
• Most give good results, none is significantly superior than others
16
Overfitting and Tree Pruning
• Overfitting: a model that models the training data too well, can
not generalize to new data.
• An induced Decision Tree may be overfit: Too many branches, some may
reflect anomalies due to noise or outliers

• Approaches to avoid overfitting


• Prepruning: pruning the model by halting the tree’s
formation in advance
• Postpruning: Remove branches from a “fully grown” tree

17
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

18
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
• Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves

19
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates #


of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
20
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C n Class Imbalance Problem
C TP FN P (anomaly detection):
¬C FP TN N
n One class may be rare, e.g.
P’ N’ All
fraud, or HIV-positive
• Classifier Accuracy, or recognition n Significant majority of the
rate: percentage of test set tuples negative class and minority of
that are correctly classified the positive class
Accuracy = (TP + TN)/All n Sensitivity: True Positive

• Error rate: 1 – accuracy, or recognition rate


Error rate = (FP + FN)/All n Sensitivity = TP/P

n Specificity: True Negative

recognition rate
n Specificity = TN/N
21
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?

• F measure (F1 or F-score): harmonic mean of precision and


recall,

• Fß: weighted measure of precision and recall


• assigns ß times as much weight to recall as to precision

22
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

23
Evaluating Classifier Accuracy
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation

• Cross-validation (k-fold, where k = 10 is most popular)


• Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
• At i-th iteration, use Di as test set and others as training set

• Variation 1. Leave-one-out: k folds where k = # of tuples, for


small sized data
• Variation 2. Stratified cross-validation: folds are stratified so
that class dist. in each fold is approx. the same as that in the
initial data
24
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

25
Model Selection Using Statistical Tests of
Significance
Estimating Confidence Intervals: Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• What if the difference between the 2 error rates is just attributed


to chance?
• Use a test of statistical significance (e.g. t-test, student t-test)

• Obtain confidence limits for our error estimates

26
Estimating Confidence Intervals:
Null Hypothesis

• Assume samples follow a t distribution with k–1 degrees of


freedom
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
• we conclude that the difference between M1 & M2 is
statistically significant
• Chose model with lower error rate

27
Estimating Confidence Intervals: t-test

• pairwise comparison
• For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
• Average over 10 rounds to get and
• t-test computes t-statistic with k-1 degrees of freedom:

where

28
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
critical value(sig/2,df)

29
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?


• Compute t. Select significance level (e.g. sig = 5%)
• Consult table for t-distribution: Find t value
corresponding to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
t = critical value(sig/2,df)
• If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are
same
• Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance

30
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
with numerical outputs
• Shows the trade-off between the true
positive rate and the false positive rate
• Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
• The closer to the diagonal line (i.e.,
the closer the area is to 0.5), the less
accurate is the model
• The area under the ROC curve (AUC)
is a measure of the accuracy of the
model

31
More Measures in Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
32
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

33
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging
• Boosting

34
Bagging
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• Majority voting: The bagged classifier M* counts the votes and assigns
the class with the most votes to X
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction

35
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy d
1 - error ( M i )
error ( M i ) = å w j ´ err ( X j ) Wj= log
j error ( M i )
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data

36
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but numerous


negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs
• Typical methods for imbalance data in binary classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative
class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance
of costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers

38
Multiclass Classification - from binary
• Classification involving more than two classes (i.e., > 2 Classes)
• Method 1. One-vs.-all (OVA): Learn a classifier one at a time
• Given m classes, train m classifiers: one for each class
• Classifier j: treat tuples in class j as positive & all others as negative
• To classify a tuple X, the set of classifiers vote as an ensemble
• Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
• Given m classes, construct m(m-1)/2 binary classifiers
• A classifier is trained using tuples of the two classes
• To classify a tuple X, each classifier votes. X is assigned to the class with
maximal vote

39
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

40
Classification: A Mathematical Mapping

x Î X = Ân, y Î Y = {+1, –1}, x


We want to derive a function f: X à Y x x
x x
x
x x x o
• Linear Classification o
x o
• Binary Classification problem ooo o
o o
• Data above the line belongs to class o o o o
‘x’
• Data below line belongs to class ‘o’
• Examples: SVM, Perceptron,
Probabilistic Classifiers, Logistic
Regression

41
Linear Classifier
𝑥! In order to classify new
𝑦! = +1
data closer to the line
correctly , the minimum
distance between the
line and training samples
should be as far as
possible.
𝑦! = −1

0 𝑥"
• In addition to correction, we manage to find a robust classifier which
has good generalization ability to new data.
Pick the one with the largest margin

𝑥! margin
𝑦! = +1

Maximum margin
classifier

𝑦! = −1

0 𝑥"
Specifying a hyper-plane
• A hyper-plane: 𝑤 (x + b = 0
• For (𝑥) ,𝑦) )∈ 𝐷, if 𝑦) =+1, 𝑤 (𝑥) + b > 0; if 𝑦) =-1, 𝑤 (𝑥) + b < 0
𝑥! 𝑤 !x + b > 0 𝑤 Correction constrains
𝑦! = +1
𝑤 !x + b = 0

𝑤 !x + b < 0
𝑦! = −1

0 𝑥"
Maximum margin
2𝑐
• Margin 𝛾=
𝑤
• Goal : max{margin}

𝛼 𝑤 ! x +𝛼b =𝛼c 2𝑐
𝛾=
𝑥! 𝑤
𝑦! = +1

𝑤 !x + b = 0

Support vectors
𝑤 !x + b = - c

𝑦! = −1
Note: ‘c’ is arbitrary (can
normalize equations by c)
0 𝑥"
Maximum margin
2
• Margin 𝛾=
𝑤
• Goal : max{margin} "
𝛼=0
𝑤 !x + b = 1 2
𝛾=
𝑥! 𝑦! = +1
𝑤

𝑤 !x + b = 0

Support vectors
𝑤 !x + b = -1

𝑦! = −1
Note: the ’𝑤’ here is equal
"
to 𝑤 of last slide , ’b’ too
0 𝑥" #
Primal optimization problemlem

• 𝑤 " 𝑥# + 𝑏 ≥ 1, if 𝑦# = +1 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚


• 𝑤 " 𝑥# + 𝑏 ≤ −1, if 𝑦# = −1
2
max
%,' 𝑤

s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚

1 (
m𝑖𝑛 𝑤
%,' 2
(1)
s.t. 𝑦# 𝑤 " 𝑥# + 𝑏 ≥ 1, 𝑖 = 1,2, … 𝑚
Dual problem
• Lagrangian:
&
1 $
𝐿 𝑤, 𝑏, 𝛼 = 𝑤 + 4 𝛼! (1 − 𝑦! (𝑤 ' 𝑥! + 𝑏))
2
!%"
𝛼 = 𝛼" ; 𝛼$ ; … ; 𝛼& , 𝛼! ≥ 0

1 $
𝑤 , If w,b saWsfies primal constraints
max 𝐿 𝑤, 𝑏, 𝛼 = 2
(:(! *+
∞, otherwise

• re-written (1) : m𝑖𝑛 max 𝐿 𝑤, 𝑏, 𝛼


%,' ):) +,
!
Lagrangian duality
• The dual problem:
"
max m𝑖𝑛 𝐿 𝒘, 𝑏, 𝛼 = 𝒘 $ + ∑& '
!%" 𝛼! (1 − 𝑦! (𝒘 𝒙! + 𝑏))
(:(! *+ ,,. $

• We minimize L with respect to w and b first:


%
𝜕𝐿 𝒘 = ! 𝛼" 𝑦" 𝒙"
=0 (*)
𝜕𝒘 "#$
%
𝜕𝐿
=0 ! 𝛼" 𝑦" = 0 (**)
𝜕𝑏 "#$

• Plug (*) back to 𝐿 𝒘, 𝑏, 𝛼 , and using (**), we have:


& & &
1
𝐿 𝒘, 𝑏, 𝛼 = 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ 𝒙! ' 𝒙/
2
!%" !%" /%"
Dual problem
• Now we have the
&
dual&problem:
&
1
max 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ 𝒙! ' 𝒙/
( 2
!%" !%" /%"

s.t. ∑&!%" 𝛼! 𝑦! = 0,
𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚

• Resolve the above problem, we will get 𝛼, then 𝒘, 𝑏, and


then the final model: &

𝑓 𝒙 = 𝒘' 𝒙 + 𝑏 = 4 𝛼! 𝑦! 𝒙! ' 𝒙 + 𝑏
!%"
SVM—Linearly Inseparable

n Transform the original input data into a higher dimensional


space

n Search for a linear separating hyperplane in the new space


53
Kernel function
With this mapping ,our dual optimization problem becomes :
& & &
1
max 𝐿 𝛼 = 4 𝛼! − 4 4 𝛼! 𝛼/ 𝑦! 𝑦/ ∅(𝒙! )' ∅(𝒙/ ) 𝒙! ' 𝒙/
( 2
!%" !%" /%"

s.t. ∑&
!%" 𝛼! 𝑦! = 0,

𝛼! ≥ 0, 𝑖 = 1,2, … , 𝑚
• Calculating the inner product of feature vectors in the
feature space can be costly because it is high dimensional.
• The kernel trick comes to rescue:
Kernel function 𝑘 𝒙! , 𝒙/ = ∅(𝒙! )' ∅(𝒙/ )
the inner product of feature vectors in the feature space calculation
in the original input space by function 𝑘(=,=).
SVM: Different Kernel functions
Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔<
φ(x), φ(y ) > for some φ(.).
Kernel function
• Examples of commonly-used kernel functions:
Linear kernel: K(xi ,xj ) = xiTx j

Polynomial kernel: K(xi ,xj ) = (1 + xiT xj )p


xi - x j
Gaussian (Radial-Basis Function(RBF)) kernel:K(xi ,xj ) = exp(- )
2s 2

Sigmoid kernel: K(xi ,xj ) = tanh( b 0 xiT xj + b1 )


Data with noise

𝑤 !x + b = 1 • Allow some training


𝑥! samples don’t satisfy the
constrain:
𝑦! 𝑤 ' 𝑥! + 𝑏 ≥ 1
𝑤 !x + b = 0
• Maximize margin and
minimize # training
𝑤 !x + b = -1 samples that don’t satisfy
the constrain

0 𝑥"
Optimization object
Minimize # training samples that
Maximize margin
don’t satisfy the constrain

&
1 $
min 𝑤 + 𝐶 4 𝑙+⁄" (𝑦! 𝑤 ' 𝑥! + 𝑏 − 1 )
,,. 2
!%"

• 𝑙,⁄. ---loss function 1, 𝑖𝑓 𝑧 < 0;


𝑙+⁄" 𝑧 =
0, otherwise.

𝐶 > 0, tradeoff parameter; 𝐶 = ∞ hard margin SVM


But 𝑙+⁄" is not convex, discontinuous
Surrogate loss function

-2 -1 0 1 2

hinge loss: 𝑙1!234 𝑧 = max 0,1 − 𝑧


exponential loss: 𝑙456 𝑧 = exp(−𝑧)
logistic loss:𝑙783 𝑧 = log(1 + exp(−𝑧))
Soft margin SVM
If we apply hinge loss , the optimization object becomes:
&
1 $
min 𝑤 + 𝐶 4 max 0,1 − 𝑦! 𝑤 ' 𝑥! + 𝑏 (∗)
,,. 2
!%"

Introduce ‘slack variables’−𝜉) , rewrite (∗):


&
1
Soft margin SVM min 𝑤 $
+ 𝐶 4 𝜉!
,,.,9! 2
!%"
s.t. 𝑦! 𝑤 ' 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝜉! ≥ 0, 𝑖 = 1,2, … 𝑚

• every sample have a 𝜉# , 𝜉# denotes the degree of disobeying constrains.


SVM Related Links
• SVM Website: http://www.kernel-machines.org/
• Representative implementations
• LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
• SVM-light: simpler but performance is not better than
LIBSVM, support only binary classification and only in C
• SVM-torch: another recent implementation also written in C

60
Classification

• Classification: Basic Concepts


• Decision Tree Induction
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Support Vector Machine
• Artificial Neural Networks

61
Artificial Neural Networks
• A set of connected input/output units
where each connection has a weight
associated with it

• During the learning phase, the network


learns by adjusting the weights so as
to be able to predict the correct class
label of the input tuples
May have more hidden layers
• At one node/neuron (hidden layer or output layer)

Perception

• Given enough hidden units and enough training


samples, they can closely approximate any function
• An example:
1 0.2
5
f(x) 0.3775
5
-1.
0. 5

Weighted Sum : (1 ´ 0.25) + (0.5 ´ (-1.5)) = 0.25 + (-0.75) = - 0.5

Activation function : 1
f(x) =
1 + e- x
1
Then = 0.3775
1+e 0.5
Activation functions
• Limit the output to the range
• Enable non-linear transformation
• Learning:
• Given the network structure – determined empirically
• Learn the weights

• Learning Approach:
• Random initial weights – usually small
• Iterative: Backpropagation to update the weights based
on the model output “error”
• Terminating condition (when error is very small, or
enough epochs completed , etc.)
Backpropagation
• An optimization problem
• Loss function: evaluates how well the neural
network models the training data, for example,
mean squared error
Gradient Descent
• In each iteration:
• Propagate the inputs forward
• Backpropagate the error
Neural Network as a Classifier
• Multiple Layer Perception (MLP)
• Strength
• High tolerance to noisy data
• Well-suited for continuous-valued inputs and outputs
• Algorithms are inherently parallel
• Weakness
• Long training time
• Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
• Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network

72
Deep Learning
• Deep Learning is a branch of machine learning
based on a set of algorithms that attempt to model
high level abstractions in data by using a deep
graph with multiple processing layers, composed of
multiple linear and non- linear transformations.

• Based on artificial neural networks with


representation learning.

• Applications: Text, Image, Video…


Special structures
• No longer Fully Connected Neural Networks!

• May not be feed-forward!

• Model Structure based on application

You might also like