Introduction To Support Vector Machines: BTR Workshop Fall 2006
Introduction To Support Vector Machines: BTR Workshop Fall 2006
Outline
Statistical Machine Learning Basics
Training error, generalization error, hypothesis space
Supervised Learning
Find function from input space X to output space Y
Supervised Learning
Find function from input space X to output space Y
1
y
-1
Supervised Learning
Find function from input space X to output space Y
1
y
-1
y
7.3
Instance Space X:
Feature vector of word occurrences => binary features
N features (N typically > 50000)
Target Concept c:
Spam (+1) / Ham (-1)
Real-world Process
P(X,Y)
Learner
drawn i.i.d.
Test Sample Stest
(xn+1,yn+1),
Hypothesis space H
Notation
Optimal Hyperplanes
Linear Hard-Margin Support Vector Machine
Assumption: Training examples are linearly separable.
Hard-Margin Separation
Goal: Find hyperplane with the largest distance to the
closest training examples.
Optimization Problem (Primal):
Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):
Soft-Margin OP (Primal):
Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):
Soft-Margin OP (Primal):
y
x1
x2
x3
x4
x5
x6
x7
-1
-1
b
w1
w2
w3
w4
w5
w6
w7
Hyperplane 1
Hyperplane 2
-1
-1
Hyperplane 3
-1
Hyperplane 4
0.5
-0.5
Hyperplane 5
-1
Hyperplane 6
0.95
-0.95
0.05
0.05
-0.05 -0.05
Estimate:
Question: Is there a cheaper way to compute this estimate?
Correct
0.7
Correct
3.5
Error
0.1
Correct
1.3
Correct
Experiment:
Training Sample
CPU-Time (sec)
Reuters (n=6451)
0.58%
32.3
WebKB (n=2092)
20.42%
235.4
Ohsumed (n=10000)
2.56%
1132.3
Non-Linear Problems
Problem:
some tasks have non-linear structure
no hyperplane is sufficiently accurate
How can SVMs learn non-linear classification rules?
Example
Input Space:
Feature Space:
(2 attributes)
(6 attributes)
Kernels
Problem: Very many Parameters! Polynomials of degree p
over N attributes in input space lead to attributes in feature
space!
Solution: [Boser et al.] The dual OP depends only on inner
products => Kernel Functions
Example: For
calculating
computes inner product
in feature space.
no need to represent feature space explicitly.
Classification:
Examples of Kernels
Polynomial
c-t
a-r
b-a
b-t
c-r
a-r
b-r
(cat)
(car)
(bat)
(bar)
Computational
Objective function has no local optima (only one global)
Independent of dimensionality of feature space
Design decisions
Kernel type and parameters
Value of C
Multi-class Classification
[Schoelkopf/Smola Book, Section 7.6]
Regression
[Schoelkopf/Smola Book, Section 1.6]
Outlier Detection
D.M.J. Tax and R.P.W. Duin, "Support vector domain
description", Pattern Recognition Letters, vol. 20, pp. 1191-1199,
1999b. 26
Ordinal Regression and Ranking
Herbrich et al., Large Margin Rank Boundaries for Ordinal
Regression, Advances in Large Margin Classifiers, MIT Press,
1999.
Joachims, Optimizing Search Engines using Clickthrough Data,
ACM SIGKDD Conference (KDD), 2001.
Supervised Learning
Find function from input space X to output space Y
Supervised Learning
Find function from input space X to output space Y
1
y
-1
Supervised Learning
Find function from input space X to output space Y
1
y
-1
y
7.3
y
x
NP
VP
NP
Det
Det
y -1
-1
-1
+1
+1
-1
-1
-1
antarctica
benelux
germany
iraq
oil
coal
trade
acquisitions
threshold
)=0.4?
threshold
threshold
)=0.4?
threshold
x
s:ABJLHBNJYAUGAI
t:BHJKBNYGU
AB-JLHBNJYAUGAI
BHJK-BN-YGU
Related Work
is one class
Problems:
Exponentially many classes!
How to predict efficiently?
How to learn efficiently?
y1
V
y2
NP
Det
VP
NP
Det
VP
N
NP
Det
VP
yk
VP
Det
S
NP
N
V
NP
Det
with
Soft Margin
(training error)
with minimal
with
Hard Margin
(separable)
Soft Margin
(training error)
with minimal
y2
V
y1
V
S
NP
12
VP
Det NPN
S
V
VP
NP
Det
VP
VP
Det
N
NP
NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det
VP
S
NP
N
V
NP
Det
y12 NP
Det
N
S
V
VP
Det
NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det
VP
S
NP
N
V
NP
Det
y2
V
y1
V
S
NP
12
VP
Det NPN
S
V
VP
NP
Det
VP
VP
Det
N
NP
NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det
VP
S
NP
N
V
NP
Det
Problems
y
S
VPefficiently?
VP
How to predict
NP
N
V
Det
N
How to learnV efficiently?
y
S
NP
Manageable
number
ofVPparameters?
NP
2
y12 NP
Det
N
S
V
VP
Det
NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det
VP
S
NP
N
V
NP
Det
CKY Parser
f : X Y
S
y
NP
VP
NP
Det
Det
cat
1 S NP VP
0 S NP
2 NP Det N
1 VP V NP
#
( x, y ) =
0 Det dog
2 Det the
1 N dog
1 V chased
1 N cat
CKY Parser
1 S NP VP
Problems
The dog chased the cat
0 S NP
How to predict efficiently?
f : X YHow to learn efficiently?
2 NP Det N
1 VP V NP
Manageable
number of parameters?
y
S
#
NP
VP
( x, y ) =
0 Det dog
NP
2 Det the
Det N
V
Det
N
1 N dog
1 V chased
1 N cat
The dog chased the cat
REPEAT
Find most
violated
constraint
FOR
Violated
by more
than ?
compute
IF
ENDIF
ENDFOR
UNTIL
Add constraint
to working set
[Jo03] [TsoJoHoAl05]
[Jo03] [TsoJoHoAl05]
Data
Penn Treebank sentences of length at most 10 (start with POS)
Train on Sections 2-22: 4098 sentences
Test on Section 23: 163 sentences
[TsoJoHoAl05]
So far:
General:
Example:
Task
y Det
N
Det
96.49
95.78
95.75
95.63
95.02
94.68
Brill (RBT)
HMM
(ACOPOST)
kNN (MBT)
Tree Tagger
SVM Multiclass
(SVM-light)
SVM-HMM
(SVM-struct)
Hypothesis:
Amino Acid sequences for into structure with lowest engery
Problem: Huge search space (> 2100 states)
yes/no
[A3,P4,P7,]
Predicting an Alignment
Protein Sequence to Structure Alignment (Threading)
Given a pair x=(s,t) of new sequence s and known
structure t, predict the alignment y.
Elements of s and t are described by features, not just
character identity.
x
(
(
BBBLLBBLLHHHHH
32401450143520
ABJLHBNJYAUGAI
BHJKBNYGU
BBLLBBLLH
(
(
)
)
BB-BLLBBLLHHHHH
32-401450143520
AB-JLHBNJYAUGAI
BHJK-BN-YGU
BBLL-BB-LLH
via
Log-odds
Hidden Markov Model
Loss Function
Q loss: fraction of incorrect alignments
A B C D
Correct alignment y= B A C C A - B C D
Q(y,y)=1/3
Alternate alignment y= B A C C -
Q4(y,y)=0/3
Alternate alignment y= B A C C -
Experiment
# Features
Training
Validation
Test
Simple
1020
26.83
27.79
39.89
Anova2
49634
42.25
35.58
44.98
Tensor
203280
52.36
34.79
42.81
Window
447016
51.26
38.09
46.30
Results: Comparison
Q4-score
Test
70.71
67.30
BLAST
28.44
(85.32)
Methods:
Conclusions:
Structured Output Prediction
Generative training
Hidden-Markov models [Manning & Schuetze, 1999]
Probabilistic context-free grammars [Manning & Schuetze, 1999]
Markov random fields [Geman & Geman, 1984]
Etc.
Discriminative training
Multivariate output regression [Izeman, 1975] [Breiman & Friedman, 1997]
Kernel Dependency Estimation [Weston et al. 2003]
Conditional HMM [Krogh, 1994]
Transformer networks [LeCun et al, 1998]
Conditional random fields [Lafferty et al., 2001] [Sutton & McCallum, 2005]
Perceptron training of HMM [Collins, 2002]
Structural SVMs / Maximum-margin Markov networks [Taskar et al., 2003]
[Tsochantaridis et al., 2004, 2005] [Taskar 2004]