0% found this document useful (0 votes)
48 views88 pages

Introduction To Support Vector Machines: BTR Workshop Fall 2006

This document introduces support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane with the largest margin between classes in the training data. It describes the soft-margin SVM, which allows some misclassified points to balance margin size and training error. The document outlines how SVMs can be extended to non-linear classification using kernels, which enable SVMs to learn in high-dimensional feature spaces without explicitly representing those spaces. Finally, it discusses how kernels can be used to apply SVMs to non-vectorial structured data like sequences, graphs, and trees.

Uploaded by

Anonymous PKVCsG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views88 pages

Introduction To Support Vector Machines: BTR Workshop Fall 2006

This document introduces support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane with the largest margin between classes in the training data. It describes the soft-margin SVM, which allows some misclassified points to balance margin size and training error. The document outlines how SVMs can be extended to non-linear classification using kernels, which enable SVMs to learn in high-dimensional feature spaces without explicitly representing those spaces. Finally, it discusses how kernels can be used to apply SVMs to non-vectorial structured data like sequences, graphs, and trees.

Uploaded by

Anonymous PKVCsG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Introduction to

Support Vector Machines


BTR Workshop
Fall 2006
Thorsten Joachims
Cornell University

Outline
Statistical Machine Learning Basics
Training error, generalization error, hypothesis space

Support Vector Machines for Classification

Optimal hyperplanes and margins


Soft-margin Support Vector Machine
Primal vs. dual optimization problem
Kernels

Support Vector Machines for Structured Outputs


Linear discriminant models
Solving exponentially-size training problems
Example: Predicting the alignment between proteins

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
but eventually were turned off by the mountains
and the snowy winters

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the snowy winters
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA

1
y

-1

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the snowy winters
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA

1
y

-1
y

7.3

Example: Spam Filtering

Instance Space X:
Feature vector of word occurrences => binary features
N features (N typically > 50000)

Target Concept c:
Spam (+1) / Ham (-1)

Learning as Prediction Task


drawn i.i.d.

Real-world Process
P(X,Y)

Training Sample Strain S


train
(x1,y1), , (xn,yn)

Learner

drawn i.i.d.
Test Sample Stest
(xn+1,yn+1),

Goal: Find h with small prediction error ErrP(h) over P(X,Y).


Strategy: Find (any?) h with small error ErrStrain(h) on
training sample Strain.
Training Error: Error ErrStrain(h) on training sample.
Test Error: Error ErrStest(h) on test sample is an estimate
of ErrP(h) .

Linear Classification Rules


Hypotheses of the form
unbiased:
biased:
Parameter vector w, scalar b

Hypothesis space H

Notation

Optimal Hyperplanes
Linear Hard-Margin Support Vector Machine
Assumption: Training examples are linearly separable.

Margin of a Linear Classifier

Hard-Margin Separation
Goal: Find hyperplane with the largest distance to the
closest training examples.
Optimization Problem (Primal):

Support Vectors: Examples with minimal distance (i.e. margin).

Non-Separable Training Data


Limitations of hard-margin formulation
For some training data, there is no separating hyperplane.
Complete separation (i.e. zero training error) can lead to
suboptimal prediction error.

Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):

Soft-Margin OP (Primal):

Slack variable i measures by how


much (xi,yi) fails to achieve margin
i is upper bound on number of
training errors
C is a parameter that controls trade-off
between margin and training error.

Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):

Soft-Margin OP (Primal):

Slack variable i measures by how


much (xi,yi) fails to achieve margin
i is upper bound on number of
training errors
C is a parameter that controls trade-off
between margin and training error.

Controlling Soft-Margin Separation


Soft-Margin OP (Primal):
i is upper bound on
number of training errors
C is a parameter that
controls trade-off between
margin and training error.

Controlling Soft-Margin Separation


Soft-Margin OP (Primal):
i is upper bound on
number of training errors
C is a parameter that
controls trade-off between
margin and training error.

Example Reuters acq: Varying C

Example: Margin in High-Dimension


Training
Sample Strain

y
x1

x2

x3

x4

x5

x6

x7

-1

-1
b

w1

w2

w3

w4

w5

w6

w7

Hyperplane 1

Hyperplane 2

-1

-1

Hyperplane 3

-1

Hyperplane 4

0.5

-0.5

Hyperplane 5

-1

Hyperplane 6

0.95

-0.95

0.05

0.05

-0.05 -0.05

SVM Solution as Linear Combination


Primal OP:

Theorem: The solution w* can always be written as a


linear combination
of the training vectors.
Properties:

Factor i indicates influence of training example (xi,yi).


If i > 0, then i = C.
If 0 i < C, then i = 0.
(xi,yi) is a Support Vector, if and only if i > 0.
If 0 < i < C, then yi(xi w+b)=1.
SVM-light outputs i using the -a option

Dual SVM Optimization Problem


Primal Optimization Problem

Dual Optimization Problem

Theorem: If w* is the solution of the Primal and * is the


solution of the Dual, then

Leave-One-Out (i.e. n-fold CV)


Training Set:
Approach: Repeatedly leave one example out for testing.

Estimate:
Question: Is there a cheaper way to compute this estimate?

Necessary Condition for Leave-One-Out Error


Lemma: For SVM,
Input:
i dual variable of example i
i slack variable of example i
||x|| R bound on length
Example: Value of 2 i R2 + i Leave-one-out Error?
0.0

Correct

0.7

Correct

3.5

Error

0.1

Correct

1.3

Correct

Case 1: Example is not SV


Criterion: (i = 0) ) (i=0) ) (2 i R2 + i < 1) ) Correct

Case 2: Example is SV with Low Influence


Criterion: (i<0.5/R2 < C) ) (i=0) ) (2iR2+i < 1) ) Correct

Case 3: Example has Small Training Error


Criterion: (I = C) ) (i < 1-2CR2) ) (2iR2+i < 1) ) Correct

Experiment: Reuters Text Classification


Experiment Setup
6451 Training Examples
6451 Validation Examples to estimate true Prediction Error
Comparison between Leave-One-Out upper bound and error
on Validation Set (average over 10 test/validation splits)

Fast Leave-One-Out Estimation for SVMs


Lemma: Training errors are always Leave-One-Out Errors.
Algorithm:
(R,,) = trainSVM(Strain)
FOR (xi,yi) 2 Strain
IF i >1 THEN loo++;
ELSE IF (2 i R2 + i < 1) THEN loo = loo;
ELSE trainSVM(Strain \ {(xi,yi)}) and test explicitly

Experiment:
Training Sample

Retraining Steps (%)

CPU-Time (sec)

Reuters (n=6451)

0.58%

32.3

WebKB (n=2092)

20.42%

235.4

Ohsumed (n=10000)

2.56%

1132.3

Non-Linear Problems

Problem:
some tasks have non-linear structure
no hyperplane is sufficiently accurate
How can SVMs learn non-linear classification rules?

Extending the Hypothesis Space


Idea: add more features

Learn linear rule in feature space.


Example:

The separating hyperplane in feature space is degree


two polynomial in input space.

Example
Input Space:
Feature Space:

(2 attributes)
(6 attributes)

Dual SVM Optimization Problem


Primal Optimization Problem

Dual Optimization Problem

Theorem: If w* is the solution of the Primal and * is the


solution of the Dual, then

Kernels
Problem: Very many Parameters! Polynomials of degree p
over N attributes in input space lead to attributes in feature
space!
Solution: [Boser et al.] The dual OP depends only on inner
products => Kernel Functions

Example: For
calculating
computes inner product
in feature space.
no need to represent feature space explicitly.

SVM with Kernel


Training:

Classification:

New hypotheses spaces through new Kernels:


Linear:
Polynomial:
Radial Basis Function:
Sigmoid:

Examples of Kernels
Polynomial

Radial Basis Function

What is a Valid Kernel?


Definition: Let X be a nonempty set. A function is a valid
kernel in X if for all n and all x1,, xn 2 X it produces a
Gram matrix
Gij = K(xi, xj)
that is symmetric
G = GT
and positive semi-definite

How to Construct Valid Kernels


Theorem: Let K1 and K2 be valid Kernels over X X, X
<N, 0, 0 1, f a real-valued function on X, :X! <m
with a kernel K3 over <m <m, and K a symmetric positive
semi-definite matrix. Then the following functions are
valid Kernels
K(x,z) = K1(x,z) + (1-) K2(x,z)
K(x,z) = K1(x,z)
K(x,z) = K1(x,z) K2(x,z)
K(x,z) = f(x) f(z)
K(x,z) = K3((x),(z))
K(x,z) = xT K z

Kernels for Discrete and Structured Data


Kernels for Sequences: Two sequences are similar, if the have
many common and consecutive subsequences.
Example [Lodhi et al., 2000]: For 0 1 consider the
following features space
c-a

c-t

a-r

b-a

b-t

c-r

a-r

b-r

(cat)

(car)

(bat)

(bar)

=> K(car,cat) = 4, efficient computation via dynamic


programming

Kernels for Non-Vectorial Data


Applications with Non-Vectorial Input Data
classify non-vectorial objects

Protein classification (x is string of amino acids)


Drug activity prediction (x is molecule structure)
Information extraction (x is sentence of words)
Etc.

Applications with Non-Vectorial Output Data


predict non-vectorial objects
Natural Language Parsing (y is parse tree)
Noun-Phrase Co-reference Resolution (y is clustering)
Search engines (y is ranking)

Kernels can compute inner products efficiently!

Properties of SVMs with Kernels


Expressiveness
SVMs with Kernel can represent any boolean function (for
appropriate choice of kernel)
SVMs with Kernel can represent any sufficiently smooth
function to arbitrary accuracy (for appropriate choice of
kernel)

Computational
Objective function has no local optima (only one global)
Independent of dimensionality of feature space

Design decisions
Kernel type and parameters
Value of C

Reading: Support Vector Machines


Books
Schoelkopf, Smola, Learning with Kernels, MIT Press,
2002.
Cristianini, Shawe-Taylor. Introduction to Support Vector
Machines, Cambridge University Press, 2000.
Cristianini, Shawe-Taylor. ???

SVMs for other Problems

Multi-class Classification
[Schoelkopf/Smola Book, Section 7.6]
Regression
[Schoelkopf/Smola Book, Section 1.6]
Outlier Detection
D.M.J. Tax and R.P.W. Duin, "Support vector domain
description", Pattern Recognition Letters, vol. 20, pp. 1191-1199,
1999b. 26
Ordinal Regression and Ranking
Herbrich et al., Large Margin Rank Boundaries for Ordinal
Regression, Advances in Large Margin Classifiers, MIT Press,
1999.
Joachims, Optimizing Search Engines using Clickthrough Data,
ACM SIGKDD Conference (KDD), 2001.

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
but eventually were turned off by the mountains
and the snowy winters

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the snowy winters
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA

1
y

-1

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.


x

Microsoft announced today that they acquired


Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the snowy winters
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA

1
y

-1
y

7.3

Examples of Complex Output Spaces


Natural Language Parsing
Given a sequence of words x, predict the parse tree y.
Dependencies from structural constraints, since y has to
be a tree.

y
x

The dog chased the cat

NP

VP
NP

Det

Det

Examples of Complex Output Spaces


Multi-Label Classification
Given a (bag-of-words) document x, predict a set of
labels y.
Dependencies between labels from correlations
between labels (iraq and oil in newswire corpus)
x Due to the continued violence
in Baghdad, the oil price is
expected to further increase.
OPEC officials met with

y -1
-1
-1
+1
+1
-1
-1
-1

antarctica
benelux
germany
iraq
oil
coal
trade
acquisitions

Examples of Complex Output Spaces


Non-Standard Performance Measures (e.g. F1-score, Lift)
F1-score: harmonic average of precision and recall
New example vector . Predict y8=1, if P(y8=1|
Depends on other examples!
x
p 0.2 1
py0.2-1 1
0.1
0.1-1
0.3
0.3-1
x
y -1
0.6 F1
0.4+1F1
0.4
0.4-1
0.0
0.0-1
0.9 0
0.3+1 0
0

threshold

)=0.4?

threshold

Examples of Complex Output Spaces


Non-Standard Performance Measures (e.g. F1-score, Lift)
F1-score: harmonic average of precision and recall
New example vector . Predict y8=1, if P(y8=1|
Depends on other examples!
x
p 0.2 1
py0.2-1 1
0.1
0.1-1
0.3
0.3-1
x
y -1
0.6 F1
0.4+1F1
0.4
0.4-1
0.0
0.0-1
0.9 0
0.3+1 0
0

threshold

)=0.4?

threshold

Examples of Complex Output Spaces


Information Retrieval
Given a query x, predict a ranking y.
Dependencies between results (e.g. avoid redundant
hits)
Loss function over rankings (e.g. AvgPrec)
y 1. Kernel-Machines
x SVM
2. SVM-Light
3. Learning with Kernels
4. SV Meppen Fan Club
5. Service Master & Co.
6. School of Volunteer Management
7. SV Mattersburg Online

Examples of Complex Output Spaces


Noun-Phrase Co-reference
Given a set of noun phrases x, predict a clustering y.
Structural dependencies, since prediction has to be an
equivalence relation.
Correlation dependencies from interactions.
y
x
The policeman fed

The policeman fed

the cat. He did not know

the cat. He did not know

that he was late.

that he was late.

The cat is called Peter.

The cat is called Peter.

Examples of Complex Output Spaces


Protein Sequence Alignment
Given two sequences x=(s,t), predict an alignment y.
Structural dependencies, since prediction has to be a
valid global/local alignment.
y

x
s:ABJLHBNJYAUGAI
t:BHJKBNYGU

AB-JLHBNJYAUGAI
BHJK-BN-YGU

Outline: Structured Output Prediction


with SVMs
Task: Learning to predict complex outputs
SVM algorithm for complex outputs
Formulation as convex quadratic program
General algorithm
Sparsity bound

Example 1: Learning to parse natural language


Learning weighted context free grammar

Example 2: Learning to align proteins


Learning to predict optimal alignment of homologous proteins
for comparative modelling

Why do we Need Research on


Complex Outputs?

Important applications for which conventional methods dont fit!


Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
Directly optimize complex loss functions (e.g. F1, AvgPrec)
Improve upon existing methods!
Natural language parsing: generative models like probabilistic contextfree grammars
SVM outperforms nave Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
More flexible models!
Avoid generative (independence) assumptions
Kernels for structured input spaces and non-linear functions
Transfer what we learned for classification and regression!
Boosting
Bagging
Support Vector Machines

Related Work

Generative training (i.e. learn P(Y,X))


Hidden-Markov models
Probabilistic context-free grammars
Markov random fields
Etc.
Discriminative training (i.e. learn P(Y|X))
Multivariate output regression [Izeman, 1975] [Breiman & Friedman,
1997]
Kernel Dependency Estimation [Weston et al. 2003]
Conditional HMM [Krogh, 1994]
Transformer networks [LeCun et al, 1998]
Conditional random fields [Lafferty et al., 2001]
Perceptron training of HMM [Collins, 2002]
Maximum-margin Markov networks [Taskar et al., 2003]

Challenges in Discriminative Learning with


Complex Outputs
Approach: view as multi-class classification task
Every complex output

is one class

Problems:
Exponentially many classes!
How to predict efficiently?
How to learn efficiently?

Potentially huge model!

y1

Manageable number of features?

V
y2

The dog chased the cat

NP
Det

VP

NP

Det
VP

N
NP

Det

VP

yk

VP
Det

S
NP
N
V

NP
Det

Support Vector Machine [Vapnik et al.]


Training Examples:
Hypothesis Space:

with

Training: Find hyperplane


Hard Margin
(separable)

Soft Margin
(training error)

with minimal

Support Vector Machine [Vapnik et al.]


Training Examples:
Hypothesis Space:

with

Training: Find hyperplane


Optimization Problem:

Hard Margin
(separable)

Soft Margin
(training error)

with minimal

Multi-Class SVM [Crammer & Singer]


Training Examples:
Hypothesis Space:

y2
V
y1

V
S

NP
12

x The dog chased the cat

VP

Det NPN

S
V

VP

NP

Det
VP
VP
Det

N
NP

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

VP

S
NP
N
V

NP
Det

Training: Find SVM [Crammer


that solve & Singer]
Multi-Class
Training Examples:
Hypothesis Space:
Problems
y
S
VP
VP
How to predict efficiently?
NP
V
N
V
Det
N
How to learn efficiently?
y
S
NPof parameters?
VP
Manageable number
NP
2

x The dog chased the cat

y12 NP
Det
N

S
V

VP
Det

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

VP

S
NP
N
V

NP
Det

Joint Feature Map


Feature vector
that describes match between x and y
Learn single weight vector and rank by

y2
V
y1

V
S

NP
12

x The dog chased the cat

VP

Det NPN

S
V

VP

NP

Det
VP
VP
Det

N
NP

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

VP

S
NP
N
V

NP
Det

Joint Feature Map


Feature vector
that describes match between x and y
Learn single weight vector and rank by

Problems
y
S
VPefficiently?
VP
How to predict
NP
N
V
Det
N
How to learnV efficiently?
y
S
NP
Manageable
number
ofVPparameters?
NP
2

x The dog chased the cat

y12 NP
Det
N

S
V

VP
Det

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

VP

S
NP
N
V

NP
Det

Joint Feature Map for Trees


Weighted Context Free Grammar
Each rule (e.g. S NP VP) has a weight

CKY Parser

Score of a tree is the sum of its weights


Find highest scoring tree

The dog chased the cat

f : X Y
S

y
NP

VP
NP

Det

Det

The dog chased the

cat

1 S NP VP
0 S NP

2 NP Det N
1 VP V NP
#
( x, y ) =
0 Det dog
2 Det the
1 N dog
1 V chased
1 N cat

Joint Feature Map for Trees


Weighted Context Free Grammar
Each rule (e.g. S NP VP) has a weight

CKY Parser

Score of a tree is the sum of its weights


Find highest scoring tree

1 S NP VP
Problems
The dog chased the cat
0 S NP
How to predict efficiently?

f : X YHow to learn efficiently?
2 NP Det N
1 VP V NP
Manageable
number of parameters?
y
S
#
NP
VP
( x, y ) =
0 Det dog

NP
2 Det the
Det N
V
Det
N
1 N dog
1 V chased
1 N cat
The dog chased the cat

Structural Support Vector Machine


Joint features
Learn weights

describe match between x and y


so that
is max for correct y

Structural Support Vector Machine


Hard-margin optimization problem:
Joint features
describe match between x and y
Learn weights so that
is max for correct y

Loss Functions: Soft-Margin Struct


SVM
Loss function
prediction.

measures match between target and

Loss Functions: Soft-Margin Struct


SVM
Soft-margin optimization problem:
Loss function
prediction.

measures match between target and

Lemma: The training loss is upper bounded by

Sparse Approximation Algorithm for


Structural SVM
Input:

REPEAT

Find most
violated
constraint

FOR

Violated
by more
than ?

compute
IF

ENDIF

optimize StructSVM over

ENDFOR

UNTIL

has not changed during iteration

Add constraint
to working set

Polynomial Sparsity Bound


Theorem: The sparse-approximation algorithm finds a
solution to the soft-margin optimization problem after
adding at most

constraints to the working set , so that the Kuhn-Tucker


conditions are fulfilled up to a precision . The loss has to
be bounded
, and
.

[Jo03] [TsoJoHoAl05]

Polynomial Sparsity Bound


Theorem: The sparse-approximation algorithm finds a
solution to the soft-margin optimization problem after
adding at most
Problems
How to predict efficiently?
How to learn efficiently?
Manageable
number
parameters?
constraints
to the working
set of, so
that the Kuhn-Tucker
conditions are fulfilled up to a precision . The loss has to
be bounded
, and
.

[Jo03] [TsoJoHoAl05]

Experiment: Natural Language Parsing


Implemention
Implemented Sparse-Approximation Algorithm in SVMlight
Incorporated modified version of Mark Johnsons CKY parser
Learned weighted CFG with

Data
Penn Treebank sentences of length at most 10 (start with POS)
Train on Sections 2-22: 4098 sentences
Test on Section 23: 163 sentences

[TsoJoHoAl05]

More Expressive Features


Linear composition:

So far:

General:

Example:

Experiment: Part-of-Speech Tagging

Task

Given a sequence of words x, predict sequence of tags y.

x The dog chased the cat

y Det
N

Det

Test Accuracy (%)

Dependencies from tag-tag transitions in Markov model.


Model
Markov model with one state per tag and words as emissions
Each word described by ~250,000 dimensional feature vector (all
word suffixes/prefixes, word length, capitalization )
Experiment (by Dan Fleisher)
Train/test on 7966/1700 sentences from Penn Treebank
97.00
96.50
96.00
95.50
95.00
94.50
94.00

96.49
95.78

95.75

95.63
95.02
94.68

Brill (RBT)

HMM
(ACOPOST)

kNN (MBT)

Tree Tagger

SVM Multiclass
(SVM-light)

SVM-HMM
(SVM-struct)

Applying StructSVM to New Problem


Basic algorithm implemented in SVM-struct
http://svmlight.joachims.org
Application specific
Loss function
Representation
Algorithms to compute

Generic structure that covers OMM, MPD, Finite-State


Transducers, MRF, etc. (polynomial time inference)

Outline: Structured Output Prediction


with SVMs
Task: Learning to predict complex outputs
SVM algorithm for complex outputs
Formulation as convex quadratic program
General algorithm
Sparsity bound

Example 1: Learning to parse natural language


Learning weighted context free grammar

Example 2: Learning to align proteins


Learning to predict optimal alignment of homologous proteins
for comparative modeling

Comparative Modeling of Protein Structure


Goal: Predict structure from sequence
h(APPGEAYLQV)

Hypothesis:
Amino Acid sequences for into structure with lowest engery
Problem: Huge search space (> 2100 states)

Approach: Comparative Modeling


Similar protein sequences fold into similar shapes
use known shapes as templates
Task 1: Find a similar known protein for a new protein
h(APPGEAYLQV,

yes/no

Task 2: Map new protein into known structure


h(APPGEAYLQV,

[A3,P4,P7,]

Predicting an Alignment
Protein Sequence to Structure Alignment (Threading)
Given a pair x=(s,t) of new sequence s and known
structure t, predict the alignment y.
Elements of s and t are described by features, not just
character identity.
x

(
(

BBBLLBBLLHHHHH
32401450143520
ABJLHBNJYAUGAI

BHJKBNYGU
BBLLBBLLH

(
(

)
)

BB-BLLBBLLHHHHH
32-401450143520
AB-JLHBNJYAUGAI
BHJK-BN-YGU
BBLL-BB-LLH

Linear Score Sequence Alignment


Method: Find alignment y that maximizes linear score
Example:
A
B
C
D
Sequences:
A 10 0
-5 -10 -5
s=(A B C D)
B
0 10 5 -10 -5
C -5 5 10 -10 -5
t=(B A C C)
D -10 -10 -10 10 -5
Alignment y1:
- -5 -5 -5 -5 -5
A B C D
B A C C
score = 0+0+10-10 = 0
Alignment y2:
- A B C D
B A C C - score = -5+10+5+10-5 = 15
Algorithm: Dynamic programming

How to Estimate the Scores?


General form of linear scoring function:
Estimation:
Generative estimation of

via

Log-odds
Hidden Markov Model

Discriminative estimation of complex models via SVM

match/gap score can be arbitrary linear function

Expressive Scoring Functions


Conventional substitution matrix
Poor performance at low sequence similarity, if only amino
acid identity is considered
Difficult to design generative models that take care of the
dependencies between different features.
Would like to make use of structural features like secondary
structures, exposed surface area, and take into account the
interactions between these features

General feature-based scoring function


Allows us to describe each character by feature vector (e.g.
secondary structure, exposed surface area, contact profile)
Learn w vector of parameters
Computation of argmax still tractable via dynamic program

Loss Function
Q loss: fraction of incorrect alignments
A B C D
Correct alignment y= B A C C A - B C D

Q(y,y)=1/3

Alternate alignment y= B A C C -

Q4 loss: fraction of incorrect alignments outside window


A B C D
Correct alignment y= B A C C A - B C D

Q4(y,y)=0/3

Alternate alignment y= B A C C -

Model how bad different types of mistakes are for


structural modelling.

Experiment

Train set [Qiu & Elber]:


5119 structural alignments for training, 5169 structural alignments for
validation of regularization parameter C
Test set:
29764 structural alignments from new deposits to PDB from June
2005 to June 2006.
All structural alignments produced by the program CE by superposing
the 3D coordinates of the proteins structures. All alignments have CE
Z-score greater than 4.5.
Features (known for structure, predicted for sequence):
Amino acid identity (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
Secondary structure (,,)
Exposed surface area (0,1,2,3,4,5)

Results: Model Complexity


Feature Vectors:

Simple: (s,t,yi) (A|A; A|C; ;-|Y; |; |; 0|0; 0|1;)


Anova2: (s,t,yi) (A|A; 0|0; A0|A0;)
Tensor: (s,t,yi) (A0|A0; A0|A1; )
Window: (s,t,yi) (AAA|AAA; ; |; ; 00000|00000;)
Q-Score

# Features

Training

Validation

Test

Simple

1020

26.83

27.79

39.89

Anova2

49634

42.25

35.58

44.98

Tensor

203280

52.36

34.79

42.81

Window

447016

51.26

38.09

46.30

Q-score when optimizing to Q-loss

Results: Comparison
Q4-score

Test

SVM (Window, Q4-loss)

70.71

SSALN [Qiu & Elber]

67.30

BLAST

28.44

TM-align [Zhang & Skolnick]

(85.32)

Methods:

SVM: train on Window feature vector with Q4-loss


SSALN: generative method using same training data
BLAST: lower baseline
TM-align: upper baseline (disagreement between two
structural alignment methods

Conclusions:
Structured Output Prediction

Learning to predict complex output


Predict structured objects
Optimize loss functions over multivariate predictions
An SVM method for learning with complex outputs
Learning to predict trees (natural language parsing) [Tsochantaridis et
al. 2004 (ICML), 2005 (JMLR)] [Taskar et al., 2004 (ACL)]
Optimize to non-standard performance measures (imbalanced classes)
[Joachims, 2005 (ICML)]
Learning to cluster (noun-phrase coreference resolution) [Finley,
Joachims, 2005 (ICML)]
Learning to align proteins [Yu et al., 2005 (ICML Workshop)]
Software: SVMstruct
http://svmlight.joachims.org/

Reading: Structured Output Prediction

Generative training
Hidden-Markov models [Manning & Schuetze, 1999]
Probabilistic context-free grammars [Manning & Schuetze, 1999]
Markov random fields [Geman & Geman, 1984]
Etc.
Discriminative training
Multivariate output regression [Izeman, 1975] [Breiman & Friedman, 1997]
Kernel Dependency Estimation [Weston et al. 2003]
Conditional HMM [Krogh, 1994]
Transformer networks [LeCun et al, 1998]
Conditional random fields [Lafferty et al., 2001] [Sutton & McCallum, 2005]
Perceptron training of HMM [Collins, 2002]
Structural SVMs / Maximum-margin Markov networks [Taskar et al., 2003]
[Tsochantaridis et al., 2004, 2005] [Taskar 2004]

Why do we Need Research on


Complex Outputs?

Important applications for which conventional methods dont fit!


Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
Directly optimize complex loss functions (e.g. F1, AvgPrec)
Improve upon existing methods!
Natural language parsing: generative models like probabilistic contextfree grammars
SVM outperforms nave Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
More flexible models!
Avoid generative (independence) assumptions
Kernels for structured input spaces and non-linear functions
Transfer what we learned for classification and regression!
Boosting
Bagging
Support Vector Machines

Why do we Need Research on


Complex Outputs?

Important applications for which conventional methods dont fit!


Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
Directly optimize complex loss functions (e.g. F1, AvgPrec)
Improve upon existing methods!
Natural language parsing: generative models like probabilistic contextfree grammars
SVM outperforms nave Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
MorePrecision/Recall
flexible models!
Nave Bayes
Avoid
generative
(independence)
assumptions Linear SVM
Break-Even
Point
Kernels for structured input spaces and non-linear functions
Reuters
72.1
87.5
Transfer what we learned for classification and regression!
WebKB
82.0
90.3
Boosting
Bagging
Ohsumed
62.4
71.6
Support Vector Machines

You might also like