0% found this document useful (0 votes)

48 views88 pages

Introduction To Support Vector Machines: BTR Workshop Fall 2006

This document introduces support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane with the largest margin between classes in the training data. It describes the soft-margin SVM, which allows some misclassified points to balance margin size and training error. The document outlines how SVMs can be extended to non-linear classification using kernels, which enable SVMs to learn in high-dimensional feature spaces without explicitly representing those spaces. Finally, it discusses how kernels can be used to apply SVMs to non-vectorial structured data like sequences, graphs, and trees.

Uploaded by

Anonymous PKVCsG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views88 pages

Introduction To Support Vector Machines: BTR Workshop Fall 2006

Uploaded by

Anonymous PKVCsG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Introduction to

Support Vector Machines

BTR Workshop
Fall 2006
Thorsten Joachims
Cornell University

Outline
Statistical Machine Learning Basics
Training error, generalization error, hypothesis space

Support Vector Machines for Classification

Optimal hyperplanes and margins

Soft-margin Support Vector Machine
Primal vs. dual optimization problem
Kernels

Support Vector Machines for Structured Outputs

Linear discriminant models
Solving exponentially-size training problems
Example: Predicting the alignment between proteins

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
but eventually were turned off by the mountains
and the snowy winters

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
stated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the snowy winters
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA

1
y

-1

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

1
y

-1
y

7.3

Example: Spam Filtering

Instance Space X:
Feature vector of word occurrences => binary features
N features (N typically > 50000)

Target Concept c:
Spam (+1) / Ham (-1)

Learning as Prediction Task

drawn i.i.d.

Real-world Process
P(X,Y)

Training Sample Strain S

train
(x1,y1), , (xn,yn)

Learner

drawn i.i.d.
Test Sample Stest
(xn+1,yn+1),

Goal: Find h with small prediction error ErrP(h) over P(X,Y).

Strategy: Find (any?) h with small error ErrStrain(h) on
training sample Strain.
Training Error: Error ErrStrain(h) on training sample.
Test Error: Error ErrStest(h) on test sample is an estimate
of ErrP(h) .

Linear Classification Rules

Hypotheses of the form
unbiased:
biased:
Parameter vector w, scalar b

Hypothesis space H

Notation

Optimal Hyperplanes
Linear Hard-Margin Support Vector Machine
Assumption: Training examples are linearly separable.

Margin of a Linear Classifier

Hard-Margin Separation
Goal: Find hyperplane with the largest distance to the
closest training examples.
Optimization Problem (Primal):

Support Vectors: Examples with minimal distance (i.e. margin).

Non-Separable Training Data

Limitations of hard-margin formulation
For some training data, there is no separating hyperplane.
Complete separation (i.e. zero training error) can lead to
suboptimal prediction error.

Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):

Soft-Margin OP (Primal):

Slack variable i measures by how

much (xi,yi) fails to achieve margin
i is upper bound on number of
training errors
C is a parameter that controls trade-off
between margin and training error.

Soft-Margin Separation
Idea: Maximize margin and minimize training error.
Hard-Margin OP (Primal):

Soft-Margin OP (Primal):

Slack variable i measures by how

much (xi,yi) fails to achieve margin
i is upper bound on number of
training errors
C is a parameter that controls trade-off
between margin and training error.

Controlling Soft-Margin Separation

Soft-Margin OP (Primal):
i is upper bound on
number of training errors
C is a parameter that
controls trade-off between
margin and training error.

Controlling Soft-Margin Separation

Soft-Margin OP (Primal):
i is upper bound on
number of training errors
C is a parameter that
controls trade-off between
margin and training error.

Example Reuters acq: Varying C

Example: Margin in High-Dimension

Training
Sample Strain

y
x1

-1

-1
b

Hyperplane 1

Hyperplane 2

-1

Hyperplane 3

-1

Hyperplane 4

0.5

-0.5

Hyperplane 5

-1

Hyperplane 6

0.95

-0.95

0.05

-0.05 -0.05

SVM Solution as Linear Combination

Primal OP:

Theorem: The solution w* can always be written as a

linear combination
of the training vectors.
Properties:

Factor i indicates influence of training example (xi,yi).

If i > 0, then i = C.
If 0 i < C, then i = 0.
(xi,yi) is a Support Vector, if and only if i > 0.
If 0 < i < C, then yi(xi w+b)=1.
SVM-light outputs i using the -a option

Dual SVM Optimization Problem

Primal Optimization Problem

Dual Optimization Problem

Theorem: If w* is the solution of the Primal and * is the

solution of the Dual, then

Leave-One-Out (i.e. n-fold CV)

Training Set:
Approach: Repeatedly leave one example out for testing.

Estimate:
Question: Is there a cheaper way to compute this estimate?

Necessary Condition for Leave-One-Out Error

Lemma: For SVM,
Input:
i dual variable of example i
i slack variable of example i
||x|| R bound on length
Example: Value of 2 i R2 + i Leave-one-out Error?
0.0

Correct

0.7

Correct

3.5

Error

0.1

Correct

1.3

Correct

Case 1: Example is not SV

Criterion: (i = 0) ) (i=0) ) (2 i R2 + i < 1) ) Correct

Case 2: Example is SV with Low Influence

Criterion: (i<0.5/R2 < C) ) (i=0) ) (2iR2+i < 1) ) Correct

Case 3: Example has Small Training Error

Criterion: (I = C) ) (i < 1-2CR2) ) (2iR2+i < 1) ) Correct

Experiment: Reuters Text Classification

Experiment Setup
6451 Training Examples
6451 Validation Examples to estimate true Prediction Error
Comparison between Leave-One-Out upper bound and error
on Validation Set (average over 10 test/validation splits)

Fast Leave-One-Out Estimation for SVMs

Lemma: Training errors are always Leave-One-Out Errors.
Algorithm:
(R,,) = trainSVM(Strain)
FOR (xi,yi) 2 Strain
IF i >1 THEN loo++;
ELSE IF (2 i R2 + i < 1) THEN loo = loo;
ELSE trainSVM(Strain \ {(xi,yi)}) and test explicitly

Experiment:
Training Sample

Retraining Steps (%)

CPU-Time (sec)

Reuters (n=6451)

0.58%

32.3

WebKB (n=2092)

20.42%

235.4

Ohsumed (n=10000)

2.56%

1132.3

Non-Linear Problems

Problem:
some tasks have non-linear structure
no hyperplane is sufficiently accurate
How can SVMs learn non-linear classification rules?

Extending the Hypothesis Space

Idea: add more features

Learn linear rule in feature space.

Example:

The separating hyperplane in feature space is degree

two polynomial in input space.

Example
Input Space:
Feature Space:

(2 attributes)
(6 attributes)

Dual SVM Optimization Problem

Primal Optimization Problem

Dual Optimization Problem

Theorem: If w* is the solution of the Primal and * is the

solution of the Dual, then

Kernels
Problem: Very many Parameters! Polynomials of degree p
over N attributes in input space lead to attributes in feature
space!
Solution: [Boser et al.] The dual OP depends only on inner
products => Kernel Functions

Example: For
calculating
computes inner product
in feature space.
no need to represent feature space explicitly.

SVM with Kernel

Training:

Classification:

New hypotheses spaces through new Kernels:

Linear:
Polynomial:
Radial Basis Function:
Sigmoid:

Examples of Kernels
Polynomial

Radial Basis Function

What is a Valid Kernel?

Definition: Let X be a nonempty set. A function is a valid
kernel in X if for all n and all x1,, xn 2 X it produces a
Gram matrix
Gij = K(xi, xj)
that is symmetric
G = GT
and positive semi-definite

How to Construct Valid Kernels

Theorem: Let K1 and K2 be valid Kernels over X X, X
<N, 0, 0 1, f a real-valued function on X, :X! <m
with a kernel K3 over <m <m, and K a symmetric positive
semi-definite matrix. Then the following functions are
valid Kernels
K(x,z) = K1(x,z) + (1-) K2(x,z)
K(x,z) = K1(x,z)
K(x,z) = K1(x,z) K2(x,z)
K(x,z) = f(x) f(z)
K(x,z) = K3((x),(z))
K(x,z) = xT K z

Kernels for Discrete and Structured Data

Kernels for Sequences: Two sequences are similar, if the have
many common and consecutive subsequences.
Example [Lodhi et al., 2000]: For 0 1 consider the
following features space
c-a

c-t

a-r

b-a

b-t

c-r

a-r

b-r

(cat)

(car)

(bat)

(bar)

=> K(car,cat) = 4, efficient computation via dynamic

programming

Kernels for Non-Vectorial Data

Applications with Non-Vectorial Input Data
classify non-vectorial objects

Protein classification (x is string of amino acids)

Drug activity prediction (x is molecule structure)
Information extraction (x is sentence of words)
Etc.

Applications with Non-Vectorial Output Data

predict non-vectorial objects
Natural Language Parsing (y is parse tree)
Noun-Phrase Co-reference Resolution (y is clustering)
Search engines (y is ranking)

Kernels can compute inner products efficiently!

Properties of SVMs with Kernels

Expressiveness
SVMs with Kernel can represent any boolean function (for
appropriate choice of kernel)
SVMs with Kernel can represent any sufficiently smooth
function to arbitrary accuracy (for appropriate choice of
kernel)

Computational
Objective function has no local optima (only one global)
Independent of dimensionality of feature space

Design decisions
Kernel type and parameters
Value of C

Reading: Support Vector Machines

Books
Schoelkopf, Smola, Learning with Kernels, MIT Press,
2002.
Cristianini, Shawe-Taylor. Introduction to Support Vector
Machines, Cambridge University Press, 2000.
Cristianini, Shawe-Taylor. ???

SVMs for other Problems

Multi-class Classification
[Schoelkopf/Smola Book, Section 7.6]
Regression
[Schoelkopf/Smola Book, Section 1.6]
Outlier Detection
D.M.J. Tax and R.P.W. Duin, "Support vector domain
description", Pattern Recognition Letters, vol. 20, pp. 1191-1199,
1999b. 26
Ordinal Regression and Ranking
Herbrich et al., Large Margin Rank Boundaries for Ordinal
Regression, Advances in Large Margin Classifiers, MIT Press,
1999.
Joachims, Optimizing Search Engines using Clickthrough Data,
ACM SIGKDD Conference (KDD), 2001.

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

1
y

-1

Supervised Learning
Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired

1
y

-1
y

7.3

Examples of Complex Output Spaces

Natural Language Parsing
Given a sequence of words x, predict the parse tree y.
Dependencies from structural constraints, since y has to
be a tree.

y
x

The dog chased the cat

VP
NP

Det

Examples of Complex Output Spaces

Multi-Label Classification
Given a (bag-of-words) document x, predict a set of
labels y.
Dependencies between labels from correlations
between labels (iraq and oil in newswire corpus)
x Due to the continued violence
in Baghdad, the oil price is
expected to further increase.
OPEC officials met with

y -1
-1
-1
+1
+1
-1
-1
-1

antarctica
benelux
germany
iraq
oil
coal
trade
acquisitions

Examples of Complex Output Spaces

Non-Standard Performance Measures (e.g. F1-score, Lift)
F1-score: harmonic average of precision and recall
New example vector . Predict y8=1, if P(y8=1|
Depends on other examples!
x
p 0.2 1
py0.2-1 1
0.1
0.1-1
0.3
0.3-1
x
y -1
0.6 F1
0.4+1F1
0.4
0.4-1
0.0
0.0-1
0.9 0
0.3+1 0
0

threshold

)=0.4?

threshold

Examples of Complex Output Spaces

threshold

)=0.4?

threshold

Examples of Complex Output Spaces

Information Retrieval
Given a query x, predict a ranking y.
Dependencies between results (e.g. avoid redundant
hits)
Loss function over rankings (e.g. AvgPrec)
y 1. Kernel-Machines
x SVM
2. SVM-Light
3. Learning with Kernels
4. SV Meppen Fan Club
5. Service Master & Co.
6. School of Volunteer Management
7. SV Mattersburg Online

Examples of Complex Output Spaces

Noun-Phrase Co-reference
Given a set of noun phrases x, predict a clustering y.
Structural dependencies, since prediction has to be an
equivalence relation.
Correlation dependencies from interactions.
y
x
The policeman fed

The policeman fed

the cat. He did not know

that he was late.

The cat is called Peter.

Examples of Complex Output Spaces

Protein Sequence Alignment
Given two sequences x=(s,t), predict an alignment y.
Structural dependencies, since prediction has to be a
valid global/local alignment.
y

x
s:ABJLHBNJYAUGAI
t:BHJKBNYGU

AB-JLHBNJYAUGAI
BHJK-BN-YGU

Outline: Structured Output Prediction

with SVMs
Task: Learning to predict complex outputs
SVM algorithm for complex outputs
Formulation as convex quadratic program
General algorithm
Sparsity bound

Example 1: Learning to parse natural language

Learning weighted context free grammar

Example 2: Learning to align proteins

Learning to predict optimal alignment of homologous proteins
for comparative modelling

Why do we Need Research on

Complex Outputs?

Important applications for which conventional methods dont fit!

Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
Directly optimize complex loss functions (e.g. F1, AvgPrec)
Improve upon existing methods!
Natural language parsing: generative models like probabilistic contextfree grammars
SVM outperforms nave Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
More flexible models!
Avoid generative (independence) assumptions
Kernels for structured input spaces and non-linear functions
Transfer what we learned for classification and regression!
Boosting
Bagging
Support Vector Machines

Related Work

Generative training (i.e. learn P(Y,X))

Hidden-Markov models
Probabilistic context-free grammars
Markov random fields
Etc.
Discriminative training (i.e. learn P(Y|X))
Multivariate output regression [Izeman, 1975] [Breiman & Friedman,
1997]
Kernel Dependency Estimation [Weston et al. 2003]
Conditional HMM [Krogh, 1994]
Transformer networks [LeCun et al, 1998]
Conditional random fields [Lafferty et al., 2001]
Perceptron training of HMM [Collins, 2002]
Maximum-margin Markov networks [Taskar et al., 2003]

Challenges in Discriminative Learning with

Complex Outputs
Approach: view as multi-class classification task
Every complex output

is one class

Problems:
Exponentially many classes!
How to predict efficiently?
How to learn efficiently?

Potentially huge model!

Manageable number of features?

V
y2

The dog chased the cat

NP
Det

Det
VP

N
NP

Det

VP
Det

S
NP
N
V

NP
Det

Support Vector Machine [Vapnik et al.]

Training Examples:
Hypothesis Space:

with

Training: Find hyperplane

Hard Margin
(separable)

Soft Margin
(training error)

with minimal

Support Vector Machine [Vapnik et al.]

Training Examples:
Hypothesis Space:

with

Training: Find hyperplane

Optimization Problem:

Hard Margin
(separable)

Soft Margin
(training error)

with minimal

Multi-Class SVM [Crammer & Singer]

Training Examples:
Hypothesis Space:

y2
V
y1

V
S

NP
12

x The dog chased the cat

Det NPN

S
V

Det
VP
VP
Det

N
NP

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

S
NP
N
V

NP
Det

Training: Find SVM [Crammer

that solve & Singer]
Multi-Class
Training Examples:
Hypothesis Space:
Problems
y
S
VP
VP
How to predict efficiently?
NP
V
N
V
Det
N
How to learn efficiently?
y
S
NPof parameters?
VP
Manageable number
NP
2

x The dog chased the cat

y12 NP
Det
N

S
V

VP
Det

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

S
NP
N
V

NP
Det

Joint Feature Map

Feature vector
that describes match between x and y
Learn single weight vector and rank by

y2
V
y1

V
S

NP
12

x The dog chased the cat

Det NPN

S
V

Det
VP
VP
Det

N
NP

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

S
NP
N
V

NP
Det

Joint Feature Map

Feature vector
that describes match between x and y
Learn single weight vector and rank by

Problems
y
S
VPefficiently?
VP
How to predict
NP
N
V
Det
N
How to learnV efficiently?
y
S
NP
Manageable
number
ofVPparameters?
NP
2

x The dog chased the cat

y12 NP
Det
N

S
V

VP
Det

NP N
y34
Det NPN
V
S
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
Det

S
NP
N
V

NP
Det

Joint Feature Map for Trees

Weighted Context Free Grammar
Each rule (e.g. S NP VP) has a weight

CKY Parser

Score of a tree is the sum of its weights

Find highest scoring tree

The dog chased the cat

f : X Y
S

y
NP

VP
NP

Det

The dog chased the

cat

1 S NP VP
0 S NP

2 NP Det N
1 VP V NP
#
( x, y ) =
0 Det dog
2 Det the
1 N dog
1 V chased
1 N cat

Joint Feature Map for Trees

Weighted Context Free Grammar
Each rule (e.g. S NP VP) has a weight

CKY Parser

Score of a tree is the sum of its weights

Find highest scoring tree

1 S NP VP
Problems
The dog chased the cat
0 S NP
How to predict efficiently?

f : X YHow to learn efficiently?
2 NP Det N
1 VP V NP
Manageable
number of parameters?
y
S
#
NP
VP
( x, y ) =
0 Det dog

NP
2 Det the
Det N
V
Det
N
1 N dog
1 V chased
1 N cat
The dog chased the cat

Structural Support Vector Machine

Joint features
Learn weights

describe match between x and y

so that
is max for correct y

Structural Support Vector Machine

Hard-margin optimization problem:
Joint features
describe match between x and y
Learn weights so that
is max for correct y

Loss Functions: Soft-Margin Struct

SVM
Loss function
prediction.

measures match between target and

Loss Functions: Soft-Margin Struct

SVM
Soft-margin optimization problem:
Loss function
prediction.

measures match between target and

Lemma: The training loss is upper bounded by

Sparse Approximation Algorithm for

Structural SVM
Input:

REPEAT

Find most
violated
constraint

FOR

Violated
by more
than ?

compute
IF

ENDIF

optimize StructSVM over

ENDFOR

UNTIL

has not changed during iteration

Add constraint
to working set

Polynomial Sparsity Bound

Theorem: The sparse-approximation algorithm finds a
solution to the soft-margin optimization problem after
adding at most

constraints to the working set , so that the Kuhn-Tucker

conditions are fulfilled up to a precision . The loss has to
be bounded
, and
.

[Jo03] [TsoJoHoAl05]

Polynomial Sparsity Bound

Theorem: The sparse-approximation algorithm finds a
solution to the soft-margin optimization problem after
adding at most
Problems
How to predict efficiently?
How to learn efficiently?
Manageable
number
parameters?
constraints
to the working
set of, so
that the Kuhn-Tucker
conditions are fulfilled up to a precision . The loss has to
be bounded
, and
.

[Jo03] [TsoJoHoAl05]

Experiment: Natural Language Parsing

Implemention
Implemented Sparse-Approximation Algorithm in SVMlight
Incorporated modified version of Mark Johnsons CKY parser
Learned weighted CFG with

Data
Penn Treebank sentences of length at most 10 (start with POS)
Train on Sections 2-22: 4098 sentences
Test on Section 23: 163 sentences

[TsoJoHoAl05]

More Expressive Features

Linear composition:

So far:

General:

Example:

Experiment: Part-of-Speech Tagging

Task

Given a sequence of words x, predict sequence of tags y.

x The dog chased the cat

y Det
N

Det

Test Accuracy (%)

Dependencies from tag-tag transitions in Markov model.

Model
Markov model with one state per tag and words as emissions
Each word described by ~250,000 dimensional feature vector (all
word suffixes/prefixes, word length, capitalization )
Experiment (by Dan Fleisher)
Train/test on 7966/1700 sentences from Penn Treebank
97.00
96.50
96.00
95.50
95.00
94.50
94.00

96.49
95.78

95.75

95.63
95.02
94.68

Brill (RBT)

HMM
(ACOPOST)

kNN (MBT)

Tree Tagger

SVM Multiclass
(SVM-light)

SVM-HMM
(SVM-struct)

Applying StructSVM to New Problem

Basic algorithm implemented in SVM-struct
http://svmlight.joachims.org
Application specific
Loss function
Representation
Algorithms to compute

Generic structure that covers OMM, MPD, Finite-State

Transducers, MRF, etc. (polynomial time inference)

Outline: Structured Output Prediction

with SVMs
Task: Learning to predict complex outputs
SVM algorithm for complex outputs
Formulation as convex quadratic program
General algorithm
Sparsity bound

Example 1: Learning to parse natural language

Learning weighted context free grammar

Example 2: Learning to align proteins

Learning to predict optimal alignment of homologous proteins
for comparative modeling

Comparative Modeling of Protein Structure

Goal: Predict structure from sequence
h(APPGEAYLQV)

Hypothesis:
Amino Acid sequences for into structure with lowest engery
Problem: Huge search space (> 2100 states)

Approach: Comparative Modeling

Similar protein sequences fold into similar shapes
use known shapes as templates
Task 1: Find a similar known protein for a new protein
h(APPGEAYLQV,

yes/no

Task 2: Map new protein into known structure

h(APPGEAYLQV,

[A3,P4,P7,]

Predicting an Alignment
Protein Sequence to Structure Alignment (Threading)
Given a pair x=(s,t) of new sequence s and known
structure t, predict the alignment y.
Elements of s and t are described by features, not just
character identity.
x

(
(

BBBLLBBLLHHHHH
32401450143520
ABJLHBNJYAUGAI

BHJKBNYGU
BBLLBBLLH

(
(

)
)

BB-BLLBBLLHHHHH
32-401450143520
AB-JLHBNJYAUGAI
BHJK-BN-YGU
BBLL-BB-LLH

Linear Score Sequence Alignment

Method: Find alignment y that maximizes linear score
Example:
A
B
C
D
Sequences:
A 10 0
-5 -10 -5
s=(A B C D)
B
0 10 5 -10 -5
C -5 5 10 -10 -5
t=(B A C C)
D -10 -10 -10 10 -5
Alignment y1:
- -5 -5 -5 -5 -5
A B C D
B A C C
score = 0+0+10-10 = 0
Alignment y2:
- A B C D
B A C C - score = -5+10+5+10-5 = 15
Algorithm: Dynamic programming

How to Estimate the Scores?

General form of linear scoring function:
Estimation:
Generative estimation of

via

Log-odds
Hidden Markov Model

Discriminative estimation of complex models via SVM

match/gap score can be arbitrary linear function

Expressive Scoring Functions

Conventional substitution matrix
Poor performance at low sequence similarity, if only amino
acid identity is considered
Difficult to design generative models that take care of the
dependencies between different features.
Would like to make use of structural features like secondary
structures, exposed surface area, and take into account the
interactions between these features

General feature-based scoring function

Allows us to describe each character by feature vector (e.g.
secondary structure, exposed surface area, contact profile)
Learn w vector of parameters
Computation of argmax still tractable via dynamic program

Loss Function
Q loss: fraction of incorrect alignments
A B C D
Correct alignment y= B A C C A - B C D

Q(y,y)=1/3

Alternate alignment y= B A C C -

Q4 loss: fraction of incorrect alignments outside window

A B C D
Correct alignment y= B A C C A - B C D

Q4(y,y)=0/3

Alternate alignment y= B A C C -

Model how bad different types of mistakes are for

structural modelling.

Experiment

Train set [Qiu & Elber]:

5119 structural alignments for training, 5169 structural alignments for
validation of regularization parameter C
Test set:
29764 structural alignments from new deposits to PDB from June
2005 to June 2006.
All structural alignments produced by the program CE by superposing
the 3D coordinates of the proteins structures. All alignments have CE
Z-score greater than 4.5.
Features (known for structure, predicted for sequence):
Amino acid identity (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
Secondary structure (,,)
Exposed surface area (0,1,2,3,4,5)

Results: Model Complexity

Feature Vectors:

Simple: (s,t,yi) (A|A; A|C; ;-|Y; |; |; 0|0; 0|1;)

Anova2: (s,t,yi) (A|A; 0|0; A0|A0;)
Tensor: (s,t,yi) (A0|A0; A0|A1; )
Window: (s,t,yi) (AAA|AAA; ; |; ; 00000|00000;)
Q-Score

# Features

Training

Validation

Test

Simple

1020

26.83

27.79

39.89

Anova2

49634

42.25

35.58

44.98

Tensor

203280

52.36

34.79

42.81

Window

447016

51.26

38.09

46.30

Q-score when optimizing to Q-loss

Results: Comparison
Q4-score

Test

SVM (Window, Q4-loss)

70.71

SSALN [Qiu & Elber]

67.30

BLAST

28.44

TM-align [Zhang & Skolnick]

(85.32)

Methods:

SVM: train on Window feature vector with Q4-loss

SSALN: generative method using same training data
BLAST: lower baseline
TM-align: upper baseline (disagreement between two
structural alignment methods

Conclusions:
Structured Output Prediction

Learning to predict complex output

Predict structured objects
Optimize loss functions over multivariate predictions
An SVM method for learning with complex outputs
Learning to predict trees (natural language parsing) [Tsochantaridis et
al. 2004 (ICML), 2005 (JMLR)] [Taskar et al., 2004 (ACL)]
Optimize to non-standard performance measures (imbalanced classes)
[Joachims, 2005 (ICML)]
Learning to cluster (noun-phrase coreference resolution) [Finley,
Joachims, 2005 (ICML)]
Learning to align proteins [Yu et al., 2005 (ICML Workshop)]
Software: SVMstruct
http://svmlight.joachims.org/

Reading: Structured Output Prediction

Generative training
Hidden-Markov models [Manning & Schuetze, 1999]
Probabilistic context-free grammars [Manning & Schuetze, 1999]
Markov random fields [Geman & Geman, 1984]
Etc.
Discriminative training
Multivariate output regression [Izeman, 1975] [Breiman & Friedman, 1997]
Kernel Dependency Estimation [Weston et al. 2003]
Conditional HMM [Krogh, 1994]
Transformer networks [LeCun et al, 1998]
Conditional random fields [Lafferty et al., 2001] [Sutton & McCallum, 2005]
Perceptron training of HMM [Collins, 2002]
Structural SVMs / Maximum-margin Markov networks [Taskar et al., 2003]
[Tsochantaridis et al., 2004, 2005] [Taskar 2004]

Why do we Need Research on

Complex Outputs?

Important applications for which conventional methods dont fit!

Why do we Need Research on

Complex Outputs?

Important applications for which conventional methods dont fit!

Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
Directly optimize complex loss functions (e.g. F1, AvgPrec)
Improve upon existing methods!
Natural language parsing: generative models like probabilistic contextfree grammars
SVM outperforms nave Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
MorePrecision/Recall
flexible models!
Nave Bayes
Avoid
generative
(independence)
assumptions Linear SVM
Break-Even
Point
Kernels for structured input spaces and non-linear functions
Reuters
72.1
87.5
Transfer what we learned for classification and regression!
WebKB
82.0
90.3
Boosting
Bagging
Ohsumed
62.4
71.6
Support Vector Machines

Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Deep Learn
No ratings yet
Deep Learn
48 pages
Module 3 ML 24
No ratings yet
Module 3 ML 24
65 pages
Support Vector Machines: Theory, Implementation, and Applications
No ratings yet
Support Vector Machines: Theory, Implementation, and Applications
40 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
Complete Guide To Sewing
94% (32)
Complete Guide To Sewing
440 pages
C++ Exercises II
50% (2)
C++ Exercises II
4 pages
L5 SVMs
No ratings yet
L5 SVMs
37 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
The Measurement Imperative
0% (1)
The Measurement Imperative
10 pages
Support Vector Machine: With Python Code
No ratings yet
Support Vector Machine: With Python Code
21 pages
Slide - SVM
No ratings yet
Slide - SVM
12 pages
PSPP Users' Guide
No ratings yet
PSPP Users' Guide
209 pages
List of Free Useful Serial n1
No ratings yet
List of Free Useful Serial n1
6 pages
Support Vector Machines
No ratings yet
Support Vector Machines
33 pages
SIS Techcincal Paper
No ratings yet
SIS Techcincal Paper
58 pages
FSL PreProcessing Pipeline OHBM15 Jenkinson
No ratings yet
FSL PreProcessing Pipeline OHBM15 Jenkinson
63 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Ain3001 - 04 - Support - Vector.machines
No ratings yet
Ain3001 - 04 - Support - Vector.machines
50 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
SVM Student
No ratings yet
SVM Student
40 pages
Support Vector Machine Classifiers
No ratings yet
Support Vector Machine Classifiers
44 pages
Freightliner Business Class M2 Automated-Manual Transmissions Faut Codes PDF
No ratings yet
Freightliner Business Class M2 Automated-Manual Transmissions Faut Codes PDF
7 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
Data Science Recommended Books
No ratings yet
Data Science Recommended Books
23 pages
Unit 2
No ratings yet
Unit 2
47 pages
4-DNS Server Configuration
No ratings yet
4-DNS Server Configuration
41 pages
Excel Cookbook Final
No ratings yet
Excel Cookbook Final
19 pages
Game Def
No ratings yet
Game Def
11 pages
Support Vector Machines For Classification and Regression
No ratings yet
Support Vector Machines For Classification and Regression
8 pages
FIN3028 Phyton Companion Notes
No ratings yet
FIN3028 Phyton Companion Notes
175 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM
No ratings yet
SVM
11 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Architect Training Lab Guide
No ratings yet
Architect Training Lab Guide
34 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Lect 3
No ratings yet
Lect 3
14 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
3d Printer Manual
No ratings yet
3d Printer Manual
18 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
AP Ict Policy 2010-15
No ratings yet
AP Ict Policy 2010-15
12 pages
CH 5 SVM
No ratings yet
CH 5 SVM
25 pages
SVM Using Python
No ratings yet
SVM Using Python
24 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
State of Ecommerce Report
No ratings yet
State of Ecommerce Report
28 pages
FreePBX CLI Command Asterisk
No ratings yet
FreePBX CLI Command Asterisk
5 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
QP MEP Q0202 Office Assistant
No ratings yet
QP MEP Q0202 Office Assistant
28 pages
Voip Protocols
No ratings yet
Voip Protocols
64 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Intro SVM PDF
No ratings yet
Intro SVM PDF
47 pages
The Origins of New Ways of Working - Office Concepts in The 1970s
No ratings yet
The Origins of New Ways of Working - Office Concepts in The 1970s
12 pages
The Origins of New Ways of Working - Office Concepts in The 1970s
No ratings yet
The Origins of New Ways of Working - Office Concepts in The 1970s
12 pages
Fast Kernel Classifiers
No ratings yet
Fast Kernel Classifiers
41 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Amadeus Training Catalogue
No ratings yet
Amadeus Training Catalogue
81 pages
Basic Spreadsheet Concepts Exercise 2 Type in The - 5aadb1aa1723dda4b37e82ed
No ratings yet
Basic Spreadsheet Concepts Exercise 2 Type in The - 5aadb1aa1723dda4b37e82ed
6 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
Intro ERP Using GBI SAP Slides en v2.20
No ratings yet
Intro ERP Using GBI SAP Slides en v2.20
14 pages
An Introduction Of: Support Vector Machine
No ratings yet
An Introduction Of: Support Vector Machine
36 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Ethernet Cabling Standards PDF
No ratings yet
Ethernet Cabling Standards PDF
2 pages
Lesson Plan Logo
No ratings yet
Lesson Plan Logo
2 pages
Support Vector Machin, An Excellent Tool
No ratings yet
Support Vector Machin, An Excellent Tool
36 pages
2nd Year Syllabus
No ratings yet
2nd Year Syllabus
34 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Introduction To Support Vector Machines: BTR Workshop Fall 2006
No ratings yet
Introduction To Support Vector Machines: BTR Workshop Fall 2006
88 pages
BTB 24 Triac
No ratings yet
BTB 24 Triac
10 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
1.1 Communication Skills - I: Rationale
No ratings yet
1.1 Communication Skills - I: Rationale
25 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM
No ratings yet
SVM
21 pages
ND69 Operating Manual English
No ratings yet
ND69 Operating Manual English
103 pages
Megumi Tsugumi Vol 1 Yaoi Manga Si Mitsuru Download
No ratings yet
Megumi Tsugumi Vol 1 Yaoi Manga Si Mitsuru Download
32 pages
Oh My WebServer
No ratings yet
Oh My WebServer
4 pages
Day1 5
No ratings yet
Day1 5
29 pages
Magazine, Servo 11 2008
No ratings yet
Magazine, Servo 11 2008
85 pages
Kca-021 WT Ut 23-24 Sol Fin
No ratings yet
Kca-021 WT Ut 23-24 Sol Fin
12 pages
Google Career Certificate Courses Overview
No ratings yet
Google Career Certificate Courses Overview
11 pages
Flappy Bird Game Presentation
No ratings yet
Flappy Bird Game Presentation
10 pages
Xts Cheat Sheet R
No ratings yet
Xts Cheat Sheet R
1 page
Oop Reviewer
No ratings yet
Oop Reviewer
3 pages
Calmar Eng Doc
No ratings yet
Calmar Eng Doc
36 pages
Using The NoSQL Capabilities in Postgres
No ratings yet
Using The NoSQL Capabilities in Postgres
18 pages