Structured Prediction
Structured Prediction
Learning Objectives:
• Recognize when a problem should
be solved using a structured predic-
tion technique.
• Implement the structured perceptron
algorithm for sequence labeling.
• Map “argmax” problems to integer
linear programs.
It is often the case that instead of predicting a single output, you • Augment the structured perceptron
need to predict multiple, correlated outputs simultaneously. In nat- with losses to derive structured
SVMs.
ural language processing, you might want to assign a syntactic label
(like noun, verb, adjective, etc.) to words in a sentence: there is clear
correlation among these labels. In computer vision, you might want
to label regions in an image with object categories; again, there is
correlation among these regions. The branch of machine learning that
studies such questions is structured prediction.
In this chapter, we will cover two of the most common algorithms
for structured prediction: the structured perceptron and the struc-
tured support vector machine. We will consider two types of struc-
ture. The first is the “sequence labeling” problem, typified by the
natural language processing example above, but also common in
Dependencies:
computational biology (labeling amino acids in DNA) and robotics
(labeling actions in a sequence). For this, we will develop specialized
prediction algorithms that take advantage of the sequential nature
of the task. We will also consider more general structures beyond
sequences, and discuss how to cast them in a generic optimization
framework: integer linear programming (or ILP).
The general framework we will explore is that of jointly scoring
input/output configurations. We will construct algorithms that learn a
function s( x, ŷ) (s for “score”), where x is an input (like an image)
and ŷ is some predicted output (like a segmentation of that image).
For any given image, there are a lot of possible segmentations (i.e.,
a lot of possible ŷs), and the goal of s is to rank them in order of
“how good” they are: how compatible they are with the input x. The
most important thing is that the scoring function s ranks the true
segmentation y higher than any other imposter segmentation ŷ. That
is, we want to ensure that s( x, y) > s( x, ŷ) for all ŷ 6= y. The main
challenge we will face is how to do this efficiently, given that there are
so many imposter ŷs.
196 a course in machine learning
s( x, y) = w · φ( x, y) (17.1)
If this predicted output is correct (i.e., ŷ = y), then, per the normal
perceptron, we will do nothing. Suppose that ŷ 6= y. This means that
the score of ŷ is greater than the score of y, so we want to update w
so that the score of ŷ is decreased and the score of y is increased. We
do this by:
w ← w + φ( x, y) − φ( x, ŷ) (17.5)
To make sure this is doing what we expect, let’s consider what would
happen if we computed scores under the updated value of w. To make
the notation clear, let’s say w(old) are the weights before update, and
structured prediction 197
w(new) · φ( x, y) (17.6)
= w(old) + φ( x, y) − φ( x, ŷ) · φ( x, y) (17.7)
Here, the first term is the old prediction. The second term is of the
form a · a which is non-negative (and, unless φ( x, y) is the zero vec-
tor, positive). The third term is the dot product between φ for two
different labels, which by definition of φ is zero (see Eq (17.2)). Verify the score of ŷ, w(new) · φ( x, ŷ),
This gives rise to the updated multiclass perceptron specified in ? decreases after an update, as we
would want.
Algorithm 17.1. As with the normal perceptron, the generalization
of the multiclass perceptron increases dramatically if you do weight
averaging.
An important note is that MulticlassPerceptronTrain is actually
more powerful than suggested so far. For instance, suppose that you
have three categories, but believe that two of them are tightly related,
while the third is very different. For instance, the categories might
be {music, movies, oncology}. You can encode this relatedness by
defining a feature expansion φ that reflects this:
φ( x, music) = h x, 0, 0, xi (17.9)
φ( x, movies) = h0, x, 0, xi (17.10)
φ( x, oncology) = h0, 0, x, 0i (17.11)
• the number of times word w has been labeled with tag l, for all
words w and all syntactic tags l
• the number of times tag l is adjacent to tag l 0 in the output, for all
tags l and l 0
The first set of features are often called unary features, because they
talk only about the relationship between the input (sentence) and a
single (unit) label in the output sequence. The second set of features
are often called Markov features, because they talk about adjacent la-
bels in the output sequence, which is reminiscent of Markov models
which only have short term memory.
Note that for a given input x of length L (in the example, L =
4), the number of possible outputs is K L , where K is the number of
syntactic tags. This means that the number of possible outputs grows
exponentially in the length of the input. In general, we write Y ( x) to
mean “the set of all possible structured outputs for the input x”. We
have just seen that |Y ( x)| = Klen( x) .
Despite the fact that the inputs and outputs have variable length,
the size of the feature representation is constant. If there are V words
in your vocabulary and K labels for a given word, the the number of
unary features is VK and the number of Markov features is K2 , so
structured prediction 199
What this means is that we can build a graph like that in Figure ??,
with one verticle slice per time step (l 1 . . . L).2 Each edge in this 2
A graph of this sort is called a trel-
graph will receive a weight, constructed in such a way that if you lis, and sometimes a lattice in the
literature.
take a complete path through the lattice, and add up all the weights,
this will correspond exactly to w · φ( x, y).
To complete the construction, let φl ( x, · · · ◦ y ◦ y0 ) denote the unary
features at position l together with the Markov features that end at
structured prediction 201
L
`(Ham) (y, ŷ) = ∑ 1[yl 6= ŷl ] (17.17)
l =1
1
min ||w||2 + C ∑ ξ n (17.18)
w,ξ 2
| {z } n
| {z }
large margin
small slack
subj. to yn (w · xn + b) ≥ 1 − ξ n (∀n)
ξn ≥ 0 (∀n)
1
min ||w||2 + C ∑ `(hin) (yn , w · xn + b) (17.19)
w 2
| {z } n
| {z }
large margin
small slack
subj. to sw ( x, y) − sw ( x, ŷ)
≥ `(Ham) (yn , ŷ) − ξ n,ŷ (∀n, ∀ŷ ∈ Y ( xn ))
ξ n,ŷ ≥ 0 (∀n, ∀ŷ ∈ Y ( xn ))
This optimization problem asks for a large margin and small slack,
where there is a slack very for every training example and every
possible incorrect output associated with that training example. In
general, this is way too many slack variables and way too many con-
straints!
There is a very useful, general trick we can apply. If you focus on
the first constraint, it roughly says (letting s() denote score): s(y) ≥
s(ŷ) + `(y, ŷ) for all ŷ, modulo slack. We’ll refer to the thing in
brackets as the “loss-augmented score.” But if we want to guarantee
that the score of the true y beats the loss-augmented score of all ŷ, it’s
enough to ensure that it beats the loss-augmented score of the most
confusing imposter. Namely, it is sufficient to require that s(y) ≥
maxŷ s(ŷ) + `(y, ŷ) , modulo slack. Expanding out the definition
of s() and adding slack back in, we can replace the exponentially
large number of constraints in Eq (17.21) with the simpler set of
constraints:
h i
sw ( xn , yn ) ≥ max sw ( xn , ŷ) + `(Ham) (yn , ŷ) − ξ n (∀n)
ŷ∈Y ( xn )
We can now apply the same trick as before to remove ξ n from the
analysis. In particular, because ξ n is constrained to be ≥ 0 and be-
cause we are trying to minimize it’s sum, we can figure out that out
the optimum, it will be the case that:
( )
h i
(Ham)
ξ n = max 0, max sw ( xn , ŷ) + ` (yn , ŷ) − sw ( xn , yn )
ŷ∈Y ( xn )
(17.22)
(s-h)
=` (yn , xn , w) (17.23)
In particular, if the score of the true output beats the score of every
the best imposter by at least its loss, then ξ n will be zero. On the
other hand, if some imposter (plus its loss) beats the true output, the
loss scales linearly as a function of the difference. At this point, there
is nothing special about Hamming loss, so we will replace it with
some arbitrary structured loss `.
Plugging this back into the objective function of Eq (17.21), we can
write the structured SVM as an unconstrained optimization problem,
akin to Eq (17.19), as:
1
min ||w||2 + C ∑ `(s-h) (yn , xn , w) (17.24)
w 2 n
define ŷn to be the output that attains the maximum above, rearrange
n o
= ∇w w · φ( xn , ŷ) − w · φ( xn , yn ) + `(yn , ŷ) (17.27)
take gradient
= φ( xn , ŷ) − φ( xn , yn ) (17.28)
w · φl ( x, · · · ◦ y ◦ y0 ) + 1[ y 0 6 = y l ] (17.30)
| {z } | {z }
edge score, as before +1 for mispredictions
Once this loss-augmented graph has been constructed, the same max-
weight path algorithm can be run to find the loss-augmented argmax
sequence.
These special cases are often very useful, and many problems can be
cast in one of these frameworks. However, it is often the case that you
need a more general solution.
One of the most generally useful solutions is to cast the argmax
problem as an integer linear program, or ILP. ILPs are a specific
type of mathematical program/optimization problem, in which the
objective function being optimized is linear and the constraints are
linear. However, unlike “normal” linear programs, in an ILP you are
allowed to have integer constraints and disallow fractional values.
The general form of an ILP is, for a fixed vector a:
al,k0 ,k = w · φl ( x, h. . . , k0 , k i) (17.34)
1. That all the zs are binary. That’s easy: just say zl,k0 ,k ∈ {0, 1}, for
all l, k0 , k.
this. We can do this as: ∑k0 zl,k0 ,k = ∑k00 zl +1,k,k00 for all l, k. Effec-
tively what this is saying is that z5,?,verb = z6,verb,? where the “?”
means “sum over all possibilities.”
This fully specifies an ILP that you can relatively easily implement
(arguably more easily than the dynamic program in Algorithm 17.7)
and which will solve the argmax problem for you. Will it be efficient?
In this case, probably yes. Will it be as efficient as the dynamic pro-
gram? Probably not.
It takes a bit of effort and time to get used to casting optimization
problems as ILPs, and certainly not all can be, but most can and it’s a
very nice alternative.
In the case of loss-augmented search for structured SVMs (as
opposed to structured perceptron), the objective function of the ILP
will need to be modified to include terms corresponding to the loss.
L
w · φ( x, y) = w · ∑ φl (x, y) decomposition of structure (17.35)
l =1
L
= ∑ w · φl (x, y) associative law (17.36)
l =1
This means that (a) the score for the prefix ending at position 3 la-
beled as adjective is 12.7, and (b) the “winning” previous label was
“verb”. We will need to record these winning previous labels so that
we can extract the best path at the end. Let’s denote by ζ l,k the label
at position l − 1 that achieves the max.
From here, we can formally compute the αs recursively. The
main observation that will be necessary is that, because we have
limited ourselves to Markov features, φl +1 ( x, hy1 , y2 , . . . , yl , yl +1 i)
depends only on the last two terms of y, and does not depend on
y1 , y2 , . . . , yl −1 . The full recursion is derived as:
α0,k = 0 ∀k (17.41)
ζ 0,k = ∅ ∀k (17.42)
the score for any empty sequence is zero
αl +1,k = max w · φ1:l +1 ( x, ŷ ◦ k ) (17.43)
ŷ1:l
Algorithm 42 ArgmaxForSequences(x, w)
1: L ← len(x)
4: for k = 1 . . . K do
αl +1,k ← maxk0 αl,k0 + w · φl +1 ( x, h. . . , k0 , ki)
5: // recursion:
// here, φl +1 (. . . k 0 , k . . . ) is the set of features associated with
// output position l + 1 and two adjacent labels k 0 and k at that position
6: ζ l +1,k ← the k’ that achieves the maximum above // store backpointer
7: end for
8: end for
At the end, we can take maxk α L,k as the score of the best output
sequence. To extract the final sequence, we know that the best label
for the last word is argmax α L,k . Let’s call this ŷ L Once we know that,
the best previous label is ζ L−1,ŷ L . We can then follow a path through ζ
back to the beginning. Putting this all together gives Algorithm 17.7.
The main benefit of Algorithm 17.7 is that it is guaranteed to ex-
actly compute the argmax output for sequences required in the struc-
tured perceptron algorithm, efficiently. In particular, it’s runtime is
O( LK2 ), which is an exponential improvement on the naive O(K L )
runtime if one were to enumerate every possible output sequence.
The algorithm can be naturally extended to handle “higher order”
Markov assumptions, where features depend on triples or quadru-
ples of the output. The memoization becomes notationally cumber-
some, but the algorithm remains essentially the same. In order to
handle a length M Markov features, the resulting algorithm will take
O( LK M ) time. In practice, it’s rare that M > 3 is necessary or useful.
In the case of loss-augmented search for structured SVMs (as
opposed to structured perceptron), we need to include the scores
coming from the loss augmentation in the dynamic program. The
only thing that changes between the standard argmax solution (Al-
gorithm 17.7, and derivation in Eq (17.48)) is that the any time an
incorrect label is used, the (loss-augmented) score increases by one.
Recall that in the non-loss-augmented case, we have the α recursion
structured prediction 211
as:
TODO