8 CRF
8 CRF
1 Log-linear models
Let x be an example, and let y be a possible label for it. A log-linear model
assumes that
P
exp j wj Fj (x, y)
p(y|x; w) = (1)
Z(x, w)
1
Each expression Fj (x, y) is called a feature-function. In general, a feature-function
can be any real-valued function of both the data space X and the label space Y .
Formally, a feature-function is any mapping Fj : X × Y − > R.
Often, a feature-function is zero for all values of y except one particular value.
Given some attribute of x, we can have a different weight for this attribute and
each different label. The weights for these feature-functions can then capture the
affinity of this attribute-value for each label. Often, feature-functions are pres-
ence/absence indicators, so the value of the feature-function is either 0 or 1. If we
have a conventional attribute a(x) with k alternative values, and n classes, we can
make kn different features as defined above. With log-linear models, anything
and the kitchen sink can be a feature. We can have lots of classes, lots of features,
and we can pay attention to different features for different classes.
Feature-functions can overlap in arbitrary ways. For example, if x is a word
different feature-functions can use attributes of x such as “starts with a capital let-
ter,” “starts with G,”, is “Graham,” “is six letters long.” Generally we can encode
suffixes, prefixes, facts from a lexicon, preceding/following punctuation, etc., as
features.
Mathematically, log-linear models are very simple: there is one real-valued
weight for each feature, no more no fewer. There are several possible Pjustifications
for the form of the expression (1). First, a linear combination j wj Fj (x, y)
can take any positive or negative real value; the exponential makes it positive,
like a valid probability. Second, the division makes the results between 0 and 1,
i.e. makes them be valid probabilities. Third, the ranking of the probabilities will
be the same as the ranking of the linear values.
A function of the form
exp ak
bk = P
k0 exp ak0
is called a softmax function because the exponentials enlarge the bigger ak values
compared to the smaller ak values. Other functions have the same property of
being similar to the maximum function, but differentiable. Softmax is widely
used now, perhaps because its derivative is especially simple; see Section 4 below.
2
etc. There is a fixed known set of these part-of-speech (POS) tags. Each sentence
is a separate training or test example. We will represent a sentence by feature-
functions based on its words. Feature-functions can be very varied:
• Some feature-functions can look just at one word, e.g. at its prefixes or
suffixes.
• Some features can also use the words one to the left, one to the right, two to
the left etc., up to the whole sentence.
Assume that each feature-function Fj is actually a sum along the sentence, for
i = 1 to i = n where n is the length of x̄:
X
Fj (x̄, ȳ) = fj (yi−1 , yi , x̄, i).
i
3
This notation means that each low-level feature-function fj can depend on the
whole sentence, the current tag and the previous tag, and the current position i
within the sentence. A feature-function fj may depend on only a subset of these
four possible influences. Examples of features are “the current tag is NOUN and
the current word is capitalized,” “the word at the start of the sentence is Mr.” and
“the previous tag was SALUTATION.”
Summing each fj over all positions i means that we can have a fixed set of
feature-functions Fj for log-linear training even though the training examples are
not fixed-length.
Training a CRF means finding the weight vector w that gives the best possible
prediction
ȳ ∗ = argmaxȳ p(ȳ|x̄; w) (2)
for each training example x̄. However, before we can talk about training there are
two major inference problems to solve. First, how can we do the argmax compu-
tation in Equation 2 efficiently, for any x̄ and any weights w? This computation is
difficult since the number of alternative tag sequences ȳ is exponential.
Second, given any x̄ and ȳ we want to evaluate
1 X
p(ȳ|x̄; w) = exp wj Fj (x̄, ȳ).
Z(x̄, w) j
The difficulty
Phere isP that the denominator again ranges over all tag sequences ȳ:
Z(x̄, w) = ȳ0 exp j wj Fj (x̄, ȳ 0 ). For both these tasks, we will need tricks to
account for all possible ȳ efficiently, without enumerating all possible ȳ. The fact
that feature-functions can depend on at most two tags, which must be adjacent,
makes these tricks exist.
4
P
where gi (yi−1 , yi ) = j wj fj (yi−1 , yi , x̄, i). Note that the x̄ and i arguments of fj
have been dropped in the definition of gi . Each gi is a different function for each
i, and depends on w as well as on x̄ and i.
Remember that each entry of the ȳ vector is one of a finite set of tags. Given
x̄, w, and i the function gi can be represented as an m by m matrix where m is the
cardinality of the set of tags.
Let v range over the tags. Define U (k, v) to be the score of the best sequence
of tags from 1 to k, where tag k is required to be v. This is a maximization over
k − 1 tags because tag number k is fixed to have value v. Formally,
Xk−1
U (k, v) = max [ gi (yi−1 , yi ) + gk (yk−1 , v)].
{y1 ,...,yk−1 }
i=1
Now we can write down a recurrence that lets us compute U (k, v) efficiently:
With this recurrence we can compute ȳ for any x̄ in O(m2 n) time, where n is
the length of x̄ and m is the cardinality of the set of tags. This algorithm is
a variation of the Viterbi algorithm for computing the highest-probability path
through a hidden Markov model. The base case of the recurrence is an exercise
for the reader.
The second fundamental computational problem is to compute the denomina-
tor of the probability formula. This denominator is called the partition function:
X X
Z(x̄, w) = exp wj Fj (x̄, ȳ).
ȳ j
Remember that X X
wj Fj (x̄, ȳ) = gi (yi−1 , yi ),
j i
where i ranges over all positions 1 to n of the input sequence x̄, so we can write
X X XY
Z(x̄, w) = exp gi (yi−1 , yi ) = exp gi (yi−1 , yi ).
ȳ i ȳ i
5
while M1 (u, v) is defined only for u = START and Mn+1 (u, v) is defined only for
v = STOP.
Consider multiplying M1 and M2 . We have1
X X
M12 (START, w) = M1 (START, v)M2 (v, w) = [exp g1 (START, v)][exp g2 (v, w)].
v v
Similarly,
X
M123 (START, x) = M12 (START, w)M3 (w, x)
w
XX
= [ M1 (START, v)M2 (v, w)]M3 (w, x)
w v
X
= M1 (START, v)M2 (v, w)M3 (w, x)
v,w
and so on. Consider the hSTART, STOPi entry of the entire product M123...n+1 . This
is
X
M123...n+1 (START, STOP) = T = M1 (START, y1 )M2 (y1 , y2 ) . . . Mn+1 (yn , STOP).
ȳ
We have
X
T = exp[g1 (START, y1 )] exp[g2 (y1 , y2 )] . . . exp[gn+1 (yn , STOP)]
ȳ
XY
= exp[gi (yi−1 , yi )]
ȳ i
6
The matrix multiplication method for computing the partition function is called
a forward-backward P algorithm. A similar algorithm can be used to compute any
function of the form ȳ hi (yi−1 , yi ).
Some extensions to the basic linear-chain CRF are not difficult. The output ȳ
must be a sequence, but the input x̄ is treated as a unit, so it does not have to be
a sequence. It could be an image for example, or a collection of separate items,
e.g. telephone customers.
In general, what is fundamental for making a log-linear model tractable is that
the set of possible labels ȳ should either be small, or have some structure. In order
to have structure, ȳ should be made up of parts (e.g. tags) such that only small
subsets of parts interact directly with each other. Here, every interacting subset
of tags is a pair. Often, the real-world reason interacting subsets are small is that
interactions between parts are short-distance.
7
for each feature-function, so we use j to range over weights.) Start with
∂ ∂
log p(y|x; w) = Fj (x, y) − log Z(x, w)
∂wj ∂wj
1 X ∂ X
= Fj (x, y) − exp wj 0 Fj 0 (x, y 0 )
Z(x, w) y0 ∂wj j0
1 X X
= Fj (x, y) − [exp wj 0 Fj 0 (x, y 0 )]Fj (x, y 0 )
Z(x, w) y0 j0
exp j 0 wj 0 Fj 0 (x, y 0 )
P
X
0
= Fj (x, y) − Fj (x, y ) P P 00
0 y 00 exp j 00 wj 00 Fj 00 (x, y )
y
X
= Fj (x, y) − Fj (x, y 0 )p(y 0 |x; w)
y0
This equality is true only for the whole training set, not for training examples
individually.
The left side above is the total value of feature-function j on the whole training
set. The right side is the total value of feature-function j predicted by the model.
For each feature-function, the trained model will spread out over all labels of all
examples as much mass as the training data has just on those examples for which
the feature-function is nonzero.
For any particular application of log-linear modeling, we have to write code to
evaluate numerically the symbolic derivatives. Then we can invoke an optimiza-
tion routine to find the optimal parameter values. There are two ways that we can
verify correctness. First, check for each feature-function Fj that
X X X
Fj (x, y) = p(y 0 |x; w)Fj (x, y 0 ).
hx,yi∈T hx,·i∈T y0
8
Second, check that each partial derivative is correct by comparing it numerically
to the value obtained by finite differencing of the CLL objective function.
Suppose that every feature-function Fj is the product of an attribute value
aj (x) that is a function of x only, and a label function bj (y) that is a function
∂
of y only, i.e. Fj (x, y) = aj (x)bj (y). Then ∂w j
log p(y|x; w) = 0 if aj (x) = 0,
regardless of y. This implies that given example x with online gradient ascent,
the weight for a feature-function must be updated only for feature-functions for
which the corresponding attribute aj (x) is non-zero, which can be a great saving
of computational effort. In other words, the entire gradient with respect to a sin-
gle training example is typically a sparse vector, just like the vector of all Fj (x, y)
values is sparse for a single training example. A similar savings is possible when
computing the gradient with respect to the whole training set. Note that the gra-
dient with respect to the whole training set is a single vector that is the sum of
one vector for each training example. Typically these vectors being summed are
sparse, but their sum is not.
When maximizing the conditional log-likelihood by online gradient ascent,
the update to weight wj is
The first term Fj (x̄, ȳ) is fast to compute because x̄ and its training label ȳ are
fixed. Section 3 shows P how to compute
0
P Z(x̄, w) efficiently.
0
The remaining diffi-
culty is to compute ȳ0 Fj (x̄, ȳ ) exp j wj Fj (x̄, ȳ ).
If the set of alternative labels {y} is large, then it is computationally expensive
to evaluate the expectation Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]). We can find approximations
to this expectation by finding approximations to the distribution p(y|x; w). In
9
this section we discuss three different approximations. The corresponding train-
ing methods are called the Collins perceptron, Gibbs sampling, and contrastive
divergence.
The Collins perceptron. Suppose we place all the probability mass on the
most likely y value, i.e. we use the approximation p̂(y|x; w) = I(y = ŷ) where
ŷ = argmaxy p(y|x; w) as before. Then the update rule (3) simplifies to the
following rule:
wj := wj + αFj (x, y)
wj := wj − αFj (x, ŷ).
can be evaluated numerically in an efficient way for every i. Then we can get a
stream of samples by the following process:
10
• draw y20 according to p(y2 |x, y10 , y3 , . . . , yn ; w);
It can be proved that if Step (2) is repeated an infinite number of times, then
the distribution of y = {y10 , . . . , yn0 } converges to the true distribution p(y|x; w)
regardless of the starting point. In practice, we do Step (2) some number of times
(say 1000) to come close to convergence, and then take several samples y =
{y10 , . . . , yn0 }. Between each sample we repeat Step (2) a smaller number of times
(say 100) to make the samples almost independent of each other.
Using Gibbs sampling to estimate the expectation Ey∼p(y|x;w) [Fj (x, y)] is com-
putationally intensive because the accuracy of the estimate only increases very
slowly as the number s of samples increases. Specifically, the variance decreases
proportional to 1/s.
Contrastive divergence. A third training option is to choose a single y ∗ value
that is somehow similar to the training label y, but also has high probability ac-
cording to p(y|x; w). Compared to the “impostor” ŷ, the “evil twin” y ∗ will have
lower probability, but will be more similar to y.
The idea of contrastive divergence is to obtain a single value y ∗ = hy1∗ , . . . , yn∗ i
by doing only a few iterations of Gibbs sampling (often only one), but starting at
the training label y instead of at a random guess.
How to do Gibbs sampling. Gibbs sampling relies on drawing samples
efficiently from marginal distributions. Let y−i be an abbreviation for the set
{y1 , . . . , yi−1 , ji+1 , . . . , yn }. We need to draw values according to the distribution
p(yi |x, y−i ; w). The straightforward way to do this is to evaluate p(v|x, y−i ; w)
numerically for each possible value v of yi . In typical applications the number of
alternative values v is small, so this approach is feasible, if p(v|x, y−i ; w) can be
computed.
Suppose the entire conditional distribution is a Markov random field
M
Y
p(y|x; w) ∝ φm (y m |x; w) (4)
m=1
11
tion (4). In this case
Y
p(yi |x, y−i ; w) ∝ φm (y m |x; w) (5)
m∈C
where C indexes those potential functions y m that include the part yi . To compute
p(yi |x, y−i ; w) we evaluate the product (5) for all values of yi , with the given fixed
values of y−i = {y1 , . . . , yi−1 , ii+1 , . . . , yn }. We then normalize using
XY
Z(x, y−i ; w) = φm (y m |x; w)
v m∈C
12