0% found this document useful (0 votes)
22 views12 pages

8 CRF

The document describes log-linear models and conditional random fields. Log-linear models are an extension of logistic regression that use feature functions. Conditional random fields are a special case of log-linear models that can model structured prediction tasks like part-of-speech tagging by using features that depend on neighboring tags in a sequence.

Uploaded by

jameslei47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

8 CRF

The document describes log-linear models and conditional random fields. Log-linear models are an extension of logistic regression that use feature functions. Conditional random fields are a special case of log-linear models that can model structured prediction tasks like part-of-speech tagging by using features that depend on neighboring tags in a sequence.

Uploaded by

jameslei47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Log-Linear Models

and Conditional Random Fields


Charles Elkan
[email protected]
December 6, 2007

This document describes log-linear models, which are a far-reaching extension


of logistic regression, and conditional random fields (CRFs), which are a special
case of log-linear models.
Section 1 explains what a log-linear model is, and introduces feature func-
tions. Section 2 then presents linear-chain CRFs as an example of log-linear mod-
els, and Section 3 explains the special algorithms that make inference tractable
for these CRFs. Section 4 gives a general derivation of the gradient of a log-linear
model; this is the foundation of all log-linear training algorithms. Finally Sec-
tion 5 presents two special CRF training algorithms, one that is a variant of the
perceptron method and another one called contrastive divergence.

1 Log-linear models
Let x be an example, and let y be a possible label for it. A log-linear model
assumes that
P
exp j wj Fj (x, y)
p(y|x; w) = (1)
Z(x, w)

where the partition function Z(x, w) = y0 exp j wj Fj (x, y 0 ). Therefore, given


P P
x, the label predicted by the model is
X
ŷ = argmaxy p(y|x; w) = argmaxy wj Fj (x, y).
j

1
Each expression Fj (x, y) is called a feature-function. In general, a feature-function
can be any real-valued function of both the data space X and the label space Y .
Formally, a feature-function is any mapping Fj : X × Y − > R.
Often, a feature-function is zero for all values of y except one particular value.
Given some attribute of x, we can have a different weight for this attribute and
each different label. The weights for these feature-functions can then capture the
affinity of this attribute-value for each label. Often, feature-functions are pres-
ence/absence indicators, so the value of the feature-function is either 0 or 1. If we
have a conventional attribute a(x) with k alternative values, and n classes, we can
make kn different features as defined above. With log-linear models, anything
and the kitchen sink can be a feature. We can have lots of classes, lots of features,
and we can pay attention to different features for different classes.
Feature-functions can overlap in arbitrary ways. For example, if x is a word
different feature-functions can use attributes of x such as “starts with a capital let-
ter,” “starts with G,”, is “Graham,” “is six letters long.” Generally we can encode
suffixes, prefixes, facts from a lexicon, preceding/following punctuation, etc., as
features.
Mathematically, log-linear models are very simple: there is one real-valued
weight for each feature, no more no fewer. There are several possible Pjustifications
for the form of the expression (1). First, a linear combination j wj Fj (x, y)
can take any positive or negative real value; the exponential makes it positive,
like a valid probability. Second, the division makes the results between 0 and 1,
i.e. makes them be valid probabilities. Third, the ranking of the probabilities will
be the same as the ranking of the linear values.
A function of the form
exp ak
bk = P
k0 exp ak0
is called a softmax function because the exponentials enlarge the bigger ak values
compared to the smaller ak values. Other functions have the same property of
being similar to the maximum function, but differentiable. Softmax is widely
used now, perhaps because its derivative is especially simple; see Section 4 below.

2 Conditional random fields


A conditional random field (CRFs) is an important special case of a log-linear
model. First, consider an example of a learning task for which a CRF is useful.
Given a sentence, the task is to tag each word as noun, verb, adjective, preposition,

2
etc. There is a fixed known set of these part-of-speech (POS) tags. Each sentence
is a separate training or test example. We will represent a sentence by feature-
functions based on its words. Feature-functions can be very varied:

• Some feature-functions can be position-specific, e.g. to the beginning or


to the end of a sentence, while others can be sums over all positions in a
sentence.

• Some feature-functions can look just at one word, e.g. at its prefixes or
suffixes.

• Some features can also use the words one to the left, one to the right, two to
the left etc., up to the whole sentence.

The highest-accuracy POS taggers currently use over 100,000 feature-functions.


An important restriction (that will be explained and justified below) is that each
feature-function can depend on only one tag, or on two neighboring tags.
POS tagging is an example of what is called a structured prediction task. The
goal is to predict a complex label (a sequence of POS tags) for a complex input (an
entire sentence). This task is difficult, and significantly different from a standard
classifier learning task. There are at least three important sources of difficulty.
First, too much information would be lost by learning just a per-word classifier.
Influences between neighboring tags must be taken into account. Second, different
sentences have different lengths, so it is not obvious how to represent all sentences
by vectors of the same fixed length. Third, the set of all possible sequences of tags
constitutes an exponentially large set of labels.
A linear conditional random field is a way to apply a log-linear model to this
type of task. Use the bar notation for sequences, so x̄ means a sequence of variable
length. Specifically, let x̄ be a sequence of n words and let ȳ be a corresponding
sequence of n tags. Define the log-linear model
1 X
p(ȳ|x̄; w) = exp wj Fj (x̄, ȳ).
Z(x̄, w) j

Assume that each feature-function Fj is actually a sum along the sentence, for
i = 1 to i = n where n is the length of x̄:
X
Fj (x̄, ȳ) = fj (yi−1 , yi , x̄, i).
i

3
This notation means that each low-level feature-function fj can depend on the
whole sentence, the current tag and the previous tag, and the current position i
within the sentence. A feature-function fj may depend on only a subset of these
four possible influences. Examples of features are “the current tag is NOUN and
the current word is capitalized,” “the word at the start of the sentence is Mr.” and
“the previous tag was SALUTATION.”
Summing each fj over all positions i means that we can have a fixed set of
feature-functions Fj for log-linear training even though the training examples are
not fixed-length.
Training a CRF means finding the weight vector w that gives the best possible
prediction
ȳ ∗ = argmaxȳ p(ȳ|x̄; w) (2)
for each training example x̄. However, before we can talk about training there are
two major inference problems to solve. First, how can we do the argmax compu-
tation in Equation 2 efficiently, for any x̄ and any weights w? This computation is
difficult since the number of alternative tag sequences ȳ is exponential.
Second, given any x̄ and ȳ we want to evaluate
1 X
p(ȳ|x̄; w) = exp wj Fj (x̄, ȳ).
Z(x̄, w) j

The difficulty
Phere isP that the denominator again ranges over all tag sequences ȳ:
Z(x̄, w) = ȳ0 exp j wj Fj (x̄, ȳ 0 ). For both these tasks, we will need tricks to
account for all possible ȳ efficiently, without enumerating all possible ȳ. The fact
that feature-functions can depend on at most two tags, which must be adjacent,
makes these tricks exist.

3 Inference algorithms for linear-chain CRFs


Let’s solve the first problem above efficiently. First note that we can ignore the
denominator, and also the exponential inside the numerator. We want to compute
X
ȳ ∗ = argmaxȳ p(ȳ|x̄; w) = argmaxȳ wj Fj (x̄, ȳ).
j

Use the definition of Fj to get


X X X
ȳ ∗ = argmaxȳ wj fj (yi−1 , yi , x̄, i) = argmaxȳ gi (yi−1 , yi )
j i i

4
P
where gi (yi−1 , yi ) = j wj fj (yi−1 , yi , x̄, i). Note that the x̄ and i arguments of fj
have been dropped in the definition of gi . Each gi is a different function for each
i, and depends on w as well as on x̄ and i.
Remember that each entry of the ȳ vector is one of a finite set of tags. Given
x̄, w, and i the function gi can be represented as an m by m matrix where m is the
cardinality of the set of tags.
Let v range over the tags. Define U (k, v) to be the score of the best sequence
of tags from 1 to k, where tag k is required to be v. This is a maximization over
k − 1 tags because tag number k is fixed to have value v. Formally,

Xk−1
U (k, v) = max [ gi (yi−1 , yi ) + gk (yk−1 , v)].
{y1 ,...,yk−1 }
i=1

Now we can write down a recurrence that lets us compute U (k, v) efficiently:

U (k, v) = max [U (k − 1, yk−1 ) + gk (yk−1 , v)]


yk−1

With this recurrence we can compute ȳ for any x̄ in O(m2 n) time, where n is
the length of x̄ and m is the cardinality of the set of tags. This algorithm is
a variation of the Viterbi algorithm for computing the highest-probability path
through a hidden Markov model. The base case of the recurrence is an exercise
for the reader.
The second fundamental computational problem is to compute the denomina-
tor of the probability formula. This denominator is called the partition function:
X X
Z(x̄, w) = exp wj Fj (x̄, ȳ).
ȳ j

Remember that X X
wj Fj (x̄, ȳ) = gi (yi−1 , yi ),
j i

where i ranges over all positions 1 to n of the input sequence x̄, so we can write
X X XY
Z(x̄, w) = exp gi (yi−1 , yi ) = exp gi (yi−1 , yi ).
ȳ i ȳ i

We can compute the expression above efficiently by matrix multiplication. For


t = 1 to t = n + 1 let Mt be a square m by m matrix such that Mt (u, v) =
exp gt (u, v) for any two tag values u and v. Note that M2 to Mn are fully defined,

5
while M1 (u, v) is defined only for u = START and Mn+1 (u, v) is defined only for
v = STOP.
Consider multiplying M1 and M2 . We have1
X X
M12 (START, w) = M1 (START, v)M2 (v, w) = [exp g1 (START, v)][exp g2 (v, w)].
v v

Similarly,
X
M123 (START, x) = M12 (START, w)M3 (w, x)
w
XX
= [ M1 (START, v)M2 (v, w)]M3 (w, x)
w v
X
= M1 (START, v)M2 (v, w)M3 (w, x)
v,w

and so on. Consider the hSTART, STOPi entry of the entire product M123...n+1 . This
is
X
M123...n+1 (START, STOP) = T = M1 (START, y1 )M2 (y1 , y2 ) . . . Mn+1 (yn , STOP).

We have
X
T = exp[g1 (START, y1 )] exp[g2 (y1 , y2 )] . . . exp[gn+1 (yn , STOP)]

XY
= exp[gi (yi−1 , yi )]
ȳ i

which is exactly what we need.


Computational complexity: Each matrix is m by m where m is the cardinality
of the tag set. Each matrix multiplication requires O(m3 ) time, so the total time
is O(nm3 ). We have reduced a sum over an exponential number of alternatives
to a polynomial-time computation. However, even though polynomial, this is
worse than the time needed by the Viterbi algorithm. An interesting question is
whether computing the partition function is harder in some fundamental way than
computing the most likely label sequence.
1
Note on notation: u, v, w, and x here are all single tags; w is not a weight and x is not a
component of x̄.

6
The matrix multiplication method for computing the partition function is called
a forward-backward P algorithm. A similar algorithm can be used to compute any
function of the form ȳ hi (yi−1 , yi ).
Some extensions to the basic linear-chain CRF are not difficult. The output ȳ
must be a sequence, but the input x̄ is treated as a unit, so it does not have to be
a sequence. It could be an image for example, or a collection of separate items,
e.g. telephone customers.
In general, what is fundamental for making a log-linear model tractable is that
the set of possible labels ȳ should either be small, or have some structure. In order
to have structure, ȳ should be made up of parts (e.g. tags) such that only small
subsets of parts interact directly with each other. Here, every interacting subset
of tags is a pair. Often, the real-world reason interacting subsets are small is that
interactions between parts are short-distance.

4 Training by gradient ascent


The learning task for a log-linear model is to choose values for the weights (also
called parameters). Given a set of training examples, our goal is to choose param-
eter values wj to maximize the conditional probability of the training examples.
With this functional form for probabilities, we can choose parameter values
wj to maximize the conditional probability of the training examples.
To learn values for the parameters wj by gradient-following we need to be
able to evaluate the objective function and its gradient. The standard objective
function is the conditional log-likelihood (CLL). We want to maximize CLL, so
we do gradient ascent as opposed to descent.
The objective function used for training is not the same one that we really want
to maximize on test data. Instead of maximizing CLL we could maximize (all on
training data): yes/no accuracy of the entire predicted ȳ, or pointwise conditional
log likelihood, or we could minimize mean-squared error if tags are numerical, or
some other measure of distance between true and predicted tags.
A fundamental issue is whether we want to maximize a pointwise objective.
For a long sequence, we may have a vanishing chance of predicting the entire
tag sequence correctly. The single sequence with highest probability may be very
different from the most probable tag at each position.
For online gradient ascent (also called stochastic gradient ascent) we update
parameters based on single training examples. Therefore, we evaluate the partial
derivative of CLL for a single training example, for each wj . (There is one weight

7
for each feature-function, so we use j to range over weights.) Start with
∂ ∂
log p(y|x; w) = Fj (x, y) − log Z(x, w)
∂wj ∂wj
1 X ∂ X
= Fj (x, y) − exp wj 0 Fj 0 (x, y 0 )
Z(x, w) y0 ∂wj j0
1 X X
= Fj (x, y) − [exp wj 0 Fj 0 (x, y 0 )]Fj (x, y 0 )
Z(x, w) y0 j0
exp j 0 wj 0 Fj 0 (x, y 0 )
P
X
0
= Fj (x, y) − Fj (x, y ) P P 00
0 y 00 exp j 00 wj 00 Fj 00 (x, y )
y
X
= Fj (x, y) − Fj (x, y 0 )p(y 0 |x; w)
y0

= Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )].


In words, the partial derivative with respect to weight number i is the value of
feature-function i for the true training label y, minus the average value of the
feature-function for all possible labels y 0 . Note that this derivation allows feature-
functions to be real-valued, not just zero or one.
The gradient of the CLL given the entire training set T is the sum of the gra-
dients for each training example. At the global maximum this entire gradient is
zero, so we have
X X
Fj (x, y) = Ey∼p(y|x;w) [Fj (x, y)].
hx,yi∈T hx,·i∈T

This equality is true only for the whole training set, not for training examples
individually.
The left side above is the total value of feature-function j on the whole training
set. The right side is the total value of feature-function j predicted by the model.
For each feature-function, the trained model will spread out over all labels of all
examples as much mass as the training data has just on those examples for which
the feature-function is nonzero.
For any particular application of log-linear modeling, we have to write code to
evaluate numerically the symbolic derivatives. Then we can invoke an optimiza-
tion routine to find the optimal parameter values. There are two ways that we can
verify correctness. First, check for each feature-function Fj that
X X X
Fj (x, y) = p(y 0 |x; w)Fj (x, y 0 ).
hx,yi∈T hx,·i∈T y0

8
Second, check that each partial derivative is correct by comparing it numerically
to the value obtained by finite differencing of the CLL objective function.
Suppose that every feature-function Fj is the product of an attribute value
aj (x) that is a function of x only, and a label function bj (y) that is a function

of y only, i.e. Fj (x, y) = aj (x)bj (y). Then ∂w j
log p(y|x; w) = 0 if aj (x) = 0,
regardless of y. This implies that given example x with online gradient ascent,
the weight for a feature-function must be updated only for feature-functions for
which the corresponding attribute aj (x) is non-zero, which can be a great saving
of computational effort. In other words, the entire gradient with respect to a sin-
gle training example is typically a sparse vector, just like the vector of all Fj (x, y)
values is sparse for a single training example. A similar savings is possible when
computing the gradient with respect to the whole training set. Note that the gra-
dient with respect to the whole training set is a single vector that is the sum of
one vector for each training example. Typically these vectors being summed are
sparse, but their sum is not.
When maximizing the conditional log-likelihood by online gradient ascent,
the update to weight wj is

wj := wj + α(Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]) (3)

where α is a learning rate parameter.

5 Efficient CRF training


The partial derivative for stochastic gradient training of a CRF model is
∂ X
log p(ȳ|x̄; w) = Fj (x̄, ȳ) − Fj (x̄, ȳ 0 )p(ȳ 0 |x̄; w)
∂wj ȳ 0
0
P
X
0
exp j 0 wj 0 Fj 0 (x̄, ȳ )
= Fj (x̄, ȳ) − Fj (x̄, ȳ ) .
ȳ 0
Z(x̄, w)

The first term Fj (x̄, ȳ) is fast to compute because x̄ and its training label ȳ are
fixed. Section 3 shows P how to compute
0
P Z(x̄, w) efficiently.
0
The remaining diffi-
culty is to compute ȳ0 Fj (x̄, ȳ ) exp j wj Fj (x̄, ȳ ).
If the set of alternative labels {y} is large, then it is computationally expensive
to evaluate the expectation Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]). We can find approximations
to this expectation by finding approximations to the distribution p(y|x; w). In

9
this section we discuss three different approximations. The corresponding train-
ing methods are called the Collins perceptron, Gibbs sampling, and contrastive
divergence.
The Collins perceptron. Suppose we place all the probability mass on the
most likely y value, i.e. we use the approximation p̂(y|x; w) = I(y = ŷ) where
ŷ = argmaxy p(y|x; w) as before. Then the update rule (3) simplifies to the
following rule:

wj := wj + αFj (x, y)
wj := wj − αFj (x, ŷ).

Given a training example x, the label ŷ can be thought of as an “impostor” com-


pared to the genuine label y. The concept to be learned is those vectors of feature-
function values hF1 (x, y), . . .i that correspond to correct hx, yi pairs. The vector
hF1 (x, y), . . .i, where hx, yi is a training example, is a positive example of this
concept. The vector hF1 (x, ŷ), . . .i is a negative example of the same concept.
Hence, the two updates above are perceptron updates: the first for a positive ex-
ample and the second for a negative example.
The perceptron method causes a net increase in wj for features Fj whose value
is higher for y than for ŷ. It thus modifies the weights to directly increase the
probability of y compared to the probability of ŷ.
Gibbs sampling. Computing the most likely label ŷ does not require comput-
ing the partition function Z(x, w). Nevertheless, sometimes identifying ŷ is still
too difficult. In this case one option for training is to estimate Ey∼p(y|x;w) [Fj (x, y)]
approximately by sampling y values from the distribution p(y|x; w).
A method known as Gibbs sampling can be used to find the needed samples of
y. Gibbs sampling is the following algorithm. Suppose the entire label y can be
written as a set of parts y = {y1 , . . . , yn }. For example, if y is the part-of-speech
sequence that is the label of an input sentence x, then each yi can be the tag of one
word in the sentence. Suppose the marginal distribution

p(yi |x, y1 , yi−1 , . . . , yi+1 , yn ; w)

can be evaluated numerically in an efficient way for every i. Then we can get a
stream of samples by the following process:

(1) Select an arbitrary initial guess hy1 , . . . , yn i.

(2) Draw y10 according to p(y1 |x, y2 , . . . , yn ; w);

10
• draw y20 according to p(y2 |x, y10 , y3 , . . . , yn ; w);

• draw y30 according to p(y2 |x, y10 , y20 , y4 , . . . , yn ; w);

• and so on until yn0 .

(3) Set {y1 , . . . , yn } := {y10 , . . . , yn0 } and repeat from (2).

It can be proved that if Step (2) is repeated an infinite number of times, then
the distribution of y = {y10 , . . . , yn0 } converges to the true distribution p(y|x; w)
regardless of the starting point. In practice, we do Step (2) some number of times
(say 1000) to come close to convergence, and then take several samples y =
{y10 , . . . , yn0 }. Between each sample we repeat Step (2) a smaller number of times
(say 100) to make the samples almost independent of each other.
Using Gibbs sampling to estimate the expectation Ey∼p(y|x;w) [Fj (x, y)] is com-
putationally intensive because the accuracy of the estimate only increases very
slowly as the number s of samples increases. Specifically, the variance decreases
proportional to 1/s.
Contrastive divergence. A third training option is to choose a single y ∗ value
that is somehow similar to the training label y, but also has high probability ac-
cording to p(y|x; w). Compared to the “impostor” ŷ, the “evil twin” y ∗ will have
lower probability, but will be more similar to y.
The idea of contrastive divergence is to obtain a single value y ∗ = hy1∗ , . . . , yn∗ i
by doing only a few iterations of Gibbs sampling (often only one), but starting at
the training label y instead of at a random guess.
How to do Gibbs sampling. Gibbs sampling relies on drawing samples
efficiently from marginal distributions. Let y−i be an abbreviation for the set
{y1 , . . . , yi−1 , ji+1 , . . . , yn }. We need to draw values according to the distribution
p(yi |x, y−i ; w). The straightforward way to do this is to evaluate p(v|x, y−i ; w)
numerically for each possible value v of yi . In typical applications the number of
alternative values v is small, so this approach is feasible, if p(v|x, y−i ; w) can be
computed.
Suppose the entire conditional distribution is a Markov random field
M
Y
p(y|x; w) ∝ φm (y m |x; w) (4)
m=1

where each φm is a potential function that depends on just a subset y m of com-


ponents of y. Linear-chain conditional random fields are a special case of Equa-

11
tion (4). In this case
Y
p(yi |x, y−i ; w) ∝ φm (y m |x; w) (5)
m∈C

where C indexes those potential functions y m that include the part yi . To compute
p(yi |x, y−i ; w) we evaluate the product (5) for all values of yi , with the given fixed
values of y−i = {y1 , . . . , yi−1 , ii+1 , . . . , yn }. We then normalize using
XY
Z(x, y−i ; w) = φm (y m |x; w)
v m∈C

where v ranges over the possible values of yi .

12

You might also like