0% found this document useful (0 votes)

42 views20 pages

Log-Linear Models: Michael Collins

Log-linear models are a flexible approach for natural language processing tasks that allow incorporation of rich contextual information. They define a conditional probability distribution over labels given an input using a log-linear form with feature vectors and parameters. Feature vectors map each input-label pair to a vector representing relevant context. The probability of a label is defined as the exponentiated dot product of the feature vector and parameter vector, normalized over all possible labels. This flexible form allows inclusion of diverse contextual features without becoming unwieldy like linear interpolation models.

Uploaded by

Sushil Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views20 pages

Log-Linear Models: Michael Collins

Uploaded by

Sushil Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Log-Linear Models

Michael Collins
1 Introduction
This note describes log-linear models, which are very widely used in natural lan-
guage processing. A key advantage of log-linear models is their exibility: as we
will see, they allow a very rich set of features to be used in a model, arguably much
richer representations than the simple estimation techniques we have seen earlier in
the course (e.g., the smoothing methods that we initially introduced for language
modeling, and which were later applied to other models such as HMMs for tag-
ging, and PCFGs for parsing). In this note we will give motivation for log-linear
models, give basic denitions, and describe how parameters can be estimated in
these models. In subsequent classes we will see how these models can be applied
to a number of natural language processing problems.
2 Motivation
As a motivating example, consider again the language modeling problem, where
the task is to derive an estimate of the conditional probability
P(W
i
= w
i
[W
1
= w
1
. . . W
i1
= w
i1
) = p(w
i
[w
1
. . . w
i1
)
for any sequence of words w
1
. . . w
i
, where i can be any positive integer. Here w
i
is the ith word in a document: our task is to model the distribution over the word
w
i
, conditioned on the previous sequence of words w
1
. . . w
i1
.
In trigram language models, we assumed that
p(w
i
[w
1
. . . w
i1
) = q(w
i
[w
i2
, w
i1
)
where q(w[u, v) for any trigram (u, v, w) is a parameter of the model. We studied
a variety of ways of estimating the q parameters; as one example, we studied linear
interpolation, where
q(w[u, v) =
1
q
ML
(w[u, v) +
2
q
ML
(w[v) +
3
q
ML
(w) (1)
1
Here each q
ML
is a maximum-likelihood estimate, and
1
,
2
,
3
are parameters
dictating the weight assigned to each estimate (recall that we had the constraints
that
1
+
2
+
3
= 1, and
i
0 for all i).
Trigram language models are quite effective, but they make relatively narrow
use of the context w
1
. . . w
i1
. Consider, for example, the case where the context
w
1
. . . w
i1
is the following sequence of words:
Third, the notion grammatical in English cannot be identied in any
way with the notion high order of statistical approximation to En-
glish. It is fair to assume that neither sentence (1) nor (2) (nor indeed
any part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical
Assume in addition that wed like to estimate the probability of the word model
appearing as word w
i
, i.e., wed like to estimate
P(W
i
= model[W
1
= w
1
. . . W
i1
= w
i1
)
In addition to the previous two words in the document (as used in trigram language
models), we could imagine conditioning on all kinds of features of the context,
which might be useful evidence in estimating the probability of seeing model as the
next word. For example, we might consider the probability of model conditioned
on word w
i2
, ignoring w
i1
completely:
P(W
i
= model[W
i2
= any)
We might condition on the fact that the previous word is an adjective
P(W
i
= model[pos(W
i1
) = adjective)
here pos is a function that maps a word to its part of speech. (For simplicity we
assume that this is a deterministic function, i.e., the mapping from a word to its
underlying part-of-speech is unambiguous.) We might condition on the fact that
the previous words sufx is ical:
P(W
i
= model[suff4(W
i1
) = ical)
(here suff4 is a function that maps a word to its last four characters). We might
condition on the fact that the word model does not appear in the context:
P(W
i
= model[W
j
,= model for j 1 . . . (i 1))
2
or we might condition on the fact that the word grammatical does appear in the
context:
P(W
i
= model[W
j
= grammatical for some j 1 . . . (i 1))
In short, all kinds of information in the context might be useful in estimating the
probability of a particular word (e.g., model) in that context.
A naive way to use this information would be to simply extend the methods
that we saw for trigram language models. Rather than combining three estimates,
based on trigram, bigram, and unigram estimates, we would combine a much larger
set of estimates. We would again estimate parameters reecting the importance
or weight of each estimate. The resulting estimator would take something like the
following form (this is intended as a sketch only):
p(model[w
1
, . . . w
i1
) =

1
q
ML
(model[w
i2
= any, w
i1
= statistical) +

2
q
ML
(model[w
i1
= statistical) +

3
q
ML
(model) +

4
q
ML
(model[w
i2
= any) +

5
q
ML
(model[w
i1
is an adjective) +

6
q
ML
(model[w
i1
ends in ical) +

7
q
ML
(model[model does not occur somewhere in w
1
, . . . w
i1
) +

8
q
ML
(model[grammatical occurs somewhere in w
1
, . . . w
i1
) +
. . .
The problem is that the linear interpolation approach becomes extremely unwieldy
as we add more and more pieces of conditioning information. In practice, it is
very difcult to extend this approach beyond the case where we small number of
estimates that fall into a natural hierarchy (e.g., unigram, bigram, trigram esti-
mates). In contrast, we will see that log-linear models offer a much more satisfac-
tory method for incorporating multiple pieces of contextual information.
3 A Second Example: Part-of-speech Tagging
Our second example concerns part-of-speech tagging. Consider the problem where
the context is a sequence of words w
1
. . . w
n
, together with a sequence of tags,
t
1
. . . t
i1
(here i < n), and our task is to model the conditional distribution over
the ith tag in the sequence. That is, we wish to model the conditional distribution
P(T
i
= t
i
[T
1
= t
1
. . . T
i1
= t
i1
, W
1
= w
1
. . . W
n
= w
n
)
3
As an example, we might have the following context:
Hispaniola/NNP quickly/RBbecame/VBan/DT important/JJ base from
which Spain expanded its empire into the rest of the Western Hemi-
sphere .
Here w
1
. . . w
n
is the sentence Hispaniola quickly . . . Hemisphere ., and the previ-
ous sequence of tags is t
1
. . . t
5
= NNP RB VB DT JJ. We have i = 6, and our
task is to model the distribution
P(T
6
= t
6
[ W
1
. . . W
n
= Hispaniola quickly . . . Hemisphere .,
T
1
. . . T
5
= NNP RB VB DT JJ)
i.e., our task is to model the distribution over tags for the 6th word, base, in the
sentence.
In this case there are again many pieces of contextual information that might
be useful in estimating the distribution over values for t
i
. To be concrete, consider
estimating the probability that the tag for base is V (i.e., T
6
= V). We might
consider the probability conditioned on the identity of the ith word:
P(T
6
= V[W
6
= base)
and we might also consider the probability conditioned on the previous one or two
tags:
P(T
6
= V[T
5
= JJ)
P(T
6
= V[T
4
= DT, T
5
= JJ)
We might consider the probability conditioned on the previous word in the sentence
P(T
6
= V[W
5
= important)
or the probability conditioned on the next word in the sentence
P(T
6
= V[W
7
= from)
We might also consider the probability based on spelling features of the word w
6
,
for example the last two letters of w
6
:
P(T
6
= V[suff2(W
6
) = se)
(here suff2 is a function that maps a word to its last two letters).
In short, we again have a scenario where a whole variety of contextual features
might be useful in modeling the distribution over the random variable of interest
(in this case the identity of the ith tag). Again, a naive approach based on an
extension of linear interpolation would unfortunately fail badly when faced with
this estimation problem.
4
4 Log-Linear Models
We now describe how log-linear models can be applied to problems of the above
form.
4.1 Basic Denitions
The abstract problem is as follows. We have some set of possible inputs, A, and a
set of possible labels, }. Our task is to model the conditional probability
p(y[x)
for any pair (x, y) such that x A and y }.
For example, in the language modeling task we have some nite set of possible
words in the language, call this set 1. The set } is simply equal to 1. The set
A is the set of possible sequences w
1
. . . w
i1
such that i 1, and w
j
1 for
j 1 . . . (i 1).
In the part-of-speech tagging example, we have some set 1 of possible words,
and a set T of possible tags. The set } is simply equal to T . The set A is the set
of contexts of the form
w
1
w
2
. . . w
n
, t
1
t
2
. . . t
i1

where n 1 is an integer specifying the length of the input sentence, w

j
1 for
j 1 . . . n, i 1 . . . (n 1), and t
j
T for j 1 . . . (i 1).
We will assume throughout that } is a nite set. The set A could be nite,
countably innite, or even uncountably innite.
Log-linear models are then dened as follows:
Denition 1 (Log-linear Models) A log-linear model consists of the following
components:
A set A of possible inputs.
A set } of possible labels. The set } is assumed to be nite.
A positive integer d specifying the number of features and parameters in the
model.
A function f : A } R
d
that maps any (x, y) pair to a feature-vector
f(x, y).
A parameter vector v R
d
.
5
For any x A, y }, the model denes a condtional probability
p(y[x; v) =
exp (v f(x, y))

Y
exp (v f(x, y

))
Here exp(x) = e
x
, and vf(x, y) =

d
k=1
v
k
f
k
(x, y) is the inner product between
v and f(x, y). The term p(y[x; v) is intended to be read as the probability of y
conditioned on x, under parameter values v.
We now describe the components of the model in more detail, rst focusing on
the feature-vector denitions f(x, y), then giving intuition behind the model form
p(y[x; v) =
exp (v f(x, y))

Y
exp (v f(x, y

))
5 Features
As described in the previous section, for any pair (x, y), f(x, y) R
d
is a feature
vector representing that pair. Each component f
k
(x, y) for k = 1 . . . d in this vector
is referred to as a feature. The features allows us to represent different properties
of the input x, in conjunction with the label y. Each feature has an associated
parameter, v
k
, whose value is estimated using a set of training examples. The
training set consists of a sequence of examples (x
(i)
, y
(i)
) for i = 1 . . . n, where
each x
(i)
A, and each y
(i)
}.
In this section we rst give an example of how features can be constructed for
the language modeling problem, as introduced earlier in this note; we then describe
some practical issues in dening features.
5.1 Features for the Language Modeling Example
Consider again the language modeling problem, where the input x is a sequence of
words w
1
w
2
. . . w
i1
, and the label y is a word. Figure 1 shows a set of example
features for this problem. Each feature is an indicator function: that is, each feature
is a function that returns either 1 or 0. It is extremely common in NLP applications
to have indicator functions as features. Each feature returns the value of 1 if some
property of the input x conjoined with the label y is true, and 0 otherwise.
The rst three features, f
1
, f
2
, and f
3
, are analogous to unigram, bigram, and
trigram features in a regular trigram language model. The rst feature returns 1 if
the label y is equal to the word model, and 0 otherwise. The second feature returns
1 if the bigram w
i1
y is equal to statistical model, and 0 otherwise. The third
feature returns 1 if the trigram w
i2
w
i1
y is equal to any statistical model,
6
f
1
(x, y) =
_
1 if y = model
0 otherwise
f
2
(x, y) =
_
1 if y = model and w
i1
= statistical
0 otherwise
f
3
(x, y) =
_
1 if y = model, w
i2
= any, w
i1
= statistical
0 otherwise
f
4
(x, y) =
_
1 if y = model, w
i2
= any
0 otherwise
f
5
(x, y) =
_
1 if y = model, w
i1
is an adjective
0 otherwise
f
6
(x, y) =
_
1 if y = model, w
i1
ends in ical
0 otherwise
f
7
(x, y) =
_
1 if y = model, model is not in w
1
, . . . w
i1
0 otherwise
f
8
(x, y) =
_
1 if y = model, grammatical is in w
1
, . . . w
i1
0 otherwise
Figure 1: Example features for the language modeling problem, where the input x
is a sequence of words w
1
w
2
. . . w
i1
, and the label y is a word.
7
and 0 otherwise. Recall that each of these features will have a parameter, v
1
, v
2
, or
v
3
; these parameters will play a similar role to the parameters in a regular trigram
language model.
The features f
4
. . . f
8
in gure 1 consider properties that go beyond unigram,
bigram, and trigram features. The feature f
4
considers word w
i2
in conjunction
with the label y, ignoring the word w
i1
; this type of feature is often referred to as
a skip bigram. Feature f
5
considers the part-of-speech of the previous word (as-
sume again that the part-of-speech for the previous word is available, for example
through a deterministic mapping from words to their part-of-speech, or perhaps
through a POS taggers output on words w
1
. . . w
i1
). Feature f
6
considers the
sufx of the previous word, and features f
7
and f
8
consider various other features
of the input x = w
1
. . . w
i1
.
From this example we can see that it is possible to incorporate a broad set of
contextual information into the language modeling problem, using features which
are indicator functions.
5.2 Feature Templates
We now discuss some practical issues in dening features. In practice, a key idea in
dening features is that of feature templates. We introduce this idea in this section.
Recall that our rst three features in the previous example were as follows:
f
1
(x, y) =
_
1 if y = model
0 otherwise
f
2
(x, y) =
_
1 if y = model and w
i1
= statistical
0 otherwise
f
3
(x, y) =
_
1 if y = model, w
i2
= any, w
i1
= statistical
0 otherwise
These features track the unigram model, the bigram statistical model, and the
trigram any statistical model.
Each of these features is specic to a particular unigram, bigram or trigram. In
practice, we would like to dene a much larger class of features, which consider
all possible unigrams, bigrams or trigrams seen in the training data. To do this, we
use feature templates to generate large sets of features.
As one example, here is a feature template for trigrams:
Denition 2 (Trigram feature template) For any trigram (u, v, w) seen in train-
8
ing data, create a feature
f
N(u,v,w)
(x, y) =
_
1 if y = w, w
i2
= u, w
i1
= v
0 otherwise
where N(u, v, w) is a function that maps each trigram in the training data to a
unique integer.
A couple of notes on this denition:
Note that the template only generates trigram features for those trigrams
seen in training data. There are two reasons for this restriction. First, it is
not feasible to generate a feature for every possible trigram, even those not
seen in training data: this would lead to V
3
features, where V is the number
of words in the vocabulary, which is a very large set of features. Second, for
any trigram (u, v, w) not seen in training data, we do not have evidence to
estimate the associated parameter value, so there is no point including it in
any case.
1
The function N(u, v, w) maps each trigram to a unique integer: that is, it
is a function such that for any trigrams (u, v, w) and (u

, v

, w

) such that
u ,= u

, v ,= v

, or w ,= w

, we have
N(u, v, w) ,= N(u

, v

, w

)
In practice, in implementations of feature templates, the function N is imple-
mented through a hash function. For example, we could use a hash table to
hash strings such as trigram=any statistical model to integers.
Each distinct string is hashed to a different integer.
Continuing with the example, we can also dene bigram and unigram feature
templates:
Denition 3 (Bigram feature template) For any bigram (v, w) seen in training
data, create a feature
f
N(v,w)
(x, y) =
_
1 if y = w, w
i1
= v
0 otherwise
where N(v, w) maps each bigram to a unique integer.
1
This isnt quite accurate: there may in fact be reasons for including features for trigrams
(u, v, w) where the bigram (u, v) is observed in the training data, but the trigram (u, v, w) is not
observed in the training data. We defer discussion of this until later.
9
Denition 4 (Unigram feature template) For any unigram (w) seen in training
data, create a feature
f
N(w)
(x, y) =
_
1 if y = w
0 otherwise
where N(w) maps each unigram to a unique integer.
We actually need to be slightly more careful with these denitions, to avoid
overlap between trigram, bigram, and unigram features. Dene T, B and U to be
the set of trigrams, bigrams, and unigrams seen in the training data. Dene
N
t
= i : (u, v, w) T such that N(u, v, w) = i
N
b
= i : (v, w) B such that N(v, w) = i
N
u
= i : (w) U such that N(w) = i
Then we need to make sure that there is no overlap between these setsotherwise,
two different n-grams would be mapped to the same feature. More formally, we
need
N
t
N
b
= N
t
N
u
= N
b
N
u
= (2)
In practice, it is easy to ensure this when implementing log-linear models, using a
single hash table to hash strings such as trigram=any statistical model,
bigram=statistical model, unigram=model, to distinct integers.
We could of course dene additional templates. For example, the following is
a template which tracks the length-4 sufx of the previous word, in conjunction
with the label y:
Denition 5 (Length-4 Sufx Template) For any pair (v, w) seen in training data,
where v = suff4(w
i1
), and w = y, create a feature
f
N(suff4=v,w)
(x, y) =
_
1 if y = w and suff4(x) = v
0 otherwise
where N(suff4 = v, w) maps each pair (v, w) to a unique integer, with no over-
lap with the other feature templates used in the model (where overlap is dened
analogously to Eq. 2 above).
10
5.3 Feature Sparsity
A very important property of the features we have dened above is feature sparsity.
The number of features, d, in many NLP applications can be extremely large. For
example, with just the trigram template dened above, we would have one feature
for each trigram seen in training data. It is not untypical to see models with 100s
of thousands or even millions of features.
This raises obvious concerns with efciency of the resulting models. However,
we describe in this section how feature sparsity can lead to efcient models.
The key observation is the following: for any given pair (x, y), the number of
values for k in 1 . . . d such that
f
k
(x, y) = 1
is often very small, and is typically much smaller than the total number of features,
d. Thus all but a very small subset of the features are 0: the feature vector f(x, y)
is a very sparse bit-string, where almost all features f
k
(x, y) are equal to 0, and
only a few features are equal to 1.
As one example, consider the language modeling example where we use only
the trigram, bigram and unigram templates, as described above. The number of
features in this model is large (it is equal to the number of distinct trigrams, bigrams
and unigrams seen in training data). However, it can be seen immediately that for
any pair (x, y), at most three features are non-zero (in the worst case, the pair (x, y)
contains trigram, bigram and unigram features which are all seen in the training
data, giving three non-zero features in total).
When implementing log-linear models, models with sparse features can be
quite efcient, because there is no need to explicitly represent and manipulate d-
dimensional feature vectors f(x, y). Instead, it is generally much more efcient to
implement a function (typically through hash tables) that for any pair (x, y) com-
putes the indices of the non-zero features: i.e., a function that computes the set
Z(x, y) = k : f
k
(x, y) = 1
This set is small in sparse feature spacesfor example with unigram/bigram/trigram
features alone, it would be of size at most 3. In general, it is straightforward
to implement a function that computes Z(x, y) in O([Z(x, y)[) time, using hash
functions. Note that [Z(x, y)[ d, so this is much more efcient than explicitly
computing all d features, which would take O(d) time.
As one example of how efcient computation of Z(x, y) can be very helpful,
consider computation of the inner product
v f(x, y) =
d

k=1
v
k
f
k
(x, y)
11
This computation is central in log-linear models. A naive method would iterate
over each of the d features in turn, and would take O(d) time. In contrast, if we
make use of the identity
d

k=1
v
k
f
k
(x, y) =

kZ(x,y)
v
k
hence looking at only non-zero features, we can compute the inner product in
O([Z(x, y)[) time.
6 The Model form for Log-Linear Models
We now describe the model form for log-linear models in more detail. Recall that
for any pair (x, y) such that x A, and y }, the conditional probability under
the model is
p(y [ x; v) =
exp (v f(x, y))

Y
exp (v f(x, y

))
The inner products
v f(x, y)
play a key role in this expression. Again, for illustration consider our languge-
modeling example where the input x = w
1
. . . w
i1
is the following sequence of
words:
Third, the notion grammatical in English cannot be identied in any
way with the notion high order of statistical approximation to En-
glish. It is fair to assume that neither sentence (1) nor (2) (nor indeed
any part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical
The rst step in calculating the probability distribution over the next word in
the document, conditioned on x, is to calculate the inner product v f(x, y) for
each possible label y (i.e., for each possible word in the vocabulary). We might,
for example, nd the following values (we show the values for just a few possible
wordsin reality we would compute an inner product for each possible word):
v f(x, model) = 5.6 v f(x, the) = 3.2
v f(x, is) = 1.5 v f(x, of) = 1.3
v f(x, models) = 4.5 . . .
12
Note that the inner products can take any value in the reals, positive or negative.
Intuitively, if the inner product v f(x, y) for a given word y is high, this indicates
that the word has high probability given the context x. Conversely, if v f(x, y) is
low, it indicates that y has low probability in this context.
The inner products vf(x, y) can take any value in the reals; our goal, however,
is to dene a conditional distribution p(y[x). If we take
exp (v f(x, y))
for any label y, we now have a value that is greater than 0. If v f(x, y) is high,
this value will be high; if v f(x, y) is low, for example if it is strongly negative,
this value will be low (close to zero).
Next, if we divide the above quantity by

Y
exp
_
v f(x, y

)
_
giving
p(y[x; v) =
exp (v f(x, y))

Y
exp (v f(x, y

))
(3)
then it is easy to verify that we have a well-formed distribution: that is,

yY
p(y[x; v) = 1
Thus the denominator in Eq. 3 is a normalization term, which ensures that we have
a distribution that sums to one. In summary, the function
exp (v f(x, y))

Y
exp (v f(x, y

))
performs a transformation which takes as input a set of values vf(x, y) : y },
where each v f(x, y) can take any value in the reals, and as output produces a
probability distribution over the labels y }.
Finally, we consider where the name log-linear models originates from. It
follows from the above denitions that
log p(y[x; v) = v f(x, y) log

Y
exp
_
v f(x, y

)
_
= v f(x, y) g(x)
where
g(x) = log

Y
exp
_
v f(x, y

)
_
13
The rst term, v f(x, y), is linear in the features f(x, y). The second term, g(x),
depends only on x, and does not depend on the label y. Hence the log probability
log p(y[x; v) is a linear function in the features f(x, y), as long as we hold x xed;
this justies the term log-linear.
7 Parameter Estimation in Log-Linear Models
7.1 The Log-Likelihood Function, and Regularization
We now consider the problem of parameter estimation in log-linear models. We
assume that we have a training set, consisting of examples (x
(i)
, y
(i)
) for i
1 . . . n, where each x
(i)
A, and each y
(i)
}.
Given parameter values v, for any example i, we can calculate the log condi-
tional probability
log p(y
(i)
[x
(i)
; v)
under the model. Intuitively, the higher this value, the better the model ts this
particular example. The log-likelihood considers the sum of log probabilities of
examples in the training data:
L(v) =
n

i=1
log p(y
(i)
[x
(i)
; v) (4)
This is a function of the parameters v. For any parameter vector v, the value of
L(v) can be interpreted of a measure of how well the parameter vector ts the
training examples.
The rst estimation method we will consider is maximum-likelihood estima-
tion, where we choose our parameters as
v
ML
= arg max
vR
d
L(v)
In the next section we describe how the parameters v
ML
can be found efciently.
Intuitively, this estimation method nds the parameters which t the data as well
as possible.
The maximum-likelihood estimates can run into problems, in particular in
cases where the number of features in the model is very large. To illustrate, con-
sider the language-modeling problem again, and assume that we have trigram, bi-
gram and unigram features. Now assume that we have some trigram (u, v, w)
which is seen only once in the training data; to be concrete, assume that the tri-
gram is any statistical model, and assume that this trigram is seen on the 100th
14
training example alone. More precisely, we assume that
f
N(any,statistical,model)
(x
(100)
, y
(100)
) = 1
In addition, assume that this is the only trigram (u, v, w) in training data with
u = any, and v = statistical. In this case, it can be shown that the maximum-
likelihood parameter estimate for v
100
is +,
2
, which gives
p(y
(100)
[x
(100)
; v) = 1
In fact, we have a very similar situation to the case in maximum-likelihood
estimates for regular trigram models, where we would have
q
ML
(model[any, statistical) = 1
for this trigram. As discussed earlier in the class, this model is clearly under-
smoothed, and it will generalize badly to new test examples. It is unreasonable to
assign
P(W
i
= model[W
i1
, W
i2
= any, statistical) = 1
based on the evidence that the bigram any statistical is seen once, and on that one
instance the bigram is followed by the word model.
A very common solution for log-linear models is to modify the objective func-
tion in Eq. 4 to include a regularization term, which prevents parameter values from
becoming too large (and in particular, prevents parameter values from diverging to
innity). A common regularization term is the 2-norm of the parameter values, that
is,
[[v[[
2
=

k
v
2
k
(here [[v[[ is simply the length, or Euclidean norm, of a vector v; i.e., [[v[[ =
_

k
v
2
k
). The modied objective function is
L

(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)

2

k
v
2
k
(5)
2
It is relatively easy to prove that v100 can diverge to . To give a sketch: under the above
assumptions, the feature f
N(any,statistical,model)
(x, y) is equal to 1 on only a single pair x
(i)
, y
where i {1 . . . n}, and y Y, namely the pair (x
(100)
, y
(100)
). Because of this, as v100 ,
we will have p(y
(100)
|x
(100)
; v) tending closer and closer to a value of 1, with all other values
p(y
(i)
|x
(i)
; v) remaining unchanged. Thus we can use this one parameter to maximize the value for
log p(y
(100)
|x
(100)
; v), independently of the probability of all other examples in the training set.
15
where > 0 is a parameter, which is typically chosen by validation on some
held-out dataset. We again choose the parameter values to maximize the objective
function: that is, our optimal parameter values are
v

= arg max
v
L

(v)
The key idea behind the modied objective in Eq. 5 is that we now balance two
separate terms. The rst term is the log-likelihood on the training data, and can be
interpreted as a measure of howwell the parameters v t the training examples. The
second term is a penalty on large parameter values: it encourages parameter values
to be as close to zero as possible. The parameter denes the relative weighting of
the two terms. In practice, the nal parameters v

will be a compromise between

tting the data as well as is possible, and keeping their values as small as possible.
In practice, this use of regularization is very effective in smoothing of log-linear
models.
7.2 Finding the Optimal Parameters
First, consider nding the maximum-likelihood parameter estimates: that is, the
problem of nding
v
ML
= arg max
vR
d
L(v)
where
L(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)
The bad news is that in the general case, there is no closed-form solution for the
maximum-likelihood parameters v
ML
. The good news is that nding arg max
v
L(v)
is a relatively easy problem, because L(v) can be shown to be a convex function.
This means that simple gradient-ascent-style methods will nd the optimal param-
eters v
ML
relatively quickly.
Figure 2 gives a sketch of a gradient-based algorithm for optimization of L(v).
The parameter vector is initialized to the vector of all zeros. At each iteration we
rst calculate the gradients
k
for k = 1 . . . d. We then move in the direction
of the gradient: more precisely, we set v v +

where

is chosen to
give the optimal improvement in the objective function. This is a hill-climbing
technique where at each point we compute the steepest direction to move in (i.e.,
the direction of the gradient); we then move the distance in that direction which
gives the greatest value for L(v).
Simple gradient ascent, as shown in gure 2, can be rather slow to converge.
Fortunately there are many standard packages for gradient-based optimization,
16
Initialization: v = 0
Iterate until convergence:
Calculate
k
=
dL(v)
dv
k
for k = 1 . . . d
Calculate

= arg max
R
L(v + ) where is the vector with
components
k
for k = 1 . . . d (this step is performed using some type
of line search)
Set v v +

Figure 2: A gradient ascent algorithm for optimization of L(v).

which use more sophisticated algorithms, and which give considerably faster con-
vergence. As one example, a commonly used method for parameter estimation in
log-linear models is LBFGs. LBFGs is again a gradient method, but it makes a
more intelligent choice of search direction at each step. It does however rely on the
computation of L(v) and
dL(v)
dv
k
for k = 1 at each stepin fact this is the only infor-
mation it requires about the function being optimized. In summary, if we can com-
pute L(v) and
dL(v)
dv
k
efciently, then it is simple to use an existing gradient-based
optimization package (e.g., based on LBFGs) to nd the maximum-likelihood es-
timates.
Optimization of the regularized objective function,
L

(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)

2

k
v
2
k
can be performed in a very similar manner, using gradient-based methods. L

(v)
is also a convex function, so a gradient-based method will nd the global optimum
of the parameter estimates.
The one remaining step is to describe how the gradients
dL(v)
dv
k
and
dL

(v)
dv
k
can be calculated. This is the topic of the next section.
17
7.3 Gradients
We rst consider the derivatives
dL(v)
dv
k
where
L(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)
It is relatively easy to show (see the appendix of this note), that for any k
1 . . . d,
dL(v)
dv
k
=
n

i=1
f
k
(x
(i)
, y
(i)
)
n

i=1

yY
p(y[x
(i)
; v)f
k
(x
(i)
, y) (6)
where as before
p(y[x
(i)
; v) =
exp
_
v f(x
(i)
, y)
_

Y
exp
_
v f(x
(i)
, y

)
_
The expression in Eq. 6 has a quite intuitive form. The rst part of the expression,
n

i=1
f
k
(x
(i)
, y
(i)
)
is simply the number of times that the feature f
k
is equal to 1 on the training ex-
amples (assuming that f
k
is an indicator function; i.e., assuming that f
k
(x
(i)
, y
(i)
)
is either 1 or 0). The second part of the expression,
n

i=1

yY
p(y[x
(i)
; v)f
k
(x
(i)
, y)
can be interpreted as the expected number of times the feature is equal to 1, where
the expectation is taken with respect to the distribution
p(y[x
(i)
; v) =
exp
_
v f(x
(i)
, y)
_

Y
exp
_
v f(x
(i)
, y

)
_
specied by the current parameters. The gradient is then the difference of these
terms. It can be seen that the gradient is easily calculated.
The gradients
dL

(v)
dv
k
18
where
L

(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)

2

k
v
2
k
are derived in a very similar way. We have
d
dv
k
_

k
v
2
k
_
= 2v
k
hence
dL

(v)
dv
k
=
n

i=1
f
k
(x
(i)
, y
(i)
)
n

i=1

yY
p(y[x
(i)
; v)f
k
(x
(i)
, y) v
k
(7)
Thus the only difference from the gradient in Eq. 6 is the additional term v
k
in
this expression.
A Calculation of the Derivatives
In this appendix we show how to derive the expression for the derivatives, as given
in Eq. 6. Our goal is to nd an expression for
dL(v)
dv
k
where
L(v) =
n

i=1
log p(y
(i)
[x
(i)
; v)
First, consider a single term log p(y
(i)
[x
(i)
; v). Because
p(y
(i)
[x
(i)
; v) =
exp
_
v f(x
(i)
, y
(i)
)
_

Y
exp
_
v f(x
(i)
, y

)
_
we have
log p(y
(i)
[x
(i)
; v) = v f(x
(i)
, y
(i)
) log

Y
exp
_
v f(x
(i)
, y

)
_
The derivative of the rst term in this expression is simple:
d
dv
k
_
v f(x
(i)
, y
(i)
)
_
=
d
dv
k
_

k
v
k
f
k
(x
(i)
, y
(i)
)
_
= f
k
(x
(i)
, y
(i)
) (8)
19
Now consider the second term. This takes the form
log g(v)
where
g(v) =

Y
exp
_
v f(x
(i)
, y

)
_
By the usual rules of differentiation,
d
dv
k
log g(v) =
d
dv
k
(g(v))
g(v)
In addition, it can be veried that
d
dv
k
g(v) =

Y
f
k
(x
(i)
, y

) exp
_
v f(x
(i)
, y

)
_
hence
d
dv
k
log g(v) =
d
dv
k
(g(v))
g(v)
(9)
=

Y
f
k
(x
(i)
, y

) exp
_
v f(x
(i)
, y

)
_

Y
exp
_
v f(x
(i)
, y

)
_ (10)
=

Y
_
_
f
k
(x
(i)
, y

)
exp
_
v f(x
(i)
, y

)
_

Y
exp
_
v f(x
(i)
, y

)
_
_
_
(11)
=

Y
f
k
(x
(i)
, y

)p(y

[x; v) (12)
Combining Eqs 8 and 12 gives
dL(v)
dv
k
=
n

i=1
f
k
(x
(i)
, y
(i)
)
n

i=1

yY
p(y[x
(i)
; v)f
k
(x
(i)
, y)
20

CHAPTER 6-Ambiguity Resolutions Statistical Methods
No ratings yet
CHAPTER 6-Ambiguity Resolutions Statistical Methods
92 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
NLP Week4 Ngrams
No ratings yet
NLP Week4 Ngrams
60 pages
05 Ar 4
No ratings yet
05 Ar 4
145 pages
Lecture 7 - Language Modelling
No ratings yet
Lecture 7 - Language Modelling
107 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Lec 4
No ratings yet
Lec 4
35 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
NLP Week 03
No ratings yet
NLP Week 03
33 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Langmodel PDF
0% (1)
Langmodel PDF
69 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
47 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
PLM 17
No ratings yet
PLM 17
15 pages
Maximum Entropy Markov Models: Alan Ritter CSE 5525
No ratings yet
Maximum Entropy Markov Models: Alan Ritter CSE 5525
70 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Lectures LM
No ratings yet
Lectures LM
57 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Language Modeling Lecture1112
No ratings yet
Language Modeling Lecture1112
47 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Roark - Lec 2 - HMM Viterbi Forward
No ratings yet
Roark - Lec 2 - HMM Viterbi Forward
37 pages
Петрова, Ив. 2009. Синтактичен анализ на просто съобщително изречение с прав словоред. Дисертация за присъждане на научната и образователна степен "Доктор".
No ratings yet
Петрова, Ив. 2009. Синтактичен анализ на просто съобщително изречение с прав словоред. Дисертация за присъждане на научната и образователна степен "Доктор".
8 pages
Log-Linear Models, Memms, and CRFS: 1 Notation
No ratings yet
Log-Linear Models, Memms, and CRFS: 1 Notation
11 pages
Practice Midterm Solutions
No ratings yet
Practice Midterm Solutions
7 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
No ratings yet
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
6 pages
Enriching The Knowledge Sources Used in A Maximum Entropy Part-of-Speech Tagger
No ratings yet
Enriching The Knowledge Sources Used in A Maximum Entropy Part-of-Speech Tagger
8 pages
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
35 pages
9783642336928-c2+ch 2
No ratings yet
9783642336928-c2+ch 2
9 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
The Log-Log Term Frequency Distribution
No ratings yet
The Log-Log Term Frequency Distribution
13 pages
NLP
No ratings yet
NLP
12 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
No ratings yet
A Neural Probabilistic Language Model by Yoshua Bengio Ducharme and Vincent 2001
7 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Ngrams
100% (1)
Ngrams
22 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
CS 904: Natural Language Processing Statistical Inference: N-Grams
No ratings yet
CS 904: Natural Language Processing Statistical Inference: N-Grams
30 pages
Trigram Language Models
No ratings yet
Trigram Language Models
19 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
NLP and Entropy
No ratings yet
NLP and Entropy
54 pages
(Evenzoha, Danr) Uiuc, Edu: A Classification Approach To Word Prediction
No ratings yet
(Evenzoha, Danr) Uiuc, Edu: A Classification Approach To Word Prediction
8 pages
MLE and MAP Ex PG 1-4 Print
No ratings yet
MLE and MAP Ex PG 1-4 Print
10 pages
Topic Belts
100% (1)
Topic Belts
20 pages
Statistical NLP
No ratings yet
Statistical NLP
19 pages
PW 01 - 25.02-1.03.2019 - Physical Quantities and Error Calculations PDF
No ratings yet
PW 01 - 25.02-1.03.2019 - Physical Quantities and Error Calculations PDF
53 pages
01 RF Measurement and Optimization
100% (1)
01 RF Measurement and Optimization
20 pages
Lemongrass Ash As An Alternative Foundation Soil Stabilizer
No ratings yet
Lemongrass Ash As An Alternative Foundation Soil Stabilizer
6 pages
Ni 2301
No ratings yet
Ni 2301
26 pages
Bed DynSMch13
100% (1)
Bed DynSMch13
92 pages
Mathematics and Twist PDF
No ratings yet
Mathematics and Twist PDF
148 pages
PU-BCD: Exponential Family Models For The Coarse-And Fine-Grained All-Words Tasks
No ratings yet
PU-BCD: Exponential Family Models For The Coarse-And Fine-Grained All-Words Tasks
5 pages
Associated Legendre Function
No ratings yet
Associated Legendre Function
8 pages
Potentials PDF
No ratings yet
Potentials PDF
28 pages
Chapters and Topics For SSC CGL TIER I Exam
No ratings yet
Chapters and Topics For SSC CGL TIER I Exam
3 pages
Bandpass Microstrip Filters
No ratings yet
Bandpass Microstrip Filters
24 pages
236 Voorspanbouten Rapport-Tightening en 1090-2
100% (1)
236 Voorspanbouten Rapport-Tightening en 1090-2
36 pages
Gaseous State Final
50% (2)
Gaseous State Final
43 pages
Habitat PDF
No ratings yet
Habitat PDF
8 pages
Obstacle Avoiding Robot Using Arduino and Ultrasonic Sensor
100% (3)
Obstacle Avoiding Robot Using Arduino and Ultrasonic Sensor
10 pages
Heating Calculations
No ratings yet
Heating Calculations
6 pages
SA 515 and SA 516
No ratings yet
SA 515 and SA 516
3 pages
1final Syllabus-CIVIL - (1st To 8th Semester)
No ratings yet
1final Syllabus-CIVIL - (1st To 8th Semester)
130 pages
Questionpaper Paper1P June2017 PDF
No ratings yet
Questionpaper Paper1P June2017 PDF
36 pages
2011 JKP FRC SLS Uls
No ratings yet
2011 JKP FRC SLS Uls
4 pages
Lecture Notes 8 The Traceless Symmetric Tensor Expansion and Standard Spherical Harmonics
No ratings yet
Lecture Notes 8 The Traceless Symmetric Tensor Expansion and Standard Spherical Harmonics
9 pages
Pure Maths P2 May 2010 GCE
No ratings yet
Pure Maths P2 May 2010 GCE
24 pages
003 Boolean Algebra
No ratings yet
003 Boolean Algebra
19 pages
A Report On Topic: Haptics Technology
No ratings yet
A Report On Topic: Haptics Technology
21 pages
r∨ F Q Q ε Fm: r - being the distance between 2 point charges (in this case, they are spheres,
No ratings yet
r∨ F Q Q ε Fm: r - being the distance between 2 point charges (in this case, they are spheres,
12 pages
Pressure Derivatives of Bulk Modulus, Thermal Expansivity and Grüneisen Parameter For MgO at High Temperatures and High Pressures
No ratings yet
Pressure Derivatives of Bulk Modulus, Thermal Expansivity and Grüneisen Parameter For MgO at High Temperatures and High Pressures
3 pages
Retrieving Data in PLSQL
No ratings yet
Retrieving Data in PLSQL
4 pages
Pre - Lab Assignment: Heat of Hydration
No ratings yet
Pre - Lab Assignment: Heat of Hydration
4 pages
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)

Log-Linear Models: Michael Collins

Uploaded by

Log-Linear Models: Michael Collins

Uploaded by

Log-Linear Models

where n 1 is an integer specifying the length of the input sentence, w

will be a compromise between

Figure 2: A gradient ascent algorithm for optimization of L(v).

You might also like