0% found this document useful (0 votes)

22 views12 pages

8 CRF

The document describes log-linear models and conditional random fields. Log-linear models are an extension of logistic regression that use feature functions. Conditional random fields are a special case of log-linear models that can model structured prediction tasks like part-of-speech tagging by using features that depend on neighboring tags in a sequence.

Uploaded by

jameslei47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

8 CRF

Uploaded by

jameslei47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Log-Linear Models

and Conditional Random Fields

Charles Elkan
[email protected]
December 6, 2007

This document describes log-linear models, which are a far-reaching extension

of logistic regression, and conditional random fields (CRFs), which are a special
case of log-linear models.
Section 1 explains what a log-linear model is, and introduces feature func-
tions. Section 2 then presents linear-chain CRFs as an example of log-linear mod-
els, and Section 3 explains the special algorithms that make inference tractable
for these CRFs. Section 4 gives a general derivation of the gradient of a log-linear
model; this is the foundation of all log-linear training algorithms. Finally Sec-
tion 5 presents two special CRF training algorithms, one that is a variant of the
perceptron method and another one called contrastive divergence.

1 Log-linear models
Let x be an example, and let y be a possible label for it. A log-linear model
assumes that
P
exp j wj Fj (x, y)
p(y|x; w) = (1)
Z(x, w)

where the partition function Z(x, w) = y0 exp j wj Fj (x, y 0 ). Therefore, given

P P
x, the label predicted by the model is
X
ŷ = argmaxy p(y|x; w) = argmaxy wj Fj (x, y).
j

1
Each expression Fj (x, y) is called a feature-function. In general, a feature-function
can be any real-valued function of both the data space X and the label space Y .
Formally, a feature-function is any mapping Fj : X × Y − > R.
Often, a feature-function is zero for all values of y except one particular value.
Given some attribute of x, we can have a different weight for this attribute and
each different label. The weights for these feature-functions can then capture the
affinity of this attribute-value for each label. Often, feature-functions are pres-
ence/absence indicators, so the value of the feature-function is either 0 or 1. If we
have a conventional attribute a(x) with k alternative values, and n classes, we can
make kn different features as defined above. With log-linear models, anything
and the kitchen sink can be a feature. We can have lots of classes, lots of features,
and we can pay attention to different features for different classes.
Feature-functions can overlap in arbitrary ways. For example, if x is a word
different feature-functions can use attributes of x such as “starts with a capital let-
ter,” “starts with G,”, is “Graham,” “is six letters long.” Generally we can encode
suffixes, prefixes, facts from a lexicon, preceding/following punctuation, etc., as
features.
Mathematically, log-linear models are very simple: there is one real-valued
weight for each feature, no more no fewer. There are several possible Pjustifications
for the form of the expression (1). First, a linear combination j wj Fj (x, y)
can take any positive or negative real value; the exponential makes it positive,
like a valid probability. Second, the division makes the results between 0 and 1,
i.e. makes them be valid probabilities. Third, the ranking of the probabilities will
be the same as the ranking of the linear values.
A function of the form
exp ak
bk = P
k0 exp ak0
is called a softmax function because the exponentials enlarge the bigger ak values
compared to the smaller ak values. Other functions have the same property of
being similar to the maximum function, but differentiable. Softmax is widely
used now, perhaps because its derivative is especially simple; see Section 4 below.

2 Conditional random fields

A conditional random field (CRFs) is an important special case of a log-linear
model. First, consider an example of a learning task for which a CRF is useful.
Given a sentence, the task is to tag each word as noun, verb, adjective, preposition,

2
etc. There is a fixed known set of these part-of-speech (POS) tags. Each sentence
is a separate training or test example. We will represent a sentence by feature-
functions based on its words. Feature-functions can be very varied:

• Some feature-functions can be position-specific, e.g. to the beginning or

to the end of a sentence, while others can be sums over all positions in a
sentence.

• Some feature-functions can look just at one word, e.g. at its prefixes or
suffixes.

• Some features can also use the words one to the left, one to the right, two to
the left etc., up to the whole sentence.

The highest-accuracy POS taggers currently use over 100,000 feature-functions.

An important restriction (that will be explained and justified below) is that each
feature-function can depend on only one tag, or on two neighboring tags.
POS tagging is an example of what is called a structured prediction task. The
goal is to predict a complex label (a sequence of POS tags) for a complex input (an
entire sentence). This task is difficult, and significantly different from a standard
classifier learning task. There are at least three important sources of difficulty.
First, too much information would be lost by learning just a per-word classifier.
Influences between neighboring tags must be taken into account. Second, different
sentences have different lengths, so it is not obvious how to represent all sentences
by vectors of the same fixed length. Third, the set of all possible sequences of tags
constitutes an exponentially large set of labels.
A linear conditional random field is a way to apply a log-linear model to this
type of task. Use the bar notation for sequences, so x̄ means a sequence of variable
length. Specifically, let x̄ be a sequence of n words and let ȳ be a corresponding
sequence of n tags. Define the log-linear model
1 X
p(ȳ|x̄; w) = exp wj Fj (x̄, ȳ).
Z(x̄, w) j

Assume that each feature-function Fj is actually a sum along the sentence, for
i = 1 to i = n where n is the length of x̄:
X
Fj (x̄, ȳ) = fj (yi−1 , yi , x̄, i).
i

3
This notation means that each low-level feature-function fj can depend on the
whole sentence, the current tag and the previous tag, and the current position i
within the sentence. A feature-function fj may depend on only a subset of these
four possible influences. Examples of features are “the current tag is NOUN and
the current word is capitalized,” “the word at the start of the sentence is Mr.” and
“the previous tag was SALUTATION.”
Summing each fj over all positions i means that we can have a fixed set of
feature-functions Fj for log-linear training even though the training examples are
not fixed-length.
Training a CRF means finding the weight vector w that gives the best possible
prediction
ȳ ∗ = argmaxȳ p(ȳ|x̄; w) (2)
for each training example x̄. However, before we can talk about training there are
two major inference problems to solve. First, how can we do the argmax compu-
tation in Equation 2 efficiently, for any x̄ and any weights w? This computation is
difficult since the number of alternative tag sequences ȳ is exponential.
Second, given any x̄ and ȳ we want to evaluate
1 X
p(ȳ|x̄; w) = exp wj Fj (x̄, ȳ).
Z(x̄, w) j

The difficulty
Phere isP that the denominator again ranges over all tag sequences ȳ:
Z(x̄, w) = ȳ0 exp j wj Fj (x̄, ȳ 0 ). For both these tasks, we will need tricks to
account for all possible ȳ efficiently, without enumerating all possible ȳ. The fact
that feature-functions can depend on at most two tags, which must be adjacent,
makes these tricks exist.

3 Inference algorithms for linear-chain CRFs

Let’s solve the first problem above efficiently. First note that we can ignore the
denominator, and also the exponential inside the numerator. We want to compute
X
ȳ ∗ = argmaxȳ p(ȳ|x̄; w) = argmaxȳ wj Fj (x̄, ȳ).
j

Use the definition of Fj to get

X X X
ȳ ∗ = argmaxȳ wj fj (yi−1 , yi , x̄, i) = argmaxȳ gi (yi−1 , yi )
j i i

4
P
where gi (yi−1 , yi ) = j wj fj (yi−1 , yi , x̄, i). Note that the x̄ and i arguments of fj
have been dropped in the definition of gi . Each gi is a different function for each
i, and depends on w as well as on x̄ and i.
Remember that each entry of the ȳ vector is one of a finite set of tags. Given
x̄, w, and i the function gi can be represented as an m by m matrix where m is the
cardinality of the set of tags.
Let v range over the tags. Define U (k, v) to be the score of the best sequence
of tags from 1 to k, where tag k is required to be v. This is a maximization over
k − 1 tags because tag number k is fixed to have value v. Formally,

Xk−1
U (k, v) = max [ gi (yi−1 , yi ) + gk (yk−1 , v)].
{y1 ,...,yk−1 }
i=1

Now we can write down a recurrence that lets us compute U (k, v) efficiently:

U (k, v) = max [U (k − 1, yk−1 ) + gk (yk−1 , v)]

yk−1

With this recurrence we can compute ȳ for any x̄ in O(m2 n) time, where n is
the length of x̄ and m is the cardinality of the set of tags. This algorithm is
a variation of the Viterbi algorithm for computing the highest-probability path
through a hidden Markov model. The base case of the recurrence is an exercise
for the reader.
The second fundamental computational problem is to compute the denomina-
tor of the probability formula. This denominator is called the partition function:
X X
Z(x̄, w) = exp wj Fj (x̄, ȳ).
ȳ j

Remember that X X
wj Fj (x̄, ȳ) = gi (yi−1 , yi ),
j i

where i ranges over all positions 1 to n of the input sequence x̄, so we can write
X X XY
Z(x̄, w) = exp gi (yi−1 , yi ) = exp gi (yi−1 , yi ).
ȳ i ȳ i

We can compute the expression above efficiently by matrix multiplication. For

t = 1 to t = n + 1 let Mt be a square m by m matrix such that Mt (u, v) =
exp gt (u, v) for any two tag values u and v. Note that M2 to Mn are fully defined,

5
while M1 (u, v) is defined only for u = START and Mn+1 (u, v) is defined only for
v = STOP.
Consider multiplying M1 and M2 . We have1
X X
M12 (START, w) = M1 (START, v)M2 (v, w) = [exp g1 (START, v)][exp g2 (v, w)].
v v

Similarly,
X
M123 (START, x) = M12 (START, w)M3 (w, x)
w
XX
= [ M1 (START, v)M2 (v, w)]M3 (w, x)
w v
X
= M1 (START, v)M2 (v, w)M3 (w, x)
v,w

and so on. Consider the hSTART, STOPi entry of the entire product M123...n+1 . This
is
X
M123...n+1 (START, STOP) = T = M1 (START, y1 )M2 (y1 , y2 ) . . . Mn+1 (yn , STOP).
ȳ

We have
X
T = exp[g1 (START, y1 )] exp[g2 (y1 , y2 )] . . . exp[gn+1 (yn , STOP)]
ȳ
XY
= exp[gi (yi−1 , yi )]
ȳ i

which is exactly what we need.

Computational complexity: Each matrix is m by m where m is the cardinality
of the tag set. Each matrix multiplication requires O(m3 ) time, so the total time
is O(nm3 ). We have reduced a sum over an exponential number of alternatives
to a polynomial-time computation. However, even though polynomial, this is
worse than the time needed by the Viterbi algorithm. An interesting question is
whether computing the partition function is harder in some fundamental way than
computing the most likely label sequence.
1
Note on notation: u, v, w, and x here are all single tags; w is not a weight and x is not a
component of x̄.

6
The matrix multiplication method for computing the partition function is called
a forward-backward P algorithm. A similar algorithm can be used to compute any
function of the form ȳ hi (yi−1 , yi ).
Some extensions to the basic linear-chain CRF are not difficult. The output ȳ
must be a sequence, but the input x̄ is treated as a unit, so it does not have to be
a sequence. It could be an image for example, or a collection of separate items,
e.g. telephone customers.
In general, what is fundamental for making a log-linear model tractable is that
the set of possible labels ȳ should either be small, or have some structure. In order
to have structure, ȳ should be made up of parts (e.g. tags) such that only small
subsets of parts interact directly with each other. Here, every interacting subset
of tags is a pair. Often, the real-world reason interacting subsets are small is that
interactions between parts are short-distance.

4 Training by gradient ascent

The learning task for a log-linear model is to choose values for the weights (also
called parameters). Given a set of training examples, our goal is to choose param-
eter values wj to maximize the conditional probability of the training examples.
With this functional form for probabilities, we can choose parameter values
wj to maximize the conditional probability of the training examples.
To learn values for the parameters wj by gradient-following we need to be
able to evaluate the objective function and its gradient. The standard objective
function is the conditional log-likelihood (CLL). We want to maximize CLL, so
we do gradient ascent as opposed to descent.
The objective function used for training is not the same one that we really want
to maximize on test data. Instead of maximizing CLL we could maximize (all on
training data): yes/no accuracy of the entire predicted ȳ, or pointwise conditional
log likelihood, or we could minimize mean-squared error if tags are numerical, or
some other measure of distance between true and predicted tags.
A fundamental issue is whether we want to maximize a pointwise objective.
For a long sequence, we may have a vanishing chance of predicting the entire
tag sequence correctly. The single sequence with highest probability may be very
different from the most probable tag at each position.
For online gradient ascent (also called stochastic gradient ascent) we update
parameters based on single training examples. Therefore, we evaluate the partial
derivative of CLL for a single training example, for each wj . (There is one weight

7
for each feature-function, so we use j to range over weights.) Start with
∂ ∂
log p(y|x; w) = Fj (x, y) − log Z(x, w)
∂wj ∂wj
1 X ∂ X
= Fj (x, y) − exp wj 0 Fj 0 (x, y 0 )
Z(x, w) y0 ∂wj j0
1 X X
= Fj (x, y) − [exp wj 0 Fj 0 (x, y 0 )]Fj (x, y 0 )
Z(x, w) y0 j0
exp j 0 wj 0 Fj 0 (x, y 0 )
P
X
0
= Fj (x, y) − Fj (x, y ) P P 00
0 y 00 exp j 00 wj 00 Fj 00 (x, y )
y
X
= Fj (x, y) − Fj (x, y 0 )p(y 0 |x; w)
y0

= Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )].

In words, the partial derivative with respect to weight number i is the value of
feature-function i for the true training label y, minus the average value of the
feature-function for all possible labels y 0 . Note that this derivation allows feature-
functions to be real-valued, not just zero or one.
The gradient of the CLL given the entire training set T is the sum of the gra-
dients for each training example. At the global maximum this entire gradient is
zero, so we have
X X
Fj (x, y) = Ey∼p(y|x;w) [Fj (x, y)].
hx,yi∈T hx,·i∈T

This equality is true only for the whole training set, not for training examples
individually.
The left side above is the total value of feature-function j on the whole training
set. The right side is the total value of feature-function j predicted by the model.
For each feature-function, the trained model will spread out over all labels of all
examples as much mass as the training data has just on those examples for which
the feature-function is nonzero.
For any particular application of log-linear modeling, we have to write code to
evaluate numerically the symbolic derivatives. Then we can invoke an optimiza-
tion routine to find the optimal parameter values. There are two ways that we can
verify correctness. First, check for each feature-function Fj that
X X X
Fj (x, y) = p(y 0 |x; w)Fj (x, y 0 ).
hx,yi∈T hx,·i∈T y0

8
Second, check that each partial derivative is correct by comparing it numerically
to the value obtained by finite differencing of the CLL objective function.
Suppose that every feature-function Fj is the product of an attribute value
aj (x) that is a function of x only, and a label function bj (y) that is a function
∂
of y only, i.e. Fj (x, y) = aj (x)bj (y). Then ∂w j
log p(y|x; w) = 0 if aj (x) = 0,
regardless of y. This implies that given example x with online gradient ascent,
the weight for a feature-function must be updated only for feature-functions for
which the corresponding attribute aj (x) is non-zero, which can be a great saving
of computational effort. In other words, the entire gradient with respect to a sin-
gle training example is typically a sparse vector, just like the vector of all Fj (x, y)
values is sparse for a single training example. A similar savings is possible when
computing the gradient with respect to the whole training set. Note that the gra-
dient with respect to the whole training set is a single vector that is the sum of
one vector for each training example. Typically these vectors being summed are
sparse, but their sum is not.
When maximizing the conditional log-likelihood by online gradient ascent,
the update to weight wj is

wj := wj + α(Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]) (3)

where α is a learning rate parameter.

5 Efficient CRF training

The partial derivative for stochastic gradient training of a CRF model is
∂ X
log p(ȳ|x̄; w) = Fj (x̄, ȳ) − Fj (x̄, ȳ 0 )p(ȳ 0 |x̄; w)
∂wj ȳ 0
0
P
X
0
exp j 0 wj 0 Fj 0 (x̄, ȳ )
= Fj (x̄, ȳ) − Fj (x̄, ȳ ) .
ȳ 0
Z(x̄, w)

The first term Fj (x̄, ȳ) is fast to compute because x̄ and its training label ȳ are
fixed. Section 3 shows P how to compute
0
P Z(x̄, w) efficiently.
0
The remaining diffi-
culty is to compute ȳ0 Fj (x̄, ȳ ) exp j wj Fj (x̄, ȳ ).
If the set of alternative labels {y} is large, then it is computationally expensive
to evaluate the expectation Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]). We can find approximations
to this expectation by finding approximations to the distribution p(y|x; w). In

9
this section we discuss three different approximations. The corresponding train-
ing methods are called the Collins perceptron, Gibbs sampling, and contrastive
divergence.
The Collins perceptron. Suppose we place all the probability mass on the
most likely y value, i.e. we use the approximation p̂(y|x; w) = I(y = ŷ) where
ŷ = argmaxy p(y|x; w) as before. Then the update rule (3) simplifies to the
following rule:

wj := wj + αFj (x, y)
wj := wj − αFj (x, ŷ).

Given a training example x, the label ŷ can be thought of as an “impostor” com-

pared to the genuine label y. The concept to be learned is those vectors of feature-
function values hF1 (x, y), . . .i that correspond to correct hx, yi pairs. The vector
hF1 (x, y), . . .i, where hx, yi is a training example, is a positive example of this
concept. The vector hF1 (x, ŷ), . . .i is a negative example of the same concept.
Hence, the two updates above are perceptron updates: the first for a positive ex-
ample and the second for a negative example.
The perceptron method causes a net increase in wj for features Fj whose value
is higher for y than for ŷ. It thus modifies the weights to directly increase the
probability of y compared to the probability of ŷ.
Gibbs sampling. Computing the most likely label ŷ does not require comput-
ing the partition function Z(x, w). Nevertheless, sometimes identifying ŷ is still
too difficult. In this case one option for training is to estimate Ey∼p(y|x;w) [Fj (x, y)]
approximately by sampling y values from the distribution p(y|x; w).
A method known as Gibbs sampling can be used to find the needed samples of
y. Gibbs sampling is the following algorithm. Suppose the entire label y can be
written as a set of parts y = {y1 , . . . , yn }. For example, if y is the part-of-speech
sequence that is the label of an input sentence x, then each yi can be the tag of one
word in the sentence. Suppose the marginal distribution

p(yi |x, y1 , yi−1 , . . . , yi+1 , yn ; w)

can be evaluated numerically in an efficient way for every i. Then we can get a
stream of samples by the following process:

(1) Select an arbitrary initial guess hy1 , . . . , yn i.

(2) Draw y10 according to p(y1 |x, y2 , . . . , yn ; w);

10
• draw y20 according to p(y2 |x, y10 , y3 , . . . , yn ; w);

• draw y30 according to p(y2 |x, y10 , y20 , y4 , . . . , yn ; w);

• and so on until yn0 .

(3) Set {y1 , . . . , yn } := {y10 , . . . , yn0 } and repeat from (2).

It can be proved that if Step (2) is repeated an infinite number of times, then
the distribution of y = {y10 , . . . , yn0 } converges to the true distribution p(y|x; w)
regardless of the starting point. In practice, we do Step (2) some number of times
(say 1000) to come close to convergence, and then take several samples y =
{y10 , . . . , yn0 }. Between each sample we repeat Step (2) a smaller number of times
(say 100) to make the samples almost independent of each other.
Using Gibbs sampling to estimate the expectation Ey∼p(y|x;w) [Fj (x, y)] is com-
putationally intensive because the accuracy of the estimate only increases very
slowly as the number s of samples increases. Specifically, the variance decreases
proportional to 1/s.
Contrastive divergence. A third training option is to choose a single y ∗ value
that is somehow similar to the training label y, but also has high probability ac-
cording to p(y|x; w). Compared to the “impostor” ŷ, the “evil twin” y ∗ will have
lower probability, but will be more similar to y.
The idea of contrastive divergence is to obtain a single value y ∗ = hy1∗ , . . . , yn∗ i
by doing only a few iterations of Gibbs sampling (often only one), but starting at
the training label y instead of at a random guess.
How to do Gibbs sampling. Gibbs sampling relies on drawing samples
efficiently from marginal distributions. Let y−i be an abbreviation for the set
{y1 , . . . , yi−1 , ji+1 , . . . , yn }. We need to draw values according to the distribution
p(yi |x, y−i ; w). The straightforward way to do this is to evaluate p(v|x, y−i ; w)
numerically for each possible value v of yi . In typical applications the number of
alternative values v is small, so this approach is feasible, if p(v|x, y−i ; w) can be
computed.
Suppose the entire conditional distribution is a Markov random field
M
Y
p(y|x; w) ∝ φm (y m |x; w) (4)
m=1

where each φm is a potential function that depends on just a subset y m of com-

ponents of y. Linear-chain conditional random fields are a special case of Equa-

11
tion (4). In this case
Y
p(yi |x, y−i ; w) ∝ φm (y m |x; w) (5)
m∈C

where C indexes those potential functions y m that include the part yi . To compute
p(yi |x, y−i ; w) we evaluate the product (5) for all values of yi , with the given fixed
values of y−i = {y1 , . . . , yi−1 , ii+1 , . . . , yn }. We then normalize using
XY
Z(x, y−i ; w) = φm (y m |x; w)
v m∈C

where v ranges over the possible values of yi .

CRF Tutorial ISMIR-2013 PDF
No ratings yet
CRF Tutorial ISMIR-2013 PDF
133 pages
Crftut FNT PDF
No ratings yet
Crftut FNT PDF
109 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Log Line Arc Rfs
No ratings yet
Log Line Arc Rfs
30 pages
NB 13
No ratings yet
NB 13
27 pages
CRF Eric Xing
No ratings yet
CRF Eric Xing
31 pages
David Forsyth - Applied Machine Learning (2019)
No ratings yet
David Forsyth - Applied Machine Learning (2019)
496 pages
ST Flour Notes
No ratings yet
ST Flour Notes
104 pages
CRF Laura Kallmeyer
No ratings yet
CRF Laura Kallmeyer
21 pages
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
No ratings yet
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
90 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3
No ratings yet
Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3
41 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
CRF Tutorial Talk
No ratings yet
CRF Tutorial Talk
35 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Conditional Random Fields (CRFS)
No ratings yet
Conditional Random Fields (CRFS)
13 pages
CRF Klinger Tomanek
No ratings yet
CRF Klinger Tomanek
32 pages
Conditional Random Fields - A probabilistic graphical model: Yen-Chin Lee 指導老師：鮑興國
No ratings yet
Conditional Random Fields - A probabilistic graphical model: Yen-Chin Lee 指導老師：鮑興國
25 pages
Scikit Learn Docs PDF
100% (3)
Scikit Learn Docs PDF
2,204 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Log-Linear Models, Memms, and CRFS: 1 Notation
No ratings yet
Log-Linear Models, Memms, and CRFS: 1 Notation
11 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
Nonlinear
No ratings yet
Nonlinear
8 pages
Conditional Random Field
No ratings yet
Conditional Random Field
5 pages
This Is AI4001: GCR: t37g47w
No ratings yet
This Is AI4001: GCR: t37g47w
51 pages
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (9)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
HLT 2004
No ratings yet
HLT 2004
8 pages
Ch13 5-ConditionalRandomFields
No ratings yet
Ch13 5-ConditionalRandomFields
57 pages
Over Fitting and TBL
No ratings yet
Over Fitting and TBL
46 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Lab Manual ML
No ratings yet
Lab Manual ML
28 pages
Research On CDR
No ratings yet
Research On CDR
24 pages
Question Bank
No ratings yet
Question Bank
13 pages
Part 4: Conditional Random Fields: Sebastian Nowozin and Christoph H. Lampert
No ratings yet
Part 4: Conditional Random Fields: Sebastian Nowozin and Christoph H. Lampert
39 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data
No ratings yet
Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data
28 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Logistic Regression
No ratings yet
Logistic Regression
29 pages
Inherent Stochasticity
No ratings yet
Inherent Stochasticity
12 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
ML Lab Programs
No ratings yet
ML Lab Programs
18 pages
Quantum Conditional Random Field: PACS Numbers
No ratings yet
Quantum Conditional Random Field: PACS Numbers
9 pages
Training Conditional Random Fields With Natural Gradient Descent
No ratings yet
Training Conditional Random Fields With Natural Gradient Descent
9 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
hw3 Solution
No ratings yet
hw3 Solution
7 pages
CS 717: Endsem
No ratings yet
CS 717: Endsem
5 pages
Intro To Machine Learning
No ratings yet
Intro To Machine Learning
5 pages
ML Lab File Batch 1
No ratings yet
ML Lab File Batch 1
20 pages
Machine Learning Lab Record: Dr. Sarika Hegde
No ratings yet
Machine Learning Lab Record: Dr. Sarika Hegde
23 pages
Tutorial I
No ratings yet
Tutorial I
6 pages
Challenges in ML&DM
No ratings yet
Challenges in ML&DM
12 pages
Class Test 2 Answer Key
No ratings yet
Class Test 2 Answer Key
4 pages
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
No ratings yet
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
9 pages
Semi-Markov Conditional Random Fields For Information Extraction
No ratings yet
Semi-Markov Conditional Random Fields For Information Extraction
8 pages
NLP Summary
No ratings yet
NLP Summary
2 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Unit Iii
No ratings yet
Unit Iii
17 pages
Resume Parsing Report M
No ratings yet
Resume Parsing Report M
103 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
Lre Relproceedings
100% (1)
Lre Relproceedings
122 pages
Review
No ratings yet
Review
42 pages
LLM Usecase
No ratings yet
LLM Usecase
31 pages
Question Bank
No ratings yet
Question Bank
67 pages
DipanshuKhurana NERTask
No ratings yet
DipanshuKhurana NERTask
8 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
53 856 14 Gradinaru Badea Dragomir
No ratings yet
53 856 14 Gradinaru Badea Dragomir
15 pages
面向司法案件的实体关系与事件关系抽取方法陈永琪
No ratings yet
面向司法案件的实体关系与事件关系抽取方法陈永琪
97 pages
研究章节
100% (1)
研究章节
8 pages
Predicting Spatiotemporal Impacts of Weather On Power Systems Using Big Data Science
No ratings yet
Predicting Spatiotemporal Impacts of Weather On Power Systems Using Big Data Science
35 pages
Sleep XAI
No ratings yet
Sleep XAI
14 pages
2018 Ahmed PDF
No ratings yet
2018 Ahmed PDF
123 pages
Research On Resume Classification System Based On Graph Neural Network
No ratings yet
Research On Resume Classification System Based On Graph Neural Network
5 pages
W-Net A Deep Model For Fully Unsupervised Image Segmentation
No ratings yet
W-Net A Deep Model For Fully Unsupervised Image Segmentation
13 pages
Learning Customer Behaviors For Effective Load Forecasting
No ratings yet
Learning Customer Behaviors For Effective Load Forecasting
14 pages
Dandan 2017
No ratings yet
Dandan 2017
9 pages
Brain Heaters Questions and Answers
No ratings yet
Brain Heaters Questions and Answers
15 pages
Jiang 2021
No ratings yet
Jiang 2021
11 pages
Parsing 2
No ratings yet
Parsing 2
7 pages
Natural Language Processing: An Introduction: Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman
No ratings yet
Natural Language Processing: An Introduction: Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman
8 pages
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
No ratings yet
International Journal of Computational Linguistics (IJCL), Volume (1), Issue
28 pages
A Treatise on the Calculus of Finite Differences
From Everand
A Treatise on the Calculus of Finite Differences
George Boole
4/5 (1)
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)

8 CRF

Uploaded by

8 CRF

Uploaded by

Log-Linear Models

and Conditional Random Fields

This document describes log-linear models, which are a far-reaching extension

where the partition function Z(x, w) = y0 exp j wj Fj (x, y 0 ). Therefore, given

2 Conditional random fields

• Some feature-functions can be position-specific, e.g. to the beginning or

The highest-accuracy POS taggers currently use over 100,000 feature-functions.

3 Inference algorithms for linear-chain CRFs

Use the definition of Fj to get

U (k, v) = max [U (k − 1, yk−1 ) + gk (yk−1 , v)]

We can compute the expression above efficiently by matrix multiplication. For

which is exactly what we need.

4 Training by gradient ascent

= Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )].

wj := wj + α(Fj (x, y) − Ey0 ∼p(y0 |x;w) [Fj (x, y 0 )]) (3)

where α is a learning rate parameter.

5 Efficient CRF training

Given a training example x, the label ŷ can be thought of as an “impostor” com-

p(yi |x, y1 , yi−1 , . . . , yi+1 , yn ; w)

(1) Select an arbitrary initial guess hy1 , . . . , yn i.

(2) Draw y10 according to p(y1 |x, y2 , . . . , yn ; w);

• draw y30 according to p(y2 |x, y10 , y20 , y4 , . . . , yn ; w);

• and so on until yn0 .

(3) Set {y1 , . . . , yn } := {y10 , . . . , yn0 } and repeat from (2).

where each φm is a potential function that depends on just a subset y m of com-

where v ranges over the possible values of yi .

You might also like