Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
Hanna M. Wallach
(1) [PRP He] [VBZ reckons] [DT the] [JJ current] [NN account] [NN
deficit] [MD will] [VB narrow] [TO to] [RB only] [# #] [CD 1.8] [CD
billion] [IN in] [NNP September] [. .]
Labeling sentences in this way is a useful preprocessing step for higher natural
language processing tasks: POS tags augment the information contained within
words alone by explicitly indicating some of the structure inherent in language.
One of the most common methods for performing such labeling and segmen-
tation tasks is that of employing hidden Markov models [13] (HMMs) or proba-
bilistic finite-state automata to identify the most likely sequence of labels for the
words in any given sentence. HMMs are a form of generative model, that defines
a joint probability distribution p(X, Y ) where X and Y are random variables
respectively ranging over observation sequences and their corresponding label
sequences. In order to define a joint distribution of this nature, generative mod-
els must enumerate all possible observation sequences – a task which, for most
domains, is intractable unless observation elements are represented as isolated
units, independent from the other elements in an observation sequence. More
precisely, the observation element at any given instant in time may only directly
∗ University of Pennsylvania CIS Technical Report MS-CIS-04-21
1
depend on the state, or label, at that time. This is an appropriate assumption
for a few simple data sets, however most real-world observation sequences are
best represented in terms of multiple interacting features and long-range depen-
dencies between observation elements.
Conditional random fields [8] (CRFs) are a probabilistic framework for label-
ing and segmenting sequential data, based on the conditional approach described
in the previous paragraph. A CRF is a form of undirected graphical model that
defines a single log-linear distribution over label sequences given a particular
observation sequence. The primary advantage of CRFs over hidden Markov
models is their conditional nature, resulting in the relaxation of the indepen-
dence assumptions required by HMMs in order to ensure tractable inference.
Additionally, CRFs avoid the label bias problem [8], a weakness exhibited by
maximum entropy Markov models [9] (MEMMs) and other conditional Markov
models based on directed graphical models. CRFs outperform both MEMMs
and HMMs on a number of real-world sequence labeling tasks [8, 11, 15].
2
Y form a simple first-order chain, as illustrated in Figure 1.
Y1 Y2 Y3 Yn−1 Yn
...
X = X1 , . . . , Xn−1 , Xn
It is worth noting that an isolated potential function does not have a direct
probabilistic interpretation, but instead represents constraints on the configu-
rations of the random variables on which the function is defined. This in turn
affects the probability of global configurations – a global configuration with a
high probability is likely to have satisfied more of these constraints than a global
configuration with a low probability.
1 The product of a set of strictly positive, real-valued functions is not guaranteed to satisfy
the axioms of probability. A normalization factor is therefore introduced to ensure that the
product of potential functions is a valid probability distribution over the random variables
represented by vertices in G.
3
3 Conditional Random Fields
Lafferty et al. [8] define the the probability of a particular label sequence y
given observation sequence x to be a normalized product of potential functions,
each of the form
X X
exp ( λj tj (yi−1 , yi , x, i) + µk sk (yi , x, i)), (2)
j k
Each feature function takes on the value of one of these real-valued observation
features b(x, i) if the current state (in the case of a state function) or previous
and current states (in the case of a transition function) take on particular val-
ues. All feature functions are therefore real-valued. For example, consider the
following transition function:
(
b(x, i) if yi−1 = IN and yi = NNP
tj (yi−1 , yi , x, i) =
0 otherwise.
s(yi , x, i) = s(yi−1 , yi , x, i)
and
n
X
Fj (y, x) = fj (yi−1 , yi , x, i),
i=1
4
4 Maximum Entropy
Assuming the training data {(x(k) , y (k) )} are independently and identically dis-
tributed, the product of (3) over all training sequences, as a function of the
parameters λ, is known as the likelihood, denoted by p({y (k) }|{x(k) }, λ). Max-
imum likelihood training chooses parameter values such that the logarithm of
the likelihood, known as the log-likelihood, is maximized. For a CRF, the log-
likelihood is given by
X 1 X
L(λ) = log + λj Fj (y (k) , x(k) ) .
Z(x(k) ) j
k
where p̃(Y , X) is the empirical distribution of training data and Ep [·] denotes
expectation with respect to distribution p. Note that setting this derivative to
5
zero yields the maximum entropy model constraint: The expectation of each
feature with respect to the model distribution is equal to the expected value
under the empirical distribution of the training data.
Letting Y be the alphabet from which labels are drawn and y and y 0 be
labels drawn from this alphabet, we define a set of n + 1 matrices {Mi (x)|i =
1, . . . , n + 1}, where each Mi (x) is a |Y × Y| matrix with elements of the form
X
Mi (y 0 , y|x) = exp ( λj fj (y 0 , y, x, i)).
j
6
7 Dynamic Programming
7
References
[1] A. L. Berger. The improved iterative scaling algorithm: A gentle introduc-
tion, 1997.
[2] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum en-
tropy approach to natural language processing. Computational Linguistics,
22(1):39–71, 1996.
[3] P. Clifford. Markov random fields in statistics. In Geoffrey Grimmett
and Dominic Welsh, editors, Disorder in Physical Systems: A Volume in
Honour of John M. Hammersley, pages 19–32. Oxford University Press,
1990.
[4] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algo-
rithms. MIT Press/McGraw-Hill, 1990.
[5] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear mod-
els. The Annals of Mathematical Statistics, 43:1470–1480, 1972.
[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge
University Press, 1998.
[7] E. T. Jaynes. Information theory and statistical mechanics. The Physical
Review, 106(4):620–630, May 1957.
[8] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: prob-
abilistic models for segmenting and labeling sequence data. In International
Conference on Machine Learning, 2001.
[9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov mod-
els for information extraction and segmentation. In International Confer-
ence on Machine Learning, 2000.
[10] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of ran-
dom fields. Technical Report CMU-CS-95-144, Carnegie Mellon University,
1995.
[11] D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using
conditional random fields. Proceedings of the ACM SIGIR, 2003.
[12] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice
Hall Signal Processing Series. Prentice-Hall, Inc., 1993.
[13] L. R. Rabiner. A tutorial on hidden Markov models and selected applica-
tions in speech recognition. Proceedings of the IEEE, 77(2):257–285, 1989.
[14] A. Ratnaparkhi. A simple introduction to maximum entropy models for
natural language processing. Technical Report 97-08, Institute for Research
in Cognitive Science, University of Pennsylvania, 1997.
8
[15] F. Sha and F. Pereira. Shallow parsing with conditional random fields.
Proceedings of Human Language Technology, NAACL 2003, 2003.
[16] C. E. Shannon. A mathematical theory of communication. Bell System
Tech. Journal, 27:379–423 and 623–656, 1948.
[17] H. M. Wallach. Efficient training of conditional random fields. Master’s
thesis, University of Edinburgh, 2002.