CRF Laura Kallmeyer
CRF Laura Kallmeyer
Laura Kallmeyer
Heinrich-Heine-Universität Düsseldorf
Summer 2016
1 / 21
Introduction
2 / 21
Table of contents
1 Motivation
3 Efficient computation
4 Features
3 / 21
Motivation
4 / 21
Motivation
Generative classifiers compute the probability of a class y
given an input x as
P(x∣y)P(y) P(x, y)
P(y∣x) = =
P(x) P(x)
HMM is a generative classifier:
P(o, q)
P(q∣o) =
P(o)
For the classification, we have to compute
arg max P(o, q) = arg max P(o∣q)P(q)
q∈Q n q∈Q n
6 / 21
Conditional Random Fields
Goal: determine the best sequence y ∈ C n of classes, given an input
sequence x of length n.
CRF Applications
Sample applications are:
POS tagging Ratnaparkhi (1997)
shallow parsing Sha & Pereira (2003)
Named Entity Recognition (Stanford NER)a Finkel et al. (2005)
language identification Samih & Maier (2016)
a
http://nlp.stanford.edu/software/CRF-NER.shtml
7 / 21
Conditional Random Fields
Sample features
1 if xi =“September” and yi−1 = IN and yi = NNP
t(yi−1 , yi , x, i) = {
0 otherwise
(taken from Wallach (2004))
1 if xi =“to” and yi = TO
s(yi , x, i) = {
0 otherwise
8 / 21
Conditional Random Fields
CRF graph
If we have only transition features applying to yi−1 , yi , x, i and state
features applying to yi , x, i, we get a chain-structured CRF:
y1 y2 y3 yn−1 yn
...
x1 . . . xn
9 / 21
Conditional Random Fields
10 / 21
Conditional Random Fields
The f weights for a fixed f but for different input positions receive all
the same weight. Therefore we can sum them up before weighting.
e∑i=1 λi Fi (y,x)
k
P(y∣x) =
∑y′ ∈C n e∑i=1 λi Fi (y ,x)
k ′
11 / 21
Conditional Random Fields
CRF – POS tagging
Assume that we have POS tags Det, N, Adv, V and features
1 if xi =“chief” and yi−1 = Det and yi = Adj
f1 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“chief” and yi = N
f2 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Det and yi = N
f3 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Adj and yi = N
f4 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = N and yi = V
f5 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“the” and yi = Det
f6 (yi−1 , yi , x, i) = {
0 otherwise
Weights: λ1 = 2, λ2 = 5, λ3 = 9, λ4 = 8, λ5 = 7, λ6 = 20.
12 / 21
Conditional Random Fields
13 / 21
Efficient computation
1 n+1
P(y∣x) = ∏ Mi (yi−1 , yi )
Z i=1
14 / 21
Efficient computation
15 / 21
Efficient computation
Probability calculation in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 e2 e1 + e0.5 e0.3 = 22.31 22.31e1 + 144.62e0.3
B 0 e0.5 e2 e2 + e0.5 e4 = 144.62 22.31e2 + 144.62e4
0 1 2 3
vn (c) vn (c)
=
Z ∑c∈C αn (c)
17 / 21
Efficient computation
Classification in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 , start e2 e1 , A e4.5 e0.3 , B
B 0 e0.5 , start e0.5 e4 , B e4.5 e4 , B
0 1 2 3
e8.5
Best class sequence: BBB, probability 8316.69 = 0.59
18 / 21
Features
In general, we can have any type of features depending on the entire
input sequence, the position i, and the classes of xi and xi−1 .
Relaxed independence assumptions compared to HMM.
19 / 21
Features
Shallow parsing Sha & Pereira (2003)
Sample predicates p(x, i):
xi+2 = w,
xi−1 = w, xi = w ′ ,
POS(xi−1 ) = t, POS(xi ) = t ′ ,
...
Sample predicates q(yi−1 , yi ):
yi = cc ′ (c, c ′ are the labels assigned by the chunker),
yi−1 = cc ′ , yi = c ′ c ′′ ,
yi = xc ′ where x can be any label,
...
20 / 21
References
Finkel, Jenny Rose, Trond Grenager & Christopher Manning. 2005. Incorporating
non-local information into information extraction systems by gibbs sampling. In
Proceedings of the 43rd annual meeting on association for computational linguistics
ACL ’05, 363–370. Stroudsburg, PA, USA: Association for Computational
Linguistics. doi:10.3115/1219840.1219885.
http://dx.doi.org/10.3115/1219840.1219885.
Lafferty, John, Andrew McCallum & Fernando Pereira. 2001. Conditional random
fields: probabilistic models for segmenting and labeling sequence data. In
International conference on machine learning, .
Ratnaparkhi, Adwait. 1997. A simple introduction to maximum entropy models for
natural language processing. Tech. Rep. 97–08 Institue for Research in Cognitive
Science, University of Pennsylvania.
Samih, Younes & Wolfgang Maier. 2016. Detecting code-switching in moroccan
arabic. In Proceedings of SocialNLP @ IJCAI-2016, New York. To appear.
Sha, Fei & Fernando Pereira. 2003. Shallow parsing with conditional random fields.
In Proceedings of human language technology, naacl, .
Wallach, Hana M. 2002. Efficient training of conditional random fields: University of
Edinburgh dissertation.
Wallach, Hana M. 2004. Conditional random fiels: An introduction. Tech. rep.
University of Pennsylvania. Technical Report (CIS), Paper 22.
21 / 21