0% found this document useful (0 votes)
12 views21 pages

CRF Laura Kallmeyer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

CRF Laura Kallmeyer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Learning

for natural language processing


Conditional Random Fields

Laura Kallmeyer

Heinrich-Heine-Universität Düsseldorf

Summer 2016

1 / 21
Introduction

HMM is a generative sequence classifier.


MaxEnt is a discriminative classifier.
Today: discriminative sequence classifier, combining HMM and
MaxEnt.
Lafferty et al. (2001); Sha & Pereira (2003); Wallach (2002, 2004)

2 / 21
Table of contents

1 Motivation

2 Conditional Random Fields

3 Efficient computation

4 Features

3 / 21
Motivation

Naive Bayes is a generative classifier assigning a single class to


a single input.

MaxEnt is a discriminative classifier assigning a single class to


a single input.

HMM is a generative sequence classifier assigning sequences of


classes to sequences of input symbols.

CRF is a discriminative sequence classifier assigning sequences


of classes to sequences of input symbols.

single classification sequence classification


generative naive Bayes HMM
discriminative MaxEnt CRF

4 / 21
Motivation
Generative classifiers compute the probability of a class y
given an input x as
P(x∣y)P(y) P(x, y)
P(y∣x) = =
P(x) P(x)
HMM is a generative classifier:
P(o, q)
P(q∣o) =
P(o)
For the classification, we have to compute
arg max P(o, q) = arg max P(o∣q)P(q)
q∈Q n q∈Q n

The computation of the joint probability P(x, y) (here P(o, q))


is a complex task.
In contrast, MaxEnt classifiers directly compute the conditional
probability P(y∣x) that has to be maximized.
5 / 21
Motivation

The move from generative to discriminative sequence classification


(HMM to CRF) has two advantages:

We do not need to compute the joint probability any longer.

The strong independence assumptions of HMMs can be relaxed


since features in a discriminative approach can capture depen-
dencies that are less local than the n-gram based featues of
HMMs.

Feature weights need not be probabilities, i.e., can have values


lower than 0 or greater than 1.

6 / 21
Conditional Random Fields
Goal: determine the best sequence y ∈ C n of classes, given an input
sequence x of length n.

ŷ = arg max P(y∣x)


y∈C n

CRF Applications
Sample applications are:
POS tagging Ratnaparkhi (1997)
shallow parsing Sha & Pereira (2003)
Named Entity Recognition (Stanford NER)a Finkel et al. (2005)
language identification Samih & Maier (2016)
a
http://nlp.stanford.edu/software/CRF-NER.shtml

7 / 21
Conditional Random Fields

The probability of a class sequence for an input sequence depends on


features (so-called potential functions).
Features refer to the potential class of some input symbol xi and to
the classes of some other input symbols.
Features are usually indicator functions that will be weighted.

Sample features
1 if xi =“September” and yi−1 = IN and yi = NNP
t(yi−1 , yi , x, i) = {
0 otherwise
(taken from Wallach (2004))
1 if xi =“to” and yi = TO
s(yi , x, i) = {
0 otherwise

8 / 21
Conditional Random Fields

The dependencies that are expressed within the features can be


captured in a graph.

CRF graph
If we have only transition features applying to yi−1 , yi , x, i and state
features applying to yi , x, i, we get a chain-structured CRF:
y1 y2 y3 yn−1 yn
...

x1 . . . xn

9 / 21
Conditional Random Fields

In order to compute the probability of a class sequence for an input


sequence, we
extract the corresponding features,
combine them linearly (= multiplying each by a weight and
adding them up)
and then applying a function to this linear combination, exactly
as in the MaxEnt case.

In the following, we assume that we have only transition features and


state features where the latter can also be considered a transition
feature (that gives the same value for all preceding states). I.e., every
features has the form f (yi−1 , yi , x, i).

10 / 21
Conditional Random Fields
The f weights for a fixed f but for different input positions receive all
the same weight. Therefore we can sum them up before weighting.

From class features to sequence features


n
F(y, x) = ∑ f (yi−1 , yi , x, i)
i=1

Assume that we have features f1 , . . . , fk , which yield sequence


features F1 , . . . , Fk .
We weight these and apply them exactly as in the MaxEnt case:
Conditional class sequence probability
Let λj be the weight of features Fj .

e∑i=1 λi Fi (y,x)
k

P(y∣x) =
∑y′ ∈C n e∑i=1 λi Fi (y ,x)
k ′

11 / 21
Conditional Random Fields
CRF – POS tagging
Assume that we have POS tags Det, N, Adv, V and features
1 if xi =“chief” and yi−1 = Det and yi = Adj
f1 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“chief” and yi = N
f2 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Det and yi = N
f3 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Adj and yi = N
f4 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = N and yi = V
f5 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“the” and yi = Det
f6 (yi−1 , yi , x, i) = {
0 otherwise

Weights: λ1 = 2, λ2 = 5, λ3 = 9, λ4 = 8, λ5 = 7, λ6 = 20.
12 / 21
Conditional Random Fields

CRF – POS tagging


Assume that we have a sequence “the chief talks”. Which of the
following probabilities is higher? P(Det N V ∣ the chief talks) or
P(Det Adj N ∣ the chief talks)?
Weighted feature sums for both:
1 Det N V : 20 + 5 + 7 = 32
2 Det Adj N : 20 + 2 + 8 = 30
Consequently, Det N V has a slightly higher probability.

(In real applications, we have of course many more features.)

13 / 21
Efficient computation

In the following, we assume chain-structured CRF (see example). Let


x be our input sequence of length n. We have features f1 , . . . fk . Each
label sequence is augmented with an initial start and a final end.
Let C be the set of class labels.

We define a set of C × C matrices M1 (x), M2 (x), . . . Mn+1 (x) where


for all i, 1 ≤ i ≤ n + 1 and all classes c, c ′ :

Mi (c, c ′ ) = e∑j=1 λj fj (c,c ,x,i)


k ′

With this, we can compute

1 n+1
P(y∣x) = ∏ Mi (yi−1 , yi )
Z i=1

14 / 21
Efficient computation

In order to obtain the probability, we have to compute


Z = ∑y′ ∈C n e∑j=1 λj Fj (y ,x)
k ′

As in the HMM case, we can use the forward-backward algorithm in


order to compute this efficiently.
We compute αi (c) = ∑y′ ∈C i−1 e∑j=1 λj Fj (y ,x) in a way similar to the
k ′

HMM foward computation:


Forward computation
1 α0 (c) = 1 if c = start, else α0 (c) = 0.
2 αi (c) = ∑c′ ∈C αi−1 (c ′ )Mi (c ′ , c) for 1 ≤ i ≤ n
3 Z = ∑c∈C αn (c)

15 / 21
Efficient computation
Probability calculation in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 e2 e1 + e0.5 e0.3 = 22.31 22.31e1 + 144.62e0.3
B 0 e0.5 e2 e2 + e0.5 e4 = 144.62 22.31e2 + 144.62e4
0 1 2 3

Z = 255.86 + 8060.83 = 8316.69


e2+1+1 e0.5+4+4
P(AAA∣abb) = 8316.69 = 0.0066 P(BBB∣abb) = 8316.69 = 0.59
e2+2+4 e0.5+0.3+2
P(ABB∣abb) = 8316.69 = 0.39 P(BAB∣abb) = 8316.69 = 0.002
e0.5+4+0.3 e2+1+2
P(BBA∣abb) = 8316.69 = 0.015 P(AAB∣abb) = 8316.69 = 0.018
e2+2+0.3 e0.5+0.3+1
P(ABA∣abb) = 8316.69 = 0.009 P(BAA∣abb) = 8316.69 = 0.0007
16 / 21
Efficient computation
In order to obtain the best class sequence, we can use the viterbi
algorithm:
Viterbi for CRF
1 v (c) = 1 if c = start, else v (c) = 0.
0 0
2 vi (c) = maxc′ ∈C (vi−1 (c ′ )Mi (c ′ , c)) for 1 ≤ i ≤ n

If we keep additional backpointers to the c ′ that has lead to the


maximal value, we can read off the best class sequence, starting from
the maximal value vn (c) we have for any of the classes c. If this best
class for n is c, the probability of the best class sequence is

vn (c) vn (c)
=
Z ∑c∈C αn (c)

17 / 21
Efficient computation

Classification in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 , start e2 e1 , A e4.5 e0.3 , B
B 0 e0.5 , start e0.5 e4 , B e4.5 e4 , B
0 1 2 3

e8.5
Best class sequence: BBB, probability 8316.69 = 0.59

18 / 21
Features
In general, we can have any type of features depending on the entire
input sequence, the position i, and the classes of xi and xi−1 .
Relaxed independence assumptions compared to HMM.

Shallow parsing Sha & Pereira (2003)


Shallow parsing: identify chunks (= non-recursive noun phrases)
without analyzing their internal structure.
The chunker assigns a chunk label B, I or O (begin of chunk, inside
chunk, outside chunk) to each word.
The CRF classifier assigns a pair of chunk labels to a word xi ,
namely the concatenation of the chunk labels of xi−1 and of xi .
All features f (yi−1 , yi , x, i) are indicator functions, indicating that
some predicate p(x, i) holds and
some predicate q(yi−1 , yi ) holds.

19 / 21
Features
Shallow parsing Sha & Pereira (2003)
Sample predicates p(x, i):
xi+2 = w,
xi−1 = w, xi = w ′ ,
POS(xi−1 ) = t, POS(xi ) = t ′ ,
...
Sample predicates q(yi−1 , yi ):
yi = cc ′ (c, c ′ are the labels assigned by the chunker),
yi−1 = cc ′ , yi = c ′ c ′′ ,
yi = xc ′ where x can be any label,
...

(w, w ′ are specific words, t, t ′ specific POS tags and c, c ′ , c ′′ specific


labels from {B, I , O}.)

In total, Sha & Pereira (2003) use 3,8 million features.

20 / 21
References
Finkel, Jenny Rose, Trond Grenager & Christopher Manning. 2005. Incorporating
non-local information into information extraction systems by gibbs sampling. In
Proceedings of the 43rd annual meeting on association for computational linguistics
ACL ’05, 363–370. Stroudsburg, PA, USA: Association for Computational
Linguistics. doi:10.3115/1219840.1219885.
http://dx.doi.org/10.3115/1219840.1219885.
Lafferty, John, Andrew McCallum & Fernando Pereira. 2001. Conditional random
fields: probabilistic models for segmenting and labeling sequence data. In
International conference on machine learning, .
Ratnaparkhi, Adwait. 1997. A simple introduction to maximum entropy models for
natural language processing. Tech. Rep. 97–08 Institue for Research in Cognitive
Science, University of Pennsylvania.
Samih, Younes & Wolfgang Maier. 2016. Detecting code-switching in moroccan
arabic. In Proceedings of SocialNLP @ IJCAI-2016, New York. To appear.
Sha, Fei & Fernando Pereira. 2003. Shallow parsing with conditional random fields.
In Proceedings of human language technology, naacl, .
Wallach, Hana M. 2002. Efficient training of conditional random fields: University of
Edinburgh dissertation.
Wallach, Hana M. 2004. Conditional random fiels: An introduction. Tech. rep.
University of Pennsylvania. Technical Report (CIS), Paper 22.
21 / 21

You might also like