0% found this document useful (0 votes)

12 views21 pages

CRF Laura Kallmeyer

Uploaded by

nguyenngocthinh02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

CRF Laura Kallmeyer

Uploaded by

nguyenngocthinh02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Machine Learning

for natural language processing

Conditional Random Fields

Laura Kallmeyer

Heinrich-Heine-Universität Düsseldorf

Summer 2016

1 / 21
Introduction

HMM is a generative sequence classifier.

MaxEnt is a discriminative classifier.
Today: discriminative sequence classifier, combining HMM and
MaxEnt.
Lafferty et al. (2001); Sha & Pereira (2003); Wallach (2002, 2004)

2 / 21
Table of contents

1 Motivation

2 Conditional Random Fields

3 Efficient computation

4 Features

3 / 21
Motivation

Naive Bayes is a generative classifier assigning a single class to

a single input.

MaxEnt is a discriminative classifier assigning a single class to

a single input.

HMM is a generative sequence classifier assigning sequences of

classes to sequences of input symbols.

CRF is a discriminative sequence classifier assigning sequences

of classes to sequences of input symbols.

single classification sequence classification

generative naive Bayes HMM
discriminative MaxEnt CRF

4 / 21
Motivation
Generative classifiers compute the probability of a class y
given an input x as
P(x∣y)P(y) P(x, y)
P(y∣x) = =
P(x) P(x)
HMM is a generative classifier:
P(o, q)
P(q∣o) =
P(o)
For the classification, we have to compute
arg max P(o, q) = arg max P(o∣q)P(q)
q∈Q n q∈Q n

The computation of the joint probability P(x, y) (here P(o, q))

is a complex task.
In contrast, MaxEnt classifiers directly compute the conditional
probability P(y∣x) that has to be maximized.
5 / 21
Motivation

The move from generative to discriminative sequence classification

(HMM to CRF) has two advantages:

We do not need to compute the joint probability any longer.

The strong independence assumptions of HMMs can be relaxed

since features in a discriminative approach can capture depen-
dencies that are less local than the n-gram based featues of
HMMs.

Feature weights need not be probabilities, i.e., can have values

lower than 0 or greater than 1.

6 / 21
Conditional Random Fields
Goal: determine the best sequence y ∈ C n of classes, given an input
sequence x of length n.

ŷ = arg max P(y∣x)

y∈C n

CRF Applications
Sample applications are:
POS tagging Ratnaparkhi (1997)
shallow parsing Sha & Pereira (2003)
Named Entity Recognition (Stanford NER)a Finkel et al. (2005)
language identification Samih & Maier (2016)
a
http://nlp.stanford.edu/software/CRF-NER.shtml

7 / 21
Conditional Random Fields

The probability of a class sequence for an input sequence depends on

features (so-called potential functions).
Features refer to the potential class of some input symbol xi and to
the classes of some other input symbols.
Features are usually indicator functions that will be weighted.

Sample features
1 if xi =“September” and yi−1 = IN and yi = NNP
t(yi−1 , yi , x, i) = {
0 otherwise
(taken from Wallach (2004))
1 if xi =“to” and yi = TO
s(yi , x, i) = {
0 otherwise

8 / 21
Conditional Random Fields

The dependencies that are expressed within the features can be

captured in a graph.

CRF graph
If we have only transition features applying to yi−1 , yi , x, i and state
features applying to yi , x, i, we get a chain-structured CRF:
y1 y2 y3 yn−1 yn
...

x1 . . . xn

9 / 21
Conditional Random Fields

In order to compute the probability of a class sequence for an input

sequence, we
extract the corresponding features,
combine them linearly (= multiplying each by a weight and
adding them up)
and then applying a function to this linear combination, exactly
as in the MaxEnt case.

In the following, we assume that we have only transition features and

state features where the latter can also be considered a transition
feature (that gives the same value for all preceding states). I.e., every
features has the form f (yi−1 , yi , x, i).

10 / 21
Conditional Random Fields
The f weights for a fixed f but for different input positions receive all
the same weight. Therefore we can sum them up before weighting.

From class features to sequence features

n
F(y, x) = ∑ f (yi−1 , yi , x, i)
i=1

Assume that we have features f1 , . . . , fk , which yield sequence

features F1 , . . . , Fk .
We weight these and apply them exactly as in the MaxEnt case:
Conditional class sequence probability
Let λj be the weight of features Fj .

e∑i=1 λi Fi (y,x)
k

P(y∣x) =
∑y′ ∈C n e∑i=1 λi Fi (y ,x)
k ′

11 / 21
Conditional Random Fields
CRF – POS tagging
Assume that we have POS tags Det, N, Adv, V and features
1 if xi =“chief” and yi−1 = Det and yi = Adj
f1 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“chief” and yi = N
f2 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Det and yi = N
f3 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = Adj and yi = N
f4 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“talks” and yi−1 = N and yi = V
f5 (yi−1 , yi , x, i) = {
0 otherwise
1 if xi =“the” and yi = Det
f6 (yi−1 , yi , x, i) = {
0 otherwise

Weights: λ1 = 2, λ2 = 5, λ3 = 9, λ4 = 8, λ5 = 7, λ6 = 20.
12 / 21
Conditional Random Fields

CRF – POS tagging

Assume that we have a sequence “the chief talks”. Which of the
following probabilities is higher? P(Det N V ∣ the chief talks) or
P(Det Adj N ∣ the chief talks)?
Weighted feature sums for both:
1 Det N V : 20 + 5 + 7 = 32
2 Det Adj N : 20 + 2 + 8 = 30
Consequently, Det N V has a slightly higher probability.

(In real applications, we have of course many more features.)

13 / 21
Efficient computation

In the following, we assume chain-structured CRF (see example). Let

x be our input sequence of length n. We have features f1 , . . . fk . Each
label sequence is augmented with an initial start and a final end.
Let C be the set of class labels.

We define a set of C × C matrices M1 (x), M2 (x), . . . Mn+1 (x) where

for all i, 1 ≤ i ≤ n + 1 and all classes c, c ′ :

Mi (c, c ′ ) = e∑j=1 λj fj (c,c ,x,i)

k ′

With this, we can compute

1 n+1
P(y∣x) = ∏ Mi (yi−1 , yi )
Z i=1

14 / 21
Efficient computation

In order to obtain the probability, we have to compute

Z = ∑y′ ∈C n e∑j=1 λj Fj (y ,x)
k ′

As in the HMM case, we can use the forward-backward algorithm in

order to compute this efficiently.
We compute αi (c) = ∑y′ ∈C i−1 e∑j=1 λj Fj (y ,x) in a way similar to the
k ′

HMM foward computation:

Forward computation
1 α0 (c) = 1 if c = start, else α0 (c) = 0.
2 αi (c) = ∑c′ ∈C αi−1 (c ′ )Mi (c ′ , c) for 1 ≤ i ≤ n
3 Z = ∑c∈C αn (c)

15 / 21
Efficient computation
Probability calculation in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 e2 e1 + e0.5 e0.3 = 22.31 22.31e1 + 144.62e0.3
B 0 e0.5 e2 e2 + e0.5 e4 = 144.62 22.31e2 + 144.62e4
0 1 2 3

Z = 255.86 + 8060.83 = 8316.69

e2+1+1 e0.5+4+4
P(AAA∣abb) = 8316.69 = 0.0066 P(BBB∣abb) = 8316.69 = 0.59
e2+2+4 e0.5+0.3+2
P(ABB∣abb) = 8316.69 = 0.39 P(BAB∣abb) = 8316.69 = 0.002
e0.5+4+0.3 e2+1+2
P(BBA∣abb) = 8316.69 = 0.015 P(AAB∣abb) = 8316.69 = 0.018
e2+2+0.3 e0.5+0.3+1
P(ABA∣abb) = 8316.69 = 0.009 P(BAA∣abb) = 8316.69 = 0.0007
16 / 21
Efficient computation
In order to obtain the best class sequence, we can use the viterbi
algorithm:
Viterbi for CRF
1 v (c) = 1 if c = start, else v (c) = 0.
0 0
2 vi (c) = maxc′ ∈C (vi−1 (c ′ )Mi (c ′ , c)) for 1 ≤ i ≤ n

If we keep additional backpointers to the c ′ that has lead to the

maximal value, we can read off the best class sequence, starting from
the maximal value vn (c) we have for any of the classes c. If this best
class for n is c, the probability of the best class sequence is

vn (c) vn (c)
=
Z ∑c∈C αn (c)

17 / 21
Efficient computation

Classification in CRF
C = {A, B}, x = abb. Features fc,c′ ,x and their weights:
fs,A,a ∶ 2 fA,A,b ∶ 1 fB,A,b ∶ 0.3
fs,B,a ∶ 0.5 fA,B,b ∶ 2 fB,B,b ∶ 4
start 1
A 0 e2 , start e2 e1 , A e4.5 e0.3 , B
B 0 e0.5 , start e0.5 e4 , B e4.5 e4 , B
0 1 2 3

e8.5
Best class sequence: BBB, probability 8316.69 = 0.59

18 / 21
Features
In general, we can have any type of features depending on the entire
input sequence, the position i, and the classes of xi and xi−1 .
Relaxed independence assumptions compared to HMM.

Shallow parsing Sha & Pereira (2003)

Shallow parsing: identify chunks (= non-recursive noun phrases)
without analyzing their internal structure.
The chunker assigns a chunk label B, I or O (begin of chunk, inside
chunk, outside chunk) to each word.
The CRF classifier assigns a pair of chunk labels to a word xi ,
namely the concatenation of the chunk labels of xi−1 and of xi .
All features f (yi−1 , yi , x, i) are indicator functions, indicating that
some predicate p(x, i) holds and
some predicate q(yi−1 , yi ) holds.

19 / 21
Features
Shallow parsing Sha & Pereira (2003)
Sample predicates p(x, i):
xi+2 = w,
xi−1 = w, xi = w ′ ,
POS(xi−1 ) = t, POS(xi ) = t ′ ,
...
Sample predicates q(yi−1 , yi ):
yi = cc ′ (c, c ′ are the labels assigned by the chunker),
yi−1 = cc ′ , yi = c ′ c ′′ ,
yi = xc ′ where x can be any label,
...

(w, w ′ are specific words, t, t ′ specific POS tags and c, c ′ , c ′′ specific

labels from {B, I , O}.)

In total, Sha & Pereira (2003) use 3,8 million features.

20 / 21
References
Finkel, Jenny Rose, Trond Grenager & Christopher Manning. 2005. Incorporating
non-local information into information extraction systems by gibbs sampling. In
Proceedings of the 43rd annual meeting on association for computational linguistics
ACL ’05, 363–370. Stroudsburg, PA, USA: Association for Computational
Linguistics. doi:10.3115/1219840.1219885.
http://dx.doi.org/10.3115/1219840.1219885.
Lafferty, John, Andrew McCallum & Fernando Pereira. 2001. Conditional random
fields: probabilistic models for segmenting and labeling sequence data. In
International conference on machine learning, .
Ratnaparkhi, Adwait. 1997. A simple introduction to maximum entropy models for
natural language processing. Tech. Rep. 97–08 Institue for Research in Cognitive
Science, University of Pennsylvania.
Samih, Younes & Wolfgang Maier. 2016. Detecting code-switching in moroccan
arabic. In Proceedings of SocialNLP @ IJCAI-2016, New York. To appear.
Sha, Fei & Fernando Pereira. 2003. Shallow parsing with conditional random fields.
In Proceedings of human language technology, naacl, .
Wallach, Hana M. 2002. Efficient training of conditional random fields: University of
Edinburgh dissertation.
Wallach, Hana M. 2004. Conditional random fiels: An introduction. Tech. rep.
University of Pennsylvania. Technical Report (CIS), Paper 22.
21 / 21

Three Letter Words Sentences
100% (2)
Three Letter Words Sentences
24 pages
CRF Tutorial ISMIR-2013 PDF
No ratings yet
CRF Tutorial ISMIR-2013 PDF
133 pages
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
No ratings yet
An Introduction To Conditional Random Fields: Charles Sutton and Andrew Mccallum
90 pages
Predicting Structured Data
No ratings yet
Predicting Structured Data
29 pages
Maxent Models and Discriminative Estimation: The Maximum Entropy Model Presentation
No ratings yet
Maxent Models and Discriminative Estimation: The Maximum Entropy Model Presentation
39 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
CRF Tutorial Talk
No ratings yet
CRF Tutorial Talk
35 pages
Crftut FNT PDF
No ratings yet
Crftut FNT PDF
109 pages
Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data
No ratings yet
Conditional Random Fields: Probabilistic Models For Segmenting and Labeling Sequence Data
28 pages
Partially Directed Graphs and Conditional Random Fields: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Partially Directed Graphs and Conditional Random Fields: Sargur Srihari Srihari@cedar - Buffalo.edu
43 pages
Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3
No ratings yet
Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3
41 pages
AML-V New
No ratings yet
AML-V New
165 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Flexcrfs
No ratings yet
Flexcrfs
34 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Conditional Random Field Model (CRF)
No ratings yet
Conditional Random Field Model (CRF)
31 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Classification: Alternative Techniques: Md. Fazlul Karim Patwary IIT
No ratings yet
Classification: Alternative Techniques: Md. Fazlul Karim Patwary IIT
65 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
Conditional Random Fields
No ratings yet
Conditional Random Fields
10 pages
Conditional Random Fields - A probabilistic graphical model: Yen-Chin Lee 指導老師：鮑興國
No ratings yet
Conditional Random Fields - A probabilistic graphical model: Yen-Chin Lee 指導老師：鮑興國
25 pages
Module 3
No ratings yet
Module 3
17 pages
8 CRF
No ratings yet
8 CRF
12 pages
CRF Klinger Tomanek
No ratings yet
CRF Klinger Tomanek
32 pages
Ch13 5-ConditionalRandomFields
No ratings yet
Ch13 5-ConditionalRandomFields
57 pages
DATA - FA 2024 - Dist
No ratings yet
DATA - FA 2024 - Dist
85 pages
This Is AI4001: GCR: t37g47w
No ratings yet
This Is AI4001: GCR: t37g47w
51 pages
Collective Multi-Label Classification
No ratings yet
Collective Multi-Label Classification
8 pages
Conditional Random Field
No ratings yet
Conditional Random Field
5 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
crf2 PDF
No ratings yet
crf2 PDF
10 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
Research On CDR
No ratings yet
Research On CDR
24 pages
Hidden Markov Support Vector Machines
No ratings yet
Hidden Markov Support Vector Machines
8 pages
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
No ratings yet
Conditional Random Fields: An Introduction: 1 Labeling Sequential Data
9 pages
02 Unit 4
No ratings yet
02 Unit 4
10 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
Semi-Markov Conditional Random Fields For Information Extraction
No ratings yet
Semi-Markov Conditional Random Fields For Information Extraction
8 pages
Discriminative Fields For Modeling Spatial Dependencies
No ratings yet
Discriminative Fields For Modeling Spatial Dependencies
8 pages
Survey On Multiclass Classification Methods
No ratings yet
Survey On Multiclass Classification Methods
9 pages
Dynamic Bayesian Multinets
No ratings yet
Dynamic Bayesian Multinets
8 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Conditional Random Fields (CRFS)
No ratings yet
Conditional Random Fields (CRFS)
13 pages
14 CRF 06 09 2024
No ratings yet
14 CRF 06 09 2024
10 pages
hw3 Solution
No ratings yet
hw3 Solution
7 pages
Sequence Labeling For Parts of Speech and Named Entities PPT 2
No ratings yet
Sequence Labeling For Parts of Speech and Named Entities PPT 2
18 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Bayesian
No ratings yet
Bayesian
23 pages
HLT 2004
No ratings yet
HLT 2004
8 pages
Filipino Alphabet Tracing
No ratings yet
Filipino Alphabet Tracing
28 pages
What Is CRF?
No ratings yet
What Is CRF?
3 pages
Exp ML
No ratings yet
Exp ML
6 pages
Hidden Conditional Random Fields For Phone Recognition: Yun-Hsuan Sung and Dan Jurafsky
No ratings yet
Hidden Conditional Random Fields For Phone Recognition: Yun-Hsuan Sung and Dan Jurafsky
6 pages
Quantum Conditional Random Field: PACS Numbers
No ratings yet
Quantum Conditional Random Field: PACS Numbers
9 pages
Mlud Unit-3 (B)
No ratings yet
Mlud Unit-3 (B)
7 pages
Class Test 2 Answer Key
No ratings yet
Class Test 2 Answer Key
4 pages
NLP Summary
No ratings yet
NLP Summary
2 pages
DSA Cheat Sheet
No ratings yet
DSA Cheat Sheet
4 pages
Solaris Command Reference
100% (12)
Solaris Command Reference
7 pages
Lesson Plan: Class: Duration: Subject: Lesson No.: Lesson Title
No ratings yet
Lesson Plan: Class: Duration: Subject: Lesson No.: Lesson Title
26 pages
EOT II P - 3 Mathematics
No ratings yet
EOT II P - 3 Mathematics
8 pages
Use of English A1 A2
No ratings yet
Use of English A1 A2
4 pages
PDF
No ratings yet
PDF
62 pages
2nd Term - Stories - Future Will & Going To
No ratings yet
2nd Term - Stories - Future Will & Going To
14 pages
Pronoun
No ratings yet
Pronoun
25 pages
CLASS X PT1 Maths
No ratings yet
CLASS X PT1 Maths
4 pages
Python - Module at Master Livewires - Python GitHub
No ratings yet
Python - Module at Master Livewires - Python GitHub
4 pages
Unveiling The Essence of India: Exploring Indian English Poetry's Quest For The Ultimate Truth of Life
No ratings yet
Unveiling The Essence of India: Exploring Indian English Poetry's Quest For The Ultimate Truth of Life
10 pages
FortiAuthenticator 6.2.0 VM Install Guide
No ratings yet
FortiAuthenticator 6.2.0 VM Install Guide
43 pages
OT24 Jericho Usa
No ratings yet
OT24 Jericho Usa
15 pages
DB2 Database Backup and Restore Steps
No ratings yet
DB2 Database Backup and Restore Steps
3 pages
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
No ratings yet
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
17 pages
Unit 4
100% (1)
Unit 4
1 page
Lab 4 - Session - Cookies
No ratings yet
Lab 4 - Session - Cookies
7 pages
L102 Mid 2022
No ratings yet
L102 Mid 2022
4 pages
God Is Pure Bliss
No ratings yet
God Is Pure Bliss
26 pages
6to Año TALKING ABOUT PLANS
No ratings yet
6to Año TALKING ABOUT PLANS
3 pages
Lec6 - Testbench Modified
No ratings yet
Lec6 - Testbench Modified
15 pages
Analytical Exposition
No ratings yet
Analytical Exposition
20 pages
The Laplace Transform
No ratings yet
The Laplace Transform
3 pages
Travelling: Types of Transport
No ratings yet
Travelling: Types of Transport
2 pages
L1 Index
No ratings yet
L1 Index
2 pages
NS LogMessages
No ratings yet
NS LogMessages
54 pages
Annotating: Why and How: How To Mark A Book by Mortimer J. Adler, PH.D
No ratings yet
Annotating: Why and How: How To Mark A Book by Mortimer J. Adler, PH.D
3 pages
Subject: English Level: Grade 8 Class Size: 40 Students Duration: 1 Hour Lesson: Nouns Learning Competencies
No ratings yet
Subject: English Level: Grade 8 Class Size: 40 Students Duration: 1 Hour Lesson: Nouns Learning Competencies
4 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)