Noun Phrase Extraction: A Description of Current Techniques
Noun Phrase Extraction: A Description of Current Techniques
A Description of Current
Techniques
What is a noun phrase?
A phrase whose head is a noun or pronoun
optionally accompanied by a set of modifiers
Determiners:
• Articles: a, an, the
• Demonstratives: this, that, those
• Numerals: one, two, three
• Possessives: my, their, whose
• Quantifiers: some, many
Adjectives: the red ball
Relative clauses: the books that I bought yesterday
Prepositional phrases: the man with the black hat
Is that really what we want?
POS tagging already identifies pronouns and
nouns by themselves
The man whose red hat I borrowed yesterday in
the street that is next to my house lives next
door.
[The man [whose red hat [I borrowed yesterday]RC ]RC
[in the street]PP [that is next to my house]RC ]NP lives
[next door]NP.
Base Noun Phrases
[The man]NP whose [red hat]NP I borrowed
[yesterday ]NP in [the street]NP that is next to [my
house]NP lives [next door]NP.
How Prevalent is this Problem?
determiner/ noun/
adjective pronoun
S0 S1 NP
adjective
Relative clause/
Prepositional phrase/
noun
Simple rule NPE example
“Contextualization” and “lexicalization”
Ratio between the number of occurrences of
a POS tag in a chunk and the number of
occurrences of this POS tag in the training
corpora
Parsing FSA’s, grammars, regular
expressions: LR(k) Parsing
The L means we do Left to right scan of input tokens
1. E -> E + T
2. E -> E - T
3. E -> T
4. T -> T * F
5. T -> T / F
6. T -> F
7. F -> (E)
8. F -> i
LR Table for Exp Grammar
An LR(1) NPE Example
Stack Input Action 1. S NP VP
[] NVN SH N 2. NP Det N
[N] VN RE 3.) NP N 3. NP N
[NP] VN SH V 4. VP V NP
[NP V] N SH N
[NP V N] RE 3.) NP N
[NP V NP] RE 4.) VP V NP
[NP VP] RE 1.) S NP VP
[S] Accept!
(Abney, 1991)
Why isn’t this enough?
Unanticipated rules
Difficulty finding non-recursive, base NP’s
Structural ambiguity
Structural Ambiguity
“I saw the man with the telescope.”
S S
NP VP NP VP
I V NP I VP NP
saw V
DET N PP DET N
PP
the man saw the man
PRP DET N
PRP DET N
with the telescope
with the telescope
What are the more current
solutions?
Machine Learning
Transformation-based Learning
Memory-based Learning
Maximum Entropy Model
Hidden Markov Model
Conditional Random Field
Support Vector Machines
Machine Learning means
TRAINING!
Corpus: a large, structured set of texts
Establish usage statistics
Learn linguistics rules
The Brown Corpus
American English, roughly 1 million words
Tagged with the parts of speech
http://www.edict.com.hk/concordance/WWWConcappE.htm
Transformation-based Machine
Learning
An ‘error-driven’ approach for learning an
ordered set of rules
1. Generate all rules that correct at least one error.
2. For each rule:
(a) Apply to a copy of the most recent
state of the training set.
(b) Score the result using the objective
function.
3. Select the rule with the best score.
4. Update the training set by applying the selected
rule.
5. Stop if the score is smaller than some pre-set
threshold T; otherwise repeat from step 1.
Transformation-based NPE
example
Input:
“WhitneyNN currentlyADV hasVB theDT rightADJ ideaNN.”
Expected output:
“[NP Whitney] [ADV currently] [VB has] [NP the right idea].”
Daelemans, 2005
Memory-based Machine Learning
Continued
Distance…?
The Overlapping Function: Count the number of
mismatching features
The Modified Value Distance Metric (MVDM)
Function: estimate a numeric distance between two
“rules”
The distance between two N-dimensional vectors A,
B with discrete (for example symbolic) elements, in a
K class problem, is computed using conditional
probabilities:
d(A,B) = Σj..n Σi..k (P(Ci I Aj) - P(Ci | Pj))
where p(CilAj) is estimated by calculating the number
Ni(Aj) of times feature Aj occurred in vectors belonging
to class Ci, and dividing it by the number of times
feature Aj occurred for any class
Dusch, 1998
Memory-based NPE example
Suppose we have the following candidate
sequence:
DT ADJ ADJ NN NN
• “The beautiful, intelligent summer intern”
In our rule set we have:
DT ADJ ADJ NN NNP
DT ADJ NN NN
Maximum Entropy
The least biased probability distribution that
encodes information maximizes the information
entropy, that is, the measure of uncertainty
associated with a random variable.
Consider that we have m unique propositions
The most informative distribution is one in which we
know one of the propositions is true – information
entropy is 0
The least informative distribution is one in which there
is no reason to favor any one proposition over
another – information entropy is log m
Maximum Entropy applied to NPE
Let’s consider several French translations of the English word “in”
p(dans) + p(en) + p(á) + p(au cours de) + p(pendant) = 1
Now suppose that we find that either dans or en is chosen 30% of
the time. We must add that constraint to the model and choose the
most uniform distribution
p(dans) = 3/20
p(en) = 3/20
p(á) = 7/30
p(au cours de) = 7/30
p(pendant) = 7/30
What if we now find that either dans or á is used half of the time?
p(dans) + p(en) = .3
p(dans) + p(á) = .5
Now what is the most “uniform” distribution?
Berger, 1996
Hidden Markov Model
In a statistical model of a system possessing the
Markov property…
There are a discrete number of possible states
The probability distribution of future states depends
only on the present state and is independent of past
states
These states are not directly observable in a
hidden Markov model.
The goal is to determine the hidden properties
from the observable ones.
Hidden Markov Model
a: transition probabilities
x: hidden states
y: observable states
b: output probabilities
HMM Example
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }
In this case, the weather possesses the Markov
property
HMM as applied to NPE
In the case of noun phrase extraction, the hidden
property is the unknown grammar “rule”
Our observations are formed by our training data
Contextual probabilities represent the transition states
that is, given our previous two transitions, what is the likelihood
of continuing, ending, or beginning a noun phrase/ P(oi|oj-1,oj-2)
Output probabilities
Given our current state transition, what is the likelihood of our
current word being part of, beginning, or ending a noun phrase/
P(ij|oj)
MaxO1…OT( πj:1…T P(oi|oj-1,oj-2) · P(ij|oj) )
The Viterbi Algorithm
Now that we’ve constructed this
probabilistic representation, we need to
traverse it
Finds the most likely sequence of states
Viterbi Algorithm
Whitney gave a painfully long presentation.
Conditional Random Fields
An undirected graphical model in which each vertex represents a
random variable whose distribution is to be inferred, and each edge
represents a dependency between two random variables. In a CRF,
the distribution of each discrete random variable Y in the graph is
conditioned on an input sequence X
x1, …, xn-1, xn
Conditional Random Fields
The primary advantage of CRF’s over
hidden Markov models is their conditional
nature, resulting in the relaxation of the
independence assumptions required by
HMM’s
The transition probabilities of the HMM
have been transformed into feature
functions that are conditional upon the
input sequence
Support Vector Machines
We wish to graph an number of data points of dimension p and separate those
points with a p-1 dimensional hyperplane that guarantees the maximum
distance between the two classes of points – this ensures the most
generalization
These data points represent pattern samples whose dimension is dependent
upon the number of features used to describe them
http://www.csie.ntu.edu.tw/~cjlin/libsvm/#GUI
What if our points are separated by
a nonlinear barrier?
(Bikel, 1998)
Primary Method Implementation Evaluation Performance Pros Cons
Work Data (F-measure)
Dejean Simple rule-based “ALLiS” CONLL 2000 92.09 Extremely simple, Not very robust, difficult to
Uses XML input task quick; doesn’t improve upon; extremely
Not available require a training difficult to generate rules
corpus
Ramshaw, Transformation C++, Perl Penn Treebank 92.03 - 93 … Extremely dependent upon
Marcus Based Learning Available! training set and its
“completeness” – how many
different ways the NP are
formed; requires a fair amount
of memory
Tjong Kim Memory-Based “TiMBL” Penn 93.34, 92.5 Highly suited to Has no ability to intelligently
Sang Learning Python Treebank, the NLP task weight “important” features;
Available! CONLL 2000 also it cannot identify feature
task dependency – both of these
problems result in a loss of
accuracy
Koeling Maximum Entropy Not available CONLL 2000 91.97 First statistical Always makes the best local
task approach, higher decision without much regard
accuracy at all for position
Molina, Pla Hidden Markov Not available CONLL 2000 92.19 Takes position Make conditional
Model task into account independence assumptions
which ignore special input
features such as
capitalization, suffixes,
surrounding words
Sha, Pereira Conditional Java Penn 94.38 (“no Can handle “Over fitting”
Random Fields Is Available… sort of Treebank, significant millions of
CRF++ in C++ by CONLL 2000 difference”) features, handles
Kudo also task both position and
dependencies
IS AVAILABLE!
Kudo, Support Vector C++, Perl, Python Penn 94.22, 93.91 Minimizes error Doesn’t really take position
Matsumoto Machines Available! Treebank, resulting in higher into account
CONLL 2000 accuracy/ handles
task tons of features