0% found this document useful (0 votes)
15 views36 pages

Week 9

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views36 pages

Week 9

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Part of Speech tagging

 Introduction
 Parts of speech tagging
 Named entity recognition
Introduction
 NLP crosses areas linguistics, computer science,
and artificial intelligence.
 In linguistics, there are 8 parts of speech (POS) attributed to
Dionysius Thrax of Alexandria (c. 1st C. BCE):
noun, verb, pronoun, preposition, adverb, conjunction,
participle, article
 Parts of speech tagging algorithm is the procedure of
marking up a word in a text (corpus) as corresponding to a
particular POS, based on both its definition and its context.

CS3TM20 © XH 2
POS tagging is useful for
 Parsing: POS tagging can improve syntactic parsing
 MT: reordering of adjectives and nouns (say from Spanish
to English)
 Sentiment or affective tasks: may want to distinguish
adjectives or other POS
 Text-to-speech (how do we pronounce “lead” or "object"?)
 Or linguistic or language-analytic computational tasks
 Need to control for POS when studying linguistic change
like creation of new words, or meaning shift
 Or control for POS in measuring meaning similarity or
differences
Two classes of words: Open vs. Closed
 Open class words
 Usually content words: Nouns, Verbs, Adjectives,
Adverbs
 Plus interjections: oh, ouch, uh-huh, yes, hello
 New nouns and verbs like iPhone or to fax
 Closed class words
 Usually function words: short, frequent words with
grammatical function
 determiners: a, an, the
 pronouns: she, he, I
 prepositions: on, under, over, near, by, …
Open class ("content") words
Nouns Verbs Adjectives old green tasty

Proper Common Main Adverbs slowly yesterday


Janet cat, cats eat
Italy mango went Interjections Ow hello
Numbers
122,312
… more
one
Closed class ("function")
Auxiliary
Determiners the some can Prepositions to with
had
Conjunctions and or Particles off up … more

Pronouns they its


CS3TM20 © XH 5
"Universal Dependencies" Tagset Nivre et al. 2016

6
CS3TM20 © XH
Part-of-Speech Tagging
 Assigning a part-of-speech to each word in a text.
 Map from sequence x1,…,xn of words to y1,…,yn of POS tags

7
CS3TM20 © XH
The Penn Treebank part-of-speech tags
 Sample "Tagged" English sentences
There/PRO were/VERB 70/NUM children/NOUN
there/ADV ./PUNC
Preliminary/ADJ findings/NOUN were/AUX
reported/VERB in/ADP today/NOUN ’s/PART
New/PROPN England/PROPN Journal/PROPN
of/ADP Medicine/PROPN

 Words often have more than one POS.


VERB: (Book that flight)
NOUN: (Hand me that book).
9
CS3TM20 © XH
How difficult is POS tagging in English?
 Roughly 15% of word types are ambiguous
• Hence 85% of word types are unambiguous
• Janet is always PROPN, hesitantly is always ADV
 But those 15% tend to be very common, so ~60% of word
tokens are ambiguous
 E.g., back
• earnings growth took a back/ADJ seat
• a small building in the back/NOUN
• a clear majority of senators back/VERB the bill
• enable the country to buy back/PART debt
• I was twenty-one back/ADV then
Sources of information for POS tagging
Janet will back the bill
AUX/NOUN/VERB? NOUN/VERB?
 Prior probabilities of word/tag
"will" is usually an AUX
 Identity of neighboring words
"the" means the next word is probably not a verb
 Morphology and wordshape:
 Prefixes unable: un-  ADJ
 Suffixes importantly: -ly  ADJ
 Capitalization Janet: CAP  PROPN
Named Entities (NE) recognition
 Named entity, in its core usage, means anything that can
be referred to with a proper name. Most common 4 tags:
• PER (Person): “Marie Curie”
• LOC (Location): “New York City”
• ORG (Organization): “Stanford University”
• GPE (Geo-Political Entity): "Boulder, Colorado"
 Often multi-word phrases
 But the term is also extended to things that aren't entities:
dates,
times,
prices
 Segmentation issues
 In POS tagging, no segmentation problem since each word
gets one tag.
 In NER we must find and segment the entities!
 Type ambiguity
 Applications:
 Sentiment analysis: consumer’s sentiment toward a particular
company or person?
 Question Answering: answer questions about an entity?
 Information Extraction: Extracting facts about entities from text.
Hidden Markov Model (HMM) POS tagging

 Markov chain
 HMM
 Viterbi POS tagging algorithm
Markov Chain
0.8
 Consider a sequence of state variables
 A Markov model embodies the Markov
cold
assumption on the probabilities of this 0.1
0.1
sequence 0.1 0.1

0.3
P(| ) hot warm

= P(| )
(transition probability between states only 0.3
0.6 0.6
dependent on previous state)
15
CS3TM20 © XH
0.8
 A Markov chain is specified
by the following component cold
0.1 0.1
a set of N states 0.1 0.1

0.3
A= each representing the hot warm
probability of moving
from state i to state j, 0.3
0.6 0.6

π An initial probability  Q ={ hot, cold, warm}


distribution over
states.  A=
 Also need Initial probabilities
for hot, cold and warm,
respectively. E.g.

16
CS3TM20 © XH
Hidden Markov Model
 A Markov chain is useful when we need to compute a
probability for a sequence of observable events.
 In many cases, however, the events we are interested in are
hidden.
 For example, we don’t normally observe part-of-speech tags
in a text. Rather, we see words, and must infer the tags from
the word sequence.
 A hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and
hidden events (like part-of-speech tags) that we think of as
causal factors in our probabilistic model.
17
CS3TM20 © XH
Hidden Markov Model
a set of N states
A= each representing the probability of moving from state i
to state j,
a sequence of T observations, each one drawn from a
vocabulary V=

a sequence of observation likelihoods, also called


emission probabilities, each expressing the probability
of an observation being generated from a state q
π An initial probability distribution over states.

18
CS3TM20 © XH
 We still have the Markov assumption on the probabilities of
tagging sequence
P(| ) = P(| )
(transition probability between states (tags) only dependent
on previous state)
 Plus the second assumption is
P(| ) = P(| )
(emission probability of words only dependent on tag state)
 We have given two matrices
A: transmission probabilities
B: emission probabilities
𝐶 ( 𝑡𝑖 − 1 , 𝑡𝑖 )
 A: transmission probabilities 𝑃 ( 𝑡 𝑖|𝑡 𝑖 −1 ) =
𝐶 ( 𝑡𝑖 − 1 )
𝐶 ( 𝑡 𝑖 , 𝑤𝑖 )
 B: emission probabilities 𝑃 ( 𝑤 𝑖|𝑡 𝑖 ) =
𝐶 (𝑡 𝑖)
How?

transmission probability

Nou Nou
Aux Verb Det
n n

Janet
Janet Will back the bill

Input: Observed words


emission probability
Output : tags 22
Viterbi algorithm
 The Viterbi algorithm is a dynamic programming
(DP) algorithm for obtaining the maximum a posteriori
probability estimate of the most likely sequence of hidden
states.
 It first sets up a probability matrix or lattice, with one column
for each observation and one row for each state in the state
graph.
 represents the maximum probability that the HMM is in state
j after seeing the first t observations and passing through the
most probable state sequence
 Using DP, recursively = max ( * )
23
Viterbi algorithm

24
25
26
MD: max
VB: 0.000028 0.0009 = …
NN: 0.000200 .0584 = …

27
28
V 0.0005354496 =max
JJ: 0.000340 .0005 = …
NN: 0.00022 0.0008 = …
RB: 0.010446 .1698 = 0.0017737308
29
V 0.0005354496 =max
JJ: 0.000340 .0005 = …
NN: 0.00022 0.0008 = …
RB: 0.010446 .1698 = 0.0017737308
30
In class exercise : complete finding the tags for “ the bill”

31
Notes on Exam style question:
 You will be given a short sentence and two probability
matrices, return the tagging.
 To simplify the negative log probabilities are used.
 Instead of = max ( * ),
we track = min ( + ),
where =
 This is convenient to have matrices with all positive
values and use additions.

32
Exam style question:
Consider a sentence “word1 word2 word3”. The following
transition matrices of the Hidden Markov Model are given as negative
log probabilities of (i) transition and (ii) emission respectively. Show
your working steps of constructing the Viterbi path and the tag.
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2
VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4

Abbreviation NNP: Proper noun MD: Modal VB: Verb NN: Noun 33
Answer:
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2 VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4

word1 word2 word3


<s> start
NNP 28 = [28,5,22,23]
MD 5
V1 = 5, tag word 1 as MD
VB 22
NN 23 34
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2 VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4

word1 word2 word3


<s> start 5+
NNP 28 27 = [27, 28 ,27,14]
MD 5 28
VB 22 27 V2 = 14, tag word 2 as NN
NN 23 14
35
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2 VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4

word1 word2 word3 14+


<s> start = [21, 35 ,20,36]
NNP 28 27 21
MD 5 28 35 V3 = 14, tag word 3 as VB
VB 22 27 20 Finally
NN 23 14 36 Word1 (MD), Word2 ( NN), word 3 ( VB)
36

You might also like