Unit 3
Unit 3
1
Objective
●
Types of PoS Tagger
– Rule Based PoSTagging
– Stochastic PoS Tagging
– Transformation based Tagging
●
Hidden Markup Model
2
Reference / Reading
●
Chapter 8
Speech and Language Processing. Daniel Jurafsky
& James H. Martin
– https://web.stanford.edu/~jurafsky/slp3/old_oct19/8.pdf
3
What is PoS?
●
A category to which a word is assigned in
accordance with its syntactic functions
●
The role a word plays in a sentence denotes
what part of speech it belongs to
●
In English the main parts of speech are
– noun, pronoun, adjective, determiner, verb, adverb,
preposition, conjunction, and interjection
4
Part of Speech
●
Noun
●
Adjective
●
Adverb
●
Verb
●
Preposition
●
Pronoun
●
Conjunctions
●
Interjections
5
PoS Tagsets
●
There are many parts of speech tagsets
●
Tag types
– Coarse-grained
●
Noun, verb, adjective, …
– Fine-grained
●
noun-proper-singular, noun-proper-plural, noun-common-mass, ..
●
verb-past, verb-present-3rd, verb-base, …
●
adjective-simple, adjective-comparative, ...
6
PoS Tagsets
●
Brown tagset (87 tags)
– Brown corpus
– https://en.wikipedia.org/wiki/Brown_Corpus
●
C5 tagset (61 tags)
●
C7 tagset (146 tags!)
●
Penn TreeBank (45 tags) – most used
– A large annotated corpus of English tagset
– https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos
.html
●
UPenn TreeBank II - 36 tags
7
PoS Tag: Challenge
●
Words often have more than one POS
●
Ambiguity in POS tags
●
Out of Vocabulary (OOV) in POS
●
Complex grammatical structure of the language
●
Lack of annotated dataset
●
Inconsistencies in annotated dataset
8
Type of PoS Taggers
●
There are different algorithms for tagging.
– Rule Based Tagging
– Transformation Based Tagging
– Statistical Tagging (HMM Part-of-Speech Tagging)
9
Rule-Based POS tagging
●
The rule-based approach uses handcrafted sets
of rules to tag input sentence
●
There are two stages in rule-based taggers:
– First Stage: Uses a dictionary to assign each word a
list of potential parts-of-speech
– Second Stage: Uses a large list of handcrafted rules
to window down this list to a single part-of-speech
for each word
10
Rule-Based POS tagging
●
The ENGTWOL is a rule-based tagger
– In the first stage, uses a two-level lexicon transducer
– In the second stage, uses hand-crafted rules (about
1100 rules).
●
Rule-1: if (the previous tag is an article)
then eliminate all verb tags
●
Rule-2: if (the next tag is verb)
then eliminate all verb tags
11
Rule-Based POS tagging
●
Example: He had a fly.
●
The fırst stage:
– He → he/pronoun
– had → have/verbpast have/auxliarypast
– a → a/article
– fly → fly/verb fly/noun
●
The second stage:
– apply rule: if (the previous tag is an article) then eliminate all verb tags
●
he → he/pronoun
●
had → have/verbpast have/auxliarypast
●
a → a/article
●
fly → fly/verb fly/noun
12
Transformation-based tagging
●
Transformation-based tagging is also known as
Brill Tagging.
– Brill Tagging uses transformation rules and rules are
learned from a tagged corpus.
– Then these learned rules are used in tagging.
●
Before the rules are applied, the tagger labels
every word with its most likely tag.
– We get these most likely tags from a tagged corpus.
13
Transformation-based tagging
●
Example:
– He is expected to race tomorrow
he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN
●
After selecting most-likely tags, we apply
transformation rules.
– Change NN to VB when the previous tag is TO
●
This rule converts race/NN into race/VB
●
This may not work for every case
….. According to race
14
Brill Tagger – How rules are learned?
●
We assume that we have a tagged corpus. Brill Tagger
algorithm has three major steps.
– Tag the corpus with the most likely tag for each (unigram model)
– Choose a transformation that deterministically replaces an existing tag
with a new tag such that the resulting tagged training corpus has the
lowest error rate out of all transformations.
– Apply the transformation to the training corpus.
●
These steps are repeated until a stopping criterion is reached
●
The result (which will be our tagger) will be:
– First tags using most-likely tags
– Then apply the learned transformations in the learning order.
15
Brill Tagger – Transformation Rules?
●
Change tag a to tag b when
– The preceding (following) word is tagged z.
– The word two before (after) is tagged z.
– One of two preceding (following) words is tagged z.
– One of three preceding (following) words is tagged z.
– The preceding word is tagged z and the following word is
tagged w.
– The preceding (following) word is tagged z and the word two
before (after) is tagged w.
16
Methods of PoS Tagging
●
Stochastic (Probabilistic) tagging
– e.g., TNT [ Brants, 2000 ]
●
Trigrams n Tags
●
Based on Markov model
– Original Paper: https://dl.acm.org/doi/pdf/10.3115/974147.974178
17
Hidden Markov Model based PoS
Tagging
18
Markov Chains
●
A Markov chain is a model that tells us
something about the probabilities of sequences
of states (random variables)
– A Markov chain makes a very strong assumption
that if we want to predict the future in the sequence,
all that matters is the current state (Markov
assumption)
– All states before the current state have no impact on
the future except via the current state
19
Markov Chains
●
Markov Assumption:
– Consider a sequence of state variables q1, q2, ..., qi.
A Markov model embodies the Markov assumption
on the probabilities of this sequence: that Markov
assumption when predicting the future, the past doesn’t
matter, only the present :
–
20
Markov Chains
●
A Markov chain is specified by the following
components
21
Markov Chains
●
Markov chain for weather events
– Vocabulary : HOT, COLD, and WARM
●
States are represented as nodes
●
Transitions, with their probabilities, as edge
●
A start distribution π is required.
– setting π = [0.1, 0.7, 0.2] would mean a probability 0.7 of starting in state 2
(cold), probability 0.1 of starting in state 1 (hot), etc.
●
Probability of the sequence: cold - hot – hot - warm
– P(cold hot hot warm) = π2 * P(hot|cold) * P(hot|hot) * P(warm|hot)
= 0.7 * 0.1 * 0.6 * 0.3
22
Markov Chains
●
Compute the probability of each of the following
sequences:
– hot hot hot hot
– cold hot cold hot
●
What does the difference in these probabilities tell
you about a real-world weather fact encoded in
Figure
23
Markov Chains
●
Markov chain is useful to compute a probability for a sequence of
observable events.
– In many cases, the events we are interested in are hidden events:
●
We don’t observe hidden events directly.
●
For example we don’t normally observe part-of-speech tags in a text.
Rather, we see words, and must infer the tags from the word sequence.
●
We call the tags hidden because they are not observed.
●
A Hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and hidden
events (like part-of-speech tags) that we think of as causal factors
in our probabilistic model.
24
Hidden Markov Model
●
An HMM is specified by the following
components
25
First-Order Hidden Markov Model
●
A first-order hidden Markov model uses two simplifying
assumptions:
1) As with a first-order Markov chain, the probability of a particular state
depends only on the previous state:
27
The components of an HMM tagger
- In the WSJ corpus, for example, MD occurs 13124 times of which it is followed by
VB 10471, for an MLE estimate of
28
The components of an HMM tagger
●
The B emission probabilities P(wi|ti), represent the probability, given a
tag (say
MD), that it will be associated with a given word (say will).
– The MLE of the emission probability is
29
HMM tagger
●
The A transition probabilities, and B observation likelihoods (emission
probabilities) of the HMM are illustrated for three states in an HMM
part-of-speech tagger; the full tagger would have one state for each tag
30
HMM tagger
●
States: Set of part-of-speech tags.
●
Transition Probabilities: Tag transition probabilities
– A tag transition probability P(tagb | taga) represents the probability of a tag tagb occurring given the
previous tag taga.
●
Observations: Words (Vocabulary)
– Observation Likelihoods: Emission Probabilities P(word|tag)
– A emission probability P(word | tag ) represents probability of tag producing word.
●
Initial Probability Distribution: First Tag Probabilities P(tag |<s>) in sentences.
31
HMM Tagging as Decoding
●
For an HMM that contains hidden variables, task of
determining hidden variables sequence corresponding to
sequence of observations is called decoding.
●
Decoding:
– Given as input an HMM λ = (TransProbs, ObsLikelihoods) and a
sequence of observations O = o1,…,oT, find the most probable
sequence of states Q = q1,…,qT .
●
For part-of-speech tagging, we will find the most probable
sequence of tags t1,…,tn (hidden variables) for a given
sequence of words w1,…,wn (observations).
32
HMM - Decoding
33
HMM - Decoding
●
HMM taggers make two further simplifying
assumption
– The first is that the probability of a word appearing depends only on its
own tag and is independent of neighboring words and tags:
34
HMM - Decoding
– Plugging the simplifying assumptions, results in the
following equation for the most probable tag
sequence from a bigram tagger:
36
Working of Viterbi Algorithm
Word Sequence O1, O2...
Most possible tag sequence
Number of tags
37
Working of Viterbi Algorithm
38
Working of Viterbi Algorithm
39
Working of Viterbi Algorithm
40
Viterbi Algorithm - Example
●
Let’s tag the sentence Janet will back the bill
41
Viterbi Algorithm - Example
●
Viterbi[NNP,Janet]
= P(NNP|<s>)*P(Janet|NNP)
= 0.2767*0.000032 = 0.00000885
= 8.85x10-6
42
Viterbi Algorithm - Example
43
Viterbi Algorithm - Example
44
Viterbi Algorithm - Example
45
Viterbi Algorithm - Example
46
Viterbi Algorithm - Example
●
Viterbi Matrix for
●
Janet will back the bill
– Janet /NNP
– will /MD
– back /VB
– the /DT
– bill /NN
47
Self Study
●
Beam search is a variant of Viterbi decoding that
maintains only a fraction of high scoring states rather than
all states during decoding.
●
Maximum Entropy Markov Model (MEMM) taggers are
another types of taggers that train logistic regression
models to pick the best tag given a word, its context and
its previous tags using feature templates.
48
Reference
●
WSJ Corpus
– https://www.spsc.tugraz.at/databases-and-tools/wall-
street-journal-corpus.html
– https://aclanthology.org/H92-1073.pdf
49
Thank you
50