0% found this document useful (0 votes)
15 views50 pages

Unit 3

NLP

Uploaded by

adhikariprabn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views50 pages

Unit 3

NLP

Uploaded by

adhikariprabn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit 3

Part of Speech Tagging


Types of PoS Tagging, Hidden Markup Model

Natural Language Processing (NLP)


MDS 555

1
Objective

Types of PoS Tagger
– Rule Based PoSTagging
– Stochastic PoS Tagging
– Transformation based Tagging

Hidden Markup Model

2
Reference / Reading

Chapter 8
Speech and Language Processing. Daniel Jurafsky
& James H. Martin
– https://web.stanford.edu/~jurafsky/slp3/old_oct19/8.pdf

3
What is PoS?

A category to which a word is assigned in
accordance with its syntactic functions

The role a word plays in a sentence denotes
what part of speech it belongs to

In English the main parts of speech are
– noun, pronoun, adjective, determiner, verb, adverb,
preposition, conjunction, and interjection

4
Part of Speech

Noun

Adjective

Adverb

Verb

Preposition

Pronoun

Conjunctions

Interjections
5
PoS Tagsets

There are many parts of speech tagsets

Tag types
– Coarse-grained

Noun, verb, adjective, …
– Fine-grained

noun-proper-singular, noun-proper-plural, noun-common-mass, ..

verb-past, verb-present-3rd, verb-base, …

adjective-simple, adjective-comparative, ...

6
PoS Tagsets

Brown tagset (87 tags)
– Brown corpus
– https://en.wikipedia.org/wiki/Brown_Corpus

C5 tagset (61 tags)

C7 tagset (146 tags!)

Penn TreeBank (45 tags) – most used
– A large annotated corpus of English tagset
– https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos
.html

UPenn TreeBank II - 36 tags
7
PoS Tag: Challenge

Words often have more than one POS

Ambiguity in POS tags

Out of Vocabulary (OOV) in POS

Complex grammatical structure of the language

Lack of annotated dataset

Inconsistencies in annotated dataset

8
Type of PoS Taggers

There are different algorithms for tagging.
– Rule Based Tagging
– Transformation Based Tagging
– Statistical Tagging (HMM Part-of-Speech Tagging)

9
Rule-Based POS tagging

The rule-based approach uses handcrafted sets
of rules to tag input sentence

There are two stages in rule-based taggers:
– First Stage: Uses a dictionary to assign each word a
list of potential parts-of-speech
– Second Stage: Uses a large list of handcrafted rules
to window down this list to a single part-of-speech
for each word
10
Rule-Based POS tagging

The ENGTWOL is a rule-based tagger
– In the first stage, uses a two-level lexicon transducer
– In the second stage, uses hand-crafted rules (about
1100 rules).

Rule-1: if (the previous tag is an article)
then eliminate all verb tags

Rule-2: if (the next tag is verb)
then eliminate all verb tags

11
Rule-Based POS tagging

Example: He had a fly.

The fırst stage:
– He → he/pronoun
– had → have/verbpast have/auxliarypast
– a → a/article
– fly → fly/verb fly/noun

The second stage:
– apply rule: if (the previous tag is an article) then eliminate all verb tags

he → he/pronoun

had → have/verbpast have/auxliarypast

a → a/article

fly → fly/verb fly/noun
12
Transformation-based tagging

Transformation-based tagging is also known as
Brill Tagging.
– Brill Tagging uses transformation rules and rules are
learned from a tagged corpus.
– Then these learned rules are used in tagging.

Before the rules are applied, the tagger labels
every word with its most likely tag.
– We get these most likely tags from a tagged corpus.
13
Transformation-based tagging

Example:
– He is expected to race tomorrow
he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN

After selecting most-likely tags, we apply
transformation rules.
– Change NN to VB when the previous tag is TO

This rule converts race/NN into race/VB

This may not work for every case
….. According to race
14
Brill Tagger – How rules are learned?

We assume that we have a tagged corpus. Brill Tagger
algorithm has three major steps.
– Tag the corpus with the most likely tag for each (unigram model)
– Choose a transformation that deterministically replaces an existing tag
with a new tag such that the resulting tagged training corpus has the
lowest error rate out of all transformations.
– Apply the transformation to the training corpus.

These steps are repeated until a stopping criterion is reached

The result (which will be our tagger) will be:
– First tags using most-likely tags
– Then apply the learned transformations in the learning order.
15
Brill Tagger – Transformation Rules?

Change tag a to tag b when
– The preceding (following) word is tagged z.
– The word two before (after) is tagged z.
– One of two preceding (following) words is tagged z.
– One of three preceding (following) words is tagged z.
– The preceding word is tagged z and the following word is
tagged w.
– The preceding (following) word is tagged z and the word two
before (after) is tagged w.
16
Methods of PoS Tagging

Stochastic (Probabilistic) tagging
– e.g., TNT [ Brants, 2000 ]

Trigrams n Tags

Based on Markov model
– Original Paper: https://dl.acm.org/doi/pdf/10.3115/974147.974178

17
Hidden Markov Model based PoS
Tagging

18
Markov Chains

A Markov chain is a model that tells us
something about the probabilities of sequences
of states (random variables)
– A Markov chain makes a very strong assumption
that if we want to predict the future in the sequence,
all that matters is the current state (Markov
assumption)
– All states before the current state have no impact on
the future except via the current state
19
Markov Chains

Markov Assumption:
– Consider a sequence of state variables q1, q2, ..., qi.
A Markov model embodies the Markov assumption
on the probabilities of this sequence: that Markov
assumption when predicting the future, the past doesn’t
matter, only the present :

P(qi | q1…qi-1) = P(qi | qi-1)

20
Markov Chains

A Markov chain is specified by the following
components

21
Markov Chains

Markov chain for weather events
– Vocabulary : HOT, COLD, and WARM

States are represented as nodes

Transitions, with their probabilities, as edge

A start distribution π is required.
– setting π = [0.1, 0.7, 0.2] would mean a probability 0.7 of starting in state 2
(cold), probability 0.1 of starting in state 1 (hot), etc.

Probability of the sequence: cold - hot – hot - warm
– P(cold hot hot warm) = π2 * P(hot|cold) * P(hot|hot) * P(warm|hot)
= 0.7 * 0.1 * 0.6 * 0.3
22
Markov Chains

Compute the probability of each of the following
sequences:
– hot hot hot hot
– cold hot cold hot


What does the difference in these probabilities tell
you about a real-world weather fact encoded in
Figure
23
Markov Chains

Markov chain is useful to compute a probability for a sequence of
observable events.
– In many cases, the events we are interested in are hidden events:

We don’t observe hidden events directly.

For example we don’t normally observe part-of-speech tags in a text.
Rather, we see words, and must infer the tags from the word sequence.

We call the tags hidden because they are not observed.

A Hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and hidden
events (like part-of-speech tags) that we think of as causal factors
in our probabilistic model.
24
Hidden Markov Model

An HMM is specified by the following
components

25
First-Order Hidden Markov Model

A first-order hidden Markov model uses two simplifying
assumptions:
1) As with a first-order Markov chain, the probability of a particular state
depends only on the previous state:

Markov Assumption: P(qi | q1…qi-1) = P(qi | qi-1)

2) Probability of an output observation oi depends only on the state that


produced the observation qi and not on any other states or any other
observations:

Output Independence: P(oi | q1…qi…qn, o1…oi…on) = P(oi | qi )


26
The components of an HMM tagger

An HMM has two components, the A and B
probabilities

The A matrix contains the tag transition probabilities P(ti|ti−1) which
represent the probability of a tag occurring given the previous tag.
– For example, modal verbs (MD) like will are very likely to be followed by a verb
in the base form (VB), like race, so we expect this probability to be high.
– We compute the maximum likelihood estimate of this transition probability by
counting, out of the times we see the first tag in a labeled corpus, how often the
first tag is followed by the second

27
The components of an HMM tagger
- In the WSJ corpus, for example, MD occurs 13124 times of which it is followed by
VB 10471, for an MLE estimate of

– In HMM tagging, the probabilities are estimated by counting on a tagged training


corpus.

28
The components of an HMM tagger

The B emission probabilities P(wi|ti), represent the probability, given a
tag (say
MD), that it will be associated with a given word (say will).
– The MLE of the emission probability is

– Of the 13124 occurrences of MD in the WSJ corpus, it is associated


with will 4046 times

29
HMM tagger

The A transition probabilities, and B observation likelihoods (emission
probabilities) of the HMM are illustrated for three states in an HMM
part-of-speech tagger; the full tagger would have one state for each tag

30
HMM tagger

States: Set of part-of-speech tags.

Transition Probabilities: Tag transition probabilities
– A tag transition probability P(tagb | taga) represents the probability of a tag tagb occurring given the
previous tag taga.


Observations: Words (Vocabulary)
– Observation Likelihoods: Emission Probabilities P(word|tag)
– A emission probability P(word | tag ) represents probability of tag producing word.


Initial Probability Distribution: First Tag Probabilities P(tag |<s>) in sentences.

31
HMM Tagging as Decoding

For an HMM that contains hidden variables, task of
determining hidden variables sequence corresponding to
sequence of observations is called decoding.

Decoding:
– Given as input an HMM λ = (TransProbs, ObsLikelihoods) and a
sequence of observations O = o1,…,oT, find the most probable
sequence of states Q = q1,…,qT .

For part-of-speech tagging, we will find the most probable
sequence of tags t1,…,tn (hidden variables) for a given
sequence of words w1,…,wn (observations).
32
HMM - Decoding

33
HMM - Decoding

HMM taggers make two further simplifying
assumption
– The first is that the probability of a word appearing depends only on its
own tag and is independent of neighboring words and tags:

– The second assumption, the bigram assumption, is that the probability of a


tag is dependent only on the previous tag, rather than the entire tag
sequence;

34
HMM - Decoding
– Plugging the simplifying assumptions, results in the
following equation for the most probable tag
sequence from a bigram tagger:

– The two parts of above equation correspond neatly


to the B emission probability and A transition
probability
35
Viterbi Algorithm

The decoding algorithm for HMMs is the Viterbi algorithm

36
Working of Viterbi Algorithm
Word Sequence O1, O2...
Most possible tag sequence
Number of tags

37
Working of Viterbi Algorithm

most probable path probabilities of first word o1


where,

PI is first tag probability of tag s and


bs(o1) is emission probability P(word o1 | tag s)

38
Working of Viterbi Algorithm

most probable path probabilities of first t words where

viterbi[st,t-1] is most probable path probability of t-1


words such that the tag of word t-1 is st

ast,s is tansition probability P(tag s | tag st) and


bs(ot) is emission probability P(word ot | tag s)

39
Working of Viterbi Algorithm

most probable path probability of T words

40
Viterbi Algorithm - Example

Let’s tag the sentence Janet will back the bill

41
Viterbi Algorithm - Example

Viterbi[NNP,Janet]
= P(NNP|<s>)*P(Janet|NNP)
= 0.2767*0.000032 = 0.00000885
= 8.85x10-6

42
Viterbi Algorithm - Example

43
Viterbi Algorithm - Example

44
Viterbi Algorithm - Example

45
Viterbi Algorithm - Example

46
Viterbi Algorithm - Example

Viterbi Matrix for

Janet will back the bill
– Janet /NNP
– will /MD
– back /VB
– the /DT
– bill /NN

47
Self Study

Beam search is a variant of Viterbi decoding that
maintains only a fraction of high scoring states rather than
all states during decoding.

Maximum Entropy Markov Model (MEMM) taggers are
another types of taggers that train logistic regression
models to pick the best tag given a word, its context and
its previous tags using feature templates.

48
Reference

WSJ Corpus
– https://www.spsc.tugraz.at/databases-and-tools/wall-
street-journal-corpus.html
– https://aclanthology.org/H92-1073.pdf

49
Thank you

50

You might also like