0% found this document useful (0 votes)
20 views50 pages

5 Sequence Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views50 pages

5 Sequence Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

CSE440: NATURAL LANGUAGE

PROCESSING II
Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 5: Sequence Learning
Outline
- Sequence tagging (SLP 8)
- Markov models (SLP Appendix A)
- Recurrent neural networks (SLP 9)
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
- Speech recognition
- Group acoustic signal into phonemes
- Group phonemes into words
- Natural language processing
- Part of speech tagging
- our running example
- Named entity recognition
- Information extraction
- Question answering
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.

Because part-of-speech changes with the surrounding sequence:


- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.

How big is this ambiguity issue?


Part-of-speech ambiguity

Most words in the English vocabulary are unambiguous.


Part-of-speech ambiguity

But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?

10 cut NN

25 cut VB

13 cut VBD

7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Maximum entropy Markov models
- Parameters estimated by logistic regression
- Recurrent neural networks
Hidden Markov Models
- Maximum entropy Markov models (MEMM)
- (Visible) Markov models for PoS tagging
- Training by counting
- Smoothing probabilities
- Handling unknown words
- Viterbi algorithm
Why POS Tagging Must Model Sequences
Our running example:

Secretariat is expected to race tomorrow.

Secretariat is ________

Race is ________

To understand context, we will predict all tags together.


Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules

You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?


id=218367
Approach 1: Maximum entropy Markov models
- Maximum entropy = logistic regression
- Markov models
- Discovered by Andrey Markov
- Limited horizon

- How would you implement sequence models in the logistic regression


algorithm that we know?
- Let’s assume we scan the text left to right.
Approach 1 continued
- Add the previously seen tags as features!
- Use gold tags in training
- Use predicted tags in testing
- Other common features
- Words, lemmas in a window [-k, +k]
- Casing info, prefixes, suffixes of these words
- Bigrams containing the current word

See also:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/pr
ocessors/clu/sequences/PartOfSpeechTagger.scala
Approach 1: bidirectional MEMMs
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Approach 2: Hidden (visible) Markov Models
- Let’s put the probability theory we covered in the previous lecture to use!
- The resulting approach is called (visible) Markov model
- “Visible” to distinguish it from the hidden Markov models, where the tags are
unknown
- Imagine implementing a POS tagger for an unstudied language without POS annotations
Approach 2: Hidden (visible) Markov Models

• Sentence 1 contains n words


• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this

We have four probabilities: likelihood, prior, posterior and marginal likelihood.


- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags

- A tag is dependent only on the previous tag


Replace in the original equation

Word Tag transition


likelihoods probabilities
Computing Tag Transition Probabilities
In the Brown corpus (1M words)
- DT occurs 116,454 times
- DT is followed by NN 56,509 times
Computing Word Likelihoods
In the Brown corpus (1M words)
- VBZ occurs 21,627 times
- VBZ is the tag for “is” 10,073 times
Example

Let’s see why VB is preferred in the first case


Example
Example
The first tag transition
- P(NN|TO) = 0.00047
- P(VB|TO) = .83

The word likelihood for “race”


- P(race|NN) = 0.00057
- P(race|VB) = 0.00012

The second tag transition


- P(NR|VB) = 0.0027
- P(NR|NN) = 0.0012
Example
P(VB|TO)P(NR|VB)P(race|VB) = 0.00000027

P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032

VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities

B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm

Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space

This is
called a
One row for
trellis
each state
(tag)

One column for each observation (word)


Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O

Output
- Most probable state sequence Q together with its probability

Both A and B are matrices with probabilities


Example of A and B matrices

A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054


Example Trace
Summary of Viterbi Algorithm

• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Extending the HMM Algorithm to Trigrams

This is pretty limiting for POS tagging


Let’s extend it to trigrams of tags!

This is better

• tn+1 – end of sentence tag


• We also need virtual tags, t0 and t-1, to be set to the beginning of sentence value.
TnT
- This is what the TnT (Trigrams’n’Tags) tagger does
- Probably the fastest POS tagger in the world
- Not the best, but pretty close (96% acc)
- http://www.coli.uni-saarland.de/~thorsten/tnt/
Problems with TnT
Very
sparse!

Backoff model: linear interpolation

P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )

λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.


Other Types of Smoothing
• Add one:

– Where K is the number of words with POS tag t


• Variant of add one (Charniak’s):

– Not a proper probability distribution!


Another Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:

- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.

It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation, but they take longer to train
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example
Evaluation
- Accuracy does not work. Why?
- We need precision, recall, F1:
- P = TP/(TP + FP)
- R = TP/(TP + FN)
- F1 = 2PR/(P + R)
- Micro vs. macro F1 measures

You might also like