5 Sequence Learning
5 Sequence Learning
PROCESSING II
Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 5: Sequence Learning
Outline
- Sequence tagging (SLP 8)
- Markov models (SLP Appendix A)
- Recurrent neural networks (SLP 9)
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
- Speech recognition
- Group acoustic signal into phonemes
- Group phonemes into words
- Natural language processing
- Part of speech tagging
- our running example
- Named entity recognition
- Information extraction
- Question answering
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.
But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?
10 cut NN
25 cut VB
13 cut VBD
7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Maximum entropy Markov models
- Parameters estimated by logistic regression
- Recurrent neural networks
Hidden Markov Models
- Maximum entropy Markov models (MEMM)
- (Visible) Markov models for PoS tagging
- Training by counting
- Smoothing probabilities
- Handling unknown words
- Viterbi algorithm
Why POS Tagging Must Model Sequences
Our running example:
Secretariat is ________
Race is ________
See also:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/pr
ocessors/clu/sequences/PartOfSpeechTagger.scala
Approach 1: bidirectional MEMMs
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Approach 2: Hidden (visible) Markov Models
- Let’s put the probability theory we covered in the previous lecture to use!
- The resulting approach is called (visible) Markov model
- “Visible” to distinguish it from the hidden Markov models, where the tags are
unknown
- Imagine implementing a POS tagger for an unstudied language without POS annotations
Approach 2: Hidden (visible) Markov Models
P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032
VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities
B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm
Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space
This is
called a
One row for
trellis
each state
(tag)
Output
- Most probable state sequence Q together with its probability
A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070
• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Extending the HMM Algorithm to Trigrams
This is better
- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.
It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation, but they take longer to train
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example
Evaluation
- Accuracy does not work. Why?
- We need precision, recall, F1:
- P = TP/(TP + FP)
- R = TP/(TP + FN)
- F1 = 2PR/(P + R)
- Micro vs. macro F1 measures