0% found this document useful (0 votes)

20 views50 pages

5 Sequence Learning

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views50 pages

5 Sequence Learning

Uploaded by

safat.ahmed.nayeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

CSE440: NATURAL LANGUAGE

PROCESSING II
Farig Sadeque
Assistant Professor
Department of Computer Science and Engineering
BRAC University
Lecture 5: Sequence Learning
Outline
- Sequence tagging (SLP 8)
- Markov models (SLP Appendix A)
- Recurrent neural networks (SLP 9)
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
Sequences are common in languages
- Speech recognition
- Group acoustic signal into phonemes
- Group phonemes into words
- Natural language processing
- Part of speech tagging
- our running example
- Named entity recognition
- Information extraction
- Question answering
Parts-of-speech tagging
Why not just make a big table?
- badger is a NOUN, trip is a VERB, etc.

Because part-of-speech changes with the surrounding sequence:

- I saw a badger in the zoo.
- Don’t badger me about it!
- I saw him trip on his shoelaces.
- She said her trip to Greece was amazing.

How big is this ambiguity issue?

Part-of-speech ambiguity

Most words in the English vocabulary are unambiguous.

Part-of-speech ambiguity

But, most words in running text are ambiguous! That is, ambiguous words are more prevalent.
A big table is still a good start
- Only 30-40% of words in running text are unambiguous.
- What if, we have a table for all words, and for ambiguous words, store the
most commonly used tag for that word in there?
- This is called Most frequent tag baseline
- assign each token the tag that it appeared with most frequently in the training data.
- 92.34% accurate on WSJ corpus.
A big table is still a good start
- What’s the tag for cut?

10 cut NN

25 cut VB

13 cut VBD

7 cut VBN
Learning sequence taggers
- To improve over the most frequent tag baseline, we should take advantage of
the sequence.
- Some options we will cover:
- Hidden Markov models
- Parameters estimated by counting (like naïve Bayes)
- Maximum entropy Markov models
- Parameters estimated by logistic regression
- Recurrent neural networks
Hidden Markov Models
- Maximum entropy Markov models (MEMM)
- (Visible) Markov models for PoS tagging
- Training by counting
- Smoothing probabilities
- Handling unknown words
- Viterbi algorithm
Why POS Tagging Must Model Sequences
Our running example:

Secretariat is expected to race tomorrow.

Secretariat is ________

Race is ________

To understand context, we will predict all tags together.

Approach 0: Rule-based baseline
- Assign each word a list of potential POS labels using the dictionary
- Winnow down the list to a single POS label for each word using lists of
hand-written disambiguation rules

You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?

id=218367
Approach 1: Maximum entropy Markov models
- Maximum entropy = logistic regression
- Markov models
- Discovered by Andrey Markov
- Limited horizon

- How would you implement sequence models in the logistic regression

algorithm that we know?
- Let’s assume we scan the text left to right.
Approach 1 continued
- Add the previously seen tags as features!
- Use gold tags in training
- Use predicted tags in testing
- Other common features
- Words, lemmas in a window [-k, +k]
- Casing info, prefixes, suffixes of these words
- Bigrams containing the current word

See also:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/pr
ocessors/clu/sequences/PartOfSpeechTagger.scala
Approach 1: bidirectional MEMMs
- You can stack MEMMs that traverse the text in opposite directions:
- Left-to-right direction (same as before)
- Right-to-left: uses the prediction(s) of the above system as features!
- What is the problem with the predictions of the left-to-right model here?
- Many state-of-the-art taggers use this approach: CoreNLP, processors,
SVMTool
Approach 2: Hidden (visible) Markov Models
- Let’s put the probability theory we covered in the previous lecture to use!
- The resulting approach is called (visible) Markov model
- “Visible” to distinguish it from the hidden Markov models, where the tags are
unknown
- Imagine implementing a POS tagger for an unstudied language without POS annotations
Approach 2: Hidden (visible) Markov Models

• Sentence 1 contains n words

• - an assignment of POS tags to this sentence
• - the words in this sentence
• - the estimate of optimal tag assignment
Let’s formalize this

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- Prior: Probability distribution representing knowledge or uncertainty of a data object prior or
before observing it
- Likelihood: The probability of falling under a specific category or class.
- Posterior: Conditional probability distribution representing what parameters are likely after
observing the data object
- Marginal likelihood: likelihood function that has been integrated over the parameter space.
Does not affect inference
Three Approximations
- Words are independent of the words around them
- Words depend only on their POS tags, not on the neighboring POS tags

- A tag is dependent only on the previous tag

Replace in the original equation

Let’s see why VB is preferred in the first case

Example
Example
The first tag transition
- P(NN|TO) = 0.00047
- P(VB|TO) = .83

The word likelihood for “race”

- P(race|NN) = 0.00057
- P(race|VB) = 0.00012

The second tag transition

P(NN|TO)P(NR|NN)P(race|NN) = 0.00000000032

VB is more likely than NN, even though “race” appears more commonly as a noun!
Training/Testing an HMM
Just like with any machine learning algorithm, there are two important issues one
needs to do to build an HMM:
- Training:
- Estimating p(ti|ti-1) and p(wi|ti)
- Testing (predicting):
- Estimating the best sequence of tags for a sentence (or sequence or
words)
Training: Two Types of Probabilities
A: transition probabilities
- Used to compute the prior probabilities (probability of a tag)
- Often called tag transition probabilities

B: observation likelihoods
- Used to compute the likelihood probabilities (probability of a word given tag)
- Often called word likelihoods
Testing: Viterbi Algorithm

Viterbi algorithm
- Computes the argmax efficiently
- Example of dynamic programming
What is a viterbi?
Illustration of Search Space
Illustration of Search Space

This is
called a
One row for
trellis
each state
(tag)

One column for each observation (word)

Viterbi Algorithm
Input
- State (or tag) transition probabilities (A)
- Observation (or word) likelihoods (B)
- An observation sequence O

Output
- Most probable state sequence Q together with its probability

Both A and B are matrices with probabilities

Example of A and B matrices

A: The rows are labeled with the conditioning event, e.g., P(PPSS|VB) = .0070

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

Example Trace
Summary of Viterbi Algorithm

• vt-1(i) – the previous Viterbi path probability from the previous time step t – 1
(i.e., the previous word)
• aij – the transition probability from previous state qi (i.e., the previous word
having POS tag i) to current state qj (i.e., the current word having POS tag j)
• bj(ot) – the state observation likelihood of the observation symbol ot (i.e., word
at position t) given the current state j (i.e., the j POS tag)
Extending the HMM Algorithm to Trigrams

This is pretty limiting for POS tagging

Let’s extend it to trigrams of tags!

This is better

• tn+1 – end of sentence tag

• We also need virtual tags, t0 and t-1, to be set to the beginning of sentence value.
TnT
- This is what the TnT (Trigrams’n’Tags) tagger does
- Probably the fastest POS tagger in the world
- Not the best, but pretty close (96% acc)
- http://www.coli.uni-saarland.de/~thorsten/tnt/
Problems with TnT
Very
sparse!

Backoff model: linear interpolation

P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )

λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.

Other Types of Smoothing
• Add one:
–

– Where K is the number of words with POS tag t

• Variant of add one (Charniak’s):
–

– Not a proper probability distribution!

Another Problem for All HMMs
- Massive multiplication here:
Yet Another Problem: Unknown Words
- Solution 0 (not great): assume uniform emission probabilities (this is what
“add one” smoothing does)
- You can exclude closed-class POS tags such as…
- This does not use any lexical information such as suffixes
- Solution 1: capture lexical information:

- This reduces error rate for unknown words from 40% to 20%
Main Disadvantage of HMMs
Hard to add features in the model
- Capitalization, hyphenated, suffixes, etc.

It’s possible but every such feature must be encoded in the p(word|tag)
- Redesign the model for every feature!
- MEMMs avoid this limitation, but they take longer to train
Evaluation
- POS tagging accuracy = 100 x (number of correct tags) / (number of words in
dataset)
- Accuracy numbers currently reported for POS tagging are most often between
95% and 97%
- But they are much worse for “unknown” words
Evaluation example
Evaluation
- Accuracy does not work. Why?
- We need precision, recall, F1:
- P = TP/(TP + FP)
- R = TP/(TP + FN)
- F1 = 2PR/(P + R)
- Micro vs. macro F1 measures

Lec 10
No ratings yet
Lec 10
77 pages
Stochastic Processes 2nd Edition - Sheldon M. Ross
100% (1)
Stochastic Processes 2nd Edition - Sheldon M. Ross
520 pages
A Probabilistic Approach To POS Tagging (HMM) - by Arindam Dey - CodeX - Medium
No ratings yet
A Probabilistic Approach To POS Tagging (HMM) - by Arindam Dey - CodeX - Medium
21 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
13 pages
PoSTagging-HMM
No ratings yet
PoSTagging-HMM
24 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
94 pages
Lecture05-Hmm Pos Tagging
No ratings yet
Lecture05-Hmm Pos Tagging
38 pages
Lecture Part of Speech Tagging
No ratings yet
Lecture Part of Speech Tagging
41 pages
Issues in Pos Tagging
No ratings yet
Issues in Pos Tagging
14 pages
Csci 544 Sequence Labeling L
No ratings yet
Csci 544 Sequence Labeling L
79 pages
Lecture 20-23 Part of Speech Tagging
No ratings yet
Lecture 20-23 Part of Speech Tagging
36 pages
L4 Tagging
No ratings yet
L4 Tagging
107 pages
HMM Detailed
No ratings yet
HMM Detailed
41 pages
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
No ratings yet
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
93 pages
Lecture 16-17-18-19
No ratings yet
Lecture 16-17-18-19
42 pages
Part of Speech Tagging and Hidden Markov Models
No ratings yet
Part of Speech Tagging and Hidden Markov Models
24 pages
NLP 4
No ratings yet
NLP 4
83 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
11 pages
Ai TXT Unit5
No ratings yet
Ai TXT Unit5
7 pages
Cme4408 p6 Pos Tagging
No ratings yet
Cme4408 p6 Pos Tagging
33 pages
Parts of Speech
No ratings yet
Parts of Speech
26 pages
Week 9
No ratings yet
Week 9
36 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Unit 3
No ratings yet
Unit 3
50 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
Design of Experiment Project Report
No ratings yet
Design of Experiment Project Report
10 pages
9.chapter7 POS Tagging
No ratings yet
9.chapter7 POS Tagging
37 pages
S1 Chp1 Slides
No ratings yet
S1 Chp1 Slides
8 pages
NLP-Lectures 4,5,6
No ratings yet
NLP-Lectures 4,5,6
85 pages
POS Tagging HMM Notes With Diagrams
No ratings yet
POS Tagging HMM Notes With Diagrams
4 pages
Module-5 (Markov Model and Pos Tagging)
No ratings yet
Module-5 (Markov Model and Pos Tagging)
66 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
47 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
Rule-Based POS Tagging: Part of Speech Tagging
No ratings yet
Rule-Based POS Tagging: Part of Speech Tagging
10 pages
This Is AI4001: GCR: t37g47w
No ratings yet
This Is AI4001: GCR: t37g47w
51 pages
Case-Based Reasoning Book PDF
100% (1)
Case-Based Reasoning Book PDF
183 pages
POS Tagging
No ratings yet
POS Tagging
5 pages
May 14
No ratings yet
May 14
23 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
47 pages
2 cs626 Pos Tagging Week of 1aug22
No ratings yet
2 cs626 Pos Tagging Week of 1aug22
57 pages
5 Natural Language Processing
No ratings yet
5 Natural Language Processing
7 pages
NLPChapter 3
No ratings yet
NLPChapter 3
14 pages
Lec PoS Tagging 2022
No ratings yet
Lec PoS Tagging 2022
67 pages
Assignment 3
No ratings yet
Assignment 3
12 pages
Natural Language Processing: Parts of Speech Tagging - Pos
No ratings yet
Natural Language Processing: Parts of Speech Tagging - Pos
20 pages
NLP Assignment 5
No ratings yet
NLP Assignment 5
5 pages
3 cs626 Pos Tagging Week of 8aug22
No ratings yet
3 cs626 Pos Tagging Week of 8aug22
27 pages
NLP Assignment 5
No ratings yet
NLP Assignment 5
5 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
Unit No 3
No ratings yet
Unit No 3
8 pages
19CSE453 - Natural Language Processing: Part of Speech Tagging
No ratings yet
19CSE453 - Natural Language Processing: Part of Speech Tagging
59 pages
NLP Programming en 04 HMM
No ratings yet
NLP Programming en 04 HMM
24 pages
POStagging
No ratings yet
POStagging
72 pages
CSCI 5832 Natural Language Processing: Jim Martin
No ratings yet
CSCI 5832 Natural Language Processing: Jim Martin
46 pages
Yuval Peres - MarkovChains and Mixing Times
No ratings yet
Yuval Peres - MarkovChains and Mixing Times
388 pages
Book
No ratings yet
Book
96 pages
Terminology and Guidelines For Glaucoma, 3rd 2008
No ratings yet
Terminology and Guidelines For Glaucoma, 3rd 2008
185 pages
Dynamic Markov Compression
No ratings yet
Dynamic Markov Compression
10 pages
Lecture Notes On Syntactic Processing
No ratings yet
Lecture Notes On Syntactic Processing
14 pages
Introduction To Discrete Event Systems
No ratings yet
Introduction To Discrete Event Systems
9 pages
2.1 Rule Based POS Tagging
No ratings yet
2.1 Rule Based POS Tagging
5 pages
Ossei Kofi Tuffuor
No ratings yet
Ossei Kofi Tuffuor
83 pages
6 NN RNN
No ratings yet
6 NN RNN
55 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
M.E. Comm. Systems
No ratings yet
M.E. Comm. Systems
58 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
8 Parsing
No ratings yet
8 Parsing
40 pages
Beginners Guide To Markov Chain Monte Carlo MCMC
No ratings yet
Beginners Guide To Markov Chain Monte Carlo MCMC
13 pages
Mudasar Bacha
No ratings yet
Mudasar Bacha
63 pages
Multi-Tagging For Transition-Based Dependency Parsing
No ratings yet
Multi-Tagging For Transition-Based Dependency Parsing
10 pages
TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
94% (18)
TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU
71 pages
TNT - A Statistical Part-Of-Speech Tagger: T I I 1 I 2 I I T+1 T
No ratings yet
TNT - A Statistical Part-Of-Speech Tagger: T I I 1 I 2 I I T+1 T
8 pages
Value at Risk - Notes
No ratings yet
Value at Risk - Notes
16 pages
Bijvank - 2015 - Parametric Replenishment Policies For Inventory Systems With Lost Sales and
No ratings yet
Bijvank - 2015 - Parametric Replenishment Policies For Inventory Systems With Lost Sales and
10 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Predicting Intraday Price Movements in The Foreign Exchange Market
No ratings yet
Predicting Intraday Price Movements in The Foreign Exchange Market
4 pages
01 Introduction
No ratings yet
01 Introduction
13 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
31 pages
Credit Behavioral Model
No ratings yet
Credit Behavioral Model
54 pages
Chapter 5 Markov Chains Lecture
No ratings yet
Chapter 5 Markov Chains Lecture
30 pages
The Advantage of The Coin Toss For The New Overtime System in The National Football League
No ratings yet
The Advantage of The Coin Toss For The New Overtime System in The National Football League
8 pages
Information Theory-Homework Exercises: 1 Entropy, Source Coding
No ratings yet
Information Theory-Homework Exercises: 1 Entropy, Source Coding
18 pages
Rusia
No ratings yet
Rusia
22 pages
P PQT Questions
No ratings yet
P PQT Questions
20 pages
COSM - Lesson Plan (CSE)
No ratings yet
COSM - Lesson Plan (CSE)
4 pages
Basic Statisticsand Epidemiology
No ratings yet
Basic Statisticsand Epidemiology
4 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
4-Forecasting Techniques in Crops
100% (1)
4-Forecasting Techniques in Crops
13 pages
IE6650: Probabilistic Models Fall 2007: Instructor: E-Mail: Homepage
No ratings yet
IE6650: Probabilistic Models Fall 2007: Instructor: E-Mail: Homepage
8 pages
303 Project
No ratings yet
303 Project
4 pages
Lec 5
No ratings yet
Lec 5
3 pages
Stat171 hw2
No ratings yet
Stat171 hw2
2 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet

5 Sequence Learning

Uploaded by

5 Sequence Learning

Uploaded by

CSE440: NATURAL LANGUAGE

Because part-of-speech changes with the surrounding sequence:

How big is this ambiguity issue?

Most words in the English vocabulary are unambiguous.

Secretariat is expected to race tomorrow.

To understand context, we will predict all tags together.

You can learn these rules: see Transformation-based Learning: https://dl.acm.org/citation.cfm?

- How would you implement sequence models in the logistic regression

• Sentence 1 contains n words

We have four probabilities: likelihood, prior, posterior and marginal likelihood.

- A tag is dependent only on the previous tag

Word Tag transition

Let’s see why VB is preferred in the first case

The word likelihood for “race”

The second tag transition

One column for each observation (word)

Both A and B are matrices with probabilities

B: same as A, rows: conditioning events, e.g. P(want|NN) = .000054

This is pretty limiting for POS tagging

• tn+1 – end of sentence tag

Backoff model: linear interpolation

P(ti|ti-1ti-2 ) = λ3 Ṕ(ti|ti-1ti-2 ) + λ2 Ṕ(ti|ti-1 ) + λ1 Ṕ(ti )

λ1 + λ2 + λ3 = 1, to guarantee that result is a probability.

– Where K is the number of words with POS tag t

– Not a proper probability distribution!

You might also like