0% found this document useful (0 votes)
32 views6 pages

NLP Lec 11

Sequence Models are machine learning models designed to handle sequences of data, particularly in natural language processing, speech recognition, and time-series forecasting. They can process inputs and outputs of varying lengths and are categorized into Statistical Sequence Models and Neural Network Based Sequence Models, with N-Gram Models and Hidden Markov Models being common types of the former. N-Gram Models predict the next item in a sequence based on previous items, utilizing the Markov assumption to simplify probability calculations.

Uploaded by

irazahid26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views6 pages

NLP Lec 11

Sequence Models are machine learning models designed to handle sequences of data, particularly in natural language processing, speech recognition, and time-series forecasting. They can process inputs and outputs of varying lengths and are categorized into Statistical Sequence Models and Neural Network Based Sequence Models, with N-Gram Models and Hidden Markov Models being common types of the former. N-Gram Models predict the next item in a sequence based on previous items, utilizing the Markov assumption to simplify probability calculations.

Uploaded by

irazahid26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

7.

2 Types of Sequence Models


Sequence Models are a subset of machine learning models that work with sequences of data. These models leverage the
temporal or sequential order of data to perform tasks such as natural language processing, speech recognition, time-series
forecasting, and more.
The most pivotal feature of sequence models is their capability to process inputs and outputs of varying lengths. They
aren't restricted to one-to-one mapping, as seen in traditional ML models.
Sequence models in NLP are generally categorized into:
1. Statistical Sequence Models
2. Neural Network Based Sequence Models

7.2.1 Statistical Sequence Models

These are based on probabilistic methods and operate with hand-crafted features or word-based statistics. They were
dominant before deep learning became mainstream.
Common Types of Statistical Sequence Models:
 N-Gram Models
 Hidden Markov Models (HMM)

7.2.1.1 N-Gram Models

An N-Gram model is a probabilistic language model used to predict the next item (word, character, etc.) in
a sequence, based on the previous N−1 items. It assumes the Markov property, which means the probability
of a word depends only on the previous N−1 words.

Types of N-Grams

 Unigram (N = 1): Considers each word independent


 Bigram (N = 2): Considers one word of context
 Trigram (N = 3): Considers two words of context
 4-gram and above: More context but leads to data sparsity
Unigram Model:
A unigram model assumes that each word in a sequence is independent of the previous words. So, the model predicts the
next word based solely on its individual frequency in a training corpus.
The probability of a word www is calculated as:

The model predicts the most probable next word, regardless of context.
Example:
Training Corpus:
"I love NLP. I love AI. AI loves me."
Tokenized Words:
[I, love, NLP, I, love, AI, AI, loves, me]
Word Counts:
I→2
love → 2
NLP → 1
AI → 2
loves → 1
me → 1
Total Words: 9
Unigram Probabilities:

Predicting the Next Word Using Unigram Model


Let’s say a user typed:
“I love”
In a unigram model, context is ignored. So, the model will predict the most frequent word from the list of probabilities.
Most likely next word = “I”, “love”, or “AI” (all tied at 0.222)
The model might randomly choose among the top-scoring words, or choose the one with highest prior probability.
 The unigram model is very simple and fast.
 It does not consider context, so its predictions are often inaccurate or unnatural.
 Still useful for baseline comparisons and some low-resource scenarios.

The Markov assumption


The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can
approximate the history by just the last few words. The bigram model, for example, approximates the probability of a
word given all the previous words P(wn | w1:n-1) by using only the conditional probability given the preceding word P(wn |
w1). In other words, instead of computing the probability
P(blue | The water of Walden Pond is so beautifully)
we approximate it with the probability
P(blue | beautifully)
When we use a bigram model to predict the conditional probability of the next word, we are thus making the following
approximation:
P(wn | w1:n-1) = P(wn | w1)
The assumption that the probability of a word depends only on the previous word is called a Markov assumption.
Markov models are the class of probabilistic models that assume we can predict the probability of some future unit
without looking too far into the past. We can generalize the bigram (which looks one word into the past) to the trigram
(which looks two words into the past) and thus to the n-gram (which looks n-1words into the past).

How to estimate probabilities


An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get estimation
normalize the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalizing the
counts so that they lie between 0 and 1. For probabilistic models, normalizing means dividing by some total count so that
the resulting probabilities fall between 0 and 1 and sum to 1. For example, to compute a particular bigram probability of a
word wn given a previous word wn-1, we’ll compute the count of the bigram C(wn-1 wn ) and normalize by the sum of all
the bigrams that share the same first word wn-1:

We can simplify this equation, since the sum of all bigram counts that start with a given word wn-1 must be equal to the
unigram count for that word wn-1:

Let’s work through an example using a mini-corpus of three sentences.


"I love NLP. I love AI. AI loves me."

We'll represent each sentence like this:

1. <s> I love NLP </s>


2. <s> I love AI </s>
3. <s> AI loves me </s>

Flattened Token Sequence:

[<s>, I, love, NLP, </s>, <s>, I, love, AI, </s>, <s>, AI, loves, me, </s>]

Step 1: Count Bigrams

Bigram Count Bigram Count

<s> → I 2 NLP → </s> 1

<s> → AI 1 AI → </s> 1

I → love 2 AI → loves 1

love → NLP 1 loves → me 1

love → AI 1 me → </s> 1

Step 2: Count Unigrams (for denominator)

Word Count Word Count

<s> 3 AI 2

I 2 loves 1

love 2 me 1

NLP 1 </s> 3
Step 3: Bigram Probabilities

Let’s calculate a few sample bigram probabilities:

Example 1: Predict word after “love”

Example 2: Probability of ending a sentence after “me”

Example 3: Probability that a sentence starts with “AI”

Suppose your current word is: “love”


You want to predict the next word.

From the above:

 P(NLP | love) = 0.5


 P(AI | love) = 0.5

Prediction: Either “NLP” or “AI” with equal probability (50%)

Trigram Model Basics


A trigram model assumes that the probability of a word depends on the two previous words:

To estimate this, we need to count how often each triple of consecutive words occurs in the corpus.

Sentences:

1. "I love NLP."


2. "I love AI."
3. "AI loves me."

We'll add <s> <s> at the beginning of each sentence (to account for trigram context), and </s> at the end:

Tokenized Sentences with Markers:

 <s> <s> I love NLP </s>


 <s> <s> I love AI </s>
 <s> <s> AI loves me </s>

Step 1: Trigram Counts

Trigram Count Trigram Count

<s> <s> I 2 love NLP </s> 1

<s> <s> AI 1 love AI </s> 1

<s> I love 2 <s> AI loves 1

I love NLP 1 AI loves me 1

I love AI 1 loves me </s> 1

Step 2: Bigram Counts (for denominator)

Bigram Count Bigram Count

<s> <s> 3 love NLP 1

<s> I 2 love AI 1

<s> AI 1 AI loves 1

I love 2 loves me 1

Step 3: Predict Next Word Using Trigram

Suppose the previous two words are: "I love"


To predict the next word, we calculate:

From the trigram table:

 I love NLP = 1
 I love AI = 1
 Total = 2

Probabilities:
So, if your context is “I love”, the trigram model says the next word is “NLP” or “AI” with equal probability.

Another Example:

Previous words: "AI loves"

AI loves me = 1

Count(AI loves) = 1

The model is confident the next word is “me” after “AI loves”

You might also like