0% found this document useful (0 votes)
13 views19 pages

Lecture 6 To 8 N-Gram

The document discusses various types of language models in Natural Language Processing (NLP), focusing on N-gram models, which predict the next word based on preceding words. It highlights the advantages and limitations of these models, including their applications in machine translation, speech recognition, and spell correction. Additionally, it covers concepts such as log probability, perplexity, and the need for smoothing techniques in N-gram models.

Uploaded by

shivasharma8189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Lecture 6 To 8 N-Gram

The document discusses various types of language models in Natural Language Processing (NLP), focusing on N-gram models, which predict the next word based on preceding words. It highlights the advantages and limitations of these models, including their applications in machine translation, speech recognition, and spell correction. Additionally, it covers concepts such as log probability, perplexity, and the need for smoothing techniques in N-gram models.

Uploaded by

shivasharma8189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Language model: N-gram

Lecture-6, 7, 8 (n-gram model)

Types of Language Models in NLP

Language modeling is a fundamental concept in Natural Language Processing


(NLP). It involves building statistical or machine learning models that can
predict the probability of a sequence of words. In essence, these models learn the
patterns and structures of language, enabling machines to understand and generate
human-like text.

Here's a breakdown of the key types:

1.​ Statistical Language Models:​


These models use statistical techniques to determine the probability of word
sequences. They rely on counting word occurrences in large text corpora.
○​ N-gram Models:​
Traditional and simpler types of language models. They calculate the
probability of a word based on the preceding n words.​
Examples:
■​ Unigrams: Consider each word independently.
■​ Bigrams: Consider the previous word.
■​ Trigrams: Consider the previous two words.​
Advantages: Easy to implement, efficient for small datasets.​
Limitations: Fail to capture long-term dependencies (cannot
model meaning beyond a few words).
○​ Probabilistic & Bayesian Language Models:​
These models use probability theory to generate sequences.
■​ Markov Models: Predict sequences based on visible state
transitions (e.g., word-to-word). Simpler than HMMs, used in
early text generation.
■​ Hidden Markov Model (HMM): Assumes words depend only
on the previous state (Markov assumption). Used in speech
recognition and part-of-speech tagging. Simple and
interpretable, but not effective for long-range dependencies.
2.​ Neural Language Models:​
These models utilize neural networks to learn representations of words and
their relationships, capturing complex patterns and semantic information.
Examples include Recurrent Neural Networks (RNNs), with subtypes like
LSTM (Long Short-Term Memory) for better memory, and Transformer
networks.
3.​ Large Language Models (LLMs):​
These are a subset of neural language models, separated due to their
massive scale and usage. Characterized by a large number of parameters and
enormous datasets, they’re based on the Transformer architecture. They
excel in NLP tasks like text generation, translation, and question-answering.
Examples include GPT models and Google’s PaLM models.
○​ Applications: HMMs powered traditional POS tagging; LLMs now
handle it with deeper context.

Applications of language models-

Language models are used in a wide range of NLP applications, including:

a.​ Machine translation


b.​ Speech recognition
c.​ Text generation
d.​ Question answering
e.​ Text summarization

Spell correction-

Language models help in spell correction by predicting the most likely correct
word based on context and probability.

●​ A user types a word with a spelling mistake (e.g., "teh" instead of "the")
●​ The system needs to correct it based on context and word probability.

How Does a Language Model Help?


1️Uses a probability-based model to determine the most likely correct word.​
2 Considers the surrounding words (context) to predict the best correction.​
3️Ranks possible corrections based on how often they appear in real-world text.

Let's say a user types:

"I am goign to the market"

Possible corrections for "goign" are:

●​ "going"
●​ "gogin"
●​ "gonig"

A bigram language model estimates the probability of each word based on


context:

●​ P(going | I am) = 0.9​

●​ P(gogin | I am) = 0.02​

●​ P(gonig | I am) = 0.08

The model chooses "going" because it has the highest probability in natural text.

N-Gram Models

A language model (LM) is a statistical model that predicts the next word in a
sequence given the previous words. It is essential in applications like speech
recognition, text generation, machine translation, and spell correction.

An N-Gram model is a type of probabilistic language model that predicts the


next word based on the previous (N-1) words in a sequence.

●​ Unigram (1-Gram) → Predicts a word independently (no context).


●​ Bigram (2-Gram) → Predicts a word based on 1 previous word.
●​ Trigram (3-Gram) → Predicts a word based on 2 previous words.
●​ N-Gram (N ≥ 4) → Predicts a word based on N-1 previous words.
Unigram: “The”, “dog”, “runs” → No dependency
Bigram: P(dog | The)
Trigram: P(runs | The dog)
Question
Question
<s> John drinks tea </s>
<s> She prefers tea with sugar </s>

Compute the Bigram Probability of sentence <s> John drinks tea </s>
Why use Log Prob?

We use log probabilities in n-gram calculations primarily to prevent numerical


underflow and simplify probability computations.

1.​ Avoiding Numerical Underflow:


○​ N-gram models compute the probability of a sentence by multiplying
the probabilities of individual words.
○​ Since all probabilities lie between 0 and 1, multiplying many small
probabilities results in extremely tiny numbers, which can cause
numerical underflow (i.e., values becoming too small for the
computer to represent accurately).
○​ Using logarithms transforms the product into a sum, which avoids
this issue.
2.​ Mathematical Simplification:
○​ A logarithmic identity states:​
log⁡(a×b×c)=log⁡a+log⁡b+log⁡c= log a + log b + log c
○​ Instead of multiplying many small numbers, we can add their log
values, making computations simpler and more stable.
3.​ Efficient Computation in Machine Learning & NLP:
○​ Log probabilities allow for more efficient storage and processing in
large corpora.
○​ Many machine learning models work better with log values rather
than raw probabilities.

The log probability in an n-gram model represents the likelihood of a sequence of


words occurring in a corpus, but expressed in the logarithmic domain instead of
the standard probability domain.

What Does Log Probability Represent?


Interpreting Log Probability:

●​ Higher (closer to 0) log probability → more likely sequence.


●​ Lower (more negative) log probability → less likely sequence.

Example Interpretation

If we have two sentences:

1.​ Sentence A: "The dog chased the cat."​

○​ Log probability: -5.2​

2.​ Sentence B: "Cat the chased dog the."​

○​ Log probability: -12.8

The log probability of Sentence A is higher (less negative), meaning it is more


likely based on the n-gram model.
Perplexity?

Perplexity is a measure of how well a language model predicts a sequence of words. Think of it
as the model’s “confusion level”—lower perplexity means the model is less confused, better at
guessing the next word. It’s widely used to evaluate N-gram models (and others) by quantifying
their predictive power on test data.
●​ Intuition: Imagine you’re guessing the next word in “The cat ___.” If your model strongly
predicts “runs” (high probability), it’s less perplexed. If it’s unsure (low probability spread
across many words), perplexity shoots up.
●​ Goal: Lower perplexity = better model. It’s like a score—aim for the lowest you can get!

Why Perplexity in N-gram Models?

N-gram models predict the probability of a word based on the previous N-1 words (e.g., bigrams
use 1 prior word, trigrams use 2). Perplexity tests how well these probabilities hold up on
unseen text:

●​ Unigram: Guesses each word independently—high perplexity (lots of uncertainty).


●​ Bigram: Uses one prior word—better, but still limited.
●​ Trigram: Uses two prior words—lower perplexity if trained well.
What Does Perplexity Mean?

●​ Perplexity = 2.75: The model’s “effective vocabulary” is ~2.75 words—it’s


like it’s choosing between ~3 options per word on average. Lower is
better—means higher confidence.
●​ High Perplexity (e.g., 100): Model’s guessing from 100
possibilities—terrible predictions.
●​ Perfect Model: Perplexity = 1 (probability = 1 for every word—impossible
in practice).

Why Use Perplexity?

●​ Evaluation: Compare models—bigram (e.g., 2.75) vs. trigram (maybe


2.5)—lower wins.
●​ N-gram Limits: High perplexity on long sentences shows N-grams miss
distant context (e.g., “The cat... [20 words]... runs”).
Need for Smoothing
Smoothing Techniques-

Smoothing techniques address the issue of zero probabilities in MLE-based n-gram


models by redistributing probability mass across observed and unseen word
sequences.
Advantages & Limitations of N-Gram Models

✅ Advantages:
✔ Simple & Efficient: Easy to train and use for text generation.​
✔ Works Well for Small Datasets: Performs decently for moderate text corpora.

❌ Limitations:
❌ Data Sparsity: Large N-Grams require huge amounts of training data. As N
increases, the number of possible N-grams grows exponentially, leading to sparse


data and increased computational demands.​
Lack of Long-Range Context: Cannot capture dependencies beyond


N-words.​
High Computational Cost: Higher N-Gram models require more memory.
Applications of N-Gram Models in NLP

🔹 Text Prediction (Smartphones, Keyboards - T9, SwiftKey)​


🔹 Speech Recognition (Google Speech, Siri, Alexa)​
🔹 Machine Translation (Statistical MT before Deep Learning)​
🔹 Spell Checking & Auto-Correction (Grammarly, MS Word)​
🔹 Plagiarism Detection & Text Summarization
Code

https://colab.research.google.com/drive/1g5hVdk8hd6WF1LA-suTN1nOC7KWCa
FPE

You might also like