0% found this document useful (0 votes)
88 views28 pages

3-Lecture Three - (Chapter Two-N-gram Language Models)

This document discusses n-gram language models. It begins with an introduction that defines language models and their uses in applications like speech recognition and machine translation. It then discusses the role of n-gram models, describing them as (n-1)-order Markov models that use the previous n-1 words to predict the next word. The document provides examples of estimating probabilities from text using n-gram counts and discusses issues like handling unknown words. Finally, it discusses parameter estimation and smoothing techniques for n-gram models.

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views28 pages

3-Lecture Three - (Chapter Two-N-gram Language Models)

This document discusses n-gram language models. It begins with an introduction that defines language models and their uses in applications like speech recognition and machine translation. It then discusses the role of n-gram models, describing them as (n-1)-order Markov models that use the previous n-1 words to predict the next word. The document provides examples of estimating probabilities from text using n-gram counts and discusses issues like handling unknown words. Finally, it discusses parameter estimation and smoothing techniques for n-gram models.

Uploaded by

Getnete degemu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 2 : N – gram Language

Models
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2022)
Outline

 Introduction
 The role of language models
 Simple N-gram models
 Estimating parameters and smoothing
 Evaluating language models

01/02/23 2
Introduction

 Language models assign a probability that a sentence is a legal


string in a language.

 Language Models are useful component of many NLP systems,


such as:
 Automatic Speech Recognition (ASR),

 Optical Character Recognition (OCR), and

 Machine Translation (MT).

01/02/23 3
Introduction …

 Language Models Definition:


 Impossible to recover W successfully in all cases – ambiguity.
 Instead, minimize probability of error.
 Choosing estimate of W out of a number of options.
 Ŵ – for which the probability given signal Y is greatest.

Ŵ = max(∀ i : p(Ŵi | Y) )

 Language model – computational mechanism for obtaining these


conditional probabilities.

01/02/23 4
Introduction …

 Language models answer the question:


 How likely is a string of English words good English?

 Help with reordering:


 PLM(the house is small) > PLM(small the is house)

 Help with word choice:


 PLM(I am going home) > PLM(I am going house)

01/02/23 5
Introduction …

 What is a statistical language model?


 A stochastic process model for word sequences. A mechanism for
computing the probability of:
p(w1, . . . ,wT )
 Statistical language modeling
 Goal: create a statistical model so that one can calculate the
probability of a sequence of tokens s = w1, w2,…, wn in a language.
 General approach:
Training corpus s

Probabilities of the
observed elements P(s)
01/02/23 6
Role of Language Models

 Why are language models interesting?


 Important component of a speech recognition system.
 Helps discriminate between similar sounding words.
 Helps reduce search costs.
 In statistical machine translation, a language model characterizes the target
language, captures fluency.
 For selecting alternatives in summarization, generation.
 Text classification (style, reading level, language, topic, . . . )
 Language models can be used for more than just words:
 Letter sequences (language identification)
 Speech act sequence modeling
 Case and punctuation restoration

01/02/23 7
Role of Language Models…

 Uses of Language Models:


 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.”

01/02/23 8
Role of Language Models…

 Completion Prediction
 A language model also supports predicting the completion of a
sentence.
 Please turn off your cell _____
 Your program does not ______

 Predictive text input systems can guess what you are typing and
give choices on how to complete it.

01/02/23 9
Simple N-Gram Models

 An n-gram model is a type of probabilistic language model for


predicting the next item in such a sequence in the form of a (n-1) -
order Markov model.

 Two benefits of n-gram models (and algorithms that use them) are
simplicity and scalability – with larger n, a model can store more
context with a well-understood space–time tradeoff, enabling small
experiments to scale up efficiently.

 Simple n-gram models are easy to train on unsupervised corpora and


can provide useful estimates of sentence likelihood.

01/02/23 10
Simple N-Gram Models…

 Estimate probability of each word given prior context.


 P(phone | Please turn off your cell)

 Number of parameters required grows exponentially with the


number of words of prior context.

 An n-gram model uses only N1 words of prior context.


 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)

01/02/23 11
Simple N-Gram Models…

 The Markov assumption is the presumption that the future


behavior of a dynamical system only depends on its recent
history.
 In particular, in a kth-order Markov model, the next state only
depends on the k most recent states, therefore an n-gram model is
a (N1)-order Markov model.
 Use the previous N-1 words in a sequence to predict the next
word.
 Language Model (LM)
 unigrams, bigrams, trigrams, 4 grams, 5 grams…
 How do we train these models?
 Using very large corpora
01/02/23 12
Simple N-Gram Models…

 N-Gram Model Formulas

01/02/23 13
Estimating Probabilities

 N-gram conditional probabilities can be estimated from raw


text based on the relative frequency of word sequences.
 To have a consistent probabilistic model, append a unique start
(<s>) and end (</s>) symbol to every sentence and treat these
as additional words.

01/02/23 14
Example

 Here are some text normalized sample user queries (a sample of


9332 sentences is on the website):
 Berkeley Restaurant Project Senetences:
 can you tell me about any good cantonese restaurants close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

01/02/23 15
Example

01/02/23 16
Example

 Bigram estimates of sentence probabilities:


 P(<s> i want English food </s>)
 = P(i | <s>) P(want | i) P(English | want) P(food | English)
P(</s> | food)
 = 0.25 x 0.33 x 0.0011 x 0.5 x 0.68 = 0.000031

 P(<s> i want Chinese food </s>)


 = P(i | <s>) P(want | i) P(Chinese | want) P(food | Chinese)
P(</s> | food)
 = 0.25 x 0.33 x 0.0065 x 0.52 x 0.68 = 0.00019
01/02/23 17
Example

 What kinds of knowledge?


 P(English | want) = 0.0011
 P(Chinese | want) = 0.0065 (More of the world)
 P(to | want) = 0.66 (more about the grammar)
 P(eat | to) = 0.28
 P(food | to) = 0.0 (contingent zero)
 P(want | spend) = 0.0 (more about the grammar)
 P(i | <s>) = 0.25

01/02/23 18
Example

 Practical Issues: We do every thing in log space,


 To avoid underflow, (Arithmetic under flow)
 To make easy computing (Adding is faster than Multiplication).

 P1 x P2 x P3 x P4 = logP1 + logP2 + logP3 + logP4

01/02/23 19
Simple N-Gram Models …

 Train and Test Corpora:


 A language model must be trained on a large corpus of text to
estimate good parameter values.

 Ideally, the training (and test) corpus should be representative of


the actual application data.

 May need to adapt a general model to a small amount of new (in-


domain) data by adding highly weighted small corpus to original
training data.

01/02/23 20
Simple N-Gram Models …

 Train and Test Corpora…


 Unknown Words:
 How to handle words in the test corpus that did not occur in
the training data, i.e. out of vocabulary (OOV) words?

 Train a model that includes an explicit symbol for an unknown


word (<UNK>).
 Choose a vocabulary in advance and replace other words in the
training corpus with <UNK>.
 Replace the first occurrence of each word in the training data
with <UNK>.

01/02/23 21
Estimating Parameters and
Smoothing
 Estimating Parameters
 Parameter estimation is fundamental to many statistical approaches
to NLP.
 Because of the high-dimensional nature of natural language, it is
often easy to generate an extremely large number of features.
 The challenge of parameter estimation is to find a combination of
the typically noisy, redundant features that accurately predicts the
target output variable and avoids over fitting.
 List of potential parameter estimators:
 Maximum Entropy (ME) estimation with L 2 regularization, the
Averaged Perceptron (AP), Boosting, ME estimation with L 1
regularization using a novel optimization algorithm, and BLasso,
which is a version of Boosting with Lasso (L 1) regularization, etc
01/02/23 22
Estimating Parameters and
Smoothing…
 Estimating Parameters…
 Intuitively, this can be achieved either
 By selecting a small number of highly-effective features and ignoring
the others, or
 By averaging over a large number of weakly informative features.
 The first intuition motivates feature selection methods such as
Boosting and Blasso which usually work best when many features
are completely irrelevant.
 L1 or Lasso regularization of linear models embeds feature
selection into regularization so that both an assessment of the
reliability of a feature and the decision about whether to remove it
are done in the same framework, and has generated a large amount
of interest in the NLP community recently.
01/02/23 23
Estimating Parameters and
Smoothing…
 Estimating Parameters…
 If on the other hand most features are noisy but at least weakly
correlated with the target, it may be reasonable to attempt to
reduce noise by averaging over all of the features.

 ME estimators with L2 regularization, which have been widely


used in NLP tasks tend to produce models that have this property.

 In addition, the perceptron algorithm and its variants, e.g., the


voted or averaged perceptron, is becoming increasingly popular
due to their competitive performance, simplicity in
implementation and low computational cost in training.
01/02/23 24
Evaluating Language Model

 Ideally, evaluate use of model in end application (extrinsic


evaluation)
 Realistic approach
 Expensive (time consuming)
 Evaluate the ability of the model using test corpus and metrics
(intrinsic evaluation: independent of any application).
 Less realistic
 Cheaper

 Verify at least once that intrinsic evaluation correlates with an


extrinsic one.
01/02/23 25
Summary of Language Model

 Limitations of LM (n-gram) so far:


 P(word / full history) is too expensive.
 P(word / previous few words) is feasible
 The approach give us the local context only! It has lack of the
global context.
 Other approaches:
 Neural Networks
 Recurrent Neural Network (RNN – Most recent words)
 Long Short Term Memory (LSTM – limited to a few hundred
words due to their inherently complex sequential paths from the
previous unit to the current unit)
 Transformer (new model – in 2017 Google paper)
01/02/23 26
Question & Answer

01/02/23 27
Thank You !!!

01/02/23 28

You might also like