0% found this document useful (0 votes)

7 views63 pages

Multimedia Application L6

The document discusses language modeling, focusing on predicting word probabilities in text sequences using various models such as N-grams and neural networks. It covers evaluation methods for language models, including perplexity and the importance of training and test sets, as well as applications in spell checking, machine translation, and speech recognition. Additionally, it highlights challenges like overfitting and the need for diverse training corpora to improve model generalization.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views63 pages

Multimedia Application L6

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD

Department of Computer Engineering
Inha University Tashkent.
Email: [email protected]
Content
 Language Models
 N-Grams
 3.2 Evaluating Language Models: Training and Test Sets
 3.3 Evaluating Language Models: Perplexity
 3.4 Sampling sentences from a language model
 3.5 Generalization and Zeros
 3.6 Smoothing
 3.8 Advanced: Kneser-Ney Smoothing
Language Modeling

 Language modeling involves predicting the probability distribution of

words or tokens in a sequence of text. The goal of language modeling
is to capture the underlying structure and patterns of natural
language, allowing computers to generate coherent and
grammatically correct text.

 There are several approaches to language modeling, including:

i) N-gram Models
ii) Neural Network Models
iii) Transformer Models
Language Modeling

 Tashkent is the capital of ---------------?

i) India
ii) China
iii) Uzbekistan
Language model applications

 Spell checking
 Grammer Checking
 Machine translation
 Summarization
 Question answering
 Speech recognition
Probabilistic Language Models

 Assign a probability to a sentence

Application:
 Machine Translation:
P(high winds tonite) > P(large winds tonite)
 Spell Correction
 The office is about fifteen minuets from my house
 P(about fifteen minutes from) > P(about fifteen minuets from)
 Speech Recognition
 P(I saw a van) >> P(eyes awe of an)


+ Summarization, question-answering ,
Probability of sentence

 Grammer correction
 I go to school
 I going to school

 Probability score: I go to school > I going to school

 Correct: go to school, Wrong: going to school

Probability of sentence or words

 Compute the probability of a sentence or sequence of words:

=> P(W) = P(w1, w2,w3, w4,w5…wn)

 Probability of an upcoming word:

=> P(w5| w1,w2,w3,w4)
 P(Uzbekistan | Tashkent , is, the, capital, of)

 A model that computes either of these :

P(W) or P(wn|w1, w2…wn-1) is called a language model.
How to compute P(W)
 How to compute this joint probability:
 P(its, water, is, so, transparent, that)
 Intuition: let’s rely on the Chain Rule of Probability

P(A,B) = p(A|B) p(B)

the joint probability of all the random variables can be calculated by

multiplying the probability of each variable conditioned on all the previous
variables
Chain Rule of Probability

 Conditional probabilities
=> P(B|A) = P(A,B) / P(A)
Rewriting : P(A,B) = P(A)P(B|A)

More variables: P(A,B,C,D) = P(A) P(B|A) P (C|A, B) P(D|A,B,C)

The chain rule in general

=> P(x1, x2, x3, …, xn) = P(x1) P(x2|x1) P(x3|x1,x2) … P(xn|x1, …, xn-
1)
Chain Rule of Probability

 Chain rule : P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

Calculation
= P(Uzbekistan| Tashkent, is ,the, capital, of )
= count (Tashkent is the capital of Uzbekistan) / count (Tashkent is the capital of)
The Chain Rule applied to compute
joint probability of words in
sentence

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities

 Could we just count and divide?

 No! Too many possible sentences!

 We’ll never see enough data for estimating these
Markov Assumption

 Simplifying assumption
= P(Uzbekistan| Tashkent, is, the, capital, of)
= P(Uzbekistan | of)
Andrei Markov
= P (Uzbekistan | capital of)

The assumption that the probability of a word depends only on the

previous word is called Markov assumption
Simplest case: Unigram model

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,

a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Bigram model

 Condition on previous word

 Please bring me a glass of water.

History Word prediction

Estimating bigram probabilities

 The Maximum Likelihood Estimate

Bigram model

<s> I am Sam </s>

<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Estimated bigram probabilities

 P(<s> I want English food </s>) = P(I|<s>)x P (want|I)x P(English|

want) x p(food|english) x P(</s>|food) = 0.000031

 We can extend to trigrams, 4-grams, 5-

grams
 In general this is an insufficient model of
language
 because language has long-distance
dependencies:

“The computer which I had just put into the

machine room on the fifth floor crashed.”

 But we can often get away with N-gram

N-gram models

 An n-gram is a collection of n successive items in a text document

that may include words, numbers, symbols, and punctuation. N-gram
models are useful in many text analytics applications where
sequences of words are relevant, such as in sentiment analysis, text
classification, and text generation.

 In deep learning , Language models used higher gram model to train

the dataset.
N-gram models

Google Ngram
Viewer displays
user-selected words
or phrases (ngrams)
in a graph that
shows how those
phrases have
occurred in a
corpus. Google
Ngram Viewer's
corpus is made up
of the scanned
books available in
Google Book
Once the language model is built, it can then be used with machine
learning algorithms to build predictive models for text analytics
applications
Google N-Gram Release, August
2006

…
Evaluating Language Models:
Training and Test Sets
 “Extrinsic Evaluation” a method of assessing the
quality of a system by evaluating its performance
on downstream tasks
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
Intrinsic evaluation

 Extrinsic evaluation not always possible

• Expensive, time-consuming
• Doesn't always generalize to other applications
 Intrinsic evaluation: perplexity
• Directly measures language model performance at predicting words.
• Doesn't necessarily correspond with real application performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
Training sets and test sets

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t
seen.
 A test set is an unseen dataset; different from training set.
 Intuition: we want to measure generalization to unseen data
 An evaluation metric (like perplexity) tells us how well
our model does on the test set.
Perplexity

 Perplexity is the standard metric for measuring quality of a language

model.
 The inverse probability of test set, normalized by the number of
words.

Chain rule:

Bigrams:

Minimizing perplexity is the maximizing probability

Perplexity

 Calculate perplexity of a sentence

Task of recognizing the digit in English

=> A sentence consist of random digits
=> Each digit probability is p = 1/10

Minimizing perplexity is the maximizing probability

Perplexity

Lower perplexity = better model

Training 38 million words, test 1.5 million words, from Wall Street Journal

N-gram Unigram Bigram Trigram

Order
Perplexity 962 170 109
Choosing training and test sets

• If we're building an LM for a specific task

• The test set should reflect the task language we want to use
the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training data
• We don't want the training set or the test set to be just from
one domain or author or language.
Training on the test set

We can’t allow test sentences into the training set

• Or else the LM will assign that sentence an artificially high
probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Dev sets

• If we test on the test set many times we might implicitly tune to its
characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times
• That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
Sampling and Generalization

 The Shannon (1948) Visualization Method Sample words from an

 Unigram:


REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT

NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD BE THESE.

 Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS
THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
How Shannon sampled those words in
1948

Open a book at random and select a letter at random on the page. This letter is
recorded. The book is then opened to another page and one reads until this letter
is encountered. The succeeding letter is then recorded. Turning to another page
this second letter is searched for and the succeeding letter recorded, etc."
Sampling a word from a distribution
Visualizing Bigrams the Shannon
Way
<s> I
Choose a random bigram (<s>, w) I want
according to its probability p(w|<s>) want to
to eat
Now choose a random bigram (w, x) according to its probability p(x|w) eat Chinese
Chinese food
And so on until we choose </s> food </s>
Then string the words together I want to eat Chinese food
There are other sampling methods

Used for neural language models

Many of them avoid generating words from the very
unlikely tail of the distribution
We'll discuss when we get to neural LM decoding:
 Temperature sampling
 Top-k sampling
 Top-p sampling
Approximating Shakespeare
Corpus

 A corpus refers to a large and structured set of machine-readable

texts

 Corpus texts are collections from

Books
Article
Websites
Conversation
Social media
Audio
Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams.
 So 99.96% of the possible bigrams were never seen (have zero entries in the table)
 That sparsity is even worse for 4-grams, explaining why our sampling generated
actual Shakespeare
Shakespeare as Corpus

 Total works:43
 Words: 884,421
 Unique word forms:28,829,
 Word occurring only once: 12,493
The Wall Street Journal is not
Shakespeare
Choosing training data

If task-specific, use a training corpus that has a similar genre

to your task.
• If legal or medical, need lots of special-purpose documents
Make sure to cover different kinds of dialects and
speaker/authors.
• Example: African-American Vernacular English (AAVE)
• One of many varieties that can be used by African Americans and others
• Can include the auxiliary verb finna that marks immediate future tense:
• "My phone finna die"

Finna going to
Why do we need corpus in NLP

 Training machine learning models

 Sentiment analysis, speech recognition, machine translation
Language understanding
 Grammer, vocabulary
Rule based system
 Part-of-speech(POS) tagging, named entity recognition
Statistical analysis
 Examine word frequency distribution, statistical features
Domain specific Knowledge
 Legal documents, medical documents, chatbots
Training Corpus

 Building a question answering system, we need a training corpus of

question answer
 To build translating legal documents, we need a training corpus of
legals documents

 N gram only works well for word prediction if the test corpus looks
like the training corpus.
The perils of overfitting

N-grams only work well for word prediction if the test

corpus looks like the training corpus
• But even when we try to pick a good training corpus,
the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros

 Training set: Test set:

.. denied the allegations … denied the offer
… denied the reports …. denied the load
… denied the claims
…. denied the request

 P(“offer”| denied the) = 0

Zero probability bigrams
 Bigram with zero probability
- Assign 0 probability to the test set
It is not possible to calculate perplexity
- Divide by zero

Bigrams with zero probability

 Will hurt our performance for texts where those words appear!
 And mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t divide by 0)!
Smoothing: Add-one (Laplace)
smoothing
 When we have sparse statistics:

P(w | denied the)

allegations
3 allegations
2 reports

outcome
reports
1 claims

attack
…

request
claims

man
1 request
7 total

 Steal probability mass to generalize better

P(w | denied the)
2.5 allegations
1.5 reports allegations
allegations

outcome
0.5 claims
reports

attack
0.5 request
…

man
claims

request
2 other
7 total
Add-one estimation

 Also called Laplace smoothing

 Pretend we saw each word one more time than we did
 Just add one to all the counts!

 MLE estimate:

 Add-1 estimate:
Maximum Likelihood Estimates

 The maximum likelihood estimate

 of some parameter of a model M from a training set T
 maximizes the likelihood of the training set T given the model M
 Suppose the word “bagel” occurs 400 times in a corpus of a million words
 What is the probability that a random word from some other text will be
“bagel”?
 MLE estimate is 400/1,000,000 = .0004
 This may be a bad estimate for some other corpus
 But it is the estimate that makes it most likely that “bagel” will occur 400 times in
a million word corpus.
Example corpus Berkely restaurant
project
 Can you tell me about a good Cantonese restaurant close by.
 Mid priced thai food is what I am looking for
 Tell me about chez panisse
 Can you give me a listing of the kinds of food that are available
 I am looking for a good place to eat breakfast
 When is caffe Venezia open during the day.
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Out of 9222 sentences
Backoff and Interpolation

 Sometimes it helps to use less context

 Condition on less context for contexts you haven’t learned much
about
 Backoff:
 use trigram if you have good evidence,
 otherwise bigram, otherwise unigram
 Interpolation:
 mix unigram, bigram, trigram

 Interpolation works better

How to set the lambdas

 Use a held-out corpus

Held-Out Test
Training Data Data Data
 Choose λs to maximize the probability of held-out data:
 Fix the N-gram probabilities (on the training data)

 Then search for λs that give largest probability to held-out

set:

Held-out data refers to a portion of historical, labeled data

Unknown Words
 If we know the words
-- Vocabulary V is fixed
-- Closed vocabulary task

 If we don’t know the word

-- Out of vocabulary = OOV words
-- Open vocabulary task
 What we do in this situation
Create an unknown word token <UNK>
Create a fixed lexicon L of size V
In normalization phase, any word not in L changed to <UNK>
We train it’s probabilities like a normal word
At decoding time, use UNK probabilities for any word not in training
Huge web-scale n-grams

 How to deal with, e.g., Google N-gram corpus

 Pruning
 Only store N-grams with count > threshold.
 Remove singletons of higher-order n-grams
 Entropy-based pruning
 Efficiency
 Efficient data structures like tries
 Bloom filters: approximate language models
 Store words as indexes, not strings
 Use Huffman coding to fit large numbers of words into two bytes
 Quantize probabilities (4-8 bits instead of 8-byte float)

“Stupid backoff” (Brants et al. 2007)

N-gram Smoothing Summary

 Add-1 smoothing:
 OKfor text categorization, not for language
modeling
 The most commonly used method:
 Extended Interpolated Kneser-Ney
 For very large N-grams like the Web:
 Stupid backoff
Advanced Language Modeling

 Discriminative models:
 choose n-gram weights to improve a task, not to fit the
training set
 Parsing-based models
 Caching Models
 Recently used words are more likely to appear
Reference

Chapter 3
Question
Thank you

NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
NLP m2
No ratings yet
NLP m2
74 pages
Language Modeling and Spelling Correction
No ratings yet
Language Modeling and Spelling Correction
97 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Lecture 2 Language Model
No ratings yet
Lecture 2 Language Model
127 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Government Engineering College, Modasa
No ratings yet
Government Engineering College, Modasa
25 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
No ratings yet
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
46 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Week 3
No ratings yet
Week 3
24 pages
Future Worth Method
No ratings yet
Future Worth Method
17 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Ngrams
100% (1)
Ngrams
22 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
LRDI-07 Number Based With Solutions
100% (1)
LRDI-07 Number Based With Solutions
10 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
N Grams
No ratings yet
N Grams
51 pages
01 Introduction To N-Grams 8-41
No ratings yet
01 Introduction To N-Grams 8-41
4 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
Introduction To Computer Fundamentals
No ratings yet
Introduction To Computer Fundamentals
15 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
01 Introduction To N-Grams 8-41
No ratings yet
01 Introduction To N-Grams 8-41
13 pages
Week 4
No ratings yet
Week 4
37 pages
Draughtsman Mechanical 1st Year (Volume II of II) TT
No ratings yet
Draughtsman Mechanical 1st Year (Volume II of II) TT
211 pages
Image Recognition Using CIFAR 10
100% (1)
Image Recognition Using CIFAR 10
56 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Week 2
No ratings yet
Week 2
31 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Elder Disk
No ratings yet
Elder Disk
39 pages
Multimedia Application L4
No ratings yet
Multimedia Application L4
42 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
Marilyn Vos Savant
No ratings yet
Marilyn Vos Savant
18 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
2014 The Rietveld Method
No ratings yet
2014 The Rietveld Method
7 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
No ratings yet
Lenze 8400 Electrical Shaft Technology Application - v1-0 - EN
50 pages
Poster ASME 2022 EN 013
No ratings yet
Poster ASME 2022 EN 013
1 page
Icom Ic-F7000 Service Manual
0% (1)
Icom Ic-F7000 Service Manual
79 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Block Cipher
No ratings yet
Block Cipher
17 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Load Test On Separately Excitied DC Generator
No ratings yet
Load Test On Separately Excitied DC Generator
5 pages
Microwave Solid Antennas: Introduction and Antenna Descriptions
No ratings yet
Microwave Solid Antennas: Introduction and Antenna Descriptions
56 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
Polyphase Rectifier
No ratings yet
Polyphase Rectifier
72 pages
Sistema Piloto
No ratings yet
Sistema Piloto
24 pages
Ca 12
No ratings yet
Ca 12
64 pages
Harmonics
No ratings yet
Harmonics
4 pages
Q3 Carpentry Week 4
No ratings yet
Q3 Carpentry Week 4
25 pages
The Secrets of Triangles
50% (2)
The Secrets of Triangles
4 pages
Moon Observation Edited by Muhammed Syed
No ratings yet
Moon Observation Edited by Muhammed Syed
10 pages
Morphing of Geometrical Objects in Boundary Representation: Martina M Alkov A
No ratings yet
Morphing of Geometrical Objects in Boundary Representation: Martina M Alkov A
63 pages
Fiber Optics: Propagation of Light in An Optical Fiber
No ratings yet
Fiber Optics: Propagation of Light in An Optical Fiber
16 pages
Rules of Differentiation: Example
No ratings yet
Rules of Differentiation: Example
6 pages
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
No ratings yet
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
5 pages
GTP - EARTH GROUND TEST SET - Tower Footing Resistance Meter
No ratings yet
GTP - EARTH GROUND TEST SET - Tower Footing Resistance Meter
2 pages
Math 30-2 Unit Exam (#1)
No ratings yet
Math 30-2 Unit Exam (#1)
10 pages
Rhcsa v8 Exam Objectives
No ratings yet
Rhcsa v8 Exam Objectives
2 pages
Softening Behavior of Reinforced Concrete Beams Under Cyclic Loading
No ratings yet
Softening Behavior of Reinforced Concrete Beams Under Cyclic Loading
24 pages
Loji Rawatan Air Pagoh: Daily Report Electrical
No ratings yet
Loji Rawatan Air Pagoh: Daily Report Electrical
1 page
Multiple Choice (8 X 1 PT)
No ratings yet
Multiple Choice (8 X 1 PT)
5 pages
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet

Multimedia Application L6

Uploaded by

Multimedia Application L6

Uploaded by

Multimedia

Minhaz Uddin Ahmed, PhD

 Language modeling involves predicting the probability distribution of

 There are several approaches to language modeling, including:

 Tashkent is the capital of ---------------?

 Assign a probability to a sentence

 Probability score: I go to school > I going to school

 Correct: go to school, Wrong: going to school

 Compute the probability of a sentence or sequence of words:

 Probability of an upcoming word:

 A model that computes either of these :

P(A,B) = p(A|B) p(B)

the joint probability of all the random variables can be calculated by

More variables: P(A,B,C,D) = P(A) P(B|A) P (C|A, B) P(D|A,B,C)

The chain rule in general

 Chain rule : P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

P(“its water is so transparent”) =

 Could we just count and divide?

 No! Too many possible sentences!

The assumption that the probability of a word depends only on the

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

 Condition on previous word

 Please bring me a glass of water.

History Word prediction

 The Maximum Likelihood Estimate

<s> I am Sam </s>

 P(<s> I want English food </s>) = P(I|<s>)x P (want|I)x P(English|

 We can extend to trigrams, 4-grams, 5-

“The computer which I had just put into the

 But we can often get away with N-gram

 An n-gram is a collection of n successive items in a text document

 In deep learning , Language models used higher gram model to train

 Extrinsic evaluation not always possible

We train parameters of our model on a training set.

 Perplexity is the standard metric for measuring quality of a language

Minimizing perplexity is the maximizing probability

 Calculate perplexity of a sentence

Task of recognizing the digit in English

Minimizing perplexity is the maximizing probability

Lower perplexity = better model

N-gram Unigram Bigram Trigram

• If we're building an LM for a specific task

We can’t allow test sentences into the training set

 The Shannon (1948) Visualization Method Sample words from an

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT

Used for neural language models

 A corpus refers to a large and structured set of machine-readable

 Corpus texts are collections from

N=884,647 tokens, V=29,066

If task-specific, use a training corpus that has a similar genre

 Training machine learning models

 Building a question answering system, we need a training corpus of

N-grams only work well for word prediction if the test

 Training set: Test set:

 P(“offer”| denied the) = 0

Bigrams with zero probability

P(w | denied the)

 Steal probability mass to generalize better

 Also called Laplace smoothing

 The maximum likelihood estimate

 Sometimes it helps to use less context

 Interpolation works better

 Use a held-out corpus

 Then search for λs that give largest probability to held-out

Held-out data refers to a portion of historical, labeled data

 If we don’t know the word

 How to deal with, e.g., Google N-gram corpus

“Stupid backoff” (Brants et al. 2007)

You might also like