Multimedia Application L6
Multimedia Application L6
Application
By
i) N-gram Models
ii) Neural Network Models
iii) Transformer Models
Language Modeling
i) India
ii) China
iii) Uzbekistan
Language model applications
Spell checking
Grammer Checking
Machine translation
Summarization
Question answering
Speech recognition
Probabilistic Language Models
+ Summarization, question-answering ,
Probability of sentence
Grammer correction
I go to school
I going to school
Conditional probabilities
=> P(B|A) = P(A,B) / P(A)
Rewriting : P(A,B) = P(A)P(B|A)
Example
= P(Tashkent is the capital of Uzbekistan)
= P(Tashkent) x P(is | Tashkent) x P(the | Tashkent, is) x P(capital | Tashkent, is
the) x P(of | Tashkent, is, the, capital) x P( Uzbekistan | Tashkent, is, the, capital,
of)
Chain Rule of Probability
Example
= P (Tashkent is the capital of Uzbekistan)
= P(Tashkent) X P(is |Tashkent)X P(the| Tashkent, is) x P(capital | Tashkent, is, the)
X P(of | Tashkent, is, the, capital) x P( Uzbekistan| Tashkent, is ,the, capital, of)
Calculation
= P(Uzbekistan| Tashkent, is ,the, capital, of )
= count (Tashkent is the capital of Uzbekistan) / count (Tashkent is the capital of)
The Chain Rule applied to compute
joint probability of words in
sentence
Simplifying assumption
= P(Uzbekistan| Tashkent, is, the, capital, of)
= P(Uzbekistan | of)
Andrei Markov
= P (Uzbekistan | capital of)
Given that
P(I|<s>) = 0.25
P (want|I)= 0.33
P(English|want)= 0.0011
p(food|english)=0.5
P(</s>|food) = 0.68
N-gram models
Google Ngram
Viewer displays
user-selected words
or phrases (ngrams)
in a graph that
shows how those
phrases have
occurred in a
corpus. Google
Ngram Viewer's
corpus is made up
of the scanned
books available in
Google Book
Once the language model is built, it can then be used with machine
learning algorithms to build predictive models for text analytics
applications
Google N-Gram Release, August
2006
…
Evaluating Language Models:
Training and Test Sets
“Extrinsic Evaluation” a method of assessing the
quality of a system by evaluating its performance
on downstream tasks
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
Intrinsic evaluation
Chain rule:
Bigrams:
Training 38 million words, test 1.5 million words, from Wall Street Journal
• If we test on the test set many times we might implicitly tune to its
characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times
• That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
Sampling and Generalization
Unigram:
Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS
THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
How Shannon sampled those words in
1948
Open a book at random and select a letter at random on the page. This letter is
recorded. The book is then opened to another page and one reads until this letter
is encountered. The succeeding letter is then recorded. Turning to another page
this second letter is searched for and the succeeding letter recorded, etc."
Sampling a word from a distribution
Visualizing Bigrams the Shannon
Way
<s> I
Choose a random bigram (<s>, w) I want
according to its probability p(w|<s>) want to
to eat
Now choose a random bigram (w, x) according to its probability p(x|w) eat Chinese
Chinese food
And so on until we choose </s> food </s>
Then string the words together I want to eat Chinese food
There are other sampling methods
Total works:43
Words: 884,421
Unique word forms:28,829,
Word occurring only once: 12,493
The Wall Street Journal is not
Shakespeare
Choosing training data
Finna going to
Why do we need corpus in NLP
N gram only works well for word prediction if the test corpus looks
like the training corpus.
The perils of overfitting
allegations
3 allegations
2 reports
outcome
reports
1 claims
attack
…
request
claims
man
1 request
7 total
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
Add-one estimation
MLE estimate:
Add-1 estimate:
Maximum Likelihood Estimates
Add-1 smoothing:
OKfor text categorization, not for language
modeling
The most commonly used method:
Extended Interpolated Kneser-Ney
For very large N-grams like the Web:
Stupid backoff
Advanced Language Modeling
Discriminative models:
choose n-gram weights to improve a task, not to fit the
training set
Parsing-based models
Caching Models
Recently used words are more likely to appear
Reference
Chapter 3
Question
Thank you