0% found this document useful (0 votes)

22 views37 pages

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

The document discusses mathematical foundations of computational linguistics including language models, chain rule, maximum likelihood estimation, n-grams, and random experiments. It provides examples of applying chain rule, MLE, and n-grams to estimate word probabilities from text corpora.

Uploaded by

Richard Salnikov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views37 pages

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

Uploaded by

Richard Salnikov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Mathematical Foundations of Computational

Linguistics

Manfred Klenner & Jannis Vamvas

Department of Computational Linguistics

March 12, 2024

1 / 37
Content

(traditional) language models

example of chain rule
example of MLE
example of independence assumption (version: Markov
assumption)
random experiments and the urn model
parameters and estimators
expected value, (corrected) variance, standard deviation
sample mean, sample variance

2 / 37
Chain Rule

chain rule (simple case):

P(A ∩ B) = P(A)P(B|A) or P(B)P(A|B) (1)

i.e. the joint probability for A and B is the product of the

prior P of A and the conditional P of B given A

chain rule (general case):

P(A1 · · ·∩An ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩A2 ) . . . P(An |∩n−1
i=1 Ai )

P(2d ∩ 3d) = P(2d) ∗ P(3d|2d) = 1/2 ∗ /3 = 1/6

3 / 37
Language Models

capture word sequences of a particular language

different model types:
a (probabilistic) grammar, PCFG
probabilistic n-gram models: n-gram = sequence of n words,
e.g. bigrams, trigrams
prominent task: predict the next word
applications: speech recognition, spelling correction, machine
translation
most prominent: large (neural) language models like GPT4

see: Jurafsky and Martin, chapter 3,

https://web.stanford.edu/~jurafsky/slp3/

4 / 37
N grams

A n-gram = sequence of n (e.g. word) tokens

- unigram: ’I’, ’am’
- n=2: bigram: ’I am’, ’Sam I’
- n=3: trigram: ’I am Sam’, ’Sam I am’
- n=4: 4-gram: ’I am Sam and’
- ...
n-gram do have probabilities
how probable is: ’I am Sam’
what is the most probable next word given: ’I am’
plausible sequence in German ’der das die’ ?
PoS tag n-grams: ’ART NN’, ’ART ADJA NN’, ...
task: estimation of n-gram probabilities on the basis of MLE

5 / 37
N-gram models

a (traditional) n-gram model is a distribution

it is called a language model
sequence notation: w1 ... wn but also w1:n
current language models: neural language models
recent version of language models use word embeddings

6 / 37
Maximum Likelihood Estimation (MLE)
MLE for the estimation of the prior probability of a single
word (unigram): C = frequency count
frequency of the word (token) C (wi )
=P
frequency of all words (tokens) j C (wj )

a word wi with frequency of 250 in a corpus of 30000 words:

250
P(wi ) =
30000

probability of n-gram: P(w1 , . . . , wn )

for large n: can we estimate the probability reliably?
e.g. P(’all of a sudden I saw the guy standing on the
sidewalk’)
problem: sparseness even in a large corpus like with NB
7 / 37
Chain rule and Markov assumption

P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1 w2 ) . . . P(wn |w1:n−1 ) (2)

n
Y
= P(wk |x1:k−1 ) (3)
k=1

Markov assumption: an item (word) depends only on its immediate

predecessor
n
Y n
Y
P(wk |x1:k−1 ) ≈ P(wk |xk−1 ) (4)
k=1 k=1

i.e.
P(the|Walden Pond’s water is so transparent that) ≈ P(the|that)
8 / 37
Markov applied and MLE estimation

<s> I am Sam </s>

<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

we use an artificial start (end) symbol, e.g. ⟨s⟩ ( ⟨/s⟩)

9 / 37
MLE n-gram estimation: formulas

C (wn−1 wn )
C (wn |wn−1 ) = P
w C (wn−1 w )
the number of bigrams is identical to the frequency of the
(first) word
C (wn−1 wn )
C (wn |wn−1 ) =
C (wn−1 )
general formula:

C (wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) =
C (wn−N+1:n−1 )

where N is the n-gram size and wn is the (changing) word

position of target word w

e.g. N = 3, n = 5 : P(w5 |w5−3+1:5−1 ) = P(w5 |w3:4 )

10 / 37
Example: I want English food

frequencies of bigrams from Berkeley Restaurant Project (e.g. ’I

want’: 827)

we turn this into probabilities

11 / 37
Example, cont.

827
MLE example: P(want|i) = 2533
= 0.326 = 0.33

Probability of the example sentence:

12 / 37
Log space to avoid numerical underflow
to avoid numerical underflow, we compute the score in log space
log (P(⟨s⟩ i want englisch food ⟨/s⟩))

= log (P(i|⟨s⟩)) + low (P(want|i)) + . . . + log (P(⟨/s⟩|food))

approximately:

= −1.386294 + −1.108662 + −6.81244 + −0.693147 + −0.385662 = −10.3862

and, if needed, turn it back into a probability:

= exp(−1.386294+−1.108662+−6.81244+−0.693147+−0.385662) = 3.0855e−05

13 / 37
Examples: 2-gram

the nice film

the nice movie
a nice movie
some nice love film
the fantastic action movie

p(nice|the) = C(the nice)/C(the ?) = 2/3

p(nice|the) = C(the nice)/C(the) = 2/3

14 / 37
Examples: 3-gram

the nice film

the nice movie
a nice movie
realy , a nice film

w=movie, i.e. w3,w5=movie

p(movie|a nice) = C(a nice movie)/c(a nice ?) = 2/3

note: 3-gram counts of a word are not equal to uni-gram counts (like with
bi-grams)

< s > movie title is: ’Transformer’

15 / 37
Random Experiments and the Urn Modell

Every random experiments can be regarded as a urn problem

random experiments:
→ single: draw randomly a single ball
→ multiple (multi-step): draw 10 times a
ball
multiple:
→ dependent (without returning the ball
into the urn)
→ independent (return the ball back into
the urn)
trial:
→ ordered: ordering is of interest
→ unordered: ordering is not of interest

16 / 37
Multi-step experiments

toss a (fair) coin twice

probability of:
→ twice a head (11)?
→ P(once a head) = 1/2
multiplication: 1/2 * 1/2 = 1/4
alternatively using Ω
→ Ω = {00, 01, 10, 11}
→ identically distributed events (fair die)
→ thus P(11) = 1/4

17 / 37
Multi-step experiments

addition as well?
→ A = toss twice, mixed result
→ A = {01, 10}, P(A) = 1/4 + 1/4 = 1/2
→ add up paths of the tree

18 / 37
Distribution (Verteilung)

Random Variable X : ’3 times toss a coin’, gain Ω: digit sum

events ω 000 001 010 100 011 101 110 111

outcome X (ω) 0 1 1 1 2 2 2 3
P 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

The random variable gives us for each event its value, e.g.
X (110) = 2. The values returned by X are {0, 1, 2, 3}. The
probabilites of these elements are called their distribution:

gain x = X (ω) 0 1 2 3
p(X = x) = p(x) 1/8 3/8 3/8 1/8

note: values are no longer (in any case) identically distributed

19 / 37
Expected Value (Erwartungswert)

The weighted mean of a random variable is called its

Expected Value
xi the outcomes of X , pi the probability of xi , n the number
of outcomes of X
n
X X
E (X ) = µ = xi pi xi p(X = xi ) (5)
i i

20 / 37
Expected Value: example

digit-sum-coin-twice: Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}

outcomes (digit sum)= {0, 1, 2}

x1 = 0, x2 = 1, x3 = 2

p1 = 1/4, p2 = 1/2, p3 = 1/4

21 / 37
Expected Value (Erwartungswert)

X = Ω → digit sum

xi = X (ω) 0 1 2
pi 1/4 1/2 1/4

P
E (X ) = i xi pi = 0 ∗ 1/4 + 1 ∗ 1/2 + 2 ∗ 1/4 = 1

22 / 37
Expected Value (Erwartungswert), cont.

if we take the values of a random variable as gain/loss, the

expected values then tells us the mean gain (for a number of
trials)
E(X) of a fair die = 3.5
given n trials, we can expect to get n ∗ E (X ) outcomes
e.g. if we throw the die a 100 times, we will get approximately
350 pips

23 / 37
Expected value, mean, average, median, mode

random variable has an expected value = (weighted) mean

a fair die: 3.5
samples have a mean which is the average (arithmetic mean)
P
i xi 16
given X = {2, 4, 4, 6} : |X | = 4 =4
samples have a median: the center of the values (’middle’)
median of X: 4
median of even sets: average of the two center numbers
median of uneven sets: {2, 3, 4, 5, 6} : 4
samples have a mode: value that occurs most often
mode of X: 4

24 / 37
E (X ) of a binary random variable

the expected gain is p, where p is the probability of success

(i.e. toss head, i.e. ’1’)

fair coin: E (X ) = 1/2 ∗ 1 + 1/2 ∗ 0 = 1/2.

example: 100 times (fair coin) = 100 · 1/2 = 50.

that is: if we toss it a 100 times, we expect to get about 50

times head

25 / 37
Deviation from the expected value

the expected value can also be a measure of the probability of

certain events: the closer a sample is to the expected value,
the more likely it is (intuitively)

question: how big a deviation from the expected value is

allowed before we doubt fairness?

26 / 37
Variance and standard deviation

the expected values of various random variables may be

identical
to differentiate further, the deviation from the expected value,
the variance, is defined.
two random variables X , Y and their distribution

x −1 0 1
(6)
p(x) 1/4 1/2 1/4

y −4 −2 0 2 4
(7)
p(x) 1/4 1/5 1/10 1/5 1/4

→ both variables have as a mean 0.

27 / 37
Variance and standard deviation, cont

the expected value tells us where the center of a distribution

lies, the variance tells us how strongly the distribution is
concentrated around its center.

one uses the expected value of (X − µ)2 , with µ = E (X ) as a

measure of the variance

V (X ) = σ 2 (8)

the square root of V(X) is called standard deviation:

p
σ = V (X ) (9)

28 / 37
Variance and standard deviation, cont

notation:
xi the outcomes of X , pi the probability of xi , n the number
of outcomes of X
n
X
σ 2 = V (X ) = E ((X − µ)2 ) = (xi − µ)2 pi (10)
i

another (mathematically equivalent) notation for variance

σ 2 = V (X ) = E (X 2 ) − µ2 (11)

29 / 37
Variance and standard deviation, cont

for the random variables X (see eq. 4),Y (see eq. 5) with
x ∈ ΩX , y ∈ ΩY :

x −1 0 1
p 1/4 1/2 1/4

E (X 2 ) = (−1)2 · 1/4 + 02 · 1/2 + 12 · 1/4 = 0.5

(dropped µ2 = 0)

30 / 37
Variance and standard deviation, cont

y −4 −2 0 2 4
p 1/4 1/5 1/10 1/5 1/4

E (Y 2 ) = (−4)2 · 1/4 + (−2)2 · 1/5 + 22 · 1/5 + 42 · 1/4 = 9.6

for dichotomic (binary) random variables σ 2 = qp and thus

√
σ = qp where q = 1 − p.
X = toss coin with p = 0.6 : V (X ) = 0.6 ∗ 0.4 = 0.24
compared to standard formula: p = 0.6, µ = 0.6 :
V (X ) = 0.6 ∗ (0.6 − 1)2 + 0.4 ∗ (0.6 − 0)2 = 0.24

31 / 37
Terminology

sampling: drawing samples from a population

(Grundgesamtheit)

objective: drawing representative samples

i.e. distribution of the individual classes (events) in the sample
should be equal or very similar to the distribution in the
population
the estimation of the parameters of the population can then be
done using the sample

32 / 37
Statistics and probability theory

probability theory is pure mathematics

calculation with given probabilities
statistics: inferential and descriptive statistics
inferential statistics: induce the parameters of a distribution
from a sample
descriptive statistics: Representation and description of
samples by means of tables, graphics etc.

33 / 37
Notation conventions

expected value, variance, i.e. characteristic values of random

variables are specified with Greek letters:
µ: Expected value
σ 2 : Variance

while the characteristic values of samples are written as

follows:
x: Average and
s 2 : (Sample) Variance.

values, which are estimates for characteristic values of a

random variable carry a caret
µ̂ and σˆ2

34 / 37
Sample mean as estimator of population mean
how many smokers are there in Zurich
random variable ’Züri-Smoker’, i.e. our population: people
living in Zurich
take a representative (i.e. random) sample
big enough (not just 10)
not biased (e.g. only students)
this is not so easy
1000 people, 200 are smoker
claim: the mean (arithmetic mean, average) of the sample is
the mean of the population: 0.2 are smookers

we estimate a parameter of the population (here µ) on the basis of

a sample:
µ ≈ x or µ̂ = x (12)
35 / 37
Corrected sample variance as estimator of population
variance
the mean is a so-called unbiased (erwartungstreu) estimator
the mean of all means of all n-sized samples is identical to the
mean of the population
we don’t discuss this further, just take it
the sample variance is a biased estimator
1X
s2 = (x − xi )2 (13)
n
i

however, the corrected sample variance is unbiased

1 X
s2 = (x − xi )2 (14)
n−1
i

σ 2 ≈ s 2 or σ̂ 2 = s 2 (15)
36 / 37
Parameter estimation

random variable view sample view

P
X xi
µ = E (X ) = xi ∗ pi ≈ x= i (16)
n
i

n
2
X 1 X
σ = V (X ) = (µ − xi )2 pi ≈ s2 = (x − xi )2 (17)
n−1
i i

37 / 37

Final PPT Research Statistical Treatment of Data
100% (4)
Final PPT Research Statistical Treatment of Data
12 pages
NLP Module 2
No ratings yet
NLP Module 2
73 pages
محتوى مقرر Probability and Statistics 5
No ratings yet
محتوى مقرر Probability and Statistics 5
196 pages
W2 - Probability Topics
No ratings yet
W2 - Probability Topics
58 pages
Prob Stat
No ratings yet
Prob Stat
40 pages
JB Ise Probability
No ratings yet
JB Ise Probability
53 pages
Econometric Chap1
No ratings yet
Econometric Chap1
91 pages
L5-Discrete Probability Distribution
No ratings yet
L5-Discrete Probability Distribution
42 pages
1 - Probability 2024 Engg
No ratings yet
1 - Probability 2024 Engg
41 pages
Probablity
100% (1)
Probablity
312 pages
Probability Notes
No ratings yet
Probability Notes
19 pages
Introduction To Probability Edited
No ratings yet
Introduction To Probability Edited
46 pages
Lec-1 Probabilistic Models
No ratings yet
Lec-1 Probabilistic Models
29 pages
Lesson 1: Basic Probability: Learning Objectives
No ratings yet
Lesson 1: Basic Probability: Learning Objectives
33 pages
PTSP
No ratings yet
PTSP
74 pages
Course Notes
No ratings yet
Course Notes
111 pages
PTSP
No ratings yet
PTSP
101 pages
Week 4: Learning Objectives: at The End of Week 4 You Should Be Able To
No ratings yet
Week 4: Learning Objectives: at The End of Week 4 You Should Be Able To
22 pages
Probability Distributions - Discrete and Normal
No ratings yet
Probability Distributions - Discrete and Normal
35 pages
Main
No ratings yet
Main
11 pages
Statistics
No ratings yet
Statistics
137 pages
Report Mid
No ratings yet
Report Mid
19 pages
Course Notes Stats 210 Statistical Theory
No ratings yet
Course Notes Stats 210 Statistical Theory
199 pages
Understanding The Concepts of Probability
No ratings yet
Understanding The Concepts of Probability
10 pages
Unit 9
No ratings yet
Unit 9
23 pages
Material MAT3003 Modules - (1+2+3)
No ratings yet
Material MAT3003 Modules - (1+2+3)
63 pages
Intro To Probability (Pattern Recognition)
No ratings yet
Intro To Probability (Pattern Recognition)
94 pages
Introduction To Probability Theory and Statistics For Linguistics
No ratings yet
Introduction To Probability Theory and Statistics For Linguistics
137 pages
Classical (Or Theoretical) Probability Is Used When Each Outcome in A Sample Space Is
No ratings yet
Classical (Or Theoretical) Probability Is Used When Each Outcome in A Sample Space Is
4 pages
Report Endterm
No ratings yet
Report Endterm
30 pages
Probability and Statistics II MAY 2023
No ratings yet
Probability and Statistics II MAY 2023
51 pages
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
No ratings yet
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
126 pages
Unit 1 Ssmda Notes
No ratings yet
Unit 1 Ssmda Notes
35 pages
EMBA Day8
No ratings yet
EMBA Day8
73 pages
Probability and Statistics: Dr. K.W. Chow Mechanical Engineering
No ratings yet
Probability and Statistics: Dr. K.W. Chow Mechanical Engineering
113 pages
1 Basic Ideas 1
No ratings yet
1 Basic Ideas 1
75 pages
SLIDES Probability-Part1
No ratings yet
SLIDES Probability-Part1
27 pages
Probability and Random Variables
No ratings yet
Probability and Random Variables
14 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
Psychological Testing Quiz
No ratings yet
Psychological Testing Quiz
7 pages
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
No ratings yet
Algebra 3 Principles and Sample Problems: 3.1 Probability 3.2 Statistics 3.3 Problems For Solutions
24 pages
Probability
No ratings yet
Probability
78 pages
Finals (MS)
No ratings yet
Finals (MS)
3 pages
Probability and Statistics: To P, or Not To P?: Module Leader: DR James Abdey
No ratings yet
Probability and Statistics: To P, or Not To P?: Module Leader: DR James Abdey
5 pages
210 Book
No ratings yet
210 Book
199 pages
Report Project 1
No ratings yet
Report Project 1
25 pages
340 Printable Course Notes
No ratings yet
340 Printable Course Notes
184 pages
On Probability Theory &stochastic Process
No ratings yet
On Probability Theory &stochastic Process
101 pages
Formula Sheet
No ratings yet
Formula Sheet
18 pages
Stat - G. Assignment
No ratings yet
Stat - G. Assignment
21 pages
Stats Semis
No ratings yet
Stats Semis
18 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
Resistência Ao Rasgo (Elmendorf) - ASTM D1424-96
No ratings yet
Resistência Ao Rasgo (Elmendorf) - ASTM D1424-96
8 pages
Random Variables Probability Mass Function: Mathematical Expectation Mathematical Expectation
No ratings yet
Random Variables Probability Mass Function: Mathematical Expectation Mathematical Expectation
6 pages
Banks Customer Satisfaction in Kuwait PDF
No ratings yet
Banks Customer Satisfaction in Kuwait PDF
77 pages
CH 3 - Risk and Return
No ratings yet
CH 3 - Risk and Return
26 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
Osobine Var
No ratings yet
Osobine Var
19 pages
Mathematics Statistics PDF
No ratings yet
Mathematics Statistics PDF
8 pages
Universiti Kuala Lumpur: Attach This Coversheet As The Cover of Your Submission. All Sections Must Be Completed
No ratings yet
Universiti Kuala Lumpur: Attach This Coversheet As The Cover of Your Submission. All Sections Must Be Completed
12 pages
Hypothesis Testing Sample Worksheet
No ratings yet
Hypothesis Testing Sample Worksheet
19 pages
Practical Data Science: Basic Concepts of Probability
No ratings yet
Practical Data Science: Basic Concepts of Probability
5 pages
Analytical Chemistry - CHEM 3111 - Yong Liu - SI 4112 2-3:15
No ratings yet
Analytical Chemistry - CHEM 3111 - Yong Liu - SI 4112 2-3:15
24 pages
Stats 210 Course Book
No ratings yet
Stats 210 Course Book
200 pages
Random Variables: 1.1 Elementary Examples
No ratings yet
Random Variables: 1.1 Elementary Examples
14 pages
Question
No ratings yet
Question
161 pages
BADM 572 - Stats Homework Answers 6
No ratings yet
BADM 572 - Stats Homework Answers 6
7 pages
Probability Cheatsheet
No ratings yet
Probability Cheatsheet
4 pages
2023 11 Exam P Syllabus
No ratings yet
2023 11 Exam P Syllabus
7 pages
ENVIRONMENTALENGG
No ratings yet
ENVIRONMENTALENGG
66 pages
Practice Examination 2
No ratings yet
Practice Examination 2
6 pages
MAT 326 Syllabus Summer 2023
No ratings yet
MAT 326 Syllabus Summer 2023
3 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
17 pages
Problem-Set-7 - G6 DONE
No ratings yet
Problem-Set-7 - G6 DONE
4 pages
I2N L2 NS Organization
No ratings yet
I2N L2 NS Organization
63 pages
The T
No ratings yet
The T
10 pages
CH 5 Flexible Budget
No ratings yet
CH 5 Flexible Budget
23 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
01 Methods Phenomena
No ratings yet
01 Methods Phenomena
46 pages
l09 nltk1 Slides
No ratings yet
l09 nltk1 Slides
42 pages
Self Study Problem Set - Solutions: Total Payoff
No ratings yet
Self Study Problem Set - Solutions: Total Payoff
8 pages
CALCULATING Standard Deviation
No ratings yet
CALCULATING Standard Deviation
4 pages
ASWCCFO SBE 15e PPT CH11
No ratings yet
ASWCCFO SBE 15e PPT CH11
25 pages
Group - E - MKT470 Final Term Paper Converted 1
No ratings yet
Group - E - MKT470 Final Term Paper Converted 1
28 pages
MAT093 - Case Study - Template of Written Report
No ratings yet
MAT093 - Case Study - Template of Written Report
21 pages
Statistics scqp27
No ratings yet
Statistics scqp27
3 pages
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
No ratings yet
Afghari Et Al. - 2019 - Effects of Globally Obtained Informative Priors On Bayesian Safety Performance Functions Developed For Australia
11 pages
Answer Key
No ratings yet
Answer Key
10 pages
ECL2 2024 Ex06 SampleSolution Updated
No ratings yet
ECL2 2024 Ex06 SampleSolution Updated
9 pages
Assignment 1Gp 5
No ratings yet
Assignment 1Gp 5
7 pages
Validation of A Scale For Assessing Bystander Responses in Bullying
No ratings yet
Validation of A Scale For Assessing Bystander Responses in Bullying
8 pages

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

Uploaded by

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

Uploaded by

Mathematical Foundations of Computational

Manfred Klenner & Jannis Vamvas

Department of Computational Linguistics

March 12, 2024

(traditional) language models

chain rule (simple case):

P(A ∩ B) = P(A)P(B|A) or P(B)P(A|B) (1)

i.e. the joint probability for A and B is the product of the

chain rule (general case):

P(2d ∩ 3d) = P(2d) ∗ P(3d|2d) = 1/2 ∗ /3 = 1/6

capture word sequences of a particular language

see: Jurafsky and Martin, chapter 3,

A n-gram = sequence of n (e.g. word) tokens

a (traditional) n-gram model is a distribution

a word wi with frequency of 250 in a corpus of 30000 words:

probability of n-gram: P(w1 , . . . , wn )

P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1 w2 ) . . . P(wn |w1:n−1 ) (2)

Markov assumption: an item (word) depends only on its immediate

<s> I am Sam </s>

we use an artificial start (end) symbol, e.g. ⟨s⟩ ( ⟨/s⟩)

where N is the n-gram size and wn is the (changing) word

e.g. N = 3, n = 5 : P(w5 |w5−3+1:5−1 ) = P(w5 |w3:4 )

frequencies of bigrams from Berkeley Restaurant Project (e.g. ’I

we turn this into probabilities

Probability of the example sentence:

= log (P(i|⟨s⟩)) + low (P(want|i)) + . . . + log (P(⟨/s⟩|food))

= −1.386294 + −1.108662 + −6.81244 + −0.693147 + −0.385662 = −10.3862

and, if needed, turn it back into a probability:

the nice film

p(nice|the) = C(the nice)/C(the ?) = 2/3

p(nice|the) = C(the nice)/C(the) = 2/3

the nice film

w=movie, i.e. w3,w5=movie

p(movie|a nice) = C(a nice movie)/c(a nice ?) = 2/3

< s > movie title is: ’Transformer’

Every random experiments can be regarded as a urn problem

toss a (fair) coin twice

Random Variable X : ’3 times toss a coin’, gain Ω: digit sum

events ω 000 001 010 100 011 101 110 111

note: values are no longer (in any case) identically distributed

The weighted mean of a random variable is called its

digit-sum-coin-twice: Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}

outcomes (digit sum)= {0, 1, 2}

p1 = 1/4, p2 = 1/2, p3 = 1/4

if we take the values of a random variable as gain/loss, the

random variable has an expected value = (weighted) mean

the expected gain is p, where p is the probability of success

fair coin: E (X ) = 1/2 ∗ 1 + 1/2 ∗ 0 = 1/2.

example: 100 times (fair coin) = 100 · 1/2 = 50.

that is: if we toss it a 100 times, we expect to get about 50

the expected value can also be a measure of the probability of

question: how big a deviation from the expected value is

the expected values of various random variables may be

→ both variables have as a mean 0.

the expected value tells us where the center of a distribution

one uses the expected value of (X − µ)2 , with µ = E (X ) as a

the square root of V(X) is called standard deviation:

another (mathematically equivalent) notation for variance

E (X 2 ) = (−1)2 · 1/4 + 02 · 1/2 + 12 · 1/4 = 0.5

E (Y 2 ) = (−4)2 · 1/4 + (−2)2 · 1/5 + 22 · 1/5 + 42 · 1/4 = 9.6

for dichotomic (binary) random variables σ 2 = qp and thus

sampling: drawing samples from a population

objective: drawing representative samples

probability theory is pure mathematics

expected value, variance, i.e. characteristic values of random

while the characteristic values of samples are written as

values, which are estimates for characteristic values of a

we estimate a parameter of the population (here µ) on the basis of

however, the corrected sample variance is unbiased

random variable view sample view

You might also like