Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas
Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas
Linguistics
1 / 37
Content
2 / 37
Chain Rule
P(A1 · · ·∩An ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩A2 ) . . . P(An |∩n−1
i=1 Ai )
3 / 37
Language Models
4 / 37
N grams
5 / 37
N-gram models
6 / 37
Maximum Likelihood Estimation (MLE)
MLE for the estimation of the prior probability of a single
word (unigram): C = frequency count
frequency of the word (token) C (wi )
=P
frequency of all words (tokens) j C (wj )
n
Y
= P(wk |x1:k−1 ) (3)
k=1
i.e.
P(the|Walden Pond’s water is so transparent that) ≈ P(the|that)
8 / 37
Markov applied and MLE estimation
9 / 37
MLE n-gram estimation: formulas
C (wn−1 wn )
C (wn |wn−1 ) = P
w C (wn−1 w )
the number of bigrams is identical to the frequency of the
(first) word
C (wn−1 wn )
C (wn |wn−1 ) =
C (wn−1 )
general formula:
C (wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) =
C (wn−N+1:n−1 )
11 / 37
Example, cont.
827
MLE example: P(want|i) = 2533
= 0.326 = 0.33
12 / 37
Log space to avoid numerical underflow
to avoid numerical underflow, we compute the score in log space
log (P(⟨s⟩ i want englisch food ⟨/s⟩))
approximately:
= exp(−1.386294+−1.108662+−6.81244+−0.693147+−0.385662) = 3.0855e−05
13 / 37
Examples: 2-gram
14 / 37
Examples: 3-gram
note: 3-gram counts of a word are not equal to uni-gram counts (like with
bi-grams)
15 / 37
Random Experiments and the Urn Modell
16 / 37
Multi-step experiments
17 / 37
Multi-step experiments
addition as well?
→ A = toss twice, mixed result
→ A = {01, 10}, P(A) = 1/4 + 1/4 = 1/2
→ add up paths of the tree
18 / 37
Distribution (Verteilung)
The random variable gives us for each event its value, e.g.
X (110) = 2. The values returned by X are {0, 1, 2, 3}. The
probabilites of these elements are called their distribution:
gain x = X (ω) 0 1 2 3
p(X = x) = p(x) 1/8 3/8 3/8 1/8
19 / 37
Expected Value (Erwartungswert)
20 / 37
Expected Value: example
x1 = 0, x2 = 1, x3 = 2
21 / 37
Expected Value (Erwartungswert)
X = Ω → digit sum
xi = X (ω) 0 1 2
pi 1/4 1/2 1/4
P
E (X ) = i xi pi = 0 ∗ 1/4 + 1 ∗ 1/2 + 2 ∗ 1/4 = 1
22 / 37
Expected Value (Erwartungswert), cont.
23 / 37
Expected value, mean, average, median, mode
24 / 37
E (X ) of a binary random variable
25 / 37
Deviation from the expected value
26 / 37
Variance and standard deviation
x −1 0 1
(6)
p(x) 1/4 1/2 1/4
y −4 −2 0 2 4
(7)
p(x) 1/4 1/5 1/10 1/5 1/4
27 / 37
Variance and standard deviation, cont
V (X ) = σ 2 (8)
28 / 37
Variance and standard deviation, cont
notation:
xi the outcomes of X , pi the probability of xi , n the number
of outcomes of X
n
X
σ 2 = V (X ) = E ((X − µ)2 ) = (xi − µ)2 pi (10)
i
σ 2 = V (X ) = E (X 2 ) − µ2 (11)
29 / 37
Variance and standard deviation, cont
for the random variables X (see eq. 4),Y (see eq. 5) with
x ∈ ΩX , y ∈ ΩY :
x −1 0 1
p 1/4 1/2 1/4
(dropped µ2 = 0)
30 / 37
Variance and standard deviation, cont
y −4 −2 0 2 4
p 1/4 1/5 1/10 1/5 1/4
31 / 37
Terminology
32 / 37
Statistics and probability theory
33 / 37
Notation conventions
34 / 37
Sample mean as estimator of population mean
how many smokers are there in Zurich
random variable ’Züri-Smoker’, i.e. our population: people
living in Zurich
take a representative (i.e. random) sample
big enough (not just 10)
not biased (e.g. only students)
this is not so easy
1000 people, 200 are smoker
claim: the mean (arithmetic mean, average) of the sample is
the mean of the population: 0.2 are smookers
σ 2 ≈ s 2 or σ̂ 2 = s 2 (15)
36 / 37
Parameter estimation
n
2
X 1 X
σ = V (X ) = (µ − xi )2 pi ≈ s2 = (x − xi )2 (17)
n−1
i i
37 / 37