0% found this document useful (0 votes)
22 views37 pages

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

The document discusses mathematical foundations of computational linguistics including language models, chain rule, maximum likelihood estimation, n-grams, and random experiments. It provides examples of applying chain rule, MLE, and n-grams to estimate word probabilities from text corpora.

Uploaded by

Richard Salnikov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views37 pages

Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas

The document discusses mathematical foundations of computational linguistics including language models, chain rule, maximum likelihood estimation, n-grams, and random experiments. It provides examples of applying chain rule, MLE, and n-grams to estimate word probabilities from text corpora.

Uploaded by

Richard Salnikov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Mathematical Foundations of Computational

Linguistics

Manfred Klenner & Jannis Vamvas

Department of Computational Linguistics

March 12, 2024

1 / 37
Content

(traditional) language models


example of chain rule
example of MLE
example of independence assumption (version: Markov
assumption)
random experiments and the urn model
parameters and estimators
expected value, (corrected) variance, standard deviation
sample mean, sample variance

2 / 37
Chain Rule

chain rule (simple case):

P(A ∩ B) = P(A)P(B|A) or P(B)P(A|B) (1)

i.e. the joint probability for A and B is the product of the


prior P of A and the conditional P of B given A

chain rule (general case):

P(A1 · · ·∩An ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩A2 ) . . . P(An |∩n−1
i=1 Ai )

P(2d ∩ 3d) = P(2d) ∗ P(3d|2d) = 1/2 ∗ /3 = 1/6

3 / 37
Language Models

capture word sequences of a particular language


different model types:
a (probabilistic) grammar, PCFG
probabilistic n-gram models: n-gram = sequence of n words,
e.g. bigrams, trigrams
prominent task: predict the next word
applications: speech recognition, spelling correction, machine
translation
most prominent: large (neural) language models like GPT4

see: Jurafsky and Martin, chapter 3,


https://web.stanford.edu/~jurafsky/slp3/

4 / 37
N grams

A n-gram = sequence of n (e.g. word) tokens


- unigram: ’I’, ’am’
- n=2: bigram: ’I am’, ’Sam I’
- n=3: trigram: ’I am Sam’, ’Sam I am’
- n=4: 4-gram: ’I am Sam and’
- ...
n-gram do have probabilities
how probable is: ’I am Sam’
what is the most probable next word given: ’I am’
plausible sequence in German ’der das die’ ?
PoS tag n-grams: ’ART NN’, ’ART ADJA NN’, ...
task: estimation of n-gram probabilities on the basis of MLE

5 / 37
N-gram models

a (traditional) n-gram model is a distribution


it is called a language model
sequence notation: w1 ... wn but also w1:n
current language models: neural language models
recent version of language models use word embeddings

6 / 37
Maximum Likelihood Estimation (MLE)
MLE for the estimation of the prior probability of a single
word (unigram): C = frequency count
frequency of the word (token) C (wi )
=P
frequency of all words (tokens) j C (wj )

a word wi with frequency of 250 in a corpus of 30000 words:


250
P(wi ) =
30000

probability of n-gram: P(w1 , . . . , wn )


for large n: can we estimate the probability reliably?
e.g. P(’all of a sudden I saw the guy standing on the
sidewalk’)
problem: sparseness even in a large corpus like with NB
7 / 37
Chain rule and Markov assumption

P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1 w2 ) . . . P(wn |w1:n−1 ) (2)

n
Y
= P(wk |x1:k−1 ) (3)
k=1

Markov assumption: an item (word) depends only on its immediate


predecessor
n
Y n
Y
P(wk |x1:k−1 ) ≈ P(wk |xk−1 ) (4)
k=1 k=1

i.e.
P(the|Walden Pond’s water is so transparent that) ≈ P(the|that)
8 / 37
Markov applied and MLE estimation

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

we use an artificial start (end) symbol, e.g. ⟨s⟩ ( ⟨/s⟩)

9 / 37
MLE n-gram estimation: formulas

C (wn−1 wn )
C (wn |wn−1 ) = P
w C (wn−1 w )
the number of bigrams is identical to the frequency of the
(first) word
C (wn−1 wn )
C (wn |wn−1 ) =
C (wn−1 )
general formula:

C (wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) =
C (wn−N+1:n−1 )

where N is the n-gram size and wn is the (changing) word


position of target word w

e.g. N = 3, n = 5 : P(w5 |w5−3+1:5−1 ) = P(w5 |w3:4 )


10 / 37
Example: I want English food

frequencies of bigrams from Berkeley Restaurant Project (e.g. ’I


want’: 827)

we turn this into probabilities

11 / 37
Example, cont.

827
MLE example: P(want|i) = 2533
= 0.326 = 0.33

Probability of the example sentence:

12 / 37
Log space to avoid numerical underflow
to avoid numerical underflow, we compute the score in log space
log (P(⟨s⟩ i want englisch food ⟨/s⟩))

= log (P(i|⟨s⟩)) + low (P(want|i)) + . . . + log (P(⟨/s⟩|food))

approximately:

= −1.386294 + −1.108662 + −6.81244 + −0.693147 + −0.385662 = −10.3862

and, if needed, turn it back into a probability:

= exp(−1.386294+−1.108662+−6.81244+−0.693147+−0.385662) = 3.0855e−05

13 / 37
Examples: 2-gram

the nice film


the nice movie
a nice movie
some nice love film
the fantastic action movie

p(nice|the) = C(the nice)/C(the ?) = 2/3

p(nice|the) = C(the nice)/C(the) = 2/3

14 / 37
Examples: 3-gram

the nice film


the nice movie
a nice movie
realy , a nice film

w=movie, i.e. w3,w5=movie

p(movie|a nice) = C(a nice movie)/c(a nice ?) = 2/3

note: 3-gram counts of a word are not equal to uni-gram counts (like with
bi-grams)

< s > movie title is: ’Transformer’

15 / 37
Random Experiments and the Urn Modell

Every random experiments can be regarded as a urn problem


random experiments:
→ single: draw randomly a single ball
→ multiple (multi-step): draw 10 times a
ball
multiple:
→ dependent (without returning the ball
into the urn)
→ independent (return the ball back into
the urn)
trial:
→ ordered: ordering is of interest
→ unordered: ordering is not of interest

16 / 37
Multi-step experiments

toss a (fair) coin twice


probability of:
→ twice a head (11)?
→ P(once a head) = 1/2
multiplication: 1/2 * 1/2 = 1/4
alternatively using Ω
→ Ω = {00, 01, 10, 11}
→ identically distributed events (fair die)
→ thus P(11) = 1/4

17 / 37
Multi-step experiments

addition as well?
→ A = toss twice, mixed result
→ A = {01, 10}, P(A) = 1/4 + 1/4 = 1/2
→ add up paths of the tree

18 / 37
Distribution (Verteilung)

Random Variable X : ’3 times toss a coin’, gain Ω: digit sum

events ω 000 001 010 100 011 101 110 111


outcome X (ω) 0 1 1 1 2 2 2 3
P 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

The random variable gives us for each event its value, e.g.
X (110) = 2. The values returned by X are {0, 1, 2, 3}. The
probabilites of these elements are called their distribution:

gain x = X (ω) 0 1 2 3
p(X = x) = p(x) 1/8 3/8 3/8 1/8

note: values are no longer (in any case) identically distributed

19 / 37
Expected Value (Erwartungswert)

The weighted mean of a random variable is called its


Expected Value
xi the outcomes of X , pi the probability of xi , n the number
of outcomes of X
n
X X
E (X ) = µ = xi pi xi p(X = xi ) (5)
i i

20 / 37
Expected Value: example

digit-sum-coin-twice: Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}

outcomes (digit sum)= {0, 1, 2}

x1 = 0, x2 = 1, x3 = 2

p1 = 1/4, p2 = 1/2, p3 = 1/4

21 / 37
Expected Value (Erwartungswert)

X = Ω → digit sum

xi = X (ω) 0 1 2
pi 1/4 1/2 1/4

P
E (X ) = i xi pi = 0 ∗ 1/4 + 1 ∗ 1/2 + 2 ∗ 1/4 = 1

22 / 37
Expected Value (Erwartungswert), cont.

if we take the values of a random variable as gain/loss, the


expected values then tells us the mean gain (for a number of
trials)
E(X) of a fair die = 3.5
given n trials, we can expect to get n ∗ E (X ) outcomes
e.g. if we throw the die a 100 times, we will get approximately
350 pips

23 / 37
Expected value, mean, average, median, mode

random variable has an expected value = (weighted) mean


a fair die: 3.5
samples have a mean which is the average (arithmetic mean)
P
i xi 16
given X = {2, 4, 4, 6} : |X | = 4 =4
samples have a median: the center of the values (’middle’)
median of X: 4
median of even sets: average of the two center numbers
median of uneven sets: {2, 3, 4, 5, 6} : 4
samples have a mode: value that occurs most often
mode of X: 4

24 / 37
E (X ) of a binary random variable

the expected gain is p, where p is the probability of success


(i.e. toss head, i.e. ’1’)

fair coin: E (X ) = 1/2 ∗ 1 + 1/2 ∗ 0 = 1/2.

example: 100 times (fair coin) = 100 · 1/2 = 50.

that is: if we toss it a 100 times, we expect to get about 50


times head

25 / 37
Deviation from the expected value

the expected value can also be a measure of the probability of


certain events: the closer a sample is to the expected value,
the more likely it is (intuitively)

question: how big a deviation from the expected value is


allowed before we doubt fairness?

26 / 37
Variance and standard deviation

the expected values of various random variables may be


identical
to differentiate further, the deviation from the expected value,
the variance, is defined.
two random variables X , Y and their distribution

x −1 0 1
(6)
p(x) 1/4 1/2 1/4

y −4 −2 0 2 4
(7)
p(x) 1/4 1/5 1/10 1/5 1/4

→ both variables have as a mean 0.

27 / 37
Variance and standard deviation, cont

the expected value tells us where the center of a distribution


lies, the variance tells us how strongly the distribution is
concentrated around its center.

one uses the expected value of (X − µ)2 , with µ = E (X ) as a


measure of the variance

V (X ) = σ 2 (8)

the square root of V(X) is called standard deviation:


p
σ = V (X ) (9)

28 / 37
Variance and standard deviation, cont

notation:
xi the outcomes of X , pi the probability of xi , n the number
of outcomes of X
n
X
σ 2 = V (X ) = E ((X − µ)2 ) = (xi − µ)2 pi (10)
i

another (mathematically equivalent) notation for variance

σ 2 = V (X ) = E (X 2 ) − µ2 (11)

29 / 37
Variance and standard deviation, cont

for the random variables X (see eq. 4),Y (see eq. 5) with
x ∈ ΩX , y ∈ ΩY :

x −1 0 1
p 1/4 1/2 1/4

E (X 2 ) = (−1)2 · 1/4 + 02 · 1/2 + 12 · 1/4 = 0.5

(dropped µ2 = 0)

30 / 37
Variance and standard deviation, cont

y −4 −2 0 2 4
p 1/4 1/5 1/10 1/5 1/4

E (Y 2 ) = (−4)2 · 1/4 + (−2)2 · 1/5 + 22 · 1/5 + 42 · 1/4 = 9.6

for dichotomic (binary) random variables σ 2 = qp and thus



σ = qp where q = 1 − p.
X = toss coin with p = 0.6 : V (X ) = 0.6 ∗ 0.4 = 0.24
compared to standard formula: p = 0.6, µ = 0.6 :
V (X ) = 0.6 ∗ (0.6 − 1)2 + 0.4 ∗ (0.6 − 0)2 = 0.24

31 / 37
Terminology

sampling: drawing samples from a population


(Grundgesamtheit)

objective: drawing representative samples


i.e. distribution of the individual classes (events) in the sample
should be equal or very similar to the distribution in the
population
the estimation of the parameters of the population can then be
done using the sample

32 / 37
Statistics and probability theory

probability theory is pure mathematics


calculation with given probabilities
statistics: inferential and descriptive statistics
inferential statistics: induce the parameters of a distribution
from a sample
descriptive statistics: Representation and description of
samples by means of tables, graphics etc.

33 / 37
Notation conventions

expected value, variance, i.e. characteristic values of random


variables are specified with Greek letters:
µ: Expected value
σ 2 : Variance

while the characteristic values of samples are written as


follows:
x: Average and
s 2 : (Sample) Variance.

values, which are estimates for characteristic values of a


random variable carry a caret
µ̂ and σˆ2

34 / 37
Sample mean as estimator of population mean
how many smokers are there in Zurich
random variable ’Züri-Smoker’, i.e. our population: people
living in Zurich
take a representative (i.e. random) sample
big enough (not just 10)
not biased (e.g. only students)
this is not so easy
1000 people, 200 are smoker
claim: the mean (arithmetic mean, average) of the sample is
the mean of the population: 0.2 are smookers

we estimate a parameter of the population (here µ) on the basis of


a sample:
µ ≈ x or µ̂ = x (12)
35 / 37
Corrected sample variance as estimator of population
variance
the mean is a so-called unbiased (erwartungstreu) estimator
the mean of all means of all n-sized samples is identical to the
mean of the population
we don’t discuss this further, just take it
the sample variance is a biased estimator
1X
s2 = (x − xi )2 (13)
n
i

however, the corrected sample variance is unbiased


1 X
s2 = (x − xi )2 (14)
n−1
i

σ 2 ≈ s 2 or σ̂ 2 = s 2 (15)
36 / 37
Parameter estimation

random variable view sample view


P
X xi
µ = E (X ) = xi ∗ pi ≈ x= i (16)
n
i

n
2
X 1 X
σ = V (X ) = (µ − xi )2 pi ≈ s2 = (x − xi )2 (17)
n−1
i i

37 / 37

You might also like