0% found this document useful (0 votes)
22 views26 pages

Probs Stats

Xác suất thống kê PTIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

Probs Stats

Xác suất thống kê PTIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

2.6.

Probability and Statistics


• Probability: Reasoning under uncertainty (given a probabilistic model
of a process, we can reason about the likelihood of different events)

• Statistics: the study of data: collecting, analyzing, interpreting, and


drawing conclusions from datasets. It often involves making unknown
patterns of a population based on a sample.
2.6.1. Example: Tossing coins
• Supposed the coin is fair ( P(head) = 0.5 ), we can simulate multiple
draws with the Multinomial function
• Each time you run this sampling process, you get a different result
• As the number of samples grows, the sample estimates converge to
the true underlying probabilities (Central Limit Theorem).
2.6.2. Formal notations
• Set of possible outcomes (sample space) 𝑆 = ℎ𝑒𝑎𝑑𝑠, 𝑡𝑎𝑖𝑙𝑠 if the task is
tossing a coin.
• If we’re tossing 2 coins: 𝑆 =
{ ℎ𝑒𝑎𝑑𝑠, ℎ𝑒𝑎𝑑𝑠 , ℎ𝑒𝑎𝑑𝑠, 𝑡𝑎𝑖𝑙𝑠 , 𝑡𝑎𝑖𝑙𝑠, ℎ𝑒𝑎𝑑𝑠 , (𝑡𝑎𝑖𝑙𝑠, 𝑡𝑎𝑖𝑙𝑠)}
+ Example: rolling a dice: 𝑆 = {1, 2, 3, 4, 5, 6}
• Given a random variable 𝑋, 𝑃(𝑋 = 𝑣) denotes the probability of
𝑋 𝑡𝑎𝑘𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒 𝑣
• Similarly, 𝑃 1 ≤ 𝑋 ≤ 3 indicates the probability of event {1 ≤ 𝑋 ≤ 3}
• A probability function 𝑃 maps events onto real values:
𝑃: 𝐴 ⊆ 𝑆 → [0,1]
• The probability, denoted 𝑃(𝐴), of an event 𝐴 in sample space 𝑆 has the
following properties:
1. The probability of any event A is a real non-negative number:
𝑃 𝐴 ≥0
2. The probability of the entire sample space is 1:
𝑃 𝑆 =1
3. For any sequence of events 𝐴1 , 𝐴2 , … that are mutually exclusive
(𝐴𝑖 ∩ 𝐴𝑗 = ∅ f𝑜𝑟 𝑎𝑙𝑙 𝑖 ≠ 𝑗), the probability that any of them happen
is equal to the sum of their individual probabilities:


𝑃 ራ 𝐴𝑖 = ෍ 𝑃(𝐴𝑖 )
𝑖=1
𝑖=1
2.6.3. Random variables

• 2 types: discreet and continuous


• Example:
+ 𝑋 is the number rolled on a dice (discreet)
+ 𝑌 is the height of a group sampled at random from a population
(continuous)
Let 𝑋 be the exact amount of
rain tomorrow:
𝑃 𝑋 =2 =?


Probability density function 𝑝 𝑥 with 𝑃(𝑋) = ‫׬‬−∞ 𝑝 𝑥 𝑑𝑥

2
Example: 𝑃(𝑋 ≤ 2) = ‫׬‬0 𝑝 𝑥 𝑑𝑥
2.6.4. Multiple random variables
• Joint probability 𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) denotes the probability of event 𝐴 = 𝑎 𝑎𝑛𝑑 𝐵 =
𝑏 happening at the same time:
𝑃 𝐴 = 𝑎, 𝐵 = 𝑏 ≤ 𝑃(𝐴 = 𝑎)
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) ≤ 𝑃(𝐵 = 𝑏)
+ To get P(A=a), take sum of all 𝑃(𝐴 = 𝑎, 𝐵 = 𝑣) with all values v that random variable
B can get :
𝑃(𝐴 = 𝑎) = σ𝑣 𝑃(𝐴 = 𝑎, 𝐵 = 𝑣)
• Conditional probability 𝑃 𝐴 = 𝑎 𝐵 = 𝑏 denotes the probability of event 𝐴 = 𝑎,
once the condition 𝐵 = 𝑏 is met
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏)
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) =
𝑃(𝐵 = 𝑏)
+ For 2 disjoint events 𝐵 and 𝐵’ : 𝑃(𝐵 ∪ 𝐵′|𝐴 = 𝑎) = 𝑃(𝐵|𝐴 = 𝑎) + 𝑃(𝐵’|𝐴 = 𝑎)
Bayes theorem
• With the conditional probability equation, we have:
𝑃(𝐴, 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)

𝑃 𝐵 𝐴 𝑃(𝐴)
→𝑃 𝐴𝐵 =
𝑃(𝐵)

P(A|B): posterior
P(B|A): likelihood
P(A): prior
P(B): evidence
• Example: if we know the prevalence of symptoms for a disease, we can
determine how likely someone has the disease based on the symptoms.
• In case we don’t have access to 𝑃(𝐵), a simpler version of Bayes
theorem can be used:
𝑃(𝐴|𝐵) ∝ 𝑃 𝐵 𝐴 𝑃(𝐴)
• Since 𝑃(𝐴|𝐵) must be normalized to 1, meaning σ𝑎 𝑃 𝐴 = 𝑎 𝐵 = 1,
we also have:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐴|𝐵) =
σ𝑎 𝑃 𝐵 𝐴 = 𝑎 𝑃(𝐴 = 𝑎)

σ𝑎 𝑃 𝐵 𝐴 = 𝑎 𝑃(𝐴 = 𝑎) = σ𝑎 𝑃 𝐵 𝐴 = 𝑎 = 𝑃(𝐵)
Independence
• Random variables A and B are independent if changes on value of A does not change the
probability distribution of B and vice versa.
𝐴, 𝐵 are independent (𝐴 ⊥ 𝐵)
→ 𝑃 𝐴 𝐵 = 𝑃 𝐴 → 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐴 𝑃(𝐵)

• Conditional Independence: random variables A and B are conditionally independent


given a third variable C iff 𝑃(𝐴, 𝐵|𝐶) = 𝑃(𝐴|𝐶)𝑃(𝐵|𝐶)
• Example: broken bones and cancer are independent if we consider the whole population.
However, if we condition on being in a hospital, broken bones are negatively correlated
with having cancer.
Example: Doctor administer HIV test to a patient. 𝐷1 = 1 means positive and 𝐷1 = 0 means negative. H is
the HIV status of the patient. Assume P(H=1) = 0.0015

𝑃(𝐻 = 1|𝐷1 = 1) = ?

𝑃(𝐷1 = 1) = 𝑃(𝐷1 = 1, 𝐻 = 0) + 𝑃(𝐷1 = 1, 𝐻 = 1)


= 𝑃(𝐷1 = 1|𝐻 = 0) 𝑃(𝐻 = 0) + 𝑃(𝐷1 = 1| 𝐻 = 1) 𝑃(𝐻 = 1)
= 0.01 × 0.9975 + 1 × 0.0015
= 0.011475

Using Bayes rules:


𝑃 𝐷1=1|𝐻=1 𝑃(𝐻=1) 0.0015
→ 𝑃 𝐻 = 1 𝐷1 = 1 = = 0.0011475 = 0.1306
𝑃(𝐷1=1)

→ There’s 13% chance the patient have HIV if diagnosed positive, even though the test is very accurate
according to the table.
This is counter-intuitive
• Second test is not as accurate as the first one

𝑃(𝐷2 = 1) = 0.98 × 0.0015 + 0.03 × 0.9975 = 0.0314


0.98 × 0.0015
𝑃(𝐻 = 1|𝐷2 = 1) = = 0.0468
0.0314

Second test also came out positive with 4.68% of getting HIV.
Assuming conditional independence for test 1 and 2, we have:

𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 0) = 𝑃(𝐷1 = 1|𝐻 = 0) 𝑃(𝐷2 = 1|𝐻 = 0) = 0.0003

𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 1) = 𝑃(𝐷1 = 1|𝐻 = 1) 𝑃(𝐷2 = 1|𝐻 = 1) = 0.98

𝑃(𝐷1 = 1, 𝐷2 = 1)
= 𝑃(𝐷1 = 1, 𝐷2 = 1, 𝐻 = 0) + 𝑃(𝐷1 = 1, 𝐷2 = 1, 𝐻 = 1)
= 𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 0) 𝑃(𝐻 = 0) + 𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 1) 𝑃(𝐻 = 1)
= 0.00177

𝑃 𝐷1 = 1, 𝐷2 = 1 𝐻 = 1 𝑃(𝐻 = 1)
𝑃(𝐻 = 1|𝐷1 = 1, 𝐷2 = 1) = = 0.8307
𝑃(𝐷1 = 1, 𝐷2 = 1)
The second test significantly improved the estimate when combined with the first one
2.6.6 Expectations
• Expectation of random variable 𝑋 is defined as:
𝐸 𝑋 = 𝐸𝑥~𝑃 [𝑥] = ෍ 𝑥𝑃(𝑋 = 𝑥)
𝑥
• For densities, we have 𝐸 𝑋 = ‫)𝑥(𝑝𝑑 𝑥 ׬‬
• Expected value of some function f(x):
𝐸𝑥~𝑃 [𝑓(𝑥)] = σ𝑥 𝑓 𝑥 𝑃(𝑥) = ‫𝑥𝑑 𝑥 𝑝 𝑥 𝑓 ׬‬
Variance

𝑉𝑎𝑟 𝑋 = 𝐸[(𝑋 − 𝐸 𝑋 )2 ] = 𝐸 𝑋 2 − 𝐸[𝑋]2

• The variance of a function of a random variable:

𝑉𝑎𝑟𝑥~𝑃 𝑓 𝑥 = 𝐸𝑥~𝑃 𝑓 2 𝑥 − 𝐸𝑥~𝑃 𝑓(𝑥)2


• Standard deviation:
𝜎= 𝑉𝑎𝑟(𝑋)
Expectation and variance of vector
Apply the formula elementwise:
𝝁 ≝ 𝐸𝒙~𝑃 [𝒙]

𝝁 has coordinates 𝜇𝑖 = 𝐸𝐱~𝑃 [𝑥𝑖 ]

Covariance matrix:
𝚺 ≝ 𝐶𝑜𝑣𝐱~𝑃 𝐱 = 𝐸𝐱~𝑃 𝐱 − 𝛍 𝐱 − 𝛍 𝑇

Let 𝐯 be a vector of the same size as 𝐱

𝐯 T 𝚺𝐯 = 𝐸𝐱~𝑃 [𝐯 T (𝐱 − 𝛍)(𝐱 − 𝛍)T 𝐯]= 𝑉𝑎𝑟𝐱~𝑃 [𝐯 T 𝐱]

𝚺 allows us to compute variance for any linear function of 𝐱 with matrix multiplication. The
off-diagonal elements show the correlation between coordinates.
0 means low correlation, large positive value means they are strongly correlated
Maximum likelihood
• Suppose we have a model with parameters 𝜽 and data samples 𝑋, we want to find
the most likely value for the parameters:

Using Bayes rules:

𝑃(𝑋), 𝑃(𝜽) does not depend on 𝜽 (uninformative prior)

The probability of the data given the parameter 𝑃(𝑋|𝜃) is called the likelihood
Numerical Optimization and Negative log-
likelihood
• Instead of finding 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑋|𝜽) we can find 𝑎𝑟𝑔𝑚𝑎𝑥 log(𝑃(𝑋|𝜽)),
𝜃 𝜃
since log(x) is a monotone increasing function
𝑎𝑟𝑔𝑚𝑎𝑥𝜃 log(𝑃(𝑋|𝜽)) = 𝑎𝑟𝑔𝑚𝑖𝑛𝜽 − log(𝑃 𝑋 𝜽 )

• Related to information theory, entropy is the amount of randomness in a


random variable

• If we take the negative log-likelihood and divide by n samples, we get cross-


entropy (a way to measure classification performance)
• Due to independence assumption, most probabilities we see in ML
are products of individual probabilities:

• Using the product rule to compute derivative

• This needs n(n-1) multiplications, so it’s proportional to quadratic


time in the inputs (inefficient).
• Instead we can use negative log-likelihood

• Compute derivative:

• This needs n divisions and n sums -> Linear time


Example
• Given 𝑋 = {𝑥𝑖 }𝑛𝑖=1 is a random sample from an exponential
distribution with parameter 𝜆 > 0. It has the following p.d.f:
𝑝 𝑥 = 𝜆𝑒 −𝜆𝑥

The likelihood is: 𝐿(𝑋|𝜆) = ς𝑛𝑖=1 𝑝(𝑥𝑖 |𝜆)


= ς𝑛𝑖=1 𝜆𝑒 −𝜆𝑥𝑖
𝑛 −𝜆 σ𝑛
𝑖=1 𝑥𝑖
= 𝜆 𝑒
We want to find the maximum likelihood estimate:
We can find the maximum by taking the derivative and equate to 0:

The maximum likelihood estimate is


• Given X = [2.7, 4.9, 0.2, 4.9, 4.4, 18.7, 1.5, 0.9, 10.5, 1.3] following an
exponential distribution (n = 10)

-> The MLE is


Maximum likelihood for Continuous variables
• For continuous variables we want to compute within a range 𝜖

• Take negative log of this:

• Again, −𝑁𝑙𝑜𝑔(𝜖) does not depend on 𝜽


• We only need to optimize

You might also like