Probs Stats
Probs Stats
∞
Probability density function 𝑝 𝑥 with 𝑃(𝑋) = −∞ 𝑝 𝑥 𝑑𝑥
2
Example: 𝑃(𝑋 ≤ 2) = 0 𝑝 𝑥 𝑑𝑥
2.6.4. Multiple random variables
• Joint probability 𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) denotes the probability of event 𝐴 = 𝑎 𝑎𝑛𝑑 𝐵 =
𝑏 happening at the same time:
𝑃 𝐴 = 𝑎, 𝐵 = 𝑏 ≤ 𝑃(𝐴 = 𝑎)
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) ≤ 𝑃(𝐵 = 𝑏)
+ To get P(A=a), take sum of all 𝑃(𝐴 = 𝑎, 𝐵 = 𝑣) with all values v that random variable
B can get :
𝑃(𝐴 = 𝑎) = σ𝑣 𝑃(𝐴 = 𝑎, 𝐵 = 𝑣)
• Conditional probability 𝑃 𝐴 = 𝑎 𝐵 = 𝑏 denotes the probability of event 𝐴 = 𝑎,
once the condition 𝐵 = 𝑏 is met
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏)
𝑃(𝐴 = 𝑎, 𝐵 = 𝑏) =
𝑃(𝐵 = 𝑏)
+ For 2 disjoint events 𝐵 and 𝐵’ : 𝑃(𝐵 ∪ 𝐵′|𝐴 = 𝑎) = 𝑃(𝐵|𝐴 = 𝑎) + 𝑃(𝐵’|𝐴 = 𝑎)
Bayes theorem
• With the conditional probability equation, we have:
𝑃(𝐴, 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃 𝐵 𝐴 𝑃(𝐴)
→𝑃 𝐴𝐵 =
𝑃(𝐵)
P(A|B): posterior
P(B|A): likelihood
P(A): prior
P(B): evidence
• Example: if we know the prevalence of symptoms for a disease, we can
determine how likely someone has the disease based on the symptoms.
• In case we don’t have access to 𝑃(𝐵), a simpler version of Bayes
theorem can be used:
𝑃(𝐴|𝐵) ∝ 𝑃 𝐵 𝐴 𝑃(𝐴)
• Since 𝑃(𝐴|𝐵) must be normalized to 1, meaning σ𝑎 𝑃 𝐴 = 𝑎 𝐵 = 1,
we also have:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐴|𝐵) =
σ𝑎 𝑃 𝐵 𝐴 = 𝑎 𝑃(𝐴 = 𝑎)
σ𝑎 𝑃 𝐵 𝐴 = 𝑎 𝑃(𝐴 = 𝑎) = σ𝑎 𝑃 𝐵 𝐴 = 𝑎 = 𝑃(𝐵)
Independence
• Random variables A and B are independent if changes on value of A does not change the
probability distribution of B and vice versa.
𝐴, 𝐵 are independent (𝐴 ⊥ 𝐵)
→ 𝑃 𝐴 𝐵 = 𝑃 𝐴 → 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐴 𝑃(𝐵)
𝑃(𝐻 = 1|𝐷1 = 1) = ?
→ There’s 13% chance the patient have HIV if diagnosed positive, even though the test is very accurate
according to the table.
This is counter-intuitive
• Second test is not as accurate as the first one
Second test also came out positive with 4.68% of getting HIV.
Assuming conditional independence for test 1 and 2, we have:
𝑃(𝐷1 = 1, 𝐷2 = 1)
= 𝑃(𝐷1 = 1, 𝐷2 = 1, 𝐻 = 0) + 𝑃(𝐷1 = 1, 𝐷2 = 1, 𝐻 = 1)
= 𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 0) 𝑃(𝐻 = 0) + 𝑃(𝐷1 = 1, 𝐷2 = 1|𝐻 = 1) 𝑃(𝐻 = 1)
= 0.00177
𝑃 𝐷1 = 1, 𝐷2 = 1 𝐻 = 1 𝑃(𝐻 = 1)
𝑃(𝐻 = 1|𝐷1 = 1, 𝐷2 = 1) = = 0.8307
𝑃(𝐷1 = 1, 𝐷2 = 1)
The second test significantly improved the estimate when combined with the first one
2.6.6 Expectations
• Expectation of random variable 𝑋 is defined as:
𝐸 𝑋 = 𝐸𝑥~𝑃 [𝑥] = 𝑥𝑃(𝑋 = 𝑥)
𝑥
• For densities, we have 𝐸 𝑋 = )𝑥(𝑝𝑑 𝑥
• Expected value of some function f(x):
𝐸𝑥~𝑃 [𝑓(𝑥)] = σ𝑥 𝑓 𝑥 𝑃(𝑥) = 𝑥𝑑 𝑥 𝑝 𝑥 𝑓
Variance
Covariance matrix:
𝚺 ≝ 𝐶𝑜𝑣𝐱~𝑃 𝐱 = 𝐸𝐱~𝑃 𝐱 − 𝛍 𝐱 − 𝛍 𝑇
𝚺 allows us to compute variance for any linear function of 𝐱 with matrix multiplication. The
off-diagonal elements show the correlation between coordinates.
0 means low correlation, large positive value means they are strongly correlated
Maximum likelihood
• Suppose we have a model with parameters 𝜽 and data samples 𝑋, we want to find
the most likely value for the parameters:
The probability of the data given the parameter 𝑃(𝑋|𝜃) is called the likelihood
Numerical Optimization and Negative log-
likelihood
• Instead of finding 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑋|𝜽) we can find 𝑎𝑟𝑔𝑚𝑎𝑥 log(𝑃(𝑋|𝜽)),
𝜃 𝜃
since log(x) is a monotone increasing function
𝑎𝑟𝑔𝑚𝑎𝑥𝜃 log(𝑃(𝑋|𝜽)) = 𝑎𝑟𝑔𝑚𝑖𝑛𝜽 − log(𝑃 𝑋 𝜽 )
• Compute derivative: