Introduction
Introduction
E-mail: [email protected]
Website: https://www.csccm.in/
Important dates
Important dates
Important dates
So, if you don’t want to attend extra classes, prepare for the exam. Syllabus – second
chapter of Bishop.
• Machine learning is divided into two types . In the supervised learning approach ,
the goal is to learn a mapping from inputs 𝒙 to output 𝑦, given a labelled set of
data, 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑁.
• We work with un-labelled data; in other words, we are not told what kind of pattern
to look for.
• This is a more realistic scenario (from AI-point of view) and is more challenging.
Kaelbling , L., M. Littman, and A . Moore (1996). Reinforcement learning : A survey . J. of AI Research 4 , 237
285.
Sutton , R. and A. Barto (1998). Reinforcment Learning : An Introduction . MIT Press
Russell , S. and P. Norvig (1995). Artificial Intelligence : A Modern Approach . Englewood Cliffs, NJ: Prentice
Hall.
Szepesvari , C. (2010). Algorithms for Reinforcement Learning . Morgan Claypool.
Wiering , M. and M. van Otterlo (Eds .) (2012). Reinforcement learning: State of the art .
• This corresponds to the most probable class label and is the mode of the
distribution 𝑝 𝑦 ∗ 𝒙∗ , 𝒟 . This is known as maximum a posteriori estimate (MAP
estimate).
• Point estimates are often not the best solution – what if 𝑝 𝑦 ∗ = 1 𝒙∗ , 𝒟 is far away
from 1?
• IBM Watson beat the top human Jeopardy champion by containing a module that
estimates how confident it is of its answer.
• Google’s SmartASS (ad selection system) predicts the probability (click through
rate, CTR) you will click on an ad based on your search history and other user and
ad specific features. CTR can be used to maximize expected profit
• Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg,J. Prager, N.
Schlaefter, and C. Welty (2010). Building Watson: An Overview of the DeepQAProject.AI Magazine, 59–79.
• Metz, C. (2010). Google behavioral ad targeter is a Smart Ass. The Register.
• Cheeseman, P., J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman (1988). Autoclass: A Bayesian classification system.
In Proc. of the Fifth Intl. Workshop on Machine Learning.
• Lo, C. H. (2009). Statistical methods for high throughput genomics. Ph.D. thesis, UBC.
• Berkhin, P. (2006). A survey of clustering datamining techniques. In J. Kogan, C. Nicholas, and M. Teboulle(Eds.),
Grouping Multidimensional Data: Recent Advances in Clustering, pp. 25–71. Springer.
• Our first goal is to estimate the distribution over the number of clusters, 𝑝 𝐾 𝒟 ;
this tells us if there are subpopulations within the data.
• The second objective is to assign each data point to the corresponding cluster
(hidden or latent variables).
𝑧𝑖∗ = argmax 𝑝 𝑧𝑖 = 𝑘|𝒙𝑖 , 𝒟
𝑘
• Picking a model of the right complexity (here the number of clusters) is called
model selection.
APl 744 CSCCM@IITD 26
Dimensionality Reduction
• Reduce the dimensionality by projecting the data to a lower-dimensional
subspace which captures the essence of the data.
• Latent factors: although the data may appear high-dimensional, there may only
be a small number of degrees of variability.
• Principal Components Analysis (PCA): common approach to dimensionality
reduction. Useful for visualization, nearest neighbor searchers, etc.
• Let 𝐸 be a space of elementary events. Consider the power subset 2𝐸 , and let
ℱ ⊂ 2𝐸 be a subset of Ω. Elements of ℱ are called random events. If ℱ satisfies
the following properties, it is called 𝜎 − algebra.
• 𝐸∈ℱ
• 𝐴, 𝐵 ∈ ℱ ⇒ 𝐴 − 𝐵 ∈ ℱ
• 𝐴1 , … , 𝐴𝑛 ∈ ℱ ⇒ ∞ڂ ∞
𝑖=1 𝐴𝑖 ∈ ℱ ∧ =𝑖ځ1 𝐴𝑖 ∈ ℱ
• If ℱ is 𝜎 −algebra, then its elements are called measurable sets and (𝐸, ℱ) is
called a measurable space or Borel space.
• Second axiom: This is the assumption of unit measure: that the probability that at
least one of the elementary events in the entire sample space will occur is 1,
𝒫 ΩE = 1
𝒫 ራ 𝐸𝑖 = 𝒫 𝐸𝑖
𝑖=1 𝑖=1
ℙ ራ 𝐸𝑖 = ℙ 𝐸𝑖
𝑖=1 𝑖=1
• 𝐸 = 𝑐 𝐸ڂΩ
• ℙ 𝐸 = 𝐹ڂℙ 𝐸 + ℙ 𝐹 − ℙ 𝐸𝐹ځ
• ∅ is called empty-set.
• Discrete random variable 𝑋 can take any value from a finite or countably infinite
set 𝒳.
1
𝑓 𝑥=𝑘 = 𝑓 𝑥=𝑘 =𝕀 𝑥=1
4
• The probability that 𝑋 takes the value 𝑥𝑖 and 𝑌 takes the value 𝑦𝑗 is represented as
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 . This is the joint probability of 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 .
𝑐𝑖
𝑛𝑖𝑗 𝑟𝑗
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = 𝑦𝑗 𝑛𝑖𝑗
𝑁
• Sum rule:
σ𝑗 𝑛𝑖𝑗
ℙ 𝑋 = 𝑥𝑖 = ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 =
𝑁
𝑗
𝑐𝑖
ℙ 𝑥 = නℙ 𝑥, 𝑦 𝑑𝑦
• Product rule:
𝑛𝑖𝑗 𝑛𝑖𝑗 𝑐𝑖 𝑦𝑗 𝑛𝑖𝑗 𝑟𝑗
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = =
𝑁 𝑐𝑖 𝑁
= ℙ 𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ℙ 𝑋 = 𝑥𝑖
𝑥𝑖
• Chain rule: The product rule leads to the chain rule
• The most complex calculations in probability are nothing but simple applications
of sum and product rules
APl 744 CSCCM@IITD 39
Conditional probability and Bayes’ rule
ℙ 𝑋 = 𝑥, 𝑌 = 𝑦 ℙ 𝑋 = 𝑥 ℙ 𝑌 = 𝑦|𝑋 = 𝑥
ℙ 𝑋 = 𝑥|𝑌 = 𝑦 = =
ℙ 𝑌=𝑦 σ𝑥′ ℙ 𝑋 = 𝑥 ′ , 𝑌 = 𝑦
ℙ 𝑋 = 𝑥 ℙ 𝑌 = 𝑦|𝑋 = 𝑥
ℙ 𝑋 = 𝑥|𝑌 = 𝑦 =
σ𝑥′ ℙ 𝑋 = 𝑥 ′ ℙ 𝑌 = 𝑦|𝑋 = 𝑥 ′
• An example:
ℙ 𝐵 = 𝑟 = 0.4, ℙ 𝐵 = 𝑏 = 0.6
ℙ 𝐵 = 𝑟 ℙ 𝐹 = 𝑜|𝐵 = 𝑟
ℙ 𝐵 = 𝑟|𝐹 = 𝑜 =
ℙ 𝐹=𝑜
ℙ 𝐹=𝑜
= ℙ 𝐵 = 𝑟 ℙ 𝐹 = 𝑜|𝐵 = 𝑟
+ ℙ 𝐵 = 𝑏 ℙ 𝐹 = 𝑜|𝐵 = 𝑏
6 1
ℙ 𝐹 = 𝑜 = 0.4 × + 0.6 × = 0.45
8 4
0.4 × 6/8
ℙ 𝐵 = 𝑟|𝐹 = 𝑜 = = 2/3
0.45
• Once it is observed that the fruit selected is orange, the chance of selecting red
box increases from 0.4 to 0.67.
• A test is available but not perfect; if a tested patient has disease, 80% of the time
the test will be positive, ℙ Positive|TB = 0.80. On the contrary, if a tested
patient does not have the disease, 90% of the time the result is negative,
ℙ Negative|TB c = 0.9, ℙ Positive|TB c = 0.1.
• Base rate fallacy: People will assume that they have 80% chance to have the
disease; but they ignore the PRIOR knowledge.
ℙ TB ℙ Positive|TB ℙ Positive
ℙ TB|Positive = = ℙ TB ℙ Positive|TB
ℙ Positive
0.8 + ℙ TB 𝑐 ℙ Positive|TB c
= 0.004 × ≈ 0.031 = 3.1% = 0.004 × 0.8 + 0.996 × 0.1
0.1028 = 0.1028.
• Consider four balls, 1,2,3,4 are present in a box. Now consider the following
events:
• Event 1: ball 1 or 2 is drawn
• Event 2: ball 2 or 3 is drawn
• Event 3: ball 1 or 3 is drawn.
• Note that,
1 1
ℙ 𝐸1 , 𝐸2 = = ℙ 𝐸1 ℙ 𝐸2 , ℙ 𝐸2 , 𝐸3 = = ℙ 𝐸2 ℙ 𝐸3
4 1 1
4 1 1
2 2 2 2
1
ℙ 𝐸1 , 𝐸3 = = ℙ 𝐸1 ℙ 𝐸3
4 1 1
2 2
• However,
1
ℙ 𝐸1 , 𝐸2 , 𝐸3 = 0 ≠ ℙ 𝐸1 ℙ 𝐸2 ℙ 𝐸3 =
1 1 1
8
2 2 2
• Pairwise independence doesn’t ensure mutual independence
ℙ 𝑥, 𝑦 ℙ 𝑦
No. of parameters = 9
No. of parameters = 29 30 − 1 ℙ 𝑥
• Independence is key to EFFICIENT probabilistic modelling (Naïve Bayes’,
Markov model, probabilistic graphical model, etc).
• We call 𝑥 = 𝑋 𝜔 , 𝜔 ∈ Ω, a realization of 𝑋.
• Probability density:
ℙ𝑥 𝐵 = න 𝑝𝑋 𝑥 𝑑𝑥
𝐵
• Often we write 𝑝𝑋 𝑥 = 𝑝 𝑥
• The CDF for a random variable 𝑋 is the function 𝐹(𝑧) that returns the probability
that 𝑋 is less than 𝑧
𝑏
• ℙ 𝑥 ∈ 𝑎, 𝑏 = 𝑏 𝐹 = 𝑥𝑑 𝑥 𝑋𝑝 𝑎− 𝐹 𝑎
• Discrete:
𝐸 𝑓 𝑋 = 𝑓 𝑥 𝑝𝑋 𝑥
𝑥
• Continuous:
∞
𝐸 𝑓 𝑋 = න 𝑓 𝑥 𝑝𝑋 𝑥 𝑑𝑥
−∞
• Conditional expectation:
𝐸 𝑓 𝑋 |𝑌 = 𝑓 𝑥 𝑝 𝑥|𝑦
𝑥
• The expectation of a random variable is not necessarily the value that we should
expect a realization to have.
• The expectation of 𝑋 is
1 21
𝐸 𝑋 = 𝑥𝑝 𝑥 = 1 + 2 + 3 + 4 + 5 + 6 = = 3.5
6 6
𝑥
𝑥, 0≤𝑥≤1
ℙ𝑈 𝑥 = ቐ 1, 𝑥>1
0, 𝑥<1
• Mean of 𝒰 𝑥|𝑎, 𝑏
𝑎+𝑏
𝐸 𝑥 =
2
• Note that it is possible for 𝑝 𝑥 > 1, although the density must integrate to 1. For
e.g.,
1
𝒰 𝑥|0,1/2 = 2, ∀𝑥 ∈ 0,
2
• We often work with the precision of a Gaussian, 𝜆 = 1/𝜎 2 . The higher the 𝜆, the
narrower the distribution is.
∞ 1 2
• Let 𝐼 ≡ −∞ exp − 2𝜎2 𝑥 − 𝜇 𝑑𝑥.
∞ ∞ 1 1
• Then, 𝐼 2 = −∞ −∞ exp − 2𝜎2 𝑥 − 𝜇 2
exp − 2𝜎2 𝑦 − 𝜇 2
𝑑𝑥𝑑𝑦
• Set 𝑟 2 = 𝑥 − 𝜇 2
+ 𝑦 − 𝜇 2 and perform variable transformation,
∞ 2𝜋 ∞
1 1 2
𝐼 = න න exp − 2 𝑟 𝑟𝑑𝑟𝑑𝜃 = 2𝜋 න exp − 2 𝑟 𝑟𝑑𝑟 = 2𝜋𝜎 2
2 2
0 0 2𝜎 0 2𝜎
• Sample 𝑋1 , … , 𝑋𝑁 from 𝒩 𝜇, 𝜎 2 .
• Normaldata : Relative changes in reported larcenies between 1991 and 1995 and
1995 (relative to 1991) for the 90 most populous US counties. FBI data
MatLab implementation
From Bayesian Core , J.M. Marin and C.P. Roberts, Chapter 2 (available on line)
• The central limit-theorem shows that sum of i.i.d random variables has
approximately a Gaussian distribution making it appropriate choice for modelling
noise (limit of additive small effects).
• Gaussian distribution makes the least assumption (maximum entropy) from all
possible distributions with given mean and variance.
• Closed for solutions and interesting properties that we will encounter later.
• Consider coin flipping experiment with heads=1 and tails =0, with 𝜇 ∈ 0,1 .
ℙ 𝑥=1𝜇 =𝜇
ℙ 𝑥 =0 𝜇 =1−𝜇
• It is formed from the joint probability distribution of the sample, but viewed and
used as a function of the parameters only, thus treating the random variables as fixed
at the observed values
ℙ 𝒟|𝜇 = ෑ ℙ 𝑥𝑖 |𝜇 = ෑ 𝜇 𝑥𝑖 1 − 𝜇 1−𝑥𝑖
𝑖=1 𝑖=1
= 𝜇𝑚 1− 𝜇 𝑁−𝑚
• It is formed from the joint probability distribution of the sample, but viewed and
used as a function of the parameters only, thus treating the random variables as fixed
at the observed values
ℙ 𝒟|𝜇 = ෑ ℙ 𝑥𝑖 |𝜇 = ෑ 𝜇 𝑥𝑖 1 − 𝜇 1−𝑥𝑖
𝑖=1 𝑖=1
= 𝜇𝑚 1− 𝜇 𝑁−𝑚
• The Binomial distribution for N=10, and 𝜇 = 0.25,0.9 is shown below using
MatLab function binomDistPlot from Kevin Murphys’ PMTK.
• We are now looking at discrete variables that can take on one of K possible
mutually exclusive states.
• 𝑚𝑘 = σ𝑁𝑖=1 𝑥𝑖𝑘 is known as the sufficient statistics of the distribution and is the
number of observation of 𝑥𝑘 = 1.
• MLE estimate of 𝝁:
𝜇 ∗ = argmax log ℙ 𝒟|𝝁
𝜇
Subjected to
𝐾
𝜇𝑘 = 1
𝑘=1
𝑚𝑘
• This yields 𝜇𝑘 = 𝑁
• The parameter 𝜆 is known as the precision of the t-distribution, even though, it is not
in general equal to the inverse of the variance.
• Proof:
𝜈 1
𝜆 𝑥−𝜇 2 −2−2 𝜈+1 𝜆 𝑥−𝜇 2
𝑝 𝑥|𝜇, 𝜆, 𝜈 ∝ 1 + = exp − log 1 +
𝜈 2 𝜈
𝑥2 𝑥3
• Using Taylor’s series, we know log 1 + 𝑥 = 𝑥 − + −⋯≈𝑥
2 3
MatLab code
Mean: 𝜇, 𝜈 > 1
Mode: 𝜇
𝜈
,𝜈 > 2
𝜆 𝜈−2
Variance: ൞ ∞, 1 < 𝜈 ≤ 2 ,
undefined otherwise
1 2
• Consider, 𝑧 = 𝑏 + 2 𝑥 − 𝜇 𝜏,
𝐴
1
∞
𝑏𝑎 1 2
𝑝 𝑥|𝜇, 𝑎, 𝑏 = න 𝜏 0.5 exp −𝑧 𝜏 𝑎−1 𝑑𝜏
Γ 𝑎 2𝜋 0
1
∞
𝑏𝑎 1 2 1 𝑑𝑧
= 1 න 𝑧 0.5 exp −𝑧 𝑧 𝑎−1
Γ 𝑎 2𝜋 𝐴
𝐴2+𝑎−1 0
∞
• By definition, Γ 𝑎 = 0 exp −𝑧 𝑧 𝑎−1 𝑑𝑧. Therefore,
1 1
−2−𝑎
𝑏𝑎 1 1
2
2
1
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 𝑏+ 𝑥−𝜇 Γ 𝑎+
Γ 𝑎 2𝜋 2 2
𝑎
• Finally, redefining 𝜈 = 2𝑎, 𝜆 = 𝑏 , we have
𝜈 1 1 1 𝜈
Γ 2+2 𝜆 2 𝜆 𝑥−𝜇 2 −2−2
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 1+
Γ 𝑎 𝜋𝜈 𝜈
• The effect of small number of outliers (Fig. on right) is less significant for the t-
distribution than for the Gaussian.
Matlab Code
• The operations are the same as any other probability density functions.
• Covariance
𝑐𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌−𝐸 𝑌 = 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸 𝑌
• It expresses the extent to which 𝑋 and 𝑌 vary (linearly) together.
• If 𝑋 and 𝑌 are independent, 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌 , 𝑐𝑜𝑣 𝑋, 𝑌 = 0; however, the
reverse might not be true.
• 𝑋 and 𝑌 are said to be orthogonal if, 𝐸 𝑋𝑌 = 0.
• The correlation reflects the noisiness and direction of a linear relationship (top
row), but not the slope of that relationship (middle), nor nonlinear relationships
(bottom).
• The diagonal of the covariance matrix gives the variances of the individual
components.
• A normalized version of this is the correlation matrix where all elements are
between −1,1 (diagonal elements = 1).
• A normalized version of this is the correlation matrix where all elements are
between −1,1 (diagonal elements = 1).
1 𝑐𝑜𝑟𝑟 𝑿1 , 𝑿2 ⋯ 𝑐𝑜𝑟𝑟 𝑿1 , 𝑿𝑁
𝑐𝑜𝑟𝑟 𝑿2 , 𝑿1 1 ⋯ 𝑐𝑜𝑟𝑟 𝑿2 , 𝑿𝑁
𝐑=
⋮ ⋮ ⋱ ⋮
𝑐𝑜𝑟𝑟 𝑿𝑁 , 𝑿1 𝑐𝑜𝑟𝑟 𝑿𝑁 , 𝑿2 ⋯ 1
1
1 2 1
𝑝 𝒙 = 𝑁
exp −𝒙 − 𝝁 𝑇 Σ −1 𝒙 − 𝝁
2𝜋 det Σ 2
where 𝝁 ∈ ℝ𝑁 is the mean vector and Σ ∈ ℝ𝑁×𝑁 is the covariance matrix.
MATLAB code
• Let 𝑥 = 𝑔 𝑦 , then
𝑑𝑥
𝑝𝑌 𝑦 = 𝑝𝑋 𝑔 𝑦 = 𝑝𝑋 𝑔 𝑦 𝑔′ 𝑦
𝑑𝑦
• An example
𝑏𝑎 𝑎−1
𝒢𝒶𝓂𝓂𝒶 𝑥|𝑎, 𝑏 = 𝑥 exp −𝑥𝑏
Γ 𝑎
Define, 𝑌 = 1/𝑋.
𝑏𝑎 − 𝑎−1 1 𝑏𝑎 − 𝑎−1 −2 exp
𝑝𝑌 𝑦 = 𝑦 exp −𝑏/𝑦 − 2 = 𝑦 −𝑏/𝑦
Γ 𝑎 𝑦 Γ 𝑎
𝑏𝑎 − 𝑎+1
𝑝𝑌 𝑦 = 𝑦 exp −𝑏/𝑦
Γ 𝑎
→ This is the inverse Gamma distribution.
• We define 𝑋ത𝑛 as
𝑁
𝑋𝑖
𝑋ത𝑛 =
𝑁
𝑖=1
• Taking expectation of both sides, we see
𝑁
𝐸 𝑋𝑖
𝐸 𝑋ത𝑛 = =𝜇
𝑁
𝑖=1
𝑁
𝑣𝑎𝑟 𝑋𝑖 𝜎2
𝑉𝑎𝑟 𝑋ത𝑛 = =
𝑁2 𝑁
𝑖=1
• The problem is to infer the underlying probability distribution that gives rise to the
data 𝒮.
• Parametric model: Assume a model and then try to infer the parameters (e.g., fiting
a normal distribution).
Matlab code
𝑟
𝐼 = න 𝕀 𝑥 2 + 𝑦 2 ≤ 𝑟 2 𝑑𝑥𝑑𝑦 = 𝜋𝑟 2
−𝑟
1 2 𝑟
𝜋 = 2 4𝑟 න 𝕀 𝑥 2 + 𝑦 2 ≤ 𝑟 2 𝑝𝑋 𝑥 𝑝𝑌 𝑥 𝑑𝑥𝑑𝑦
𝑟 −𝑟
𝑁
1
≈ 4 𝕀 𝑥 2 + 𝑦2 ≤ 𝑟2
𝑁
𝑖
• Consider some unknown distribution 𝑝 𝑥 and suppose we have modelled this using
an approximate distribution 𝑞 𝑥 .
𝑓 𝜆𝑖 𝑥𝑖 ≤ 𝜆𝑖 𝑓 𝑥𝑖 , 𝜆𝑖 ≥ 0 and 𝜆𝑖 = 1
𝑖=1 𝑖=1 𝑖
• Note that only the blue term is having 𝑞. Therefore, minimizing 𝐾𝐿(𝑝| 𝑞 is similar
to maximizing the data-likelihood with the empirical distribution 𝑞 𝒙 𝜃 .
• Using 𝑝 𝑥, 𝑦 = 𝑝 𝑥 𝑝 𝑦 𝑥 , we have
𝑝 𝑥, 𝑦
ℍ 𝑦 𝑥 = න න 𝑝 𝑦, 𝑥 log 𝑑𝑥𝑑𝑦
𝑝 𝑥
ℙ 𝑋 = 𝑥𝑖 = ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ℙ 𝑥 = නℙ 𝑥, 𝑦 𝑑𝑦
𝑗
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = ℙ 𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ℙ 𝑋 = 𝑥𝑖
𝑝 𝜃|𝒟 ∝ 𝑝ถ
𝜃 𝑝 𝒟|𝜃
posterior prior likelihood
• KL divergence
𝑞 𝒙|𝜃
𝐾𝐿(𝑝| 𝑞 = − න 𝑝 𝒙 log 𝑑𝑥
𝑝 𝒙
• Jensen’s inequality
𝑀 𝑀
𝑓 𝜆𝑖 𝑥𝑖 ≤ 𝜆𝑖 𝑓 𝑥𝑖 , 𝜆𝑖 ≥ 0 and 𝜆𝑖 = 1
𝑖=1 𝑖=1 𝑖