0% found this document useful (0 votes)
61 views

Introduction

This document provides information about the course "Probabilistic Machine Learning for Mechanics (APL 744)" taught by Dr. Souvik Chakraborty at the Indian Institute of Technology Delhi. The key details include: - The course will include both offline lectures and online video lectures. Students can contact the instructor via email or meet in person for any doubts. - Homework, practical assignments, and a term project will be submitted on Moodle. All course information will be posted on the course webpage. - The class will be held on Monday and Thursday from 9:30-11:00am. Practicals will be held on Friday from 3:00-5:00pm. Three T

Uploaded by

Mirzaadnanbeig
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Introduction

This document provides information about the course "Probabilistic Machine Learning for Mechanics (APL 744)" taught by Dr. Souvik Chakraborty at the Indian Institute of Technology Delhi. The key details include: - The course will include both offline lectures and online video lectures. Students can contact the instructor via email or meet in person for any doubts. - Homework, practical assignments, and a term project will be submitted on Moodle. All course information will be posted on the course webpage. - The class will be held on Monday and Thursday from 9:30-11:00am. Practicals will be held on Friday from 3:00-5:00pm. Three T

Uploaded by

Mirzaadnanbeig
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Probabilistic Machine Learning for

Mechanics (APL 744)

Dr. Souvik Chakraborty


Department of Applied Mechanics
Indian Institute of Technology Delhi
Hauz Khas – 110016, Delhi, India.

E-mail: [email protected]
Website: https://www.csccm.in/

APl 744 CSCCM@IITD 1


Course organization
https://www.csccm.in/courses/probabilistic-machine-learning-for-mechanics
✓ Although the course is offline, video lectures will also be uploaded

✓ For any doubts, you can contact me at [email protected] or meet me in


person.

✓ Homework, practical, and term project should be submitted using moodle.

✓ All information regarding the course will be posted on the webpage.

✓ This includes homework, slides, references, etc.

✓ Class Timing: B-slot


✓ Lectures: MTh: 9:30 – 11:00 am
✓ Practical: Fr: 3:00 – 5:00 pm

✓ TAs: Shailesh Garg, Navaneeth N., and Tapas Tripura

APl 744 CSCCM@IITD 2


Course evaluation
✓ 8 homework will be provided (approximately 10 days per homework) – 30% weightage.
• Each will have programming component.
• Copy pasted program will be awarded ZERO marks
✓ 8 practical sheets will be provided - 15% weightage
• Each practical will have 1-2 questions.
• In practical session, you will be explained the procedure (theory + algorithm).
However, you will have to complete the practical at hostel.
✓ Term project – 30% weightage
• You will be judged on your understanding of the project, clarity of thoughts,
sophistication of the software developed
• A team can have maximum 2 members. We will provide you broad topics and you
will have to select from there.
✓ Minor – 10% weightage.
✓ Major – 15% weightage.
✓ In case any of you decide to audit the course, the passing grade will be 40% (absolute).
Additionally, you will have to complete the project and appear for both minor and major.
✓ We will mention which libraries can be used and which cannot be used in all cases.
APl 744 CSCCM@IITD 3
Homework

What kind of questions will be asked?

• Theoretical questions involving derivations.


• Practical questions involving programming.

Important dates

• HW1: Announcement (August 2); Submission (August 12).


• HW2: Announcement (August 15); Submission (August 25).
• HW3: Announcement (August 28); Submission (September 08).
• HW4: Announcement (September 10); Submission (September 20).
• HW5: Announcement (September 22); Submission (October 02).
• HW6: Announcement (October 05); Submission (October 15).
• HW7: Announcement (October 20); Submission (October 30).
• HW8: Announcement (November 31); Submission (November 10).

APl 744 CSCCM@IITD 4


Practical

What kind of questions will be asked?

• Practical questions involving programming

Important dates

• P1: Announcement (July 27); Submission (August 11).


• P2: Announcement (August 10); Submission (August 25).
• P3: Announcement (August 24); Submission (September 08).
• P4: Announcement (September 07); Submission (September 22).
• P5: Announcement (September 21); Submission (September 06).
• P6: Announcement (October 05); Submission (October 20).
• P7: Announcement (October 19); Submission (November 03).
• P8: Announcement (November 02); Submission (November 10).

APl 744 CSCCM@IITD 5


Term project
What to do?

• We will provide you a pool of research areas


• Select a research area and find a paper in that area (or propose something new in
that area)
• Read the paper, understand and reproduce the results.
• Prepare a report, make a MS PowerPoint

Important dates

• Announcement of broad topics – July 31st


• Topic selection will be first come first serve – Last date Aug 7th.
• Submission of paper/title and abstract – Last date Aug 14th
• First quarter report on project: September 05 (review of relevant literature).
• Mid term report on project (should have some implementation): September 25.
• Third quarter report on project (should have some implementation): October 15.
• Final report on project (6 pages report in NIPS format): November 05.
• Should be accompanied with the written codes and readme files.
• Source files (word or tex) should also be submitted
• Final presentation
APl 744 CSCCM@IITD 6
Books

✓ Bishop, C.M. Pattern recognition and Machine learning, Springer, 2007.

✓ Murphy, K.P. Machine learning: A Probabilistic Perspective, MIT press,


2012.

✓ C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning,


MIT Press, 2006 (a free ebook is also available from the Gaussian Processes web
site)

Additional references may be provided (with html link) may be provided in


each lecture

APl 744 CSCCM@IITD 7


Topics to Cover
✓ Introduction to Statistical Computing and Probability and Statistics
✓ Sum and Product Rules, Conditional Probability, Independence, PDF and CDF,
Bernoulli, Categorical and Multinomial Distributions, Poisson, Student’s T,
Laplace, Gamma, Beta and Pareto distribution.
✓ Generative Models; Bayesian concept learning, Likelihood, Prior, Posterior,
Posterior predictive distribution, Plug-in Approximation
✓ Bayesian Model Selection (continued) and Prior Models, Hierarchical Bayes,
Empirical Bayes
✓ Bayesian linear regression
✓ Introduction to Monte Carlo Methods, Sampling from Discrete and Continuum
Distributions, Reverse Sampling, Transformation Methods, Composition Methods,
Accept-Reject Methods, Stratified/Systematic Sampling
✓ Importance sampling, Gibbs sampling, MCMC, Metropolis Hasting algorithm
✓ Sequential importance sampling, Sequential Monte Carlo
✓ Latent variable model, probabilistic PCA, Expectation maximization
✓ Gaussian process and variational inference
✓ Some advanced topics in probabilistic ML: Bayesian neural network, GAN, VAE,
Flow-based model, Diffusion model

APl 744 CSCCM@IITD 8


Announcement

✓ Ungraded Quiz on 27th July 2023.


✓ Syllabus: Univariate and multivariate statistics and probability
✓ Purpose: To check your fundamentals on probability and statistics
✓ Outcome:
• If performance is okay – then the course will proceed as mentioned previously
• If performance is medium – I will upload additional lecture videos on the
probability and statistics part
• If performance is bad – I will take additional classes on weekends (or convert
a few practical classes into lectures).

So, if you don’t want to attend extra classes, prepare for the exam. Syllabus – second
chapter of Bishop.

APl 744 CSCCM@IITD 9


Machine learning
• We are in the era of big data (size of the web, YouTube, etc.) [link].

• How to utilize the data?


• Detect pattern in data.
• Predict future data
• Decision making under uncertainty
• Analysis and design (with and without uncertainty)

• Machine learning is the answer to all these questions.

• Machine learning algorithms can either be frequentist and probabilistic in nature.

• In this course, we will be focusing on probabilistic machine learning.

• The probabilistic approach to machine learning is closely related to the field of


computational statistics.

• Rajaraman , A. and J. Ullman (2010 ). Mining of massive datasets.


• Bekkerman , R., M. Bilenko , and J . Langford (Eds.) (2011). Scaling Up Machine Learning . Cambridge online presentation

APl 744 CSCCM@IITD 10


Supervised learning

• Machine learning is divided into two types . In the supervised learning approach ,
the goal is to learn a mapping from inputs 𝒙 to output 𝑦, given a labelled set of
data, 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1, … , 𝑁.

• 𝒟 is called the training set and 𝑁 is the number of training samples.

• 𝒙𝑖 is a 𝐷-dimensional vector of inputs (e.g., features, attributes or covariates).

• 𝑦𝑖 is the output and, in principle, can be anything.

• If 𝑦𝑖 is categorical (discrete values) (e.g., cat vs elephant), then the problem is a


classification problem.

• If 𝑦𝑖 is real-valued (continuous values) (e.g., price of a commodity), then the


problem is a regression problem.

APl 744 CSCCM@IITD 18


Unsupervised learning

• In the unsupervised learning approach , we have input data, 𝒟 = 𝒙𝑖 , 𝑖 = 1, … , 𝑁,


and the objective is to find pattern in the data (knowledge discovery).

• We work with un-labelled data; in other words, we are not told what kind of pattern
to look for.

• This is a more realistic scenario (from AI-point of view) and is more challenging.

APl 744 CSCCM@IITD 19


Reinforcement learning

• There is a third type of machine learning, known as reinforcement learning.

• This is also very important from AI point-of-view.

• This is how to act based on occasional reward or punishment.

• When it comes to application of reinforcement learning in mechanics, this has been


less explored.

Kaelbling , L., M. Littman, and A . Moore (1996). Reinforcement learning : A survey . J. of AI Research 4 , 237
285.
Sutton , R. and A. Barto (1998). Reinforcment Learning : An Introduction . MIT Press
Russell , S. and P. Norvig (1995). Artificial Intelligence : A Modern Approach . Englewood Cliffs, NJ: Prentice
Hall.
Szepesvari , C. (2010). Algorithms for Reinforcement Learning . Morgan Claypool.
Wiering , M. and M. van Otterlo (Eds .) (2012). Reinforcement learning: State of the art .

APl 744 CSCCM@IITD 20


Supervised learning - classification
• We learn a mapping from inputs 𝒙 to output 𝑦 where
𝑦𝑖 ∈ 1,2, … , 𝐶
• If 𝐶 = 2, it is a binary classification.
• If 𝐶 ≥ 3, it is a multi-class classification.
• If class labels are not mutually exclusive (tall and strong), we call it multi-label
classification.
• The objective is in generalization; i.e., given a new 𝒙∗ , find 𝑦 ∗ .
• Left : T raining examples of colored shapes, along with 3unlabeled test cases.
• Right: : Training data as an 𝑁 × 𝐷 design matrix. 𝑖 −th row represent the feature
vector 𝑥𝑖 . The last column is the label 𝑦𝑖

APl 744 CSCCM@IITD 21


Probabilistic predictions
• In our classification example, we work with posterior probabilities 𝑝 𝑦 ∗ 𝒙∗ , 𝒟

• For example, in binary classification, we compute the followings:


• 𝑝 𝑦 ∗ = 0|𝒙∗ , 𝒟
• 𝑝 𝑦 ∗ = 1|𝒙∗ , 𝒟

• Given as probabilistic output, we can always compute our “best guess” as


𝑦ො = argmax 𝑝 𝑦 ∗ = 𝑐|𝒙∗ , 𝒟
𝑐= 1,2,…,𝐶

• This corresponds to the most probable class label and is the mode of the
distribution 𝑝 𝑦 ∗ 𝒙∗ , 𝒟 . This is known as maximum a posteriori estimate (MAP
estimate).

• Point estimates are often not the best solution – what if 𝑝 𝑦 ∗ = 1 𝒙∗ , 𝒟 is far away
from 1?

• Therefore, it is extremely important to work with 𝑝 𝑦 ∗ = 1 𝒙∗ , 𝒟 and not with


point estimates.

APl 744 CSCCM@IITD 22


Point estimations
• Point estimates can be misleading and need to be avoided.

• Specifically, in domains such as medicine, finance, design and analysis where


failure can have significant consequences.

• IBM Watson beat the top human Jeopardy champion by containing a module that
estimates how confident it is of its answer.

• Google’s SmartASS (ad selection system) predicts the probability (click through
rate, CTR) you will click on an ad based on your search history and other user and
ad specific features. CTR can be used to maximize expected profit

• Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg,J. Prager, N.
Schlaefter, and C. Welty (2010). Building Watson: An Overview of the DeepQAProject.AI Magazine, 59–79.
• Metz, C. (2010). Google behavioral ad targeter is a Smart Ass. The Register.

APl 744 CSCCM@IITD 23


Supervised Learning: Regression
• Consider a real-valued input 𝑥𝑖 ∈ ℝ, and a single real-valued response 𝑦𝑖 ∈ ℝ.

• We fit a polynomial of order 1, 2 and 20.

• Many applications with high dimensional


input data.
• Issues of model selection are essential
(overfitting, etc.)

linRegPolyVsDegree from PMTK

APl 744 CSCCM@IITD 24


Unsupervised Vs. Supervised Learning

• There are two differences from the supervised case.


• Supervised learning is conditional density estimation, 𝑝 𝑦𝑖 𝒙; 𝜽
• 𝑦𝑖 is usually a single variable (class label) we are trying to predict.
Thus, for most supervised learning problems, we can use univariate
probability models.

• Unsupervised learning is unconditional density estimation, 𝑝 𝒙𝑖 𝜽


• 𝒙𝑖 are vectors of features and hence we need to create multivariate
probability distribution.

• Cheeseman, P., J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman (1988). Autoclass: A Bayesian classification system.
In Proc. of the Fifth Intl. Workshop on Machine Learning.
• Lo, C. H. (2009). Statistical methods for high throughput genomics. Ph.D. thesis, UBC.
• Berkhin, P. (2006). A survey of clustering datamining techniques. In J. Kogan, C. Nicholas, and M. Teboulle(Eds.),
Grouping Multidimensional Data: Recent Advances in Clustering, pp. 25–71. Springer.

APl 744 CSCCM@IITD 25


Unsupervised Learning: Hidden Variables
• Consider clustering data into groups -- height and weight of a group of 210
people. It is not clear how many clusters we have

• Our first goal is to estimate the distribution over the number of clusters, 𝑝 𝐾 𝒟 ;
this tells us if there are subpopulations within the data.

• For simplicity, we often approximate the distribution, 𝑝 𝐾 𝒟 by its mode,


𝐾 ∗ = argmax 𝑝 𝐾 𝒟
𝐾 kmeansHeightWeight from PMTK

• The second objective is to assign each data point to the corresponding cluster
(hidden or latent variables).
𝑧𝑖∗ = argmax 𝑝 𝑧𝑖 = 𝑘|𝒙𝑖 , 𝒟
𝑘
• Picking a model of the right complexity (here the number of clusters) is called
model selection.
APl 744 CSCCM@IITD 26
Dimensionality Reduction
• Reduce the dimensionality by projecting the data to a lower-dimensional
subspace which captures the essence of the data.
• Latent factors: although the data may appear high-dimensional, there may only
be a small number of degrees of variability.
• Principal Components Analysis (PCA): common approach to dimensionality
reduction. Useful for visualization, nearest neighbor searchers, etc.

pcaDemo3d from PMTK

APl 744 CSCCM@IITD 27


Discovering Graph Structure
• We measure a set of variables, and we like to discover which ones are most
correlated with which others. This is represented by a graph, 𝒢, in which nodes
represent variables, and edges represent dependence between variables. We look
to compute
𝒢መ = argmax 𝑝 𝒢|𝒟

• A sparse undirected Gaussian graphical model is


shown learned using graphical lasso applied to
some flow cytometry data which measures the
phosphorylation status of 11 proteins.

• Sachs, K., O. Perez, D. Pe’er, D. Lauffenburger, and G. Nolan (2005). Causal


protein-signaling networks derived from multiparameter single-cell data. Science
308.
• Smith, V., J. Yu, T. Smulders, A. Hartemink, and E. Jarvis (2006). Computational
Inference of Neural Information Flow Networks. PLOS Computational Biology 2,
1436–1439 ggmLassoDemo PMTK
• Horvitz, E., J. Apacible, R. Sarin, and L. Liao (2005). Prediction, Expectation,
and Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting
Service. In UAI.
• Carvalho, C. M. and M. West (2007). Bayesian Analysis 2(1), 69–98.

APl 744 CSCCM@IITD 28


Bayesian statistics
• Frequentist approach: Long run frequencies of ‘events’.

• Bayesian approach: Quantifying our uncertainty about something.

• The rules of probability are same for both.

• Sample space: Set of all possible outcomes of an experiment.

• Event: A subset of sample space.

For more details: S. Ross, Introduction to Probability Models


Bayesian statistics: click here
Frequentist statistics: click here

APl 744 CSCCM@IITD 29


Inference
• An inference problem requires statement about an unobserved (latent) variable 𝑥
based on observations 𝑦 which are related to 𝑥 but may not be sufficient to fully
determine 𝑥.

• Naturally this requires a notion of uncertainty.

• In real-life, most problems are of this nature.

• However, we tend to simplify it by ignoring the uncertain part.

• Example, predicting weather.

APl 744 CSCCM@IITD 30


Example – the roulette wheel

APl 744 CSCCM@IITD 31


Definition

• Let 𝐸 be a space of elementary events. Consider the power subset 2𝐸 , and let
ℱ ⊂ 2𝐸 be a subset of Ω. Elements of ℱ are called random events. If ℱ satisfies
the following properties, it is called 𝜎 − algebra.
• 𝐸∈ℱ
• 𝐴, 𝐵 ∈ ℱ ⇒ 𝐴 − 𝐵 ∈ ℱ
• 𝐴1 , … , 𝐴𝑛 ∈ ℱ ⇒ ‫∞ڂ‬ ∞
𝑖=1 𝐴𝑖 ∈ ℱ ∧ ‫=𝑖ځ‬1 𝐴𝑖 ∈ ℱ

• If ℱ is 𝜎 −algebra, then its elements are called measurable sets and (𝐸, ℱ) is
called a measurable space or Borel space.

APl 744 CSCCM@IITD 32


Probability axiom
• Probability space: We define ΩE , ℱ, 𝒫 to be the probability space such that ΩE
is the sample space, ℱ is the event space and 𝒫 is the probability measure such
that 𝒫 𝐸 is the probability of an event 𝐸 and 𝒫 ΩE = 1.

• First axiom: The probability of an event is a non-negative real number:


𝒫 𝐸 ∈ ℝ, 𝒫 𝐸 ≥ 0 ∀𝐸 ∈ ℱ
It follows that 𝒫 𝐸 is always finite

• Second axiom: This is the assumption of unit measure: that the probability that at
least one of the elementary events in the entire sample space will occur is 1,
𝒫 ΩE = 1

• Third axiom: This is the assumption of 𝜎-additivity: Any countable sequence of


disjoint sets (synonymous with mutually exclusive events)
∞ ∞

𝒫 ራ 𝐸𝑖 = ෍ 𝒫 𝐸𝑖
𝑖=1 𝑖=1

APl 744 CSCCM@IITD 33


Laws of probability

• ℙ 𝐸 : The probability of event 𝐸. It is a number satisfying the following two


conditions:
• 0 ≤ ℙ 𝐸 ≤ 1.
• ℙ Ω = 1, ℙ ∅ = 0.

• For any sequence of events 𝐸1 , 𝐸2 , … that are mutually exclusive,


∞ ∞

ℙ ራ 𝐸𝑖 = ෍ ℙ 𝐸𝑖
𝑖=1 𝑖=1

• 𝐸 𝑐 : This indicates complement of 𝐸, i.e., all events that are not in 𝐸.

• 𝐸‫ = 𝑐 𝐸ڂ‬Ω

• ℙ 𝐸‫ = 𝐹ڂ‬ℙ 𝐸 + ℙ 𝐹 − ℙ 𝐸‫𝐹ځ‬

APl 744 CSCCM@IITD 34


Union and Intersection
• ‫ ڂ‬operator: Consider two events 𝐸 and 𝐹 of sample-space Ω. Then an event
𝐸‫ 𝐹ڂ‬indicates all outcomes that are either in 𝐸 or in 𝐹 or in both.

• ‫ ځ‬operator: Consider two events 𝐸 and 𝐹 of sample-space Ω. Then an event


𝐸‫ 𝐹ځ‬indicates all outcomes that are in both 𝐸 and 𝐹.

• Consider 𝐸 = 1,2,5 and 𝐹 = 2,4 , then


• 𝐸‫ = 𝐹ڂ‬1,2,4,5
• 𝐸‫ = 𝐹ځ‬2

• If 𝐸‫∅ = 𝐹ځ‬, then we call 𝐸 and 𝐹 are mutually exclusive.

• ∅ is called empty-set.

• Generalizing, ‫∞ڂ‬ 𝑖=1 𝐸𝑖 indicates union of 𝐸1 , 𝐸2 , … , and indicates outcomes that


are in at least one 𝐸𝑖 , ∀𝑖.

• Generalizing, ‫∞ځ‬ 𝑖=1 𝐸𝑖 indicates intersection of 𝐸1 , 𝐸2 , … , and indicates outcomes


that are in all events 𝐸𝑖 , ∀𝑖.

APl 744 CSCCM@IITD 35


Discrete random variables

• Discrete random variable 𝑋 can take any value from a finite or countably infinite
set 𝒳.

• A discrete random variable is defined by probability mass function (PMF), 𝑓 𝑥 ,


where,
0 ≤ 𝑓 𝑥 ≤ 1; ෍ 𝑓 𝑥 = 1
𝑥∈𝒳

• Consider 𝒳 = 1,2,3,4 . Also consider two PMFs as follows:


1
(a) 𝑓 𝑥 = 𝑘 = 4 , 𝑘 = 1,2,3,4
1 if k = 1
(b) 𝑓 𝑥 = 𝑘 = 𝕀 𝑥 = 1 = ቊ
0 elsewhere

• The two PMFs are as follows:

APl 744 CSCCM@IITD 36


Discrete random variables

1
𝑓 𝑥=𝑘 = 𝑓 𝑥=𝑘 =𝕀 𝑥=1
4

Run Matlab function discreteProbDistFig from Kevin Murphys’ PMTK

APl 744 CSCCM@IITD 37


Joint Probability

• Probability theory provides a consistent framework for the quantification and


manipulation of uncertainty

• This is a central foundation of pattern recognition and machine learning.

• The probability that 𝑋 takes the value 𝑥𝑖 and 𝑌 takes the value 𝑦𝑗 is represented as
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 . This is the joint probability of 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 .
𝑐𝑖

𝑛𝑖𝑗 𝑟𝑗
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = 𝑦𝑗 𝑛𝑖𝑗
𝑁

• 𝑛𝑖𝑗 denotes the number of times 𝑋 =


𝑥𝑖 and 𝑌 = 𝑦𝑗 is observed. 𝑐𝑖 denotes the
number of times 𝑋 = 𝑥𝑖 is observed. 𝑟𝑗 𝑥𝑖
denotes the number of times 𝑌 = 𝑦𝑗 is
observed

APl 744 CSCCM@IITD 38


Sum and product rule
• The most important rule in Bayesian statistics is the sum and product rule.

• Sum rule:
σ𝑗 𝑛𝑖𝑗
ℙ 𝑋 = 𝑥𝑖 = ෍ ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 =
𝑁
𝑗

𝑐𝑖
ℙ 𝑥 = නℙ 𝑥, 𝑦 𝑑𝑦

• Product rule:
𝑛𝑖𝑗 𝑛𝑖𝑗 𝑐𝑖 𝑦𝑗 𝑛𝑖𝑗 𝑟𝑗
ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = =
𝑁 𝑐𝑖 𝑁
= ℙ 𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ℙ 𝑋 = 𝑥𝑖
𝑥𝑖
• Chain rule: The product rule leads to the chain rule

ℙ 𝑋1:𝐷 = ℙ 𝑋1 ℙ 𝑋2 𝑋1 ℙ 𝑋3 |𝑋1 , 𝑋2 ⋯ ℙ 𝑋𝐷 𝑋1 , 𝑋2 , … , 𝑋𝐷−1

• The most complex calculations in probability are nothing but simple applications
of sum and product rules
APl 744 CSCCM@IITD 39
Conditional probability and Bayes’ rule
ℙ 𝑋 = 𝑥, 𝑌 = 𝑦 ℙ 𝑋 = 𝑥 ℙ 𝑌 = 𝑦|𝑋 = 𝑥
ℙ 𝑋 = 𝑥|𝑌 = 𝑦 = =
ℙ 𝑌=𝑦 σ𝑥′ ℙ 𝑋 = 𝑥 ′ , 𝑌 = 𝑦

ℙ 𝑋 = 𝑥 ℙ 𝑌 = 𝑦|𝑋 = 𝑥
ℙ 𝑋 = 𝑥|𝑌 = 𝑦 =
σ𝑥′ ℙ 𝑋 = 𝑥 ′ ℙ 𝑌 = 𝑦|𝑋 = 𝑥 ′

• Bayes’ theorem plays an important role in machine learning and pattern


recognition.

• An example:
ℙ 𝐵 = 𝑟 = 0.4, ℙ 𝐵 = 𝑏 = 0.6

We select an orange. What is the probability


that the orange is from red box?
ℙ 𝐵 = 𝑟 ℙ 𝐹 = 𝑜|𝐵 = 𝑟
ℙ 𝐵 = 𝑟|𝐹 = 𝑜 =
σ𝑥′ ℙ 𝐵 = 𝑥 ′ , 𝐹 = 𝑜

APl 744 CSCCM@IITD 40


Conditional probability and Bayes’ rule

ℙ 𝐵 = 𝑟 ℙ 𝐹 = 𝑜|𝐵 = 𝑟
ℙ 𝐵 = 𝑟|𝐹 = 𝑜 =
ℙ 𝐹=𝑜

ℙ 𝐹=𝑜
= ℙ 𝐵 = 𝑟 ℙ 𝐹 = 𝑜|𝐵 = 𝑟
+ ℙ 𝐵 = 𝑏 ℙ 𝐹 = 𝑜|𝐵 = 𝑏

6 1
ℙ 𝐹 = 𝑜 = 0.4 × + 0.6 × = 0.45
8 4

0.4 × 6/8
ℙ 𝐵 = 𝑟|𝐹 = 𝑜 = = 2/3
0.45

• Once it is observed that the fruit selected is orange, the chance of selecting red
box increases from 0.4 to 0.67.

APl 744 CSCCM@IITD 41


Medical diagnosis – base rate fallacy
• Suppose you are sick and your doctor thinks that you have Tuberculosis (TB).

• It is known that 0.4% of the population has tuberculosis, ℙ TB = 0.004.

• A test is available but not perfect; if a tested patient has disease, 80% of the time
the test will be positive, ℙ Positive|TB = 0.80. On the contrary, if a tested
patient does not have the disease, 90% of the time the result is negative,
ℙ Negative|TB c = 0.9, ℙ Positive|TB c = 0.1.

• Your test is positive, should you be worried?

• Base rate fallacy: People will assume that they have 80% chance to have the
disease; but they ignore the PRIOR knowledge.

ℙ TB ℙ Positive|TB ℙ Positive
ℙ TB|Positive = = ℙ TB ℙ Positive|TB
ℙ Positive
0.8 + ℙ TB 𝑐 ℙ Positive|TB c
= 0.004 × ≈ 0.031 = 3.1% = 0.004 × 0.8 + 0.996 × 0.1
0.1028 = 0.1028.

APl 744 CSCCM@IITD 42


Independence and conditional probabilities

• Two events 𝐴 and 𝐵 are independent 𝐴 ⊥ 𝐵 if


ℙ 𝐴‫ = 𝐵ځ‬ℙ 𝐴 ℙ 𝐵

• Conditional probability: Probability that 𝐴 happens given 𝐵 has already


happened,
ℙ 𝐴‫𝐵ځ‬
ℙ 𝐴|𝐵 =
ℙ 𝐵

• It is trivial to prove that,


ℙ 𝐴‫ ≤ 𝐵ځ‬ℙ 𝐴|𝐵

• For independent events,


ℙ 𝐴|𝐵 = ℙ 𝐴

APl 744 CSCCM@IITD 43


Conditional independence

• Two events 𝐴 and 𝐵 are conditionally independent given 𝑍 if


ℙ 𝐴, 𝐵|𝑍 = ℙ 𝐴|𝑍 ℙ 𝐵|𝑍

• Define the following events:


• 𝑎 → it will rain tomorrow.
• 𝑏 → the ground is wet today.
• 𝑐 → it is raining today.

• 𝑐 causes both and hence, given 𝑐, 𝑎 and 𝑏 are


independent,
ℙ 𝑎, 𝑏 𝑐 = ℙ 𝑎 𝑐 ℙ 𝑏 𝑐
ℙ 𝑎 𝑏, 𝑐 = ℙ 𝑎 𝑐

• Observing a root node separates the children.

APl 744 CSCCM@IITD 44


Pairwise vs. Mutual independence

• Consider four balls, 1,2,3,4 are present in a box. Now consider the following
events:
• Event 1: ball 1 or 2 is drawn
• Event 2: ball 2 or 3 is drawn
• Event 3: ball 1 or 3 is drawn.

• Note that,
1 1
ℙ 𝐸1 , 𝐸2 = = ℙ 𝐸1 ℙ 𝐸2 , ℙ 𝐸2 , 𝐸3 = = ℙ 𝐸2 ℙ 𝐸3
4 1 1
4 1 1
2 2 2 2
1
ℙ 𝐸1 , 𝐸3 = = ℙ 𝐸1 ℙ 𝐸3
4 1 1
2 2
• However,
1
ℙ 𝐸1 , 𝐸2 , 𝐸3 = 0 ≠ ℙ 𝐸1 ℙ 𝐸2 ℙ 𝐸3 =
1 1 1
8
2 2 2
• Pairwise independence doesn’t ensure mutual independence

APl 744 CSCCM@IITD 45


Importance of independence

• Consider two events 𝑋 and 𝑌.

• 𝑋 takes 6 values and 𝑌 takes 5 values

ℙ 𝑥, 𝑦 ℙ 𝑦

No. of parameters = 9

No. of parameters = 29 30 − 1 ℙ 𝑥
• Independence is key to EFFICIENT probabilistic modelling (Naïve Bayes’,
Markov model, probabilistic graphical model, etc).

APl 744 CSCCM@IITD 46


Random variables

• Define Ω to be a probability space equipped with a probability measure P that


measures the probability of events . Ω contains all possible events in the form of
its own subsets.

• A real valued random variable 𝑋 is a mapping, 𝑋: Ω → ℝ.

• We call 𝑥 = 𝑋 𝜔 , 𝜔 ∈ Ω, a realization of 𝑋.

• Probability distribution of 𝑋: For 𝐵 ∈ ℝ,


𝜇𝑋 𝐵 = ℙ 𝑋 𝜔 ∈ 𝐵

• Probability density:
ℙ𝑥 𝐵 = න 𝑝𝑋 𝑥 𝑑𝑥
𝐵

• Often we write 𝑝𝑋 𝑥 = 𝑝 𝑥

APl 744 CSCCM@IITD 47


Cumulative density function
• A continuous random variable is defined by probability density function (PDF),
𝑝𝑋 𝑥 , where,

𝑝𝑋 𝑥 ≥ 0; න 𝑝𝑋 𝑥 𝑑𝑥 = 1
−∞

• The CDF for a random variable 𝑋 is the function 𝐹(𝑧) that returns the probability
that 𝑋 is less than 𝑧

• Cumulative density function,


𝑧
𝐹 𝑧 = න 𝑝𝑋 𝑥 𝑑𝑥
−∞

𝑏
• ℙ 𝑥 ∈ 𝑎, 𝑏 = ‫ 𝑏 𝐹 = 𝑥𝑑 𝑥 𝑋𝑝 𝑎׬‬− 𝐹 𝑎

APl 744 CSCCM@IITD 48


Mean and variance
• The expected (mean) value of 𝑋 is

𝐸 𝑋 = න 𝑥𝑝𝑋 𝑥 𝑑𝑥
−∞
• The variance of 𝑋 is
2
𝑣𝑎𝑟 𝑋 = 𝐸 𝑋−𝐸 𝑋 = 𝐸 𝑋 2 − 𝐸 2 (𝑋)

• The standard deviation is


𝑠𝑡𝑑 𝑋 = 𝑣𝑎𝑟 𝑋

• The 𝑘 −th moment is defined as



𝑘 𝑘
𝐸 𝑋−𝐸 𝑋 =න 𝑥−𝐸 𝑋 𝑝𝑋 𝑥 𝑑𝑥
−∞
• If 𝑇 𝑋 is a function of 𝑋, then

𝐸 𝑇 𝑋 = න 𝑇 𝑥 𝑝𝑋 𝑥 𝑑𝑥
−∞
∞ 2
𝑣𝑎𝑟 𝑇 𝑋 =න 𝑇 𝑥 −𝐸 𝑇 𝑋 𝑝𝑋 𝑥 𝑑𝑥
−∞

APl 744 CSCCM@IITD 49


Expectation

• Discrete:
𝐸 𝑓 𝑋 = ෍ 𝑓 𝑥 𝑝𝑋 𝑥
𝑥

• Continuous:

𝐸 𝑓 𝑋 = න 𝑓 𝑥 𝑝𝑋 𝑥 𝑑𝑥
−∞

• Conditional expectation:
𝐸 𝑓 𝑋 |𝑌 = ෍ 𝑓 𝑥 𝑝 𝑥|𝑦
𝑥

• The expectation of a random variable is not necessarily the value that we should
expect a realization to have.

APl 744 CSCCM@IITD 50


Expectation

• Consider the example of throwing a dice, Ω = 1,2,3,4,5,6 .

• Define a random variable 𝑋 that indicates the outcome of a dice throw.

• We know that 𝑝 𝑥 = 1/6.

• The expectation of 𝑋 is
1 21
𝐸 𝑋 = ෍ 𝑥𝑝 𝑥 = 1 + 2 + 3 + 4 + 5 + 6 = = 3.5
6 6
𝑥

• Clearly, 3.5 is not an outcome (realization) of a dice throw.

APl 744 CSCCM@IITD 51


Expectation

• Let, Ω = −1,1 with probability density function being uniform


𝑝𝑋 𝑥 = 1/2

• Consider the two random variables,


𝑋1 : −1,1 → ℝ, 𝑋1 𝜔 = 1, ∀𝜔
2, 𝜔≥0
𝑋2 : −1,1 → ℝ, 𝑋2 𝜔 = ቊ
0, 𝜔<0

• Expectation of the two variables are given as


1 1
1 1
𝐸 𝑋1 = න 𝑋1 𝜔 𝑑𝑥 = න 1 𝑑𝑥 = 1
−1 2 −1 2
1 0 1
1 1 1
𝐸 𝑋2 = න 𝑋2 𝜔 𝑑𝑥 = න 0 𝑑𝑥 + න 2 𝑑𝑥 = 1
−1 2 −1 2 0 2

• Clearly, 𝑋2 can never take the value 1.

APl 744 CSCCM@IITD 52


Uniform random variable

• Probability density function


1
𝒰 𝑥|𝑎, 𝑏 = 𝕀 𝑎≤𝑥≤𝑏
𝑏−𝑎

• What is CDF of uniform random variable 𝒰 0,1 ?

𝑥, 0≤𝑥≤1
ℙ𝑈 𝑥 = ቐ 1, 𝑥>1
0, 𝑥<1

• Mean of 𝒰 𝑥|𝑎, 𝑏
𝑎+𝑏
𝐸 𝑥 =
2

• Note that it is possible for 𝑝 𝑥 > 1, although the density must integrate to 1. For
e.g.,
1
𝒰 𝑥|0,1/2 = 2, ∀𝑥 ∈ 0,
2

APl 744 CSCCM@IITD 53


Gaussian random variable

• A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed, 𝑋 ∼ 𝒩 𝜇, 𝜎 2 if


𝑡
1 𝑥−𝜇 2
ℙ 𝑋≤𝑡 = න exp − 2
𝑑𝑥
𝜎 2𝜋 −∞ 2𝜎

• The PDF of Gaussian distribution is


2
1 𝑥−𝜇
𝒩 𝑥|𝜇, 𝜎 2 = exp −
𝜎 2𝜋 2𝜎 2

• We often work with the precision of a Gaussian, 𝜆 = 1/𝜎 2 . The higher the 𝜆, the
narrower the distribution is.

• 𝜇 and 𝜎 are the mean and standard deviation of Gaussian distribution.

APl 744 CSCCM@IITD 54


Gaussian random variable

Proof: The PDF of Gaussian distribution is normalized

∞ 1 2
• Let 𝐼 ≡ ‫׬‬−∞ exp − 2𝜎2 𝑥 − 𝜇 𝑑𝑥.

∞ ∞ 1 1
• Then, 𝐼 2 = ‫׬‬−∞ ‫׬‬−∞ exp − 2𝜎2 𝑥 − 𝜇 2
exp − 2𝜎2 𝑦 − 𝜇 2
𝑑𝑥𝑑𝑦

• Set 𝑟 2 = 𝑥 − 𝜇 2
+ 𝑦 − 𝜇 2 and perform variable transformation,
∞ 2𝜋 ∞
1 1 2
𝐼 = න න exp − 2 𝑟 𝑟𝑑𝑟𝑑𝜃 = 2𝜋 න exp − 2 𝑟 𝑟𝑑𝑟 = 2𝜋𝜎 2
2 2
0 0 2𝜎 0 2𝜎

Some useful relations integrals for derivations in Gaussian distribution



න exp −𝑢2 𝑑𝑢 = 𝜋
−∞ Used in

න 𝑢 exp −𝑢2 𝑑𝑢 = 0 derivation of
−∞ mean, SD etc

𝜋
න 𝑢2 exp −𝑢2 𝑑𝑢 =
−∞ 2
APl 744 CSCCM@IITD 55
Gaussian random variable
Plot of the Standard Normal 𝒩 0,1 and its CDF

Run MatLab function gaussPlotDemo


from Kevin Murphys’ PMTK

APl 744 CSCCM@IITD 56


Gaussian random variable
• The Gaussian distribution is one of the most studied and most used distribution.

• Sample 𝑋1 , … , 𝑋𝑁 from 𝒩 𝜇, 𝜎 2 .

• In a typical inference problem, we are interested in


• Estimation of 𝜇 and 𝜎
• Confidence intervals of 𝜇 and 𝜎.

• Normaldata : Relative changes in reported larcenies between 1991 and 1995 and
1995 (relative to 1991) for the 90 most populous US counties. FBI data

MatLab implementation

From Bayesian Core , J.M. Marin and C.P. Roberts, Chapter 2 (available on line)

APl 744 CSCCM@IITD 57


Datasets: CMBData

CMBdata: Spectral representation of the cosmological microwave background


(CMB), i.e., electromagnetic radiation from photons back to 300,000 years after the
Big Bang, expressed as difference in apparent temperature from the mean
temperature.

CMBdata Normal estimation


MatLab implementation

From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 2

APl 744 CSCCM@IITD 58


Univariate Gaussian
• Representation of symmetric phenomena without
long tails

• Inappropriate for skewness, fat tails, multi-modality


etc.

• However, Gaussian distribution is the most popular


distribution because of the following reasons:

• Completely defined in terms of mean and standard deviation.

• The central limit-theorem shows that sum of i.i.d random variables has
approximately a Gaussian distribution making it appropriate choice for modelling
noise (limit of additive small effects).

• Gaussian distribution makes the least assumption (maximum entropy) from all
possible distributions with given mean and variance.

• Closed for solutions and interesting properties that we will encounter later.

APl 744 CSCCM@IITD 59


Binary variable

• Consider coin flipping experiment with heads=1 and tails =0, with 𝜇 ∈ 0,1 .
ℙ 𝑥=1𝜇 =𝜇
ℙ 𝑥 =0 𝜇 =1−𝜇

• This defines a Bernoulli distribution as


ℬℯ𝓇𝓃 𝑥|𝜇 = 𝜇 𝑥 1 − 𝜇 1−𝑥
= 𝜇𝕀 𝑥=1
1−𝜇 𝕀 𝑥=0

• For Bernoulli distribution,


𝐸 𝑥 = 𝜇, 𝑣𝑎𝑟 𝑥 = 𝜇 1 − 𝜇

• Likelihood: Consider 𝒟 = 𝑥1 , … , 𝑥𝑁 in which we have 𝑚 heads 𝑥 = 1 and 𝑁 −


𝑚 tails 𝑥 = 0 . Compute the parameter 𝜇.

• To solve such a problem, we first need to define data-likelihood, ℙ 𝒟|𝜇 .

APl 744 CSCCM@IITD 60


Likelihood function
• The likelihood function (often simply called the likelihood) measures the goodness
of fit of a statistical model to a sample of data for given values of the
unknown parameters.

• It is formed from the joint probability distribution of the sample, but viewed and
used as a function of the parameters only, thus treating the random variables as fixed
at the observed values

• For the coin tossing example,


𝑁 𝑁

ℙ 𝒟|𝜇 = ෑ ℙ 𝑥𝑖 |𝜇 = ෑ 𝜇 𝑥𝑖 1 − 𝜇 1−𝑥𝑖

𝑖=1 𝑖=1
= 𝜇𝑚 1− 𝜇 𝑁−𝑚

• Once the likelihood is formed, there are three ways to compute 𝜇.


• Maximize ℙ 𝒟|𝜇 (Maximum likelihood estimation (MLE)).
ℙ 𝜇 ℙ 𝒟|𝜇
• Compute the posterior, ℙ 𝜇 𝒟 = ℙ 𝒟
∝ ℙ 𝜇 ℙ 𝒟|𝜇 (Bayesian way)
• 𝜇∗ = argmax𝜇 ℙ 𝜇 𝒟 (MAP estimate).

APl 744 CSCCM@IITD 61


Likelihood function
• The likelihood function (often simply called the likelihood) measures the goodness
of fit of a statistical model to a sample of data for given values of the
unknown parameters.

• It is formed from the joint probability distribution of the sample, but viewed and
used as a function of the parameters only, thus treating the random variables as fixed
at the observed values

• For the coin tossing example,


𝑁 𝑁

ℙ 𝒟|𝜇 = ෑ ℙ 𝑥𝑖 |𝜇 = ෑ 𝜇 𝑥𝑖 1 − 𝜇 1−𝑥𝑖

𝑖=1 𝑖=1
= 𝜇𝑚 1− 𝜇 𝑁−𝑚

• Once the likelihood is formed, there are three ways to compute 𝜇.


• Maximize ℙ 𝒟|𝜇 (Maximum likelihood estimation (MLE)).
ℙ 𝜇 ℙ 𝒟|𝜇
• Compute the posterior, ℙ 𝜇 𝒟 = ℙ 𝒟
∝ ℙ 𝜇 ℙ 𝒟|𝜇 (Bayesian way)
• 𝜇∗ = argmax𝜇 ℙ 𝜇 𝒟 (MAP estimate).

APl 744 CSCCM@IITD 62


Binomial distribution
• Consider a discrete random variable 𝑋 ∈
0,1, … , 𝑁 . Matlab code
ℬ𝒾𝓃 𝑁 = 10, 𝜇 = 0.25
• Binomial distribution is given as
𝑁 𝑥 𝑁−𝑥
ℬ𝒾𝓃 𝑥|𝑁, 𝜇 = 𝜇 1−𝜇
𝑥

• In the coin-flipping experiment, it gives the


probability that in 𝑁 flips we get 𝑥 heads with
probability of getting head to be 𝜇.

• For this distribution,


𝐸 𝑋 = 𝑁𝜇, 𝑣𝑎𝑟 𝑋 = 𝑁𝜇 1 − 𝜇

• It can be shown that in the limiting condition, 𝑁 → ∞, 𝑁𝜇 → 𝜆, Binomial distribution


converges to the Poisson’s distribution.

For more details: S. Ross, Introduction to Probability Models

APl 744 CSCCM@IITD 63


Binomial distribution

• The Binomial distribution for N=10, and 𝜇 = 0.25,0.9 is shown below using
MatLab function binomDistPlot from Kevin Murphys’ PMTK.

APl 744 CSCCM@IITD 64


Generalization of Bernoulli’s distribution

• We are now looking at discrete variables that can take on one of K possible
mutually exclusive states.

• The variable is represented by a 𝐾-dimensional vector 𝑥 in which one of the


elements 𝑥𝑘 = 1 and all remaining elements are zero, 𝑥 = 0,0, … , 1,0, … , 0 .

• Let the probability of 𝑥𝑘 = 1 be denoted by 𝜇𝑘 , then


𝐾 𝐾 𝐾
𝑥 𝕀 𝑥𝑘 =1
𝕡 𝒙 𝝁 = ෑ 𝜇𝑘 𝑘 = ෑ 𝜇𝑘 , ෍ 𝜇𝑘 = 1, 𝜇𝑘 ≥ 0, ∀𝑘
𝑘=1 𝑘=1 𝑘=1

• The mean of the distribution is 𝐸 𝒙 𝝁 = 𝜇.

• This is known as the Multinoulli distribution or the categorical distribution


𝒞𝒶𝓉 𝒙 𝝁 = ℳ𝓊ℓ𝓉𝒾𝓃ℴ𝓊ℓℓ𝒾 𝒙 𝝁 = ℳ𝓊 𝒙|1, 𝝁

• ℳ𝓊 ⋅ indicates multinomial distribution. 1 indicates that a dice is rolled only


once.

APl 744 CSCCM@IITD 65


Likelihood: Multinoulli Distribution

• Let us consider a dataset 𝒟 = 𝒙1 , … , 𝒙𝑁 . The objective is to compute the


parameters, 𝝁
𝑁 𝑁 𝐾 𝐾 𝐾
𝑥 σ𝑁
𝑖=1 𝑥𝑖𝑘 𝑚
ℙ 𝒟|𝝁 = ෑ ℙ 𝑥𝑖 |𝜇 = ෑ ෑ 𝜇𝑘 𝑖𝑘 = ෑ 𝜇𝑘 = ෑ 𝜇𝑘 𝑘
𝑖=1 𝑖=1 𝑘=1 𝑘=1 𝑘=1

• 𝑚𝑘 = σ𝑁𝑖=1 𝑥𝑖𝑘 is known as the sufficient statistics of the distribution and is the
number of observation of 𝑥𝑘 = 1.

• MLE estimate of 𝝁:
𝜇 ∗ = argmax log ℙ 𝒟|𝝁
𝜇
Subjected to
𝐾

෍ 𝜇𝑘 = 1
𝑘=1

𝑚𝑘
• This yields 𝜇𝑘 = 𝑁

APl 744 CSCCM@IITD 66


Multinomial distribution

• Generalizing Multinoulli distribution for 𝑁 −trials, we have


𝑁! 𝑚 𝑚
ℙ 𝑚1 , … , 𝑚𝐾 ; 𝑁, 𝜇1 , … , 𝜇𝐾 = 𝜇1 1 ⋯ 𝜇𝐾 𝐾
𝑚1 ! ⋯ 𝑚𝐾 !
where
෍ 𝑚𝑖 = 𝑁
𝑖

• A summary of different distributions in provided below (taken from K. Murphy’s


book)

APl 744 CSCCM@IITD 67


Student’s t distribution
𝜈 1 1 𝜈 1
Γ + 𝜆 2 𝜆 𝑥−𝜇 2 −2−2
𝑝 𝑥|𝜇, 𝜆, 𝜈 = 2 𝜈 2 1+
Γ 2 𝜋𝜈 𝜈

• The parameter 𝜆 is known as the precision of the t-distribution, even though, it is not
in general equal to the inverse of the variance.

• The parameter 𝜈 is called the degrees of freedom.

• For the particular case of 𝜈 = 1, the t-distribution reduces to Cauchy’s distribution.

• In the limit 𝜈 → ∞, the t-distribution 𝒯 𝑥|𝜇, 𝜆, 𝜈 reduces to Normal distribution with


mean 𝜇 and precision 𝜆, 𝒩 𝑥 𝜇, 𝜆−1 .

• Proof:
𝜈 1
𝜆 𝑥−𝜇 2 −2−2 𝜈+1 𝜆 𝑥−𝜇 2
𝑝 𝑥|𝜇, 𝜆, 𝜈 ∝ 1 + = exp − log 1 +
𝜈 2 𝜈

APl 744 CSCCM@IITD 68


Student’s t distribution
𝜈 1
𝜆 𝑥−𝜇 2 −2−2 𝜈+1 𝜆 𝑥−𝜇 2
𝑝 𝑥|𝜇, 𝜆, 𝜈 ∝ 1 + = exp − log 1 +
𝜈 2 𝜈

𝑥2 𝑥3
• Using Taylor’s series, we know log 1 + 𝑥 = 𝑥 − + −⋯≈𝑥
2 3

• Substituting it into the PDF,

𝜈+1 𝜆 𝑥−𝜇 2 𝜆 𝑥−𝜇 2


𝑝 𝑥|𝜇, 𝜆, 𝜈 ∝ exp − +𝒪 𝜈 −2 = exp − + 𝒪 𝜈 −1
2 𝜈 2

• This proves that in the limiting condition 𝜈 → ∞, the 𝒯 𝑥|𝜇, 𝜆, 𝜈 reduces to


𝒩 𝑥|𝜇, 𝜆−1 .

APl 744 CSCCM@IITD 69


Student’s t distribution

MatLab code

Mean: 𝜇, 𝜈 > 1
Mode: 𝜇
𝜈
,𝜈 > 2
𝜆 𝜈−2
Variance: ൞ ∞, 1 < 𝜈 ≤ 2 ,
undefined otherwise

APl 744 CSCCM@IITD 70


Student’s t distribution is mixture of Gaussians

• If we have a univariate Gaussian, 𝒩 𝑥|𝜇, 𝜏 −1 together with a prior


ℊ𝒶𝓂𝓂𝑎 𝜏|𝑎, 𝑏 , we have

−1
𝑝 𝑥|𝜇, 𝑎, 𝑏 = න𝑝 𝑥|𝜇, 𝜏 𝑝 𝜏|𝑎, 𝑏 𝑑𝜏 = න 𝒩 𝑥|𝜇, 𝜏 −1 ℊ𝒶𝓂𝓂𝑎 𝜏|𝑎, 𝑏 𝑑𝜏
0

• Writing the expression of the two distributions, we have


∞ 1
𝜏 2 𝜏 2
𝑏𝑎 𝑎−1
𝑝 𝑥|𝜇, 𝑎, 𝑏 = න exp − 𝑥 − 𝜇 𝜏 exp −𝑏𝜏 𝑑𝜏
0 2𝜋 2 Γ 𝑎

1 2
• Consider, 𝑧 = 𝑏 + 2 𝑥 − 𝜇 𝜏,
𝐴
1

𝑏𝑎 1 2
𝑝 𝑥|𝜇, 𝑎, 𝑏 = න 𝜏 0.5 exp −𝑧 𝜏 𝑎−1 𝑑𝜏
Γ 𝑎 2𝜋 0
1

𝑏𝑎 1 2 1 𝑑𝑧
= 1 න 𝑧 0.5 exp −𝑧 𝑧 𝑎−1
Γ 𝑎 2𝜋 𝐴
𝐴2+𝑎−1 0

APl 744 CSCCM@IITD 71


Student’s t distribution is mixture of Gaussians
1

𝑏𝑎 1 2 1 0.5 𝑎−1
𝑑𝑧
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 1 න 𝑧 exp −𝑧 𝑧
Γ 𝑎 2𝜋 +𝑎−1 0 𝐴
𝐴2
• On simplifications,
1 1
− −𝑎 ∞
𝑏𝑎 1 2 1 2
2
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 𝑏+ 𝑥−𝜇 න exp −𝑧 𝑧 𝑎−1+0.5 𝑑𝑧
Γ 𝑎 2𝜋 2 0


• By definition, Γ 𝑎 = ‫׬‬0 exp −𝑧 𝑧 𝑎−1 𝑑𝑧. Therefore,
1 1
−2−𝑎
𝑏𝑎 1 1
2
2
1
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 𝑏+ 𝑥−𝜇 Γ 𝑎+
Γ 𝑎 2𝜋 2 2
𝑎
• Finally, redefining 𝜈 = 2𝑎, 𝜆 = 𝑏 , we have

𝜈 1 1 1 𝜈
Γ 2+2 𝜆 2 𝜆 𝑥−𝜇 2 −2−2
𝑝 𝑥|𝜇, 𝑎, 𝑏 = 1+
Γ 𝑎 𝜋𝜈 𝜈

• Student’s t-distribution is infinite mixture of Gaussians.

APl 744 CSCCM@IITD 72


Robustness of student’s t distribution
• The robustness of the t-distribution is illustrated by comparing the maximum
likelihood solutions for a Gaussian and a t-distribution (30 samples from
Gaussian is used).

• The effect of small number of outliers (Fig. on right) is less significant for the t-
distribution than for the Gaussian.

Matlab Code

APl 744 CSCCM@IITD 73


Other distributions

• There are several other distributions:


• Laplace distribution
• Beta distribution
• Gamma distribution
• Exponential distribution
• Chi-squared distribution
• Inverse Gamma distribution
• The Pareto distribution

• The operations are the same as any other probability density functions.

• You can expect questions on some of this distributions in the HW.

APl 744 CSCCM@IITD 74


Covariance

• Covariance
𝑐𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌−𝐸 𝑌 = 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸 𝑌
• It expresses the extent to which 𝑋 and 𝑌 vary (linearly) together.
• If 𝑋 and 𝑌 are independent, 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌 , 𝑐𝑜𝑣 𝑋, 𝑌 = 0; however, the
reverse might not be true.
• 𝑋 and 𝑌 are said to be orthogonal if, 𝐸 𝑋𝑌 = 0.
• The correlation reflects the noisiness and direction of a linear relationship (top
row), but not the slope of that relationship (middle), nor nonlinear relationships
(bottom).

APl 744 CSCCM@IITD 75


Multi-variate random variable

• Consider 𝑿 = 𝑋1 , 𝑋2 , … , 𝑋𝑁 𝑇 ∈ ℝ𝑁 where each component 𝑋𝑖 is an ℝ −


valued function.

• 𝑿 is defined by the joint probability density of its components


𝑝𝑿 : ℝ𝑁 → ℝ+

• The cumulative distribution function is defined as


𝐹 𝑥1 , 𝑥2 , … , 𝑥𝑁 = ℙ 𝑋1 ≤ 𝑥1 , 𝑋2 ≤ 𝑥2 , … , 𝑋𝑁 ≤ 𝑥𝑁 ∈ 0,1

• The probability density function of 𝑿 is defined as


𝜕𝑁 𝐹 𝒙
𝑝𝑿 𝑥1 , 𝑥2 , … , 𝑥𝑁 =
𝜕𝑥1 𝜕𝑥2 ⋯ 𝜕𝑥𝑁
and ‫ = 𝒙𝑑 𝒙 𝑿𝑝 ׬‬1

• The expectation is defined as


𝐸 𝑿 = න𝒙𝑝𝑿 𝒙 𝑑𝒙 ∈ ℝ𝑁

APl 744 CSCCM@IITD 76


Multi-variate random variable

• The covariance matrix is given as


𝑇
𝑐𝑜𝑣 𝑿 = න 𝒙 − 𝐸 𝑿 𝒙−𝐸 𝑿 𝑝𝑿 𝒙 𝑑𝒙 ∈ ℝ𝑁×𝑁

• The covariance matrix is symmetric and semi-definite.

• The diagonal of the covariance matrix gives the variances of the individual
components.

𝑣𝑎𝑟 𝑿1 𝑐𝑜𝑣 𝑿1 , 𝑿2 ⋯ 𝑐𝑜𝑣 𝑿1 , 𝑿𝑁


𝑐𝑜𝑣 𝑿2 , 𝑿1 𝑣𝑎𝑟 𝑿2 ⋯ 𝑐𝑜𝑣 𝑿2 , 𝑿𝑁
𝑐𝑜𝑣 𝐗 =
⋮ ⋮ ⋱ ⋮
𝑐𝑜𝑣 𝑿𝑁 , 𝑿1 𝑐𝑜𝑣 𝑿𝑁 , 𝑿2 ⋯ 𝑣𝑎𝑟 𝑿𝑁

• A normalized version of this is the correlation matrix where all elements are
between −1,1 (diagonal elements = 1).

APl 744 CSCCM@IITD 77


Multi-variate random variable

• A normalized version of this is the correlation matrix where all elements are
between −1,1 (diagonal elements = 1).

1 𝑐𝑜𝑟𝑟 𝑿1 , 𝑿2 ⋯ 𝑐𝑜𝑟𝑟 𝑿1 , 𝑿𝑁
𝑐𝑜𝑟𝑟 𝑿2 , 𝑿1 1 ⋯ 𝑐𝑜𝑟𝑟 𝑿2 , 𝑿𝑁
𝐑=
⋮ ⋮ ⋱ ⋮
𝑐𝑜𝑟𝑟 𝑿𝑁 , 𝑿1 𝑐𝑜𝑟𝑟 𝑿𝑁 , 𝑿2 ⋯ 1

APl 744 CSCCM@IITD 78


Multivariate Gaussian

• A multivariate 𝑿 ∈ ℝ𝑁 is Gaussian if its probability

1
1 2 1
𝑝 𝒙 = 𝑁
exp −𝒙 − 𝝁 𝑇 Σ −1 𝒙 − 𝝁
2𝜋 det Σ 2
where 𝝁 ∈ ℝ𝑁 is the mean vector and Σ ∈ ℝ𝑁×𝑁 is the covariance matrix.

• This distribution is invariant under linear transformation. Suppose two


independent variables 𝑿1 and 𝑿2 are defined as follows:
𝑿1 = 𝓝 𝝁1 , 𝚺1 , 𝑿2 = 𝓝 𝝁2 , 𝚺2
Then, 𝐀𝑿1 + 𝐁𝑿2 + 𝒄 ∼ 𝓝 𝐀𝝁1 + 𝐁𝝁2 + 𝒄, 𝐀𝚺1 𝐀T + 𝐁𝚺2 𝐁 T

APl 744 CSCCM@IITD 79


Multivariate Gaussian

MATLAB code

APl 744 CSCCM@IITD 80


Transformations of probability distributions

• A probability density transforms differently from functions.

• Let 𝑥 = 𝑔 𝑦 , then
𝑑𝑥
𝑝𝑌 𝑦 = 𝑝𝑋 𝑔 𝑦 = 𝑝𝑋 𝑔 𝑦 𝑔′ 𝑦
𝑑𝑦

• This is derived based on the observation that


𝑝𝑌 𝑦 𝑑𝑦 = 𝑝𝑋 𝑥 𝑑𝑥

• An example
𝑏𝑎 𝑎−1
𝒢𝒶𝓂𝓂𝒶 𝑥|𝑎, 𝑏 = 𝑥 exp −𝑥𝑏
Γ 𝑎
Define, 𝑌 = 1/𝑋.
𝑏𝑎 − 𝑎−1 1 𝑏𝑎 − 𝑎−1 −2 exp
𝑝𝑌 𝑦 = 𝑦 exp −𝑏/𝑦 − 2 = 𝑦 −𝑏/𝑦
Γ 𝑎 𝑦 Γ 𝑎
𝑏𝑎 − 𝑎+1
𝑝𝑌 𝑦 = 𝑦 exp −𝑏/𝑦
Γ 𝑎
→ This is the inverse Gamma distribution.

APl 744 CSCCM@IITD 81


Multivariate change of variable

• For multi-variate distribution, the mapping is in terms of the jacobian


𝜕𝒙
𝑝𝑌 𝒚 = 𝑝𝑋 𝒙 det
𝜕𝒚
where
𝜕𝑦1 𝜕𝑦1

𝜕𝒚 𝜕𝑥1 𝜕𝑥𝑛
= ⋱
𝜕𝒙 𝜕𝑦𝑁 𝜕𝑦𝑁

𝜕𝑥1 𝜕𝑥𝑁
• Multivariate student’s t-distribution
𝐷 𝜈 1 𝜈 𝐷
−2− 2
Γ 2 +2 𝚲 2 Δ 2
𝒯 𝒙|𝝁, 𝚲, 𝜈 = 𝜈 1 +
Γ 2 (𝜋𝜈) 𝜈
Δ2 = 𝒙 − 𝝁 𝑇 Λ 𝒙 − 𝝁 Mahalanobis distance
• Dirichlet distribution
𝐾
𝛼 −1
𝑝 𝝁|𝜶 ∝ ෑ 𝜇𝑘 𝑘 , ෍ 𝜇𝑘 = 1, 0 ≤ 𝜇𝑘 ≤ 1
𝑘 𝑘

APl 744 CSCCM@IITD 82


The law of large numbers
• Let 𝑋𝑖 , 𝑖 = 1,2, … , 𝑁 be independent and identically distributed variables with finite
mean 𝐸 𝑋𝑖 = 𝜇 and variance 𝑣𝑎𝑟 𝑋𝑖 = 𝜎 2 .

• We define 𝑋ത𝑛 as
𝑁
𝑋𝑖
𝑋ത𝑛 = ෍
𝑁
𝑖=1
• Taking expectation of both sides, we see
𝑁
𝐸 𝑋𝑖
𝐸 𝑋ത𝑛 = ෍ =𝜇
𝑁
𝑖=1
𝑁
𝑣𝑎𝑟 𝑋𝑖 𝜎2
𝑉𝑎𝑟 𝑋ത𝑛 =෍ =
𝑁2 𝑁
𝑖=1

• Weak LLN: lim ℙ 𝑋ത𝑛 − 𝜇 ≥ 𝜖 = 0, ∀𝜖


𝑛→∞

• Large LLN: lim 𝑋ത𝑛 = 𝜇, almost surely


𝑛→∞

For more details: click here

APl 744 CSCCM@IITD 83


Statistical Inference: Parametric & Non-Parametric Approach

• Assume we have a set of observations,


𝒮 = 𝑥1 , 𝑥2 , … , 𝑥𝑁 , 𝑥𝑗 ∈ ℝ𝑁

• The problem is to infer the underlying probability distribution that gives rise to the
data 𝒮.

• Parametric model: Assume a model and then try to infer the parameters (e.g., fiting
a normal distribution).

• Non-parametric model: No analytical expression for the probability density is


available. Description consists of defining the dependency or independency of the
data. This leads to numerical exploration

APl 744 CSCCM@IITD 84


The Central Limit Theorem

• Let 𝑋𝑖 , 𝑖 = 1,2, … , 𝑁 be independent and identically distributed variables with finite


mean 𝐸 𝑋𝑖 = 𝜇 and variance 𝑣𝑎𝑟 𝑋𝑖 = 𝜎 2 .

• We define a variables 𝑧𝑁 such that


1 𝑋തN − 𝜇
𝑧𝑁 = 𝑋1 + ⋯ + 𝑋𝑁 − 𝑁𝜇 = 𝜎
𝜎 𝑁
𝑁

• As 𝑁 → ∞, the distribution of 𝑧𝑁 converges to the distribution of a standard normal


𝜎2

variable and 𝑋N ∼ 𝒩 𝜇, as 𝑁 → ∞
𝑁

• This justifies the reason for assuming the noise to be Gaussian.

APl 744 CSCCM@IITD 85


The Central Limit Theorem

Matlab code

APl 744 CSCCM@IITD 86


Computing 𝜋 using MC

𝑟
𝐼 = න 𝕀 𝑥 2 + 𝑦 2 ≤ 𝑟 2 𝑑𝑥𝑑𝑦 = 𝜋𝑟 2
−𝑟

1 2 𝑟
𝜋 = 2 4𝑟 න 𝕀 𝑥 2 + 𝑦 2 ≤ 𝑟 2 𝑝𝑋 𝑥 𝑝𝑌 𝑥 𝑑𝑥𝑑𝑦
𝑟 −𝑟
𝑁
1
≈ 4 ෍ 𝕀 𝑥 2 + 𝑦2 ≤ 𝑟2
𝑁
𝑖

Run mcEstimatePi from PMTK

APl 744 CSCCM@IITD 87


Kullback-Leibler Divergence and cross-entropy

• Consider some unknown distribution 𝑝 𝑥 and suppose we have modelled this using
an approximate distribution 𝑞 𝑥 .

• The 𝐾𝐿 divergence between 𝑝 and 𝑞 is given as


𝑞 𝑥
𝐾𝐿 𝑝||𝑞 = − න 𝑝 𝑥 log 𝑑𝑥
𝑝 𝑥
= − න𝑝 𝑥 log 𝑞 𝑥 𝑑𝑥 + න𝑝 𝑥 log 𝑞 𝑥 𝑑𝑥

• The cross-entropy between 𝑝 𝑥 and 𝑞 𝑥 is given as


ℍ 𝑝, 𝑞 = − න𝑝 𝑥 log 𝑞 𝑥 𝑑𝑥

• The two important properties of KL divergence are as follows:


• 𝐾𝐿 𝑝||𝑞 ≥ 0
• 𝐾𝐿(𝑝| 𝑞 ≠ 𝐾𝐿(𝑞| 𝑝

APl 744 CSCCM@IITD 88


Jensen’s inequality

• For a convex function 𝑓, we have


𝑀 𝑀

𝑓 ෍ 𝜆𝑖 𝑥𝑖 ≤ ෍ 𝜆𝑖 𝑓 𝑥𝑖 , 𝜆𝑖 ≥ 0 and ෍ 𝜆𝑖 = 1
𝑖=1 𝑖=1 𝑖

• In the context of probability, we thus have


𝑓 න𝑥𝑝 𝑥 𝑑𝑥 ≤ න𝑓 𝑥 𝑝 𝑥 𝑑𝑥

• Let's consider the KL-divergence from before


𝑞 𝑥 𝑞 𝑥
𝐾𝐿 𝑝||𝑞 = − න 𝑝 𝑥 log 𝑑𝑥 ≥ − log න 𝑝 𝑥 𝑑𝑥 = 0
𝑝 𝑥 𝑝 𝑥

• This proves that 𝐾𝐿 ≥ 0.

APl 744 CSCCM@IITD 89


KL divergence vs. MLE estimate

• Suppose we have data, 𝒙𝑛 ∼ 𝑝 𝒙 , 𝑛 = 1, … , 𝑁 from an unknown distribution 𝑝 𝒙


that we try to approximate with a parametric model 𝑞 𝒙 𝜃 . Then
𝑁
𝑞 𝒙|𝜃 1
𝐾𝐿(𝑝| 𝑞 = − න 𝑝 𝒙 log 𝑑𝑥 ≈ ෍ −log 𝑞 𝒙𝑛 |𝜃 + log 𝑝 𝒙𝑛
𝑝 𝒙 𝑁
𝑖=1

• Note that only the blue term is having 𝑞. Therefore, minimizing 𝐾𝐿(𝑝| 𝑞 is similar
to maximizing the data-likelihood with the empirical distribution 𝑞 𝒙 𝜃 .

APl 744 CSCCM@IITD 90


Conditional entropy

• For joint distribution, the conditional entropy is


ℍ 𝑦 𝑥 = න න𝑝 𝑦, 𝑥 log 𝑝 𝑦|𝑥 𝑑𝑥𝑑𝑦

• Using 𝑝 𝑥, 𝑦 = 𝑝 𝑥 𝑝 𝑦 𝑥 , we have
𝑝 𝑥, 𝑦
ℍ 𝑦 𝑥 = න න 𝑝 𝑦, 𝑥 log 𝑑𝑥𝑑𝑦
𝑝 𝑥

• This can be rewritten as


ℍ 𝑦 𝑥 = න න𝑝 𝑦, 𝑥 log 𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦 − න න𝑝 𝑦, 𝑥 log 𝑝 𝑥 𝑑𝑥𝑑𝑦

ℍ 𝑦 𝑥 = න න𝑝 𝑦, 𝑥 log 𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦 − න log 𝑝 𝑥 න𝑝 𝑦, 𝑥 𝑑𝑦 𝑑𝑥


ℍ 𝑥,𝑦 𝑝 𝒙
ℍ𝑥
• Therefore,
ℍ 𝑥, 𝑦 = ℍ 𝑥 + ℍ 𝑦 𝑥

APl 744 CSCCM@IITD 91


Summary

• Product rule and sum rule:

ℙ 𝑋 = 𝑥𝑖 = ෍ ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ℙ 𝑥 = නℙ 𝑥, 𝑦 𝑑𝑦
𝑗

ℙ 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = ℙ 𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ℙ 𝑋 = 𝑥𝑖

• Prior, posterior and likelihood:

𝑝 𝜃|𝒟 ∝ 𝑝ถ
𝜃 𝑝 𝒟|𝜃
posterior prior likelihood

• Law of large number, Monte Carlo integration:


𝑎 𝑁
1
න 𝑓 𝑥 𝑝𝑋 𝑥 𝑑𝑥 ≈ ෍ 𝑓 𝑥
−𝑎 𝑁
𝑖

• Central limit theorem

APl 744 CSCCM@IITD 92


Summary

• Different PDFs and PMFs, Independence and its importance

• KL divergence
𝑞 𝒙|𝜃
𝐾𝐿(𝑝| 𝑞 = − න 𝑝 𝒙 log 𝑑𝑥
𝑝 𝒙

• MLE, MAP and Bayesian approach

• Maximize ℙ 𝒟|𝜇 (MLE).


ℙ 𝜇 ℙ 𝒟|𝜇
• Compute the posterior, ℙ 𝜇 𝒟 = ∝ ℙ 𝜇 ℙ 𝒟|𝜇 (Bayesian way)
ℙ 𝒟
• 𝜇∗ = argmax𝜇 ℙ 𝜇 𝒟 (MAP estimate).

• Jensen’s inequality

𝑀 𝑀

𝑓 ෍ 𝜆𝑖 𝑥𝑖 ≤ ෍ 𝜆𝑖 𝑓 𝑥𝑖 , 𝜆𝑖 ≥ 0 and ෍ 𝜆𝑖 = 1
𝑖=1 𝑖=1 𝑖

APl 744 CSCCM@IITD 93

You might also like