0% found this document useful (0 votes)

70 views9 pages

Introduction To Probabilistic Learning

This document discusses different types of machine learning: 1. Unsupervised learning involves observing input data alone to discover underlying patterns and structure in the data. 2. Supervised learning involves observing input-output pairs to learn relationships and make predictions on new data. Main examples are classification and regression. 3. Reinforcement learning involves learning from rewards or payoffs to find policies that maximize long-term payoffs from actions. The document advocates for a probabilistic approach to modeling data generation processes and using probabilities to represent beliefs and make inferences. Basic probability rules like non-negativity and normalization are also covered.

Uploaded by

gagamamapapa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views9 pages

Introduction To Probabilistic Learning

Uploaded by

gagamamapapa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

What do we mean by learning?

Probabilistic & Unsupervised Learning

Introduction and Foundations

Maneesh Sahani
[email protected]

Gatsby Computational Neuroscience Unit, and

MSc ML/CSML, Dept Computer Science
Jan Steen
University College London
Not just remembering:
Term 1, Autumn 2018 I Systematising (noisy) observations: discovering structure.
I Predicting new outcomes: generalising.
I Choosing actions wisely.

Three learning problems Unsupervised Learning

I Systematising (noisy) observations: discovering structure.

I Unsupervised learning. Observe (sensory) input alone:
Find underlying structure:
x1 , x2 , x3 , x4 , . . .
I separate generating processes (clusters)
Describe pattern of data [p(x )], identify and extract underlying structural variables [xi → yi ].
I reduced dimensionality representations
I Predicting new outcomes: generalising.
I good explanations (causes) of the data a
filters
I Supervised learning. Observe input/output pairs (“teaching”): I modelling the data density W image patch, I
image
ensemble
(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), . . .
Predict the correct y ∗ for new test input x ∗ .
Φ
basis functions
I Choosing actions wisely. Uses of Unsupervised Learning: a causes

I Reinforcement learning. Rewards or payoffs (and possibly also inputs) depend on actions: I structure discovery, science
x1 : a1 → r1 , x2 : a2 → r2 , x3 : a3 → r3 . . . I data compression
Find a policy for action choice that maximises payoff. I outlier detection
I input to supervised/reinforcement algorithms (causes may be more simply related to
outputs or rewards)
I a theory of biological learning and perception
Supervised learning A probabilistic approach
Data are generated by random and/or unknown processes.
Two main examples:
Our approach to learning starts with a probabilistic model of data production:
Classification: Regression:
50

x 40
P (data|parameters) P (x |θ) or P (y |x , θ)
x x o 30

x x x o 20 This is the generative model or likelihood.

y
x o 10

o o
x oo
0

x x
o −10
I The probabilistic model can be used to
−20
−2 0 2 4
x
6 8 10 12 I make inferences about missing inputs
Discrete (class label) outputs. Continuous-values outputs.
I generate predictions/fantasies/imagery
I make predictions or decisions which minimise expected loss

I communicate the data in an efficient way

But also: ranks, relationships, trees etc.
I Probabilistic modelling is often equivalent to other views of learning:
I information theoretic: finding compact representations of the data
Variants may relate to unsupervised learning:
I physical analogies: minimising (free) energy of a corresponding statistical
I semi-supervised learning (most x unlabelled; assumes structure of {x } and relationship mechanical system
x → y are linked). I structural risk: compensate for overconfidence in powerful models
I multitask (transfer) learning (predict different y in different contexts; assumes links
between structure of relationships). The calculus of probabilities naturally handles randomness. It is also the right way to reason
about unknown values.

Representing beliefs Basic rules of probability

Let b(x ) represent our strength of belief in (plausibility of) proposition x:

I Probabilities are non-negative P (x ) ≥ 0 ∀x.
0 ≤ b(x ) ≤ 1
b (x ) = 0
P
x is definitely not true I Probabilities normalise: x ∈X P (x ) = 1 for distributions if x is a discrete variable and
b (x ) = 1 x is definitely true
R +∞
−∞
p(x )dx = 1 for probability densities over continuous variables
b(x |y ) strength of belief that x is true given that we know y is true
I The joint probability of x and y is: P (x , y ).

Cox Axioms (Desiderata): I The marginal probability of x is: P (x ) =

P
P (x , y ), assuming y is discrete.
y
I Let b(x ) be real. As b(x ) increases, b(¬x ) decreases, and so the function mapping
b(x ) ↔ b(¬x ) is monotonically decreasing and self-inverse. I The conditional probability of x given y is: P (x |y ) = P (x , y )/P (y )
I b(x ∧ y ) depends only on b(y ) and b(x |y ). I Bayes Rule:
I Consistency
I If a conclusion can be reasoned in more than one way, then every way should lead to the P (x |y )P (y )
same answer. P ( x , y ) = P ( x ) P ( y |x ) = P ( y ) P ( x |y ) ⇒ P ( y |x ) =
P (x )
I Beliefs always take into account all relevant evidence.
I Equivalent states of knowledge are represented by equivalent plausibility assignments.

Consequence: Belief functions (e.g. b(x ), b(x |y ), b(x , y )) must be isomorphic to

Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should
probabilities, satisfying all the usual laws, including Bayes rule. (See Jaynes, Probability be obvious from context.
Theory: The Logic of Science)
The Dutch book theorem Bayesian learning
Apply the basic rules of probability to learning from data.
Assume you are willing to accept bets with odds proportional to the strength of your beliefs.
That is, b(x ) = 0.9 implies that you will accept a bet: I Problem specification:
Data: D = {x1 , . . . , xn } Models: M1 , M2 , etc. Parameters: θi (per model)
x is true win ≥ $1 Prior probability of models: P (Mi ).
x at 1 : 9 ⇒
x is false lose $9 Prior probabilities of model parameters: P (θi |Mi )
Model of data given parameters (likelihood model): P (x |θi , Mi )
Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there
exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and I Data probability (likelihood)
for which you are guaranteed to lose money, no matter what the outcome. n
E.g. suppose A ∩ B = ∅, then
Y
P (D|θi , Mi ) = P (xj |θi , Mi ) ≡ L(θi )
    j =1
 b(A) = 0 .3   ¬A at 3 : 7 
b (B ) = 0 .2 ⇒ accept the bets ¬B at 2 : 8 (provided the data are independently and identically distributed (iid).
b (A ∪ B ) = 0 .6 A∪B at 4 : 6
   
I Parameter learning (posterior):

But then: P (D|θi , Mi )P (θi |Mi )

Z
P (θi |D, Mi ) = ; P (D|Mi ) = d θi P (D|θi , Mi )P (θi |Mi )
¬A ∩ B ⇒ win + 3 − 8 + 4 = −1 P (D|Mi )
A ∩ ¬B ⇒ win − 7 + 2 + 4 = −1 P (D|Mi ) is called the marginal likelihood or evidence for Mi . It is proportional to the
¬A ∩ ¬B ⇒ win + 3 + 2 − 6 = −1 posterior probability model Mi being the one that generated the data.
I Model selection:
The only way to guard against Dutch Books is to ensure that your beliefs are coherent: i.e. P (D|Mi )P (Mi )
satisfy the rules of probability. P (Mi |D) =
P (D)

Bayesian learning: A coin toss example Bayesian learning: The coin toss (cont)

Coin toss: One parameter q — the probability of obtaining heads

Now we observe a toss. Two possible outcomes:
So our space of models is the set of distributions over q ∈ [0, 1].
Learner A believes model MA : all values of q are equally plausible; p(H|q ) = q p(T|q ) = 1 − q
Learner B believes model MB : more plausible that the coin is “fair” (q ≈ 0.5) than “biased”.
2 2.5
Suppose our single coin toss comes out heads
1.8

1.6 2

1.4 The probability of the observed data (likelihood) is:

1.2 1.5

p(H|q ) = q
P(q)

P(q)

0.8 1

0.6

0.4 0.5

0.2

0 0
Using Bayes Rule, we multiply the prior, p(q ) by the likelihood and renormalise to get the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q
posterior probability:
A: α1 = α2 = 1.0 B: α1 = α2 = 4.0
p(q )p(H|q )
Both prior beliefs can be described by the Beta distribution: p(q |H) = ∝ q Beta(q |α1 , α2 )
p (H )
q (α1 −1) (1 − q )(α2 −1)
p(q |α1 , α2 ) = = Beta(q |α1 , α2 ) ∝ qq
(α1 −1)
(1 − q )(α2 −1) = Beta(q |α1 + 1, α2 )
B (α1 , α2 )
Bayesian learning: The coin toss (cont) Bayesian learning: The coin toss (cont)
What about multiple tosses? Suppose we observe D = { H H T H T T }:
3 3
A B p({ H H T H T T }|q ) = qq (1 − q )q (1 − q )(1 − q ) = q (1 − q )
2 2.5

1.8

1.6 2
This is still straightforward:
1.4

1.2 1.5
p(q )p(D|q )
Prior p(q |D) = ∝ q 3 (1 − q )3 Beta(q |α1 , α2 )
P(q)

P(q)
1

0.8 1 p(D)
0.6

0.4 0.5

0.2 ∝ Beta(q |α1 + 3, α2 + 3)

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Beta(q |1, 1) Beta(q |4, 4)

2.5 3

2.5 2.5

2.5
2
2 2

2
1.5 1.5
1.5

Posterior
P(q)

P(q)

P(q)
1.5
1 1

1
0.5 0.5

0.5
0 0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Beta(q |2, 1) Beta(q |5, 4) 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Conjugate priors Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is The posterior given an exponential family likelihood and conjugate prior is:
because we used a conjugate prior for an exponential family likelihood. P ν+n
h
T
P i
P (θ|{xi }) = F τ + T(xi ), ν + n g (θ) exp φ(θ) τ+ T(xi )
Exponential family distributions take the form: i i

φ(θ)T T(x ) Here,

P (x |θ) = g (θ)f (x )e
φ(θ) is the vector of natural parameters
with g (θ) the normalising constant. Given n iid observations, P
i T(xi ) is the vector of sufficient statistics
T
P
Y n φ(θ) i
T(xi ) Y τ are pseudo-observations which define the prior
P ({xi }|θ) = P (xi |θ) = g (θ) e f (xi )
i i ν is the scale of the prior (need not be an integer)
Thus, if the prior takes the conjugate form As new data come in, each one increments the sufficient statistics vector and the scale to
define the posterior.
ν φ(θ)T τ
P (θ) = F (τ , ν)g (θ) e
The prior appears to be based on “pseudo-observations”, but:
with F (τ , ν) the normaliser, then the posterior is
1. This is different to applying Bayes’ rule. No prior! Sometimes we can take a uniform

T
P
ν+n φ(θ) τ+ i
T(xi )
prior (say on [0, 1] for q), but for unbounded θ , there may be no equivalent.
P (θ|{xi }) ∝ P ({xi }|θ)P (θ) ∝ g (θ) e
P 2. A valid conjugate prior might have non-integral ν or impossible τ , with no likelihood
with the normaliser given by F τ + T(xi ), ν + n .
i equivalent.
Conjugacy in the coin flip Bayesian coins – comparing models
Distributions are not always written in their natural exponential form.
The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can We have seen how to update posteriors within each model. To study the choice of model,
be written: consider two more extreme models: “fair” and “bent”. A priori, we may think that “fair” is more
x (1−x )
P ( x |q ) = q ( 1 − q ) probable, eg:
= ex log q+(1−x ) log(1−q) p(fair) = 0.8, p(bent) = 0.2
= elog(1−q)+x log(q/(1−q))
For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a
= (1 − q )elog(q/(1−q))x fixed probability:
So the natural parameter is the log odds log(q /(1 − q )), and the sufficient stats (for multiple
tosses) is the number of heads.
1 1
The conjugate prior is

p(q|bent)

p(q|fair)
ν log(q /(1−q ))τ
P (q ) = F (τ, ν) (1 − q ) e
= F (τ, ν) (1 − q )ν eτ log q−τ log(1−q) 0.5 0.5
ν−τ τ
= F (τ, ν) (1 − q ) q
which has the form of the Beta distribution ⇒ F (τ, ν) = 1/B (τ + 1, ν − τ + 1). 0 0
In general, then, the posterior will be P (q |{xi }) = Beta(α1 , α2 ), with 0 0.5 1 0 0.5 1
P P parameter, q parameter, q
α1 = 1 + τ + i xi α2 = 1 + (ν + n) − τ + i xi
We make 10 tosses, and get: D = (T H T H T T T T T T).
P
If we observe a head, we add 1 to the sufficient statistic xi , and also 1 to the count n. This
P
increments α1 . If we observe a tail we add 1 to n, but not to xi , incrementing α2 .

Bayesian coins – comparing models Learning parameters

The Bayesian probabilistic prescription tells us how to reason about models and their
Which model should we prefer a posteriori (i.e. after seeing the data)? parameters. But it is often impractical for realistic models (outside the exponential family).
I Point estimates of parameters or other predictions
The evidence for the fair model is:
I Compute posterior and find single parameter that minimises expected loss.
10
≈ 0.001
D E
P (D|fair) = (1/2) θBP = argmin L(θ̂, θ)
P (θ|D)
θ̂
and for the bent model is:
Z Z I hθiP (θ|D) minimises squared loss.

P (D|bent) = dq P (D|q , bent)p(q |bent) =

2 8
dq q (1 − q ) = B(3, 9) ≈ 0.002 I Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P (θ),
and compute parameters that are most probable under the posterior:

Thus, the posterior for the models, by Bayes rule: θMAP = argmax P (θ|D) = argmax P (θ)P (D|θ) .
I Equivalent to minimising the 0/1 loss.

P (fair|D) ∝ 0.0008, P (bent|D) ∝ 0.0004,

I Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter
value that maximises the likelihood function alone:
ie, a two-thirds probability that the coin is fair. θML = argmax P (D|θ) .
I Parameterisation-independent.
How do we make predictions? Could choose the fair model (model selection). I Approximations may allow us to recover samples from posterior, or to find a distribution
Or could weight the predictions from each model by their probability (model averaging). which is close in some sense.
Probability of H at next toss is: I Choosing between these and other alternatives may be a matter of definition, of goals
2 1 1 3 5 (loss function), or of practicality.
P (H|D) = P (H|D, fair)P (fair|D) + P (H|D, bent)P (bent|D) = × + × = . I For the next few weeks we will look at ML and MAP learning in more complex models.
3 2 3 12 12
We will then return to the fully Bayesian formulation for the few intersting cases where it
is tractable. Approximations will be addressed in the second half of the course.
Modelling associations between variables ML Learning for a Gaussian
N
Y
1
Data set D = {x1 , . . . , xN }, likelihood: p(D|µ, Σ) = p(xn |µ, Σ)
n =1

Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:

I Data set D = {x1 , . . . , xN }
I with each data point a vector of D features: N N
i2

0
x

−1
−1 0 1
xi1
Note: equivalently, minimise −`, which is quadratic in µ
A simple forms of unsupervised (structure) learning: model the mean of the data and the
Procedure: take derivatives and set to zero:
correlations between the D features in the data.
∂` 1 X
=0 ⇒ µ̂ = xn (sample mean)
We can use a multivariate Gaussian model: ∂µ N
n

− 12 1 T −1 ∂` 1 X
p(x|µ, Σ) = N (µ, Σ) = |2πΣ| exp − (x − µ) Σ (x − µ) =0 ⇒ Σ̂ = (xn − µ̂)(xn − µ̂)T (sample covariance)
2 ∂Σ N
n

Refresher – matrix derivatives of scalar forms Gaussian Derivatives

We will use the following facts:
h i
T T T T
x Ay = y A x = Tr x Ay (scalars equal their own transpose and trace) " #
∂(−`) ∂ N 1X T −1
h i = log |2πΣ| + (xn − µ) Σ (xn − µ)
Tr [A] = Tr AT Tr [ABC ] = Tr [CAB ] = Tr [BCA] ∂µ ∂µ 2 2
n

1X ∂
h i
∂ h
T
i ∂ X T ∂ XX T ∂ X = (xn − µ)T Σ−1 (xn − µ)
Tr A B = [A B ]nn = Anm Bmn = Amn Bmn = Bij 2
n
∂µ
∂ Aij ∂ Aij n ∂ Aij n m ∂ Aij mn
1X ∂
h i
T −1 T −1 T −1
∂ h T
i = xn Σ xn + µ Σ µ − 2µ Σ xn
⇒ Tr A B =B 2
n
∂µ
∂A
1 X ∂ h T −1 i ∂ h T −1 i
∂ h T i ∂ h T
i = µ Σ µ −2 µ Σ xn
Tr A BAC = Tr F1 (A) BF2 (A)C with F1 and F2 both identity maps 2
n
∂µ ∂µ
∂A ∂A
∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ =h1 T 2Σ ∂ Fµ ∂ −1 xh T T T ∂ F2
h h h h X i −1 i
T 1 T 2 T 1 T 2 1 − 2Σ
= Tr F1 BF2 C + Tr F1 BF2 C = Tr F1 BF2 C + Tr CF1 BF2 = Tr 2
F1 BF2 C + Tr n F2 B F1 C
∂ F1 ∂A ∂ F2 ∂A ∂ F1 ∂A ∂ F2 ∂A ∂ F1 n ∂A ∂ F2 ∂A
= BF2 C + B T F1 C T = BAC + B T AC T −1 −1
X
= NΣ µ − Σ xn
n
∂ 1 ∂ 1 ∂ X 1
log |A| = |A| = (−1)i +k Aik |[A]ik | = (−1)i +j |[A]ij | 1 X
∂ Aij |A| ∂ Aij |A| ∂ Aij k |A| = 0 ⇒ µ̂ = xn
N
n
∂ −1 T
⇒ log |A| = (A )
∂A
Gaussian Derivatives Equivalences

" #
∂(−`) ∂ N 1X
= log |2πΣ| + (xn − µ)T Σ−1 (xn − µ)
∂Σ−1 ∂Σ−1 2 2

xi2
n 0

∂ N ∂ N −1
= log |2π I | − log |Σ |
∂Σ−1 2 ∂Σ−1 2
1X ∂
h i
+ −1
(xn − µ)T Σ−1 (xn − µ)
2 ∂Σ n
−1
N 1X −1 0 1

= − ΣT + (xn − µ)(xn − µ)T xi1

2 2
n
1 X modelling correlations
= 0 ⇒ Σ̂ = (xn − µ)(xn − µ)T m
N
n
maximising likelihood of a Gaussian model
m
minimising a squared error cost function
m
minimizing data coding cost in bits (assuming Gaussian distributed)

Multivariate Linear Regression Multivariate Linear Regression – ML estimate

The relationship between variables can also be modelled as a conditional distribution. ML estimates are obtained by maximising the (conditional) likelihood, as before:
X
1 `= log p(yi |xi , W, Σy )
i
N 1X
=− log |2πΣy | − (yi − Wxi )T Σ−1
y (yi − Wxi )
2 2
i
" #
I data D = {(x1 , y1 ) . . . , (xN , yN )} ∂(−`) ∂ N 1X
= log |2πΣy | + (yi − Wxi )T Σ− 1
( y − Wx )
i2

0 y i i
x

I each xi (yi ) is a vector of Dx (Dy ) features, ∂W ∂W 2 2

1X ∂
h i
I yi is conditionally independent of all else, given xi . (yi − Wxi )T Σ− 1
= y (yi − Wxi )
2
i
∂W
1 X ∂ h T −1 T T −1 T T −1
i
= yi Σy yi + xi W Σy Wxi − 2xi W Σy yi
−1
−1 0 1
2
i
∂W
x
i1
1X ∂
h i ∂ h i
T −1 T T −1 T
A simple form of supervised (predictive) learning: model y as a linear function of x, with = Tr W Σy Wxi xi − 2 Tr W Σy yi xi
2
i
∂W ∂W
Gaussian noise.
1X
h i
−1 T −1 T
= 2Σy Wxi xi − 2Σy yi xi
2
i
− 12 1 −1
− (y − Wx)T Σ−1 X
p(y|x, W, Σy ) = |2πΣy | exp y (y − Wx)
X T T
2 =0⇒W
b = yi xi xi xi
i i
Multivariate Linear Regression – Posterior MAP and ML for linear regression
Let yi be scalar (so that W is a row vector) and write w for the column vector of weights.
A conjugate prior for w is As the posterior is Gaussian, the MAP and posterior mean weights are the same:
−1 −1 P
xi xTi
P
P (w|A) = N 0, A
X T −1 X
MAP yi xi
w = A+ i
2
i
2
= Aσy2 + xi xi yi xi
σ y σ y
i i
| {z }
Σw
Then the log posterior on w is
Compare this to the (transposed) ML weight vector for scalar outputs:
log P (w|D, A, σy ) = log P (D|w, A, σy ) + log P (w|A, σy ) − log P (D|A, σy ) X −1 X
ML bT = T
1 1X w =W xi xi yi xi
= − wT Aw − (yi − wT xi )2 σy−2 + const i i
2 2
i
1 X T X
= − wT (A + σy−2 xi xi )w + w
T
(yi xi )σy−2 + const I The prior acts to “inflate” the apparent covariance of inputs.
2
| {z
i
}
i I As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0).
Σ−w
1 I If A = αI this is known as the ridge regression estimator.
1 X I The MAP/shrinkage/ridge weight estimate often has lower squared error (despite bias)
= − wT Σ−1 T −1
w w + w Σw Σw (yi xi )σy−2 + const
2 and makes more accurate predictions on test inputs than the ML estimate.
i
I An example of prior-based regularisation of estimates.
| {z }
µw
(yi xi )σy−2 , Σw
P
= log N Σw i

Gaussians for Regression Three limitations of the multivariate Gaussian model

I Models the conditional P (y|x).

I If we also model P (x), then learning is indistinguishable from unsupervised. In particular I What about higher order statistical structure in the data?
if P (x) is Gaussian, and P (y|x) is linear-Gaussian, then x, y are jointly Gaussian.
⇒ nonlinear and hierarchical models
I Generalised Linear Models (GLMs) generalise to non-Gaussian, exponential-family
distributions and to non-linear link functions.

yi ∼ ExpFam(µi , φ) I What happens if there are outliers?

T
g (µi ) = w xi
⇒ other noise models
Posterior, or even ML, estimation is not possible in closed form ⇒ iterative methods
such as gradient ascent or iteratively re-weighted least squares (IRLS).
A warning to fMRIers: SPM uses GLM for “general” (not -ised) linear model; which is just
linear. I There are D (D + 1)/2 parameters in the multivariate Gaussian model.
I These models: Gaussians, Linear-Gaussian Regression and GLMs are important What if D is very large?
building blocks for the more sophisticated models we will develop later.
⇒ dimensionality reduction
I Gaussian models are also used for regression in Gaussian Process Models. We’ll see
these later too.
End Notes

I It is very important that you understand all the material in the following cribsheet:
http://www.gatsby.ucl.ac.uk/teaching/courses/ml1/cribsheet.pdf
I The following notes by (the late) Sam Roweis are quite useful:
I Matrix identities and matrix derivatives:
http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf
I Gaussian identities:
http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf

I Here is a useful statistics / pattern recognition glossary:

http://alumni.media.mit.edu/~tpminka/statlearn/glossary/

I Tom Minka’s in-depth notes on matrix algebra:

http://research.microsoft.com/en-us/um/people/minka/papers/matrix/

BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Isc 11 Worksheet- Sets and Relation and Functions
No ratings yet
Isc 11 Worksheet- Sets and Relation and Functions
1 page
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
Set Operations
No ratings yet
Set Operations
6 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
Introduction To Lebesgue Integration
100% (1)
Introduction To Lebesgue Integration
2 pages
ADAMS Full Simulation Guide 2005
100% (2)
ADAMS Full Simulation Guide 2005
406 pages
Bayesian Credible Interval
100% (1)
Bayesian Credible Interval
8 pages
Use of Linear Functions for Modeling in Real-world
No ratings yet
Use of Linear Functions for Modeling in Real-world
11 pages
Statistics
No ratings yet
Statistics
21 pages
All TOC E-Lecture Notes
No ratings yet
All TOC E-Lecture Notes
57 pages
B.SC - III Maths
No ratings yet
B.SC - III Maths
76 pages
Lect 19 Quantitative Reasoning
No ratings yet
Lect 19 Quantitative Reasoning
24 pages
Prob ALecture Notes
No ratings yet
Prob ALecture Notes
202 pages
5) DOE Design and Analysis Using Minitab
100% (1)
5) DOE Design and Analysis Using Minitab
48 pages
Shift Handover Policy: 1. Scope
No ratings yet
Shift Handover Policy: 1. Scope
3 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
ProbabilityandProbabilityDistribution 2 PDF
No ratings yet
ProbabilityandProbabilityDistribution 2 PDF
65 pages
Unit - I Discrete Mathematics: Dr. Krishna Keerthi Chennam
No ratings yet
Unit - I Discrete Mathematics: Dr. Krishna Keerthi Chennam
105 pages
INEQUALITIES - Introduction
No ratings yet
INEQUALITIES - Introduction
9 pages
Positive Definite Matrices: Notes On Linear Algebra
100% (1)
Positive Definite Matrices: Notes On Linear Algebra
51 pages
ACS 4.x To 5.x Migration
No ratings yet
ACS 4.x To 5.x Migration
89 pages
PHP Hand Book
No ratings yet
PHP Hand Book
30 pages
Fundamental of Mathematical Statistics-S C Gupta & V K Kapoor - 1 - 1
No ratings yet
Fundamental of Mathematical Statistics-S C Gupta & V K Kapoor - 1 - 1
25 pages
Lecture 2 Merged
No ratings yet
Lecture 2 Merged
75 pages
Program Flow Mechanism
No ratings yet
Program Flow Mechanism
14 pages
A4 Paper Emails
100% (2)
A4 Paper Emails
25 pages
Finding Needles in A Haystack: Using Data Analytics To Improve Fraud Prediction
No ratings yet
Finding Needles in A Haystack: Using Data Analytics To Improve Fraud Prediction
53 pages
Ergodic Theory
No ratings yet
Ergodic Theory
56 pages
Logic-Lecture MMW
No ratings yet
Logic-Lecture MMW
22 pages
Avaya VoIP Network Assessment
No ratings yet
Avaya VoIP Network Assessment
16 pages
Chap 03
No ratings yet
Chap 03
20 pages
Ternary Plot
No ratings yet
Ternary Plot
7 pages
College of Engineering Experiment No. 2 Quadrature Phase Shift Keying
No ratings yet
College of Engineering Experiment No. 2 Quadrature Phase Shift Keying
10 pages
05 Recurrence Relation
No ratings yet
05 Recurrence Relation
100 pages
Alarm Messages A1
No ratings yet
Alarm Messages A1
74 pages
CIT383-2021-2
No ratings yet
CIT383-2021-2
1 page
Discrete Mathematics and Its Applications: Basic Structures: Sets, Functions, Sequences, and Sums
No ratings yet
Discrete Mathematics and Its Applications: Basic Structures: Sets, Functions, Sequences, and Sums
61 pages
RTEXT (Express Tool) - AutoCAD - Autodesk Knowledge Network
No ratings yet
RTEXT (Express Tool) - AutoCAD - Autodesk Knowledge Network
7 pages
Probability and Random Processes: Lessons 5-6 Discrete Random Variables
No ratings yet
Probability and Random Processes: Lessons 5-6 Discrete Random Variables
66 pages
Abstract Classes & Interface
No ratings yet
Abstract Classes & Interface
6 pages
Credit Card Processing System
No ratings yet
Credit Card Processing System
8 pages
Chapter 9: Correlation and Regression: Solutions
No ratings yet
Chapter 9: Correlation and Regression: Solutions
8 pages
Chapter 9. Test of Hypotheses For A Single Sample
No ratings yet
Chapter 9. Test of Hypotheses For A Single Sample
98 pages
Midterm 369 10s Ans
No ratings yet
Midterm 369 10s Ans
8 pages
Foiling The Cracker
No ratings yet
Foiling The Cracker
11 pages
B A67 Scen ConfigGuide Scheduling Agreement With Summarized JIT Call
No ratings yet
B A67 Scen ConfigGuide Scheduling Agreement With Summarized JIT Call
215 pages
Permutation Combination Probability
No ratings yet
Permutation Combination Probability
13 pages
Time Series Lecture Notes
No ratings yet
Time Series Lecture Notes
97 pages
Class 12 Chapter 13 Maths Important Formulas
No ratings yet
Class 12 Chapter 13 Maths Important Formulas
2 pages
Hec HMS
No ratings yet
Hec HMS
59 pages
EthemKaral Resume PDF
No ratings yet
EthemKaral Resume PDF
4 pages
Programming Assignment 3 Zombie Dash: Project 3 Specification Document
No ratings yet
Programming Assignment 3 Zombie Dash: Project 3 Specification Document
3 pages
D. Equations
No ratings yet
D. Equations
39 pages
2 Proving in Propositional Logic
No ratings yet
2 Proving in Propositional Logic
81 pages
Advanced Features of AMPL
No ratings yet
Advanced Features of AMPL
13 pages
Lecture Notes (Chapter 1.2 Limit and Continuity)
No ratings yet
Lecture Notes (Chapter 1.2 Limit and Continuity)
6 pages
Dapp University Github
No ratings yet
Dapp University Github
2 pages
Relations
No ratings yet
Relations
10 pages
CpyProbStatSection PDF
No ratings yet
CpyProbStatSection PDF
240 pages
Measure Theory Liskevich
No ratings yet
Measure Theory Liskevich
40 pages
Lesson 1 Basic Formulas
100% (1)
Lesson 1 Basic Formulas
18 pages
CM 6 MMW Chapter 6 Linear Programming
No ratings yet
CM 6 MMW Chapter 6 Linear Programming
23 pages
DPP - III Linear Equations: y X y X y X y X
No ratings yet
DPP - III Linear Equations: y X y X y X y X
3 pages
Week 2 Lecture 3
No ratings yet
Week 2 Lecture 3
20 pages
Chapter 1 Sets, Relations and Functions
No ratings yet
Chapter 1 Sets, Relations and Functions
20 pages
Auto Print Course 10.0.700
No ratings yet
Auto Print Course 10.0.700
19 pages
Unit 5 Fitting of Curves: Structure
No ratings yet
Unit 5 Fitting of Curves: Structure
21 pages
Ma40092 Problem Sheet 3 - Solutions
No ratings yet
Ma40092 Problem Sheet 3 - Solutions
4 pages
Probability and Statistics (IT302) 10 August 2020 (Monday) Class Class-4
No ratings yet
Probability and Statistics (IT302) 10 August 2020 (Monday) Class Class-4
31 pages
202004160626023624rajiv Saksena Advance Statistical Inference
No ratings yet
202004160626023624rajiv Saksena Advance Statistical Inference
31 pages
An Optimization Model For The Design of Network Arch Bridges
No ratings yet
An Optimization Model For The Design of Network Arch Bridges
13 pages
Calculus 2
No ratings yet
Calculus 2
2 pages
Fermat and Euler Theorems
No ratings yet
Fermat and Euler Theorems
14 pages
Moment Generating Functions
No ratings yet
Moment Generating Functions
7 pages
Math MJR 3110: Number Theory Course Pack
100% (1)
Math MJR 3110: Number Theory Course Pack
11 pages
M.SC Mathematics Semester I.
No ratings yet
M.SC Mathematics Semester I.
13 pages
Cnditional Probability
No ratings yet
Cnditional Probability
5 pages
Chapter 1 Probability
No ratings yet
Chapter 1 Probability
13 pages
Stability of Linear Systems: 11.1 Some Definitions
No ratings yet
Stability of Linear Systems: 11.1 Some Definitions
8 pages
Mca Notes - m-1
No ratings yet
Mca Notes - m-1
25 pages
Vector Spaces Worksheet
No ratings yet
Vector Spaces Worksheet
13 pages
07 Partial Order
No ratings yet
07 Partial Order
42 pages
Inclusion Exclusion Principle
No ratings yet
Inclusion Exclusion Principle
8 pages
Euler Totient Function
No ratings yet
Euler Totient Function
2 pages
Mathematics in The Ancient World
No ratings yet
Mathematics in The Ancient World
2 pages
Rational Functions: Asymptotes
No ratings yet
Rational Functions: Asymptotes
2 pages
Mathematical Induction Exercise PDF
No ratings yet
Mathematical Induction Exercise PDF
4 pages
Fermat's Little Theorem
No ratings yet
Fermat's Little Theorem
5 pages
Left and Right Hand Limits
No ratings yet
Left and Right Hand Limits
4 pages