0% found this document useful (0 votes)
41 views

Some Notes For Machine Learning

This document provides an overview of various statistical machine learning algorithms and techniques: 1. It describes linear regression, logistic regression, and basis expansion techniques for modeling nonlinear relationships in data. 2. It discusses different methods for selecting parameters in machine learning models, including simple linear regression, ridge regression, lasso regression, and gradient descent algorithms. 3. It covers concepts like regularization, the bias-variance tradeoff, neural networks including perceptrons, multilayer perceptrons, backpropagation, and convolutional neural networks. 4. It also briefly discusses autoencoders, deep learning, and support vector machines.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Some Notes For Machine Learning

This document provides an overview of various statistical machine learning algorithms and techniques: 1. It describes linear regression, logistic regression, and basis expansion techniques for modeling nonlinear relationships in data. 2. It discusses different methods for selecting parameters in machine learning models, including simple linear regression, ridge regression, lasso regression, and gradient descent algorithms. 3. It covers concepts like regularization, the bias-variance tradeoff, neural networks including perceptrons, multilayer perceptrons, backpropagation, and convolutional neural networks. 4. It also briefly discusses autoencoders, deep learning, and support vector machines.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Statistical Machine Learning Notes

Linear Regression
Simple linear regression
Assume a model of the form:

Assume the error follows a normal distribution:

Hence we have:

Logistic regression
Logistic regression is a form of linear regression for predicting categorical variables. The model takes the
form:

Where the logistic ‘squashing function’ is:

Assume the error follows a Bernoulli distribution:

Hence we have:

Basis expansion
A form of data transformation used when we expect non-linear relationships in our data.

1
Polynomial basis:

Radial basis function:

The type and number of basis transformation functions need to be chosen beforehand, which places a
limit on the flexibility of this approach. Neural networks learn the transformation automatically, and so
are more flexible.

Parameter selection
There are a number of choices as to how to select the parameters . Typically we attempt to minimise a
loss function, the choice of which depends on the model.

Simple linear regression

Ridge regression

Lasso regression

In lasso regression and for other more complicated models, there is no closed form solution for the
optimal model parameters. Instead we have to solve numerically using gradient descent algorithms. The
basic procedure here is to start with some guess for and then iteratively update until convergence is
reached. The update rules vary from one algorithm to another.

Coordinate descent: consider one weight at a time, find component which minimises holding
all other components fixed.
2
Gradient descent: consider all weights at once, update according to , where the
parameter is dynamically updated at each step. This assumes the function is differentiable.

Regularisation
Without regularisation model parameters are found based entirely on the information contained in the
training set . Regularisation essentially means introducing additional information, so is equivalent to
incorporating a prior over our weights.

Regularisation also can help to avoid problems intrinsic when there are irrelevant features (one feature
is a linear combination of the others), or when there are more parameters than observations. In both
cases, optimisation of a regression model becomes an ill-posed problem, and generally cannot be
solved. The use of regularisation parameters is one way around these problems.

Bias variance trade-off


In supervised learning the model is trained on the training data by minimising the training error, and the
generalisation capacity is then judged by computing the test error on a new dataset. Under these
conditions, the expected test error decomposes into three components:

Note that here we treat as a random variable that depends on the particular sample that we
have drawn, thus variance is computed across samples. The irreducible error is due to the noise in the
model from . The bias term expectations are taken with respect to and .

In general, simpler models are more likely to incompletely capture all regularities in the data, and thus
more likely to exhibit bias. More complex models are likely to overfit the data in the training set, and

3
hence often exhibit higher variance between samples. There is thus often a trade-off between bias and
variance that should be optimised by validation against test samples.

Neural Networks

The perceptron
The perceptron is a linear binary classifier, similar in operation to logistic regression. It consists of a
single layer feedforward neural network.

Although the activation function in nonlinear in its argument , the perceptron is a linear
classifier because its decision boundary represents a hyperplane in the space of datapoints.

Decision rule for sign function : predict class A if , predict class B if .

If the data is linearly separable, the perceptron training algorithm will always converge to one of
infinitely many possible solutions (separating boundaries). However, if the data is not linearly separable,
the training will fail completely rather than give some approximate solution.

4
A simple loss function for training the perceptron gives no penalty for correctly classified examples, and
a loss equal to for each misclassified example.

The perceptron is similar to logistic regression, in that both use the same likelihood and are usually
evaluated using gradient descent. However, the gradient is taken from different functions. For a single
training example, logistic regression aims to minimise negative log-likelihood, while perceptron aims to
minimise a special quantity called perceptron loss. Also, logistic regression is not necessarily trained
using gradient descent, but can be trained using algorithms that use second derivatives. In contrast, the
Perceptron training algorithm is specifically stochastic gradient descent.

5
Multilayer perceptron
Multilayer perceptrons have one or more hidden layers in between the input and output layers. Nodes
in each layer have their own activation functions and unique set of weights connecting them to each
adjacent layer.

Universal approximation theorem: An ANN with a hidden layer with a finite number of units, and mild
assumptions on the activation function, can approximate continuous functions on compact subsets of
arbitrarily well.

The following neural network has bias nodes and that are not shown. Recalling that bias nodes do
not take any input, the total number of parameters (weights) for this network is:

6
Backpropagation
A common loss function for neural networks is to use squared error between the output and the
network prediction , where :

There is no general analytic solution to this, so we use a gradient descent method.

Note the sign: if the partial derivative , then the loss shrinks as increases, so we should

increase that parameter, as in fact occurs because of the double minus signs. If then the loss
increases as increases, so we should decrease that parameter, as also occurs thanks to the single
minus sign.

In order to use this training method it is necessary to calculate the partial derivative of the loss function
with respect to each model parameter. This is complicated in multilayer networks by the fact that the
loss function depends only indirectly on the weights at earlier stages in the network – this is called the
credit assignment problem. The solution to this problem is to use backpropagation, which is essentially
an application of the chain rule.

Thus we have the partial derivatives:

7
Hence:

We also compute:

Yielding the backpropagation equations in the identity activation function case:

The idea then is that errors are propagated backwards through the network, and are then used to
compute the partial derivatives, which in turn are used to update the weights.

The flexibility of neural network models means they are liable to overfit. This tendency can be avoided
by implicit regularisation in the form of ‘early stopping’ of training, or through explicitly including a
regularisation term in the loss function to drive the weights to zero.

Deep learning
While any Boolean function over variables can be implemented using a single hidden layer with up to
elements, it is often more efficient to stack several hidden layers to form a deep network.

Each hidden layer can be thought of as a transformation of the underlying feature space.

One problem with deep networks is the vanishing gradient problem: partial derivatives get very small
moving through many layers.
8
Convolutional neural networks
Convolutional neural networks are hierarchically structured such that each output unit in one layer only
has a limited input layer (called the receptive field) from the previous layer, which is passed through a
kernel or filter function.

Careful selection of filer functions allows one to detect the presence of particular features, such as
edges in images.

Usually the representations are then passed through further, fully-connected hidden layers so as to
merge representations together before passing them through the output.

9
Autoencoders
An autoencoder is a special type of network that is designed to predict its own input as output. This
would be pointless if it weren’t for the process of passing the network through a ‘bottleneck’ which aims
to ensure the compression of the input with minimal information loss.

Autoencoders can be used for compression and dimensionality reduction via a non-linear
transformation.

Support Vector Machines

Maximum margin classifiers


A support vector machine is a linear binary classifier that works by finding a hyperplane to separate two
classes in a dataset.

10
In fact hard margin support vector machines work exactly like a perceptron, except that they optimise a
different objective function in choosing the parameters. For the perceptron, all linear boundaries that
separate the two classes are equally good, because the perceptron loss is zero for each of them. In the
example below, however, it seems clear that line A is a better choice than line B.

The support vector machine formalises this notion by finding the separating boundary that maximises
the margin between classes.

The points on the margin boundaries (the dotted lines) in the figure above are called the support vectors
(they are a vector of observation variables). They play an important role in defining the margin width.

Deriving the objective


The key goal of hard margin SVMs is to find the hyperplane with the maximum distance to the support
vectors. To define the needed concepts, consider the figure below. Let be an arbitrary vector of data
(in either class, not necessary a support vector), and let be the projection of this vector onto the
separating boundary. Let be the vector , whose length is the margin width we are looking to
calculate. Finally the weights in the classification expression themselves represent a vector
which always is perpendicular to the separating boundary, and hence is parallel to . To see this, note
that the decision boundary is the such that , meaning the two are perpendicular.

11
We thus have the relation between the parallel or anti-parallel vectors:

Substituting in the value of :

Since lies on the boundary obviously , and so:

Turns out we have to add a sign ambiguity since and could be anti-parallel if was on the other side
of the decision boundary:

Alternatively we can use the class labels to do this:

So effectively hard-margin SVMs have the following objective:

12
Unfortunately, however, there is an ambiguity to this algorithm; there are infinitely many solutions
because will also be a solution for any . To get around this we need to introduce an arbitrary
scaling convention:

So now our objective becomes:

This means we pick the whose length is the smallest, subject to the condition that every point is on or
outside of the margins.

These constraints can be interpreted as the following loss function:

13
Soft margin classifiers
When the data is not linearly separable, the hard margin SVM approach will not work.

The two main methods of dealing with this problem are to use soft margin SVMs, or to transform the
data. Here we consider the first approach.

In the soft margin SVM formulation we relax the constraints to allow points to be inside the margin or
even on the wrong side of the boundary. However, we penalise boundaries by the amount that reflects
the extent of “violation”.

We can rewrite the soft margin objective in terms of slack variables which represent a point being on
the wrong side of the margins:

14
Solving the optimisation
Training a SVM simply means solving the corresponding optimisation problem to find . Here we focus
on training a hard margin SVM. Since gradient descent methods cannot be used to solve a constrained
optimisation problem, we instead use the method of Lagrangian multipliers, also called KKT method.

It turns out that if and is a solution of the primal hard margin SVM problem, then there exists
such that all three together satisfy the KKT conditions.

We thus have the Lagrangian:

Taking partial derivatives:

Substituting these conditions into the Lagrangian we obtain:

Choosing the positive set of which maximises this Lagrangian dual problem will therefore give us the
solution to the original problem. In summary therefore we have found:

15
Note that the terms in the above sum are non-zero only for support vectors. Thus the prediction is made
essentially by taking the dot-product of each new point with the set of support vectors.

The SVM Lagrangian dual problem is a quadratic optimisation problem. Using standard algorithms this
problem can be solved in . One algorithm called chunking exploits the fact that many of the s will
be zero since this is true for all points outside the margins.

Kernel methods
Kernel methods are a way of performing a feature space transformation on the original data. If we apply
a SVM to the transformed data, our model is still linear in the transformation space, but it will be non-
linear in the original space, allowing us to capture more complex relations.

One problem with simply applying a transformation to each data point is that it is impractical to
compute for each point with very high dimensional data. We can get around this, however, by
noting (see slide above) that the data only appears in both training and prediction equations in the
form of the dot product . This means that we never actually need to compute , but only
. This naturally gives rise to the definition of a kernel:

16
Advantages of using kernels include:

 Saved computational time in not having to compute for all observations


 Ability to extend method to infinite-dimensional data
 Kernels can be applied to objects that are not vectors, such as graphs, sequences, even movies;
it serves as a general similarity measure

If is a kernel then the following are also valid kernels:

The Representer Theorem states that a large class of linear methods can be formulated (represented)
such that both training and making predictions require data only in a form of a dot product. This
includes:

 Hard margin support vector machine


 Ridge regression
 Logistic regression
 Perceptron
 Principal components analysis

17
One of the advantages of the representer theorem is that is highlights the fact that all information about
feature mapping is contained within the kernel. The algorithm this decouples into choose a learning
method (e.g. SVM vs logistic regression) and then choosing a feature space mapping (i.e. a kernel).

Examples of kernels

 Polynomial kernel: , where is the integer order of the polynomial


 Radial basis function kernel: , where is a spread parameter

Mercer’s theorem states that any finite sequence of vectors arranged in an matrix of pairwise
values will be a kernel if the matrix is positive semidefinite, and this olds for all possible
sequences.

Bagging
Bagging is a method of constructing ‘novel’ datasets by resampling with replacement from our actual
data set. The idea is to generate new datasets each of the same size as our original set, then build a
classifier on each set separately and combine predictions via voting. The purpose of this is that including
more independent observations should reduce prediction variance.

At each round of selection, a particular datum has a probability of of NOT being selected. Thus
the probability of that observation being left out of the new bootstrapped sample altogether
is: . In the limit of large , this approaches , so over a third of our data won’t be used
in each training dataset we produce. We can use this excluded data for cross-validation.

Bagging is typically an effective method to reduce variance, and the performance is generally
significantly better than the base classifiers but never substantially worse

Clustering and EM Algorithm

Unsupervised learning
Unsupervised learning differs from supervised learning in that there are no specified data labels that we
try to predict. Instead, the goal is simply to find patterns and structure in the data.

Common applications of unsupervised learning:

 Clustering
 Dimensionality reduction
 Probabilistic graphical models
 Outlier detection
18
K-means clustering
Clustering is automatic grouping of objects such that the objects within each group (cluster) are more
similar to each other than objects from different groups. In order to do this we need a measure of
similarity, or as is often used instead a measure of dissimilarity. One common method is Euclidean
distance:

K-means is a very popular iterative clustering algorithm, which requires specifying the number of
clusters in advance. It proceeds as follows:

Higher values of will always result in better fit of the data, since they permit a more flexible model.
The ‘kink method’ is a simple way of deciding the value of , which involves plotting the intra-cluster
variation against the number of clusters, and locating the ‘kink’. More abstract information-theoretic
methods can also be used to determine .
19
Gaussian mixture model
GMM clustering is a generalisation of k-means which applies a probabilistic approach. The key insight is
to regard cluster centers as nodes of different distributions which generated each cluster.

While still requiring the number of clusters/distributions to be set in advance, GMM does not require
each point be assigned to exactly one cluster. Instead, we assign a particular point to multiple clusters
each with some probability.

We typically take each of the component probability distributions to be a multivariate Gaussian:

For -dimensional data and clusters we will have mean parameters, covariance
parameters, and weight parameters. Using this method, the probability that an observation
occurs at point is equal to the weighted sum over the Gaussians:

20
To solve this problem using the standard maximum likelihood approach, therefore, we aim to find the
set of parameters for that maximise the joint log likelihood:

Notice that unlike the usual log likelihood expression, we have an inner summation term (over the
multiple Gaussian centers) which we cannot take outside the logarithm. This makes finding an analytical
solution impossible. Instead we usually solve a GMM using a method called the Expectation
Maximisation algorithm.

Expectation Maximisation algorithm


Expectation Maximisation (EM) is a generic algorithm for optimising the parameters of the log
likelihood, even when no analytic solution is available. It is often used for the Gaussian mixed model
clustering method, but can be used in many other machine learning methods too. The key insight is to
introduce a series of latent (unobserved) variables which can make calculations easier.

This gives rise to the following iterative optimisation scheme (to find local maximum only):

21
Note that most of the work is done in the maximisation step. The initial expectation step computes all
the parameters (specifically ) that are needed to fully evaluate the derivatives used in the M-step, but
does not involve any actual maximisation itself.

Applying the EM algorithm to GMM


The latent variables in GMM are taken to be the true cluster from which each observation was
generated. Thus for -clusters. If we knew all the s, we would not need to sum over all
possible clusters, since obviously we would already know which cluster each observation came from.
Our modified likelihood would then take the form:

Unlike the original log likelihood, this expression can be maximised analytically, so long as we know the
values of . Since we don’t actually know the true cluster assignments, we can instead use the
expectation (where are all the GMM model parameters, and the sum is over all possible allocations of
s to clusters):

Since we need a definition of , we use what is called the responsibility that cluster
takes for data point :

And so for the joint distribution (assuming all data points are independent):

Now we are ready to evaluate the log likelihood using the equality derived previously:

22
We then take partial derivatives to maximise this expression with respect to and . The new values
of the parameters are then used in the next step of the iteration to calculate new values of , which
are then used for another maximisation step, and so on.

It turns out that K-means is a special case of GMM in which all components have the fixed probability
, and in which each Gaussian has a fixed covariance matrix .

Dimensionality Reduction
Principle components analysis
The purpose of dimensionality reduction is to reduce the number of variables/dimensions in the data
while still preserving the structure of interest. This can aid in visualisation, computational efficiency, and
data storage compression.

Principal components analysis is a popular method for achieving dimensionality reduction. Given a
dataset, PCA aims to find a new coordinate system such that most of the variance is concentrated in the
first coordinate, and most of the remainder in the second coordinate, and so forth. Most of the
dimensions are then discarded, leaving a reduced number of dimensions that still preserve most of the
variance.

23
Consider our new coordinate system as a set of vectors each with unit length. To transform a
single original data point into our new coordinate system, we use the transformation: . If we
put all the original data points into columns we have , and then the th coordinate of all
the new variables is given by:

We need to find a method for determining all the vectors. To do this we can use the covariance matrix
of the centered data, which is simply:

We want to choose such that it minimises the variance of the transformed data. This variance is:

We want to find the that maximises this variance subject to being of unit length. We can solve
this constrained optimisation using a Lagrangian:

Since and we know that has the largest variance of all the coordinates (by
design), it follows that is simply the eigenvector of the centered covariance matrix with the largest
eigenvalue. We find all the other vectors by the same process, adding the requirement that they be
orthogonal to all previous solutions. This is always possible since a symmetric matrix always has
real eigenvalues.

One other benefit of PCA is that it should result in coordinates which are uncorrelated with each other.

An extension of regular PCA is to use kernel PCA, in which a more complex kernel function is used in
place of to capture nonlinear variations.

24
Multidimensional scaling
Multidimensional scaling is a non-linear dimensionality reduction method which seeks to map data to a
lower-dimensional space preserving pairwise differences (dissimilarities) as much as possible. How
exactly dissimiliarity is defined, and also how ‘preservation’ is defined, distinguishes distinct
instantiations of MDS from each other.

The degree of preservation of dissimilarity is measured using a stress function. One possible definition
in terms of differences is given by:

The idea is to find the s such that is minimised. The logic behind MDS is that if there are
genuine clusters in high dimensional data, then points within these clusters are close to each other,
while points from different clusters are far away. MDS attempts to preserve this distance structure, so
that clusters are preserved in the low dimensional map.

One additional application of MDS is to use it for producing a set of coordinates given just the pairwise
dissimilarities between various instances. This could be used, for example, to provide a mapping of the
characteristics of different movies based only on pairwise similarity ratings by viewers.

Manifold learning
The k-means algorithm can find spherical clusters, and GMM extends this to be able to find elliptical
clusters, however both algorithms will be unable to correctly classify highly irregular clusters:

25
If we could transform this data in some way, that may make it easier for simple clustering algorithms to
find the desired results. The key assumption of manifold learning is that high dimensional data actually
consists of a low-dimensional manifold that is locally Euclidean, but is ‘rolled up’ in a higher dimensional
space. The manifold is that subset of points in the high-dimensional space that locally looks low-
dimensional (e.g. treat the arc of a large circle as a line).

We need a way of ‘unfolding’ these manifolds, on which we can then apply regular clustering
algorithms. In doing this we are interested in preserving only local or geodesic distances along the
manifold, not global distances.

The unfolding itself is just the process of constructing a new similarity-preserving coordinate system. All
we need to do this is define all the pairwise geodesic distances, and then simply input these into the
MDS algorithm. We then need a way to determine the geodesic distances. It turns out we can do this
using weighted, undirected graphs using the following steps:

1. Define some local radius , and connect all vertices (one for each observation) if in
the original space.
2. Set the weights between vertices and to .
3. Compute the shortest path between each pair of non-connected nodes in the graph.
4. Construct the geodesic distance matrix as the length of the shortest path between each pair of
nodes in the graph.
5. Perform MDS on the resulting geodesic similarity matrix.

This method is known as the isomap algorithm.

26
Spectral clustering
Spectral clustering is an alternative non-linear dimensionality reduction method to the isomap
algorithm. It proceeds as follows:

1. Construct a similarity graph by starting with a fully connected graph, and setting all weights
equal to , which is called a Gaussian kernel.
2. Define the graph Laplacian matrix , where is the degree matrix with the degree of
each vertex on the diagonal.
3. Perform dimensionality reduction using Laplacian eigenmaps.

In this method the new set of coordinates produced by the Laplacian eigenmaps method is chosen such
that the following objective is minimised:

This minimisation is taken subject to the restriction that , as otherwise we can always reduce
the result by reducing the length of . We thus have the problem:

Each of these vectors represents the set of th components of the original data points in the new
coordinate system. As with PCA, we order the vectors by size of the eigenvalue. In this case, however,
the largest eigenvalue always corresponds to the vector which is not helpful, so we just ignore this
solution. Putting down all the in columns then gives as a matrix with the mapped points as rows.

Bayesian Methods
The Bayesian approach to inference
The frequentist approach to inference is to first specify a model, then use maximum likelihood to find
the optimised parameters for that model, and finally to use the resulting optimised model to make
predictions. The Bayesian approach differs in that it rejects the attempt to reduce the model to a single
optimal parameter, instead considering the full space of possible parameter values. Instead of making a
point prediction, therefore, Bayesians compute the expected value of the posterior distribution:

The key insight is that providing a single number as a prediction gives us no information about the
spread or uncertainty in this value.

27
The Bayesian approach does not just take the single most likely value of the parameters , but uses all
possible values weighted by their probability of being consistent with the observed data. Aside from
giving more information about data spread, this approach is also less sensitive to overfitting.

Bayesian regression
The Bayesian approach to linear regression does not compute the maximum likelihood estimate of ,
but instead considers the full posterior distribution:

If both our likelihood and our prior are normally distributed, their product is also normally
distributed. This is an example of a conjugate prior, where the posterior and prior share the same
distributional form (in general they need not).

28
Bayesian regression can be conducted in a sequential manner by updating on a single observation at a
time, in each instance using the previous posterior distribution as the new prior. An example of this is
given in the figure below.

Expected predictions in the Bayesian regression are given by:

Note how this differs from the frequentist approach to prediction:

29
Also compare the workflow for maximum likelihood, approximate Bayes, and exact Bayes methods:

The Bayesian approach results in a prediction variance that depends upon the data value , and thus has
a higher variance at further distance from the data points, which makes sense.

Bayesian classification
Generative models produce a joint distribution allowing us to model the input as well.
Traditional discriminative models only have a conditional distribution . Bayesian methods for
discrete inference in classification constitute generate models, which use conjugate priors for discrete
distributions, such as the beta-binomial pair.

For the likelihood and the prior:

30
The posterior is then:

Applying these ideas now to the case of Bayesian logistic regression, we have the likelihood:

We now need a prior over , unlike as was the case before. Unfortunately there is no known
conjugate prior in this instance, so we typically just use a Gaussian prior over .

We observe once again how the variance of our prediction increases as we move away from the data in
the Bayesian case (right), unlike the frequentist case (left).

31
Bayesian model selection
Model selection involve the selection of the likelihood function or kernel function, setting any relevant
hyperparameters, and optimising algorithm settings (such as number of clusters in K-nearest
neighbours). The frequentist approach to model selection involves using holdout validation, selecting
the model with the lowest error on a heldout validation set not used in training.

Problems with heldout validation:

 Data inefficient in not using all available data for training


 Computationally inefficient in repeated rounds of training and evaluating
 Ineffective when optimising many parameters at once, as one is liable to overfit the heldout set

Bayesian model selection computes the Bayes factor for two model classes based on the data :

If we have a uniform prior over all models then this simplifies to:

To calculate the ratio on the right hand side, we integrate over all possible parameter values for the
model in question:

Thus the Bayes factor essentially tells us the relative fit of the two models to the data, averaged over all
possible parameter settings. Complex models might better fit the data over some parameter choices,
but they also have a much larger space of possible parameter values, and so on average may have
poorer fit.

32
Probabilistic Graphical Models

Introduction
Probabilistic graphical models are an efficient way of representing a joint probability distribution, which
allows us to explicitly represent independence relationships between variables. Traditionally joint
representations of discrete random variables can be represented by tables, however these increase
exponentially in size with the number of random variables. Also a full joint distribution will generally
have far too many parameters to easily estimate, and will likely result in overfitting.

Graphical models allow representation of independence relationships that greatly simplify our model.
Each node with parents will have a number of parameters equal to .The total number of
parameters is equal to for a fully connected graph with nodes.

These graphs are particularly useful in representing causal relationships between variables.

Marginal Independence
The notation means that A is marginally independent of B. If this is true then we can write:

33
Two unconnected parent nodes in a directed graph are always marginally independent. However a
parent in such a graph is NOT always independent of its children.

In the following two graphs, X and Y are not marginally independent of each other.

Marginal independence is thus loosely related to the concept of causality: can X cause Y, or can they
both share a common cause? If potentially yes, then they are not marginally independent.

Conditional Independence
Conditional independence is different to marginal independence in that it involves setting the value of
one or more variables (shown by greying out that node on the graph).

In the first two examples, X and Y are conditionally independent of one another given Z, meaning that
they can only influence each other via their effects on Z. In the third example, however, X and Y are NOT
conditionally independent. Interestingly, however, in the third example X and Y are NOT conditionally
independent, even though they are marginally independent. This phenomenon is known as explaining
away. To see how explaining away works, consider the case when X and Y are binary coin flips, and Z is
whether they land the same side up. Given Z, then X and Y become completely dependent.

Explaining away also occurs for observed children of the head-head node:

34
D-Separation
D-separation is a concept used in determining independence relations for larger graphs. Basically we
consider all possible paths between two nodes, ignoring the direction of any arrows in the graph
(directions matter for determining pairwise independence relations, just not for finding paths). For each
path we then consider ‘is this path blocked by an independence relation between any nodes along the
path?’ If all possible paths are blocked, then we say the two nodes are D-separated.

This concept is useful because we need to understand all the independence assumptions encoded in the
graph, not just the obvious ones. The Markov blanket of a node is the smallest set of other nodes
(variables) on which that node can be conditioned to make it independent of the rest of the graph.

Undirected Graphical Models


Undirected graphs lack directions to the edges that connect nodes together. They also lack a
probabilistic interpretation, and so instead of having a conditional probability, each clique of nodes has
a factor that is always weakly greater than zero. The joint distribution over a set of nodes will be
proportional to the product of all maximal clique factors, but without the appropriate normalisation.

A clique is a set of fully connected nodes in a graph. A maximal clique is the largest clique in a graph.

In this example then the joint probability distribution is defined using normalisation constant :

Undirected graphs actually have simpler dependence semantics than directed graphs, as now any two
nodes are independent so long as we condition upon any intermediate connecting nodes:

35
To convert a directed probabilistic graph to an undirected one, we follow the method of ‘moralising’ the
parent nodes by connecting them together.

Inference with Graphs


Consider the following example graph:

Now imagine we want to calculate the query . We do this as follows:

Focus on the numerator:

36
Each step of the algorithm removes one node from the graph and connects the nodes remaining
neighbours, forming a clique. Time complexity of this algorithm is exponential in the largest clique.
Because of this complexity it is often easier to make inferences numerically by sampling from the
desired distribution and constructing a histogram to estimate the relevant probability distribution.

Another example:

37
Learning Graph Parameters
Suppose we have our probabilistic graphical model and a bunch of observations, but we as let lack any
of the values for the model parameters.

If all our variables are observed, it is easy enough to simply count all the relevant occurrences and use
them to calculate the probabilities:

More common, however, is the case where we are unable to observe some variables. These are called
latent variables. Basically the way to deal with this is to fill in the parameters we can observe, then begin
with a series of guesses for all the rest.

38
We then follow an iterative process of using our observations to update the most likely values of the
unknown parameters, given our data and also the guesses for the other unknown parameters. This is
effectively the expectation maximisation algorithm again.

Hidden Markov Models


A hidden markov model involves a sequence of observed discrete outputs from a hidden state:

These have many applications, for example tagging parts of speech of words in a sentence, identifying
biological sequences, computer vision, etc. They are formulated as a PGM:

39
Key Formulae
Expected value:

Variance:

Covariance:

Simple linear regression:

Ridge regression:

Gradient descent:

Stochastic gradient descent: for each observation at a time

Special perceptron loss:

SVM objective:

Kernel definition:

Polynomial kernel: , where is the integer order of the polynomial

Radial basis function kernel: , where is a spread parameter

Bayesian regression:

GMM parameters: For -dimensional data and clusters we will have mean parameters,
covariance parameters, and weight parameters

Bayesian model selection:

Lagrangian method:

Logistic regression: with

Quadratic loss function:

40

You might also like