0% found this document useful (0 votes)
107 views23 pages

COMP4702 Notes 2019: Week 2 - Supervised Learning

The document discusses supervised learning techniques including bagging, boosting, and ensemble methods. It explains that bagging trains multiple models on bootstrap data samples to promote diversity, with random forest being a common example. Boosting incrementally builds models to emphasize misclassified instances. Ensembles generally perform better with diverse models. The document also discusses density estimation techniques like Gaussian mixture models, kernel density estimation, and k-nearest neighbors.

Uploaded by

Kelbie Davidson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views23 pages

COMP4702 Notes 2019: Week 2 - Supervised Learning

The document discusses supervised learning techniques including bagging, boosting, and ensemble methods. It explains that bagging trains multiple models on bootstrap data samples to promote diversity, with random forest being a common example. Boosting incrementally builds models to emphasize misclassified instances. Ensembles generally perform better with diverse models. The document also discusses density estimation techniques like Gaussian mixture models, kernel density estimation, and k-nearest neighbors.

Uploaded by

Kelbie Davidson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

COMP4702 Notes 2019

Week 2 – Supervised Learning

built.Thus,ensemblescanb
eshowntohavemoreflexi
bilityinthefunctionsthey
canrepresent.This
flexibilitycan,intheory,ena
blethemtoover‐
fitthetrainingdatamoreth
anasinglemodelwould,
butinpractice,someensem
bletechniques(especiallybag
ging)tendtoreduceproble
msrelatedto
over‐
fittingofthetrainingdata.
Empirically,ensemblestendt
oyieldbetterresultswhen
thereisasignificantdiversit
yamongthe
models.Manyensemblemeth
ods,therefore,seektoprom
otediversityamongthemod
elsthey
combine
Bagging
Bootstrapaggregating,often
abbreviatedasbagging,invol
veshavingeachmodelint
heensemble
votewithequalweight.In
ordertopromotemodelvari
ance,baggingtrainseachm
odelinthe
ensembleusingarandomly
drawnsubsetofthetraining
set.Asanexample,ther
andomforest
algorithmcombinesrandomd
ecisiontreeswithbaggingto
achieveveryhighclassifica
tionaccuracy.
Examquestions:
2013Q2.
d)Baggingisatechnique
forbuildinganensembles
upervisedlearningmodelby
trainingmultiple
baselearnersonbootstrapd
atasamplesfromagiventr
ainingset.Suggestapractic
altechniquefor
decidingthenumberofbase
classifierstouseinbaggin
g.Youcandescribeitin
wordsand/oruse
pseudocode.Ifyouneedto
makeanyassumptions,stat
ethem.
Theideawithhavingmultip
lebaselearnersistopick
learnersthatimproveaccurac
yinareaswhere
theotherlearnersaren’tacc
urate.Becausebaggingalso
createsmultiplemodelsoft
hesamelearner
usingbootstrappedexamples
itsbesttopickasmallnu
mberofunstableclassifiers,
asaveraging
stableclassifiersacrossmulti
plesampleshassmallerben
efit.
Onepossibletechniqueusing
cross‐validation
1:Createaseparatevalidation
set
2:Startroundsofbagging
3:Monitorvalidationseterror

4:Stopwhenitstartstorise
/bottomsout.
Boosting
Boostinginvolvesincrementall
ybuildinganensembleby
trainingeachnewmodelinst
anceto
emphasizethetraininginstan
cesthatpreviousmodelsmis
classified.Insomecases,bo
ostinghas
beenshowntoyieldbetter
accuracythanbagging,buti
talsotendstobemorelik
elytoover‐fitthe
trainingdata.Byfar,them
ostcommonimplementation
ofBoostingisAdaboost,alt
houghsome
neweralgorithmsarereported
toachievebetterresults
Supervised Learning: Algorithms that rely on labelled training data. Training generally relies
on minimisation of error over the training data.

Two main supervised learning problems: Classification and Regression

Generalisation: The capability of a model to maintain low error when evaluating unseen
data (ultimate goal is to have good generalisation). Can be measured by evaluating a trained
model on an independent test set.

Regression vs Classification: The output of regression is continuous, while the output of


classification is binary.
Linear Regression: Write a numeric output as a function of input variables.
- Assume there is an deterministic function that is distorted by noise.
Bias/Variance trade off: Less complex models produce more bias results with less variance.
Low degree polynomials tend to start out with much lower error (variance of error) than
higher ones but do not decrease the error with the addition of more test points (bias
remains).

Linear Discriminant Analysis (LDA)


- Focuses on maximising separability among features.
- Find new axis and project data onto new axis in a way that maximises separation
o Maximise distance between means of all classes
o Minimise variation of each class

k-Nearest Neighbour Classification


- Make prediction by finding the most common class that corresponds to the k nearest
training points
o Hyper Parameters: Distance metric and k (number of neighbours)

Logistic Regression

Model Selection
There is a fundamental trade-off between complexity, training set size and generalisation. If
the model is too simple, there may be poor predictive power on the test set. If the model is
too complex (overfit), then the model may be too specific to the training set and therefore
may have high error on the test set. Indicator of overfitting if the test error is much greater
than the training error.

Overfitting can be detected by using a third data set called the validation set. When the
validation error begins to decrease, the data is beginning to be overfit. Optima is point of
minimum validation error.

Avoiding overfitting via cross-validation, early stopping, pruning – further training may
hinder generalisation.

Occam’s Razor: Given a set of hypothesis, choose the simplest one that describes the
phenomena. In ML this means that the simplest model that defines the problem should be
used.
Week 3 – Optimisation & Statistical Learning
Maximum Likelihood Estimation (MLE): Method of determining which model parameters
best fit the data points. The likelihood represents the probability of getting the data points
from a distribution.

Bayes’ Classifier: Can build a class classifier using Bayes’ Rule


P ( x|C ) P ( C )
P ( C|x )=
p(x)
Prior
Posterior=Likelihood ×
Evidence

P ( x|C ) P ( C )
P ( C|x )=
∑ p ( x|C=Ci ) P(C=C i)
Ci

P(x|C) and P( C) can be estimated from the frequency of data

Regression using MLE: Goal to minimise the sum of squared errors


Number of Parameters in regression = order * |features| + 1

Multivariate Statistical Learning


Similar to univariate, however must use a multivariate Gaussian distribution with a mean
vector ( μ)and covariance square matrix (Σ)

Simple case: No covariance. Linear discriminant and will predict class based on the closest
mean to the data point. Also limits the distribution to be elliptical only in the axis direction
(find a picture)

General case: full covariance matrix. Quadratic discriminant. Prediction based on the
nearest pdf (which is most likely?)

Curse of dimensionality?

Confusion Matrix: A specified table that allows visualization of performance of an


algorithm. Actual Class vs Predicted class. (measures false positives etc)

Likelihood vs discriminate classification: Likelihood makes assumptions about the


distribution of the underlying data (to then use Bayes rule). Discriminant analysis does not
make assumptions about the data and simply tries to get the best separation possible
between classes. The discriminant analysis only estimates the boundaries.
Week 4 – Density Estimation
Estimating the distribution of data (more detailed than just using a Gaussian distribution)
Mixture Densities: Assume the data is a linear combination of different pdfs.
k
p ( x ) =∑ p ( x|Gi ) P(Gi )
i=1

K -> number of mixture components


p ( x ) -> the mixture model density function
p ( x|G i ) -> the component density, the probability that the point came from the ith mixture
p(Gi ) -> The mixture proportion (prior) – this is a weight

The parameters of the model are: k, covariance and mean vectors for each mixture and the
mixture weights/proportions.

Gaussian Mixture Model: Mixture model with pdfs as Gaussian distributions.


p ( x|Gi ) N (μ i , Σ i)

Probabilistic models trained through maximising the likelihood, however there is no closed
form solutions so must use an alternative method.

Expectation Maximisation (EM): Difficulty with GMM is determining which Gaussian each
data point should belong to.

EM locally optimises log-likelihood to determine the parameters for the k-Gaussians


1. E Step: Estimate the latent variables (probability of each data point coming from a
specific cluster)
- Randomly place k gaussians
- For each point determine which Gaussian it is more likely to come from. Mark the
point as being closest to.
2. M Step: adjust and reclassify the points and adjust the parameters of each gaussian
3. Iterate until convergence

Issues with EM:


1. Starting conditions – must be chosen to initialise the search. The algorithm may get
different results based on where it begins (spurious/local optima) – variance driven
towards 0.
2. Selection of number of models (k) – hyperparameter, sometimes given otherwise
use cross-validation.

Non-Parametric Density Estimation: Non-parametric models let the data speak for itself,
there is no modelling of the underlying distribution. These estimations can be used to
determine the likelihood estimates for Bayes rule.

Histogram: hyperparameter is bin width, keep a count of the number of data points within
each bin and plot the distribution.

Kernel density estimator


1. Define some kernel function (most commonly a normal distribution) – if a uniform
distribution then KDE == histogram
2. Centre a kernel function on each data point
3. Sum the influence of each kernel function over the dataset and normalise (dividing
through by N) so that the integral sums to 1 (it is a probability density estimate)

Choice of bandwidth (h) is an important hyperparameter. If h is large, the kernel function


will be spread widely – smoother function.

k-nearest neighbour

Autoencoder?
Idea is to construct a neural network with hidden layers, where the input matches the
output. By doing so you create an encoder and decoder phase. At the smallest width in the
network, the data is compressed but can still represent the input well, hence the
dimensionality can be reduced by using these values. It is a more dense representation of
the input data

Question 3
a)
i)
([784+1]*50)+([50+1]*50)+([50+1]*2)+([2+1]*50)+
([50+1]*50)+([50+1]*784) = 84586
ii)
As the number of inputs of an autoencoder matches
the output the goal is to produce the input directly on the
output. Any internal structures and hidden layers less than the
input will work to reduce the dimensionality constraint of that
layer. Therefore, the output of these hidden layers can be
taken as a reduce dimensionality form if the output = input.
Week 5 – Clustering
Unsupervised learning that tries to look for grouping structure in a dataset. Related to
density estimation as trying to locate areas of high density.

k-means clustering
Performs local optimisation of the distance between points and their cluster centre. Very
similar to the EM algorithm for density estimation but using a binary value for each cluster
(rather than a gaussian probability)
1. Initialise k clusters (randomly) – have to choose k
2. Calculate closest centre for each data point
3. Recalculate the centre of the cluster with the new points
4. Repeat until convergence

Issues with k-means:


Poor initialisation may lead to a poor convergent solution. Easily overcome by running with
multiple initialisations and selecting the ‘best’ result.

Hierarchical clustering
Make a tree that indicates distance between nodes (dendrogram). Can then make a cut
through the graph to identify the number of clusters of a specific radius.

Mean shift clustering


??
Week 6 – Dimensionality Reduction
Reducing the dimensionality is used as a pre-processing step to reduce time and space used
in analysis. Leads to a simplified model (robust on smaller datasets) that is more
interpretable (simple explanations) and easier to visualise.
- Alleviate curse of dimensionality
- Enhanced generalisation
- Faster learning

Feature selection vs Feature extraction: Feature selection is a method of using a subset of


features in the model (ie discarding features that don’t contribute to explanation). Feature
extraction is the process of remapping useful features into a new space. Feature extraction
transforms useful features into new ones. Example include LDA, PCA, Tsne, MDS

Forward search: Method of selecting features for feature extraction. Greedily add the best
features (has the minimum error). When a feature is added, check to see if the total error
has decreased – if not then terminate the algorithm at the previous step. Does not always
get the optimal choice because of the greedy decision.

Feature Extraction algorithms


Principle Component Analysis (PCA)
Performs a rotation of the data to a new set of axis (in a lower dimension). These new axis
are chosen such that they are in the direction of greatest variance.

The eigenvalues of the covariance matrix of the data indicates the ‘explanatory power’ of
each feature towards the total variance. The eigenvector points in the direction of the
eigenvalue. Hence selecting the eigenvectors that correspond largest k eigenvalues gives
you the data reduced to k-dimensions that explain the most variance. Highly correlated data
points produce clusters.

Scree-graph: Used to determine the number of PC to use in PCA feature extraction. Plot of
eigenvalue vs component number. Ie it is a plot of the eigenvalues sorted in descending
order. Best to select at the elbow of the graph.

Multidimensional Scaling (MDS)


Goal of transforming data into a smaller dimension while preserving distance between each
point.
Standardised MDS Euclid dist == PCA on a correlation matrix (as opposed to covariance)
Sammon mapping: Form of MDS that aims to preserve smaller distances rather than larger
ones.

Fishers LDA
Supervised technique which aims to find a direction such that projection of the data onto
the direction makes the classes well separated. It does so by maximising the distance
between the means of the data while minimising the variance within each class.

LDA vs PCA
Both reduce dimensionality by trying to find a linear combination of the features. LDA is
supervised and attempts to model differences between classes. PCA is unsupervised and
instead tries to preserve the trend in data (maintaining the variance).

t-Stochastic Neighbour Embedding (t-SNE)


Essentially aims to preserve as much structure as possible from the higher dimensional
space in a lower dimension (local and global structure).
1. Convert all data points into conditional probabilities (represents data similarity).
Similarity between a point and ALL other points
2. Project onto the reduced dimension (randomly)
3. Minimise the KL divergence (a measure of dis-similarity) using gradient descent with
a t-distribution
Week 7-9 – Neural Networks
NN is a set of connected input/output neurons that are connected with some weight.

Neuron with linear activation reflects linear regression.

Linear neuron with sigmoid activation function is equivalent to logistic regression (output
between 0 and 1).

The original neural model was called a McCulloch-Pitts neuron and used a step activation
function for some specified threshold value.
- Cannot train with backprop as the function is not differentiable

Importance of bias/dummy input: Allows for the model to incorporate offset. It is no longer
required to output 0 given an input of 0.

Perceptron: A single layer of a neural network (comprised of a set of independent neurons


that share the same inputs but have their own weights).

(single) Perceptron learning rule: Original method of training a perceptron, requires that the
data is linearly separable for convergence (XOR problem).

Gradient Descent Learning (Backpropagation)


Gradient descent is a numerical method of solving optimisation problems. The algorithm is
used in order to update the weights in the network in order to minimise an error function. It
has one hyperparameter – the learning rate)

Learning rate: Measures the step size along the error surface during the optimisation. It is
difficult to select – too high and the weight changes do not match changes in the gradient.
Too small and the network takes too long to learn. High learning rates can lead to
overstepping of the solution and oscillation.

Momentum: Variable that can be added to the Backpropagation algorithm (a second


hyperparameter). This aims to reduce oscillations in gradient descent. Essentially allows for
a variable learning rate based on the trajectory of the algorithm.

Why use sigmoid functions: Sigmoid are non-linear, continuous and differentiable versions
of a hard threshold function. This means they can be used to solve linearly non-separable
problems. Differentiability is also important as the network can then be learned through
gradient descent/back propagation. At very large output values, the sigmoid derivative is
very small resulting in small weight updates that can make it slow.

Weight-initialisation: Should be close to 0 and similar because otherwise the high variance
(random information) may be propagated through the network. (Standardise the inputs). If
there is a large variance then different nodes will learn at different rates. It is also beneficial
to have uncorrelated inputs as the error can be minimised independently. High initialisation
of weight can lead to slow training as the sigmoid output will have low partial derivatives.
Determining number of hidden layers: Difficult to do (similar to polynomial regression
matching) because it reflects the degree of complexity in the output. There is a trade-off
between training error and generalisation. Best to use cross-validation to see what is best.

Stochastic (on-line) vs batch learning


Batch learning updates the weights using the total error function over the training set.
Stochastic learning updates the weight of each training point using only its gradient. Batch
learning is simpler to analyse than stochastic methods. Stochastic learning is normally faster
as you do not have to find averages. It can also allow for movement that increases error
which can result in a better final solution.

Controlling MLP Complexity


Early stopping: Stopping the training early if overfitting is detected. Ie when the validation
error begins to increase (when it is minimised). This limits the effective complexity of the
model.

Regularisation: Methods to avoid overfitting to training data (maintain generalisation)


1. Weight-decay: Add an additional penalty to having large weights. This suppresses
irrelevant components by choosing the smaller vectors. It also reduces noise from
the data and ensures that sigmoid operate in regions with moderate gradient. The
result is a much more smooth/simple discriminant which is likely to generalise
better.
2. Early stopping
3. Cross-validation??
Week 9 – Deep Learning

Neural Network with a large number of hidden layers, trained again with backpropagation.

They normally use ReLu functions instead of sigmoids because they train much faster.
Because the networks are much larger than previously discussed, training time becomes a
serious consideration.

Convolutional Networks/Layers
Specific structure designed to process data that come in arrays (pictures, etc). Perform a
convolutional operator on an input subset.???

Pooling Layer
These layers are placed at the end of a convolutional network and pool the results from the
neighbouring units. This creates an invariance to shifts and distortions (ie image rotations).
Week 10 – Kernel Machines
Discriminant-based: Don’t have to estimate densities. Discriminant is defined by support
vectors.

Margin: Distance from a discriminant to the closest data points (in both directions).

Support vector: Goal of a support vector machine is to maximise the margin. The support
vectors are the data points that limit the margin and hence define the discriminant (optimal
separating hyperplane).

Soft Margin Hyperplane: When the points are not linearly separable, a soft error function
allows for there to be classification errors.

Basis function: A function that transforms data into a higher dimensional space. This can be
used in maximising the margin and creating a non-linear discriminant. This is quite
computational intense to do.

Kernel function: A similarity function – bigger more similar the two points are. It is the dot
product in the higher dimensional space

Kernel Trick: Reduces computation by not requiring the use of the basis function. As such,
the higher dimensional transform can be avoided. Only need to know the kernel. This makes
kernel methods very adaptive.
Week 11 – Combined Learners/Ensembles
Comparing ML Algorithms

Cross-Validation (different to in model selection)


Divide data into training, validation and test sets.
1. Partition data into k equal subsets
2. Train the model on all but one subset
3. Test on the held-out subset
4. Repeat for each subset
5. Generalisation error is the average of the hold out test errors

Bootstrapping
Sample with replacement to get datasets. Replacement means that multiple datasets can
contain the same data. Doing this multiple times can be used to infer characteristics (eg the
mean) of the estimator.

Confusion matrices
Given labelled data and a classified, a confusion matrix gives the breakdown of predictive
accuracy of the model. The diagonal indicates correct classification, and off diagonals miss-
classification.

Ensemble Learners
Use multiple learning algorithms to obtain better performance. It is supervised learning of
building learners and voting (for classification) or weighting (for regression). Tend to
perform better when base learners are uncorrelated and diverse.

Bagging
An ensemble learning technique that uses bootstrapping to generate L training sets to train
L base-learners.

Selecting the number of base-learners for bagging


Best outcome is to select learners that improve accuracy in areas where other ones lack. As
such it is best to use a small number of unstable classifiers because averaging would make it
more accurate. Practically this could be done using cross-validation, terminating when
validation error is minimised. Could also use forward selection??

Boosting
Incrementally build an ensemble by training a model based on the previous models miss-
classifications. May have better accuracy than bagging but is far more prone to over-fitting
training data. An example of boosting is the AdaBoost algorithm.
1. Initialise all data points to have an equal probability of selection
2. Randomly select some number of data points for the sample based on their
probability
3. Train the learner on the selected subset
4. Test the learner on each data point in the large sample
5. Calculate error and the learner weight (Beta – this is based on how well it classified
the data).
6. Decrease the probability of each data point if correctly classified
7. Combine the learners, scaling by the log inverse of their error rate (higher
classification accuracy means heavier weight in the output)

Adapting AdaBoost to Bagging


By removing the early stop condition and probability adjustment of the AdaBoost algorithm,
it becomes a simple bagging technique. When testing the output for bagging is also
performed using a majority vote instead of the weighted output.

Improving Ensembles
Try to get independence between the classifiers by either subset selection or metaclassifiers
(PCA).
Week 12 – Graphical Models/Bayesian Networks
Decision Trees
Supervised learning technique that provides a structure for classification based on specific
feature criteria. Benefit of transparency and interpretability (you can see exactly how the
classification decision was derived).

Impurity: A measure of how many different classes a node can allocate to (unsure if this is
right).

Learning of a decision tree is achieved by splitting nodes to minimise impurity. Measures of


impurity include:
1. Entropy
2. Gini index
3. Misclassification

Random forests
??

Bayesian Networks

# of rows in joint distribution table


|P(x, y, z) | = |x| * |y| * |z| -> |x| is the number of values x can take

# marginal table = product of number of rows in each conditional (commonly the inference
from the conditions in the Bayesian network can reduce the table size for specifying the
joint distribution)

There are three main rules to describe a Bayesian network


1. Head-to-head

P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P(Z∨Y )
P ( Z|X )=P ( Z|Y ) × P (Y | X ) + P ( Z| Y ) × P( Y ∨ X )
2. Tail-to-head

P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P ( Z|X )
3. Head-to-Tail

P ( X ,Y , Z )=P ( X ) × P ( Y ) × P( Z∨ X ,Y )
P ( Z∨ X )=P ( Z ∨X , R ) P ( R ) + P ( Z|X , R ) P( R)
The above cases can be chained together to form more complex distributions. Bayes rule
can be used to find the reversed conditional probabilities.
Week 13 – Bayesian Inference

Main idea is that when creating a model, there is uncertainty surrounding the parameters is
used in the model. Bayesian inference works by updating the model as observations are
made. It starts with a prior distribution about the parameters, this may be determined from
guessing or through expert knowledge. When observations are made, the model is updated
to obtain the posteriors through Bayes’ Rule, giving a new adjusted distribution. The new
updated posterior, is the new model based on our original thought (prior) and the observed
data.

The posterior of the data is proportional to the likelihood

You might also like