0% found this document useful (0 votes)

107 views23 pages

COMP4702 Notes 2019: Week 2 - Supervised Learning

The document discusses supervised learning techniques including bagging, boosting, and ensemble methods. It explains that bagging trains multiple models on bootstrap data samples to promote diversity, with random forest being a common example. Boosting incrementally builds models to emphasize misclassified instances. Ensembles generally perform better with diverse models. The document also discusses density estimation techniques like Gaussian mixture models, kernel density estimation, and k-nearest neighbors.

Uploaded by

Kelbie Davidson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views23 pages

COMP4702 Notes 2019: Week 2 - Supervised Learning

Uploaded by

Kelbie Davidson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

COMP4702 Notes 2019

Week 2 – Supervised Learning

built.Thus,ensemblescanb
eshowntohavemoreflexi
bilityinthefunctionsthey
canrepresent.This
flexibilitycan,intheory,ena
blethemtoover‐
fitthetrainingdatamoreth
anasinglemodelwould,
butinpractice,someensem
bletechniques(especiallybag
ging)tendtoreduceproble
msrelatedto
over‐
fittingofthetrainingdata.
Empirically,ensemblestendt
oyieldbetterresultswhen
thereisasignificantdiversit
yamongthe
models.Manyensemblemeth
ods,therefore,seektoprom
otediversityamongthemod
elsthey
combine
Bagging
Bootstrapaggregating,often
abbreviatedasbagging,invol
veshavingeachmodelint
heensemble
votewithequalweight.In
ordertopromotemodelvari
ance,baggingtrainseachm
odelinthe
ensembleusingarandomly
drawnsubsetofthetraining
set.Asanexample,ther
andomforest
algorithmcombinesrandomd
ecisiontreeswithbaggingto
achieveveryhighclassifica
tionaccuracy.
Examquestions:
2013Q2.
d)Baggingisatechnique
forbuildinganensembles
upervisedlearningmodelby
trainingmultiple
baselearnersonbootstrapd
atasamplesfromagiventr
ainingset.Suggestapractic
altechniquefor
decidingthenumberofbase
classifierstouseinbaggin
g.Youcandescribeitin
wordsand/oruse
pseudocode.Ifyouneedto
makeanyassumptions,stat
ethem.
Theideawithhavingmultip
lebaselearnersistopick
learnersthatimproveaccurac
yinareaswhere
theotherlearnersaren’tacc
urate.Becausebaggingalso
createsmultiplemodelsoft
hesamelearner
usingbootstrappedexamples
itsbesttopickasmallnu
mberofunstableclassifiers,
asaveraging
stableclassifiersacrossmulti
plesampleshassmallerben
efit.
Onepossibletechniqueusing
cross‐validation
1:Createaseparatevalidation
set
2:Startroundsofbagging
3:Monitorvalidationseterror

4:Stopwhenitstartstorise
/bottomsout.
Boosting
Boostinginvolvesincrementall
ybuildinganensembleby
trainingeachnewmodelinst
anceto
emphasizethetraininginstan
cesthatpreviousmodelsmis
classified.Insomecases,bo
ostinghas
beenshowntoyieldbetter
accuracythanbagging,buti
talsotendstobemorelik
elytoover‐fitthe
trainingdata.Byfar,them
ostcommonimplementation
ofBoostingisAdaboost,alt
houghsome
neweralgorithmsarereported
toachievebetterresults
Supervised Learning: Algorithms that rely on labelled training data. Training generally relies
on minimisation of error over the training data.

Two main supervised learning problems: Classification and Regression

Generalisation: The capability of a model to maintain low error when evaluating unseen
data (ultimate goal is to have good generalisation). Can be measured by evaluating a trained
model on an independent test set.

Regression vs Classification: The output of regression is continuous, while the output of

classification is binary.
Linear Regression: Write a numeric output as a function of input variables.
- Assume there is an deterministic function that is distorted by noise.
Bias/Variance trade off: Less complex models produce more bias results with less variance.
Low degree polynomials tend to start out with much lower error (variance of error) than
higher ones but do not decrease the error with the addition of more test points (bias
remains).

Linear Discriminant Analysis (LDA)

- Focuses on maximising separability among features.
- Find new axis and project data onto new axis in a way that maximises separation
o Maximise distance between means of all classes
o Minimise variation of each class

k-Nearest Neighbour Classification

- Make prediction by finding the most common class that corresponds to the k nearest
training points
o Hyper Parameters: Distance metric and k (number of neighbours)

Logistic Regression

Model Selection
There is a fundamental trade-off between complexity, training set size and generalisation. If
the model is too simple, there may be poor predictive power on the test set. If the model is
too complex (overfit), then the model may be too specific to the training set and therefore
may have high error on the test set. Indicator of overfitting if the test error is much greater
than the training error.

Overfitting can be detected by using a third data set called the validation set. When the
validation error begins to decrease, the data is beginning to be overfit. Optima is point of
minimum validation error.

Avoiding overfitting via cross-validation, early stopping, pruning – further training may
hinder generalisation.

Occam’s Razor: Given a set of hypothesis, choose the simplest one that describes the
phenomena. In ML this means that the simplest model that defines the problem should be
used.
Week 3 – Optimisation & Statistical Learning
Maximum Likelihood Estimation (MLE): Method of determining which model parameters
best fit the data points. The likelihood represents the probability of getting the data points
from a distribution.

Bayes’ Classifier: Can build a class classifier using Bayes’ Rule

P ( x|C ) P ( C )
P ( C|x )=
p(x)
Prior
Posterior=Likelihood ×
Evidence

P ( x|C ) P ( C )
P ( C|x )=
∑ p ( x|C=Ci ) P(C=C i)
Ci

P(x|C) and P( C) can be estimated from the frequency of data

Regression using MLE: Goal to minimise the sum of squared errors

Number of Parameters in regression = order * |features| + 1

Multivariate Statistical Learning

Similar to univariate, however must use a multivariate Gaussian distribution with a mean
vector ( μ)and covariance square matrix (Σ)

Simple case: No covariance. Linear discriminant and will predict class based on the closest
mean to the data point. Also limits the distribution to be elliptical only in the axis direction
(find a picture)

General case: full covariance matrix. Quadratic discriminant. Prediction based on the
nearest pdf (which is most likely?)

Curse of dimensionality?

Confusion Matrix: A specified table that allows visualization of performance of an

algorithm. Actual Class vs Predicted class. (measures false positives etc)

Likelihood vs discriminate classification: Likelihood makes assumptions about the

distribution of the underlying data (to then use Bayes rule). Discriminant analysis does not
make assumptions about the data and simply tries to get the best separation possible
between classes. The discriminant analysis only estimates the boundaries.
Week 4 – Density Estimation
Estimating the distribution of data (more detailed than just using a Gaussian distribution)
Mixture Densities: Assume the data is a linear combination of different pdfs.
k
p ( x ) =∑ p ( x|Gi ) P(Gi )
i=1

K -> number of mixture components

p ( x ) -> the mixture model density function
p ( x|G i ) -> the component density, the probability that the point came from the ith mixture
p(Gi ) -> The mixture proportion (prior) – this is a weight

The parameters of the model are: k, covariance and mean vectors for each mixture and the
mixture weights/proportions.

Gaussian Mixture Model: Mixture model with pdfs as Gaussian distributions.

p ( x|Gi ) N (μ i , Σ i)

Probabilistic models trained through maximising the likelihood, however there is no closed
form solutions so must use an alternative method.

Expectation Maximisation (EM): Difficulty with GMM is determining which Gaussian each
data point should belong to.

EM locally optimises log-likelihood to determine the parameters for the k-Gaussians

1. E Step: Estimate the latent variables (probability of each data point coming from a
specific cluster)
- Randomly place k gaussians
- For each point determine which Gaussian it is more likely to come from. Mark the
point as being closest to.
2. M Step: adjust and reclassify the points and adjust the parameters of each gaussian
3. Iterate until convergence

Issues with EM:

1. Starting conditions – must be chosen to initialise the search. The algorithm may get
different results based on where it begins (spurious/local optima) – variance driven
towards 0.
2. Selection of number of models (k) – hyperparameter, sometimes given otherwise
use cross-validation.

Non-Parametric Density Estimation: Non-parametric models let the data speak for itself,
there is no modelling of the underlying distribution. These estimations can be used to
determine the likelihood estimates for Bayes rule.

Histogram: hyperparameter is bin width, keep a count of the number of data points within
each bin and plot the distribution.

Kernel density estimator

1. Define some kernel function (most commonly a normal distribution) – if a uniform
distribution then KDE == histogram
2. Centre a kernel function on each data point
3. Sum the influence of each kernel function over the dataset and normalise (dividing
through by N) so that the integral sums to 1 (it is a probability density estimate)

Choice of bandwidth (h) is an important hyperparameter. If h is large, the kernel function

will be spread widely – smoother function.

k-nearest neighbour

Autoencoder?
Idea is to construct a neural network with hidden layers, where the input matches the
output. By doing so you create an encoder and decoder phase. At the smallest width in the
network, the data is compressed but can still represent the input well, hence the
dimensionality can be reduced by using these values. It is a more dense representation of
the input data

Question 3
a)
i)
([784+1]*50)+([50+1]*50)+([50+1]*2)+([2+1]*50)+
([50+1]*50)+([50+1]*784) = 84586
ii)
As the number of inputs of an autoencoder matches
the output the goal is to produce the input directly on the
output. Any internal structures and hidden layers less than the
input will work to reduce the dimensionality constraint of that
layer. Therefore, the output of these hidden layers can be
taken as a reduce dimensionality form if the output = input.
Week 5 – Clustering
Unsupervised learning that tries to look for grouping structure in a dataset. Related to
density estimation as trying to locate areas of high density.

k-means clustering
Performs local optimisation of the distance between points and their cluster centre. Very
similar to the EM algorithm for density estimation but using a binary value for each cluster
(rather than a gaussian probability)
1. Initialise k clusters (randomly) – have to choose k
2. Calculate closest centre for each data point
3. Recalculate the centre of the cluster with the new points
4. Repeat until convergence

Issues with k-means:

Poor initialisation may lead to a poor convergent solution. Easily overcome by running with
multiple initialisations and selecting the ‘best’ result.

Hierarchical clustering
Make a tree that indicates distance between nodes (dendrogram). Can then make a cut
through the graph to identify the number of clusters of a specific radius.

Mean shift clustering

??
Week 6 – Dimensionality Reduction
Reducing the dimensionality is used as a pre-processing step to reduce time and space used
in analysis. Leads to a simplified model (robust on smaller datasets) that is more
interpretable (simple explanations) and easier to visualise.
- Alleviate curse of dimensionality
- Enhanced generalisation
- Faster learning

Feature selection vs Feature extraction: Feature selection is a method of using a subset of

features in the model (ie discarding features that don’t contribute to explanation). Feature
extraction is the process of remapping useful features into a new space. Feature extraction
transforms useful features into new ones. Example include LDA, PCA, Tsne, MDS

Forward search: Method of selecting features for feature extraction. Greedily add the best
features (has the minimum error). When a feature is added, check to see if the total error
has decreased – if not then terminate the algorithm at the previous step. Does not always
get the optimal choice because of the greedy decision.

Feature Extraction algorithms

Principle Component Analysis (PCA)
Performs a rotation of the data to a new set of axis (in a lower dimension). These new axis
are chosen such that they are in the direction of greatest variance.

The eigenvalues of the covariance matrix of the data indicates the ‘explanatory power’ of
each feature towards the total variance. The eigenvector points in the direction of the
eigenvalue. Hence selecting the eigenvectors that correspond largest k eigenvalues gives
you the data reduced to k-dimensions that explain the most variance. Highly correlated data
points produce clusters.

Scree-graph: Used to determine the number of PC to use in PCA feature extraction. Plot of
eigenvalue vs component number. Ie it is a plot of the eigenvalues sorted in descending
order. Best to select at the elbow of the graph.

Multidimensional Scaling (MDS)

Goal of transforming data into a smaller dimension while preserving distance between each
point.
Standardised MDS Euclid dist == PCA on a correlation matrix (as opposed to covariance)
Sammon mapping: Form of MDS that aims to preserve smaller distances rather than larger
ones.

Fishers LDA
Supervised technique which aims to find a direction such that projection of the data onto
the direction makes the classes well separated. It does so by maximising the distance
between the means of the data while minimising the variance within each class.

LDA vs PCA
Both reduce dimensionality by trying to find a linear combination of the features. LDA is
supervised and attempts to model differences between classes. PCA is unsupervised and
instead tries to preserve the trend in data (maintaining the variance).

t-Stochastic Neighbour Embedding (t-SNE)

Essentially aims to preserve as much structure as possible from the higher dimensional
space in a lower dimension (local and global structure).
1. Convert all data points into conditional probabilities (represents data similarity).
Similarity between a point and ALL other points
2. Project onto the reduced dimension (randomly)
3. Minimise the KL divergence (a measure of dis-similarity) using gradient descent with
a t-distribution
Week 7-9 – Neural Networks
NN is a set of connected input/output neurons that are connected with some weight.

Neuron with linear activation reflects linear regression.

Linear neuron with sigmoid activation function is equivalent to logistic regression (output
between 0 and 1).

The original neural model was called a McCulloch-Pitts neuron and used a step activation
function for some specified threshold value.
- Cannot train with backprop as the function is not differentiable

Importance of bias/dummy input: Allows for the model to incorporate offset. It is no longer
required to output 0 given an input of 0.

Perceptron: A single layer of a neural network (comprised of a set of independent neurons

that share the same inputs but have their own weights).

(single) Perceptron learning rule: Original method of training a perceptron, requires that the
data is linearly separable for convergence (XOR problem).

Gradient Descent Learning (Backpropagation)

Gradient descent is a numerical method of solving optimisation problems. The algorithm is
used in order to update the weights in the network in order to minimise an error function. It
has one hyperparameter – the learning rate)

Learning rate: Measures the step size along the error surface during the optimisation. It is
difficult to select – too high and the weight changes do not match changes in the gradient.
Too small and the network takes too long to learn. High learning rates can lead to
overstepping of the solution and oscillation.

Momentum: Variable that can be added to the Backpropagation algorithm (a second

hyperparameter). This aims to reduce oscillations in gradient descent. Essentially allows for
a variable learning rate based on the trajectory of the algorithm.

Why use sigmoid functions: Sigmoid are non-linear, continuous and differentiable versions
of a hard threshold function. This means they can be used to solve linearly non-separable
problems. Differentiability is also important as the network can then be learned through
gradient descent/back propagation. At very large output values, the sigmoid derivative is
very small resulting in small weight updates that can make it slow.

Weight-initialisation: Should be close to 0 and similar because otherwise the high variance
(random information) may be propagated through the network. (Standardise the inputs). If
there is a large variance then different nodes will learn at different rates. It is also beneficial
to have uncorrelated inputs as the error can be minimised independently. High initialisation
of weight can lead to slow training as the sigmoid output will have low partial derivatives.
Determining number of hidden layers: Difficult to do (similar to polynomial regression
matching) because it reflects the degree of complexity in the output. There is a trade-off
between training error and generalisation. Best to use cross-validation to see what is best.

Stochastic (on-line) vs batch learning

Batch learning updates the weights using the total error function over the training set.
Stochastic learning updates the weight of each training point using only its gradient. Batch
learning is simpler to analyse than stochastic methods. Stochastic learning is normally faster
as you do not have to find averages. It can also allow for movement that increases error
which can result in a better final solution.

Controlling MLP Complexity

Early stopping: Stopping the training early if overfitting is detected. Ie when the validation
error begins to increase (when it is minimised). This limits the effective complexity of the
model.

Regularisation: Methods to avoid overfitting to training data (maintain generalisation)

1. Weight-decay: Add an additional penalty to having large weights. This suppresses
irrelevant components by choosing the smaller vectors. It also reduces noise from
the data and ensures that sigmoid operate in regions with moderate gradient. The
result is a much more smooth/simple discriminant which is likely to generalise
better.
2. Early stopping
3. Cross-validation??
Week 9 – Deep Learning

Neural Network with a large number of hidden layers, trained again with backpropagation.

They normally use ReLu functions instead of sigmoids because they train much faster.
Because the networks are much larger than previously discussed, training time becomes a
serious consideration.

Convolutional Networks/Layers
Specific structure designed to process data that come in arrays (pictures, etc). Perform a
convolutional operator on an input subset.???

Pooling Layer
These layers are placed at the end of a convolutional network and pool the results from the
neighbouring units. This creates an invariance to shifts and distortions (ie image rotations).
Week 10 – Kernel Machines
Discriminant-based: Don’t have to estimate densities. Discriminant is defined by support
vectors.

Margin: Distance from a discriminant to the closest data points (in both directions).

Support vector: Goal of a support vector machine is to maximise the margin. The support
vectors are the data points that limit the margin and hence define the discriminant (optimal
separating hyperplane).

Soft Margin Hyperplane: When the points are not linearly separable, a soft error function
allows for there to be classification errors.

Basis function: A function that transforms data into a higher dimensional space. This can be
used in maximising the margin and creating a non-linear discriminant. This is quite
computational intense to do.

Kernel function: A similarity function – bigger more similar the two points are. It is the dot
product in the higher dimensional space

Kernel Trick: Reduces computation by not requiring the use of the basis function. As such,
the higher dimensional transform can be avoided. Only need to know the kernel. This makes
kernel methods very adaptive.
Week 11 – Combined Learners/Ensembles
Comparing ML Algorithms

Cross-Validation (different to in model selection)

Divide data into training, validation and test sets.
1. Partition data into k equal subsets
2. Train the model on all but one subset
3. Test on the held-out subset
4. Repeat for each subset
5. Generalisation error is the average of the hold out test errors

Bootstrapping
Sample with replacement to get datasets. Replacement means that multiple datasets can
contain the same data. Doing this multiple times can be used to infer characteristics (eg the
mean) of the estimator.

Confusion matrices
Given labelled data and a classified, a confusion matrix gives the breakdown of predictive
accuracy of the model. The diagonal indicates correct classification, and off diagonals miss-
classification.

Ensemble Learners
Use multiple learning algorithms to obtain better performance. It is supervised learning of
building learners and voting (for classification) or weighting (for regression). Tend to
perform better when base learners are uncorrelated and diverse.

Bagging
An ensemble learning technique that uses bootstrapping to generate L training sets to train
L base-learners.

Selecting the number of base-learners for bagging

Best outcome is to select learners that improve accuracy in areas where other ones lack. As
such it is best to use a small number of unstable classifiers because averaging would make it
more accurate. Practically this could be done using cross-validation, terminating when
validation error is minimised. Could also use forward selection??

Boosting
Incrementally build an ensemble by training a model based on the previous models miss-
classifications. May have better accuracy than bagging but is far more prone to over-fitting
training data. An example of boosting is the AdaBoost algorithm.
1. Initialise all data points to have an equal probability of selection
2. Randomly select some number of data points for the sample based on their
probability
3. Train the learner on the selected subset
4. Test the learner on each data point in the large sample
5. Calculate error and the learner weight (Beta – this is based on how well it classified
the data).
6. Decrease the probability of each data point if correctly classified
7. Combine the learners, scaling by the log inverse of their error rate (higher
classification accuracy means heavier weight in the output)

Adapting AdaBoost to Bagging

By removing the early stop condition and probability adjustment of the AdaBoost algorithm,
it becomes a simple bagging technique. When testing the output for bagging is also
performed using a majority vote instead of the weighted output.

Improving Ensembles
Try to get independence between the classifiers by either subset selection or metaclassifiers
(PCA).
Week 12 – Graphical Models/Bayesian Networks
Decision Trees
Supervised learning technique that provides a structure for classification based on specific
feature criteria. Benefit of transparency and interpretability (you can see exactly how the
classification decision was derived).

Impurity: A measure of how many different classes a node can allocate to (unsure if this is
right).

Learning of a decision tree is achieved by splitting nodes to minimise impurity. Measures of

impurity include:
1. Entropy
2. Gini index
3. Misclassification

Random forests
??

Bayesian Networks

# of rows in joint distribution table

|P(x, y, z) | = |x| * |y| * |z| -> |x| is the number of values x can take

# marginal table = product of number of rows in each conditional (commonly the inference
from the conditions in the Bayesian network can reduce the table size for specifying the
joint distribution)

There are three main rules to describe a Bayesian network

1. Head-to-head

P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P(Z∨Y )
P ( Z|X )=P ( Z|Y ) × P (Y | X ) + P ( Z| Y ) × P( Y ∨ X )
2. Tail-to-head

P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P ( Z|X )
3. Head-to-Tail

P ( X ,Y , Z )=P ( X ) × P ( Y ) × P( Z∨ X ,Y )
P ( Z∨ X )=P ( Z ∨X , R ) P ( R ) + P ( Z|X , R ) P( R)
The above cases can be chained together to form more complex distributions. Bayes rule
can be used to find the reversed conditional probabilities.
Week 13 – Bayesian Inference

Main idea is that when creating a model, there is uncertainty surrounding the parameters is
used in the model. Bayesian inference works by updating the model as observations are
made. It starts with a prior distribution about the parameters, this may be determined from
guessing or through expert knowledge. When observations are made, the model is updated
to obtain the posteriors through Bayes’ Rule, giving a new adjusted distribution. The new
updated posterior, is the new model based on our original thought (prior) and the observed
data.

The posterior of the data is proportional to the likelihood

Econometrics Cheat Sheet Stock and Watson
100% (5)
Econometrics Cheat Sheet Stock and Watson
2 pages
Ames Housing Price Prediction - Complete ML Project With Python
No ratings yet
Ames Housing Price Prediction - Complete ML Project With Python
14 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Session 5
No ratings yet
Session 5
36 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Lec 12
No ratings yet
Lec 12
15 pages
ML 01
No ratings yet
ML 01
24 pages
Lec 24
No ratings yet
Lec 24
39 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
40 Interview Questions On Machine Learning - AnalyticsVidhya
100% (1)
40 Interview Questions On Machine Learning - AnalyticsVidhya
21 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
40 Interview Questions On Machine Learning From Analytics Vidhya
No ratings yet
40 Interview Questions On Machine Learning From Analytics Vidhya
14 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
ML 20230316 1
No ratings yet
ML 20230316 1
9 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Unit 3
No ratings yet
Unit 3
16 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
If With: February 26, 2024
No ratings yet
If With: February 26, 2024
7 pages
Chapter 04 Answers
50% (2)
Chapter 04 Answers
6 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
FIN435 Individual Assignment
No ratings yet
FIN435 Individual Assignment
13 pages
BUS 173 Final Assignment
No ratings yet
BUS 173 Final Assignment
7 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
Soal Korelasi Dan Regresi Linier Sederhana
No ratings yet
Soal Korelasi Dan Regresi Linier Sederhana
11 pages
Stastyy
No ratings yet
Stastyy
2 pages
DM825 - Introduction To Machine Learning: Sheet 7, Spring 2013
No ratings yet
DM825 - Introduction To Machine Learning: Sheet 7, Spring 2013
4 pages
02 Multicollinearity
100% (1)
02 Multicollinearity
8 pages
Spss Reviewer
No ratings yet
Spss Reviewer
5 pages
Assignment 2 & 3
No ratings yet
Assignment 2 & 3
4 pages
Introduction To R Program and Output
No ratings yet
Introduction To R Program and Output
6 pages
Interpreting The Basic Outputs (SPSS) of Multiple Linear Regression
No ratings yet
Interpreting The Basic Outputs (SPSS) of Multiple Linear Regression
6 pages
Model Summary Model R R Square Adjusted R Square Std. Error of The Estimate 1 ? ? .516 10.000
No ratings yet
Model Summary Model R R Square Adjusted R Square Std. Error of The Estimate 1 ? ? .516 10.000
6 pages
Correlation and Regression
No ratings yet
Correlation and Regression
12 pages
Eco 303 - Econometrics: Final Exam
No ratings yet
Eco 303 - Econometrics: Final Exam
10 pages
Regression On Real Estate
No ratings yet
Regression On Real Estate
54 pages
Practice Questions
No ratings yet
Practice Questions
14 pages
Tugas Desain Riset Data Sas - Nandyni Zulfa Fitasari - E10021137 - D
No ratings yet
Tugas Desain Riset Data Sas - Nandyni Zulfa Fitasari - E10021137 - D
4 pages
QTTM509 Research Methodology-I: Dr. Tawheed Nabi
No ratings yet
QTTM509 Research Methodology-I: Dr. Tawheed Nabi
28 pages
Lecture 10
No ratings yet
Lecture 10
14 pages
ML Notes-1
No ratings yet
ML Notes-1
59 pages
Econ 316 Course Outline
No ratings yet
Econ 316 Course Outline
4 pages
Chapter Two Time Series Regression
No ratings yet
Chapter Two Time Series Regression
7 pages
Tugas 1-Forecasting SP 2023
No ratings yet
Tugas 1-Forecasting SP 2023
3 pages
Spss Mini Project
No ratings yet
Spss Mini Project
8 pages

COMP4702 Notes 2019: Week 2 - Supervised Learning

Uploaded by

COMP4702 Notes 2019: Week 2 - Supervised Learning

Uploaded by

COMP4702 Notes 2019

Week 2 – Supervised Learning

Two main supervised learning problems: Classification and Regression

Regression vs Classification: The output of regression is continuous, while the output of

Linear Discriminant Analysis (LDA)

k-Nearest Neighbour Classification

Bayes’ Classifier: Can build a class classifier using Bayes’ Rule

P(x|C) and P( C) can be estimated from the frequency of data

Regression using MLE: Goal to minimise the sum of squared errors

Multivariate Statistical Learning

Confusion Matrix: A specified table that allows visualization of performance of an

Likelihood vs discriminate classification: Likelihood makes assumptions about the

K -> number of mixture components

Gaussian Mixture Model: Mixture model with pdfs as Gaussian distributions.

EM locally optimises log-likelihood to determine the parameters for the k-Gaussians

Issues with EM:

Kernel density estimator

Choice of bandwidth (h) is an important hyperparameter. If h is large, the kernel function

Issues with k-means:

Mean shift clustering

Feature selection vs Feature extraction: Feature selection is a method of using a subset of

Feature Extraction algorithms

Multidimensional Scaling (MDS)

t-Stochastic Neighbour Embedding (t-SNE)

Neuron with linear activation reflects linear regression.

Perceptron: A single layer of a neural network (comprised of a set of independent neurons

Gradient Descent Learning (Backpropagation)

Momentum: Variable that can be added to the Backpropagation algorithm (a second

Stochastic (on-line) vs batch learning

Controlling MLP Complexity

Regularisation: Methods to avoid overfitting to training data (maintain generalisation)

Cross-Validation (different to in model selection)

Selecting the number of base-learners for bagging

Adapting AdaBoost to Bagging

Learning of a decision tree is achieved by splitting nodes to minimise impurity. Measures of

# of rows in joint distribution table

There are three main rules to describe a Bayesian network

The posterior of the data is proportional to the likelihood

You might also like