0% found this document useful (0 votes)
22 views121 pages

Machine Learning and Deep Learning Supervised Learning 1682688720

The document provides an introduction to machine learning, focusing on supervised and unsupervised learning methods. It covers essential concepts, types of data, and specific algorithms like k-means clustering, linear regression, and logistic regression, along with their applications and evaluation metrics. The second part discusses performance evaluation techniques, including cross-validation and metrics for regression and classification, as well as strategies for improving model performance.

Uploaded by

Mohammad Shafiee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views121 pages

Machine Learning and Deep Learning Supervised Learning 1682688720

The document provides an introduction to machine learning, focusing on supervised and unsupervised learning methods. It covers essential concepts, types of data, and specific algorithms like k-means clustering, linear regression, and logistic regression, along with their applications and evaluation metrics. The second part discusses performance evaluation techniques, including cross-validation and metrics for regression and classification, as well as strategies for improving model performance.

Uploaded by

Mohammad Shafiee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Introduction to Machine Learning

Part 1: Supervised and Unsupervised Learning

John Pinney
November 2020
Intended learning outcomes
After attending the three sessions of this
workshop, you will be better able to:

• Explain the difference between supervised and


unsupervised learning.
• Select a suitable machine learning method for a
given application.
• Prepare your own training and testing data
sets.
• Evaluate the performance of a machine
learning experiment.
Overview
What is machine Supervised learning
learning? Regression
linear models
Types of data Classification
logistic regression
Unsupervised learning decision trees
Clustering
k-means
What is machine learning?
What is machine learning?
Statistical learning theory
• Theory was introduced in the late 1960s.
• Became an applied science in 1990s.

• Allows us to
• detect or learn structures and relationships in data.
• assign observations to different classes.
• make predictions based on current knowledge.
Some essential vocabulary…
• vector
A quantity within a multidimensional space.
Some essential vocabulary…
• function
A mapping from one vector space (input) to another (output).
Some essential vocabulary…
• optimisation
A procedure that attempts to find the minimum (or maximum)
of a function.
A ‘machine’ has inputs and outputs.

input feature vector, x output vector, y

machine

(acts like a function)


The machine has parameters that we
might need to fit (optimise) using training
data.

training data

input vector, x output vector, y


machine
Types of data
Categorical data
(no numerical relationship between values)

• Nominal data: no obvious ordering of categories.


e.g. favourite colour:
green / blue / orange / yellow
When there are only 2 possible categories, data is called
dichotomous or binary.

• Ordinal data: there is a natural order for the categories.


e.g. Likert scale:
strongly disagree / disagree / neutral / agree / strongly agree
Quantitative data
(numerical data from counts or measurements)

• Discrete data: can only take specified values.


e.g. number of children in a family (integer)

• Continuous data: can take any value in an interval.


e.g. blood pressure
Example dataset
Take a look at the iris dataset.

What are the features and what are their data types?
Unsupervised learning
Unsupervised learning
• In unsupervised learning, we are looking for structure in
the inputs without any knowledge of associated
outputs: the data are considered to be unlabelled.

• We are seeking to “discover new knowledge”

• Examples include:
• Dimensionality reduction, e.g. principal component analysis
• Self-organising map
• Clustering
Clustering
To look for structure within a dataset, we often make use
of clustering techniques.

A set of objects is grouped in such a way that objects in


the same cluster are more similar (in some sense) to each
other than to those in other clusters.

It is a central task in exploratory data mining.


Clustering

0
0

2
2 1
machine
1
1
2
2
output cluster
input dataset, x labels, y
Clustering
• Feature-based clustering
takes as input the set of input feature vectors.

• Distance-based clustering
takes as input a matrix of distances that are
calculated between each pair of input feature vectors.
e.g. Euclidean distance.

Clustering methods may be flat (just reporting cluster


labels) or hierarchical (reporting a dendrogram of nested
clusters).
k-means clustering
A feature-based technique for flat clustering.

Requires a prior decision of the number of clusters (k) –


in practice a good value for k for a given data set may be
found by post-hoc analysis (e.g. silhouette score).

k-means clustering aims to partition n observations into k


clusters, in which each observation belongs to the cluster
with the nearest mean.
k-means clustering algorithm
1. Initialise positions for k cluster centroids (at random).
2. Assignment step: Assign each observation to the
cluster whose centroid is “nearest” according to the
chosen distance metric.
3. Update step: Calculate the new centroid positions
according to the observations assigned to each
cluster.
4. Check for convergence (cluster assignments did not
change). If not converged, go to 2.
k-means clustering

k-means is often fast in practice, but is a heuristic method


so is not guaranteed to find the global optimum. Re-
running several times with different starting points is
therefore advisable.

Note that this is an example of an expectation


maximisation approach.
k-means example
Using only the numerical features,
cluster the iris dataset.
k-means exercise
Look at the abalone dataset.

Considering only the numerical features, perform k-


means clustering. Use the silhouette score to determine
how many clusters the data appear to fall into.

What do the two clusters appear to correspond to?


Supervised learning
Supervised learning
• Here, labelled data are used to “train” a machine
learning algorithm, which is then used to classify or
predict the response of new input data.

• We want to learn the function f :x®y


Supervised learning
input vectors, x output vectors, y

Labelled training data

machine ? predictions
?
Two types of supervised learning

y is a continuous value
=> Regression
(estimate the response to a given input)

y is a discrete-valued class label


=> Classification
(identify the class of a given input)
Supervised learning:
Regression
Regression

labelled training data

machine ? predictions
?
Linear regression

Predict y from the features of x


by fitting a linear function.

Fitting is an optimisation
procedure: e.g. minimise the
sum of squared errors.
Linear regression example
With the iris dataset:
Considering only iris virginica:
1. Split the data into training and testing sets.
2. Use linear regression to predict sepal length
from petal length.
Linear regression exercise
With the abalone dataset:
Considering only adults:
1. Split the data into training and testing sets.
2. Use linear regression to predict rings from
the numerical features.
Linear regression with many features

We often want to apply some kind of


regularisation to our model, so that small
coefficients are pushed to zero. E.g. ridge
regression, lasso or elastic net.

This makes models simpler and easier to


interpret, and potentially shows which features
are informative for predicting y.
Supervised learning:
Classification
Classification

labelled training data

machine ? predictions
?
Logistic regression
Confusingly, logistic regression is an algorithm for
classification.

Consider a binary classification, with classes


labelled 0 and 1.

For our training data, we can plot the probability


that a particular value of x is labelled as class 1.
Logistic regression
Passed exam ( 1 = yes )

Hours of study
Logistic regression
Probability of passing exam

fitting a logit (sigmoid)


function to the training data

Hours of study
Logistic regression
Probability of passing exam

Predict fail Predict pass

Hours of study
Logistic regression example
With the iris dataset:
Considering iris versicolor and virginica:
1. Split the data into training and testing sets.
2. Predict iris (the species) from petal length.
3. Use a confusion matrix to examine the
results.
4. Do the results improve if the other numerical
features are included?
Do the same for a three-class logistic regression.
Logistic regression exercise
With the kickstarter dataset:
1. Split the data into training and testing sets.
2. Predict funded from the numerical features.
3. Use a contingency table to examine the
results.
4. How could we make use of the type feature,
which is a nominal data type?
‘One-hot’ encoding
Useful for converting a categorical variable into
multiple binary features, which can be used in
algorithms that require numerical inputs.
What about non-linear classification?
What about non-linear classification?
Visited GP Did not visit GP Visited GP
Decision tree

Finding an optimal tree is very difficult. In practice,


we use a greedy algorithm, which builds the tree step
by step, optimising the result at each stage.
Decision tree example
With the full iris dataset:
1. Split the data into training and testing sets.
2. Predict iris (the species) from the other
features.
3. Use a tree viewer to examine the resulting
decision tree.
4. Use a confusion matrix to examine the
results.
Decision tree exercise
With the titanic dataset:
1. Split the data into training and testing sets.
2. Predict survived from the other features.
3. Use a tree viewer to examine the resulting
decision tree.
4. Use a confusion matrix to examine the
results.
Summary of Part 1
Machine learning is a subfield of artificial intelligence,
concerning data-driven predictions.
Clustering (e.g. k-means) is an unsupervised approach. It
can be used to discover structure in unlabelled data.
Regression (e.g. linear regression) is a supervised
approach. It predicts a numerical output from the input
features.
Classification (e.g. logistic regression, decision tree) is
also a supervised approach. It predicts a categorical
output from the input features.
Next time…

How can we evaluate and compare performance in


supervised learning?

How can we improve performance beyond the basic


algorithms?
Introduction to Machine Learning

Part 2: Evaluating and improving performance

John Pinney
November 2020
Intended learning outcomes
After attending this workshop, you will be
better able to:

• Explain the difference between supervised and


unsupervised learning.
• Select a suitable machine learning method for a
given application.
• Prepare your own training and testing data
sets.
• Evaluate the performance of a machine
learning experiment.
Overview
Evaluating performance Improving performance
Train/Validate/Test Bias vs variance
Cross-validation Feature selection
Regression metrics Tree pruning
Classification metrics Ensemble methods
ROC curve Bagging
Boosting
Evaluating performance
How can we compare different ML
methods for the same task?

• We need performance metrics that measure the


similarity between the model’s predictions and the
ground truth.

• You are probably already familiar with the metric R2


for assessing the fit between a linear model and the
data.
A good model would have R2 close to 1.
Coefficient of determination, R2
R2 example
With the abalone dataset,
Use a tree to predict rings from the other features.
What is R2 for this model?
Overfitting
• We saw in the example that the evaluation might look
very different depending on whether we use the
training dataset or an unseen dataset.
• A complex model might perform very well on the
training data, but if it is overfitted then it will extend
poorly to unseen data.

• We must always be careful to make sure that our


evaluation metrics are calculated on a separate testing
dataset.
Avoiding contamination
• In some circumstances, we need to take extra care to
ensure that the test data is truly independent of the
training data.

• Part of the skill in designing a good machine learning


experiment is in recognising how to filter data to avoid
this kind of contamination.
e.g. if predicting protein function from sequence, ensure that
there are no pairs of protein sequences that are too closely
related.
Train / validate / …. and test !
• It is crucial that the test data remains unseen during the
development of the ML model.
• However, we may need to “tune” various
hyperparameters or model architectures to arrive at an
effective model.
• After the testing data is removed, we can therefore split
the remaining data into “training” and “validation” sets
so that we can get an idea of the performance of the
current iteration of the model.
Cross-validation
• A very commonly used technique is to apply k-fold
cross-validation, where
• the data is split into k equally sized subsets.
• we train on (k-1) subsets and evaluate on the remaining
subset.
• we repeat so that each subset is used in validation.
• we report the mean performance metric over the k folds.
Cross-validation (k=5)
Other metrics for regression
• There are several other metrics we can use to evaluate
regression, e.g.

• Mean absolute error (MAE)


(smaller is better)
• Root mean squared error (RMSE)
(more sensitive to outliers than MAE)
• Adjusted R2
(useful when comparing models with different numbers of
variables)
R2 exercise
With the abalone dataset,
Use a tree to predict rings from the other features.
What is R2 for this model
using 5-fold cross-validation?
Does a linear regression perform better?
What is R2 when the model is just a constant?
Evaluate on the test data and report a final R2 for
your preferred model.
Metrics for classification
Consider this confusion matrix for a binary classification

(ground truth)
(predictions)

We can define a number of metrics from these numbers,


which capture different aspects of performance.
(predictions) (ground truth)

sensitivity (recall) = proportion of N+ that are detected.


specificity = proportion of N- that are detected.
precision = proportion of 𝑵 ෡ + that are correct.
accuracy = proportion of N that are correct.
(predictions) (ground truth)

sensitivity (recall) = TP / N+
specificity = TN / N-
precision = TP /𝑵෡+
accuracy = (TP + TN ) / N
Receiver operating characteristic
There is always a trade-off between sensitivity and specificity.
If a method reports some kind of probability score for its
predictions then we could adjust a threshold to tune between
maximum sensitivity and maximum specificity.

The receiver operating characteristic is a graphical way to


examine performance in terms of this trade-off.

The area under the ROC curve (AUC) is often given as an


overall performance metric that is easy to compare between
different methods (1 = perfect classifier)
Receiver operating characteristic
sensitivity

1 - specificity
Classification metrics exercise
With the breast cancer dataset,
Compare the performance of a tree and a logistic
regression for the task of predicting recurrence.
Visualise the performance of the two methods
(over 5-fold validation) on a ROC curve.
Improving performance
Bias versus variance
These terms have a special meaning when talking about
about ML performance.
Consider a set of regression predictions made from your
testing dataset. There are two different ways in which
these can be wrong:
high bias => the model is too simple to describe the
training data well (under-fitting), so there is a systematic
error in the predictions.
high variance => the model tries to replicate the training
data in too much detail (over-fitting), leading to a lot of
noise in the predictions.
Bias versus variance
Bias versus variance
What can we do to improve
performance?
What can we do to improve
performance?
A larger training dataset (if available) will reduce bias.

Feature selection tries to eliminate uninformative features.


This makes models simpler and less prone to over-fitting.

Tree pruning is used to reduce the variance of a decision tree,


by limiting the complexity of the model.

Ensemble methods combine multiple weak learners (also


called base models) to make better overall predictions.
Feature selection
In situations with large p + small n (i.e. many features but
comparatively few data), over-fitting is difficult to avoid.

However, if we can constrain the model to a simpler form, we


might be able to reduce its variance. There are many methods
for feature selection or dimensionality reduction that can help.
For some learning methods, regularisation can be applied to
achieve the same thing.

Intuitively, a good set of informative features would all be


strongly correlated with the target, but uncorrelated with each
other.
Feature selection example
With the HDI dataset,
Using linear regression to predict HDI, find the R2 under
5-fold cross-validation.
Now using the correlations widget, choose three
features that appear to be informative of disease state.
Using select columns, make a new model using only
these three features and compare its performance to your
original model.
Compare the bias and variance of the two models
visually with scatter plots.
As an alternative to feature selection, investigate how
lasso regression could be used to force the coefficients for
uninformative features to zero.
Tree pruning
Left unconstrained, a
decision tree will be
able to correctly classify
every instance in the
training data.

This results in a very


complex model, with a
large depth (number of
splits before reaching a
leaf node).
Tree pruning
A deep model is very likely to be overfitted, so will have
high variance.

By controlling the maximum tree depth, or pruning the


last few splits, we can simplify the model to reduce its
variance.
Tree pruning exercise
With the breast cancer dataset,
Evaluate the performance of a tree for the task of
predicting recurrence.
Can you improve performance by limiting the
depth of the tree?
Ensemble methods
In an ensemble method, we combine multiple weak
learners to produce an overall strong learner. Usually the
component models are all of the same type.

We will need to choose


the base model, and
the way in which models are combined
here we will consider bagging and boosting.
Bagging
”Bagging” refers to bootstrap aggregating, i.e. a way to
train many different versions of a model in parallel, then
combine them to (hopefully) generate an ensemble
model that has less variance than each component
model.

Bootstrap sampling just means to sample data with


replacement. This is a way to generate multiple datasets
with similar properties to the original training data.

Once trained, the overall prediction of the ensemble on


the testing data is obtained by majority voting.
Random forest
A very popular bagging method is called random forest.

This is an ensemble of trees, each built from a different


bootstrap sample from the training data.

At each split in the tree-building process, the algorithm


chooses a random subspace of features to consider. This
ensures that each tree is sufficiently different from the
others.
Trees vs forest example
With the breast cancer dataset,
Set aside test and validation datasets.
Use data samplers to create three bootstrap
datasets from the remaining training data.
Train three corresponding trees to predict
recurrence on the validation data and compare results.
Would an ensemble model with majority vote improve
performance?
Use the random forest to explore what happens
when we introduce the random subspace method and
supply more trees to the ensemble.
Boosting
In contrast to the parallel training performed in bagging,
“boosting” methods train multiple models sequentially.
The aim is to produce an ensemble model that is less
biased than the base model.

The basic idea is that each model in the sequence gives


more importance to the training examples that were
poorly predicted by the model before.

AdaBoost is an example of a boosting algorithm, which is


often used with decision stumps as the base model.
Summary of Part 2
Performance in supervised learning is evaluated using a
variety of metrics, using a test dataset.
Cross-validation can be used to assess models during
development.
Performance is affected by both bias and variance.
Feature selection is one way to reduce variance when
there are many uninformative features.
Bagging and boosting are examples of ensemble
methods, which combine multiple weak learners to
improve performance.
Next time…

What are neural networks and how do they work?

How does deep learning differ from the supervised


methods we have seen so far?
Introduction to Machine Learning

Part 3: Neural networks and deep learning

John Pinney
November 2020
Intended learning outcomes
After attending this workshop, you will be
better able to:

• Explain the difference between supervised and


unsupervised learning.
• Select a suitable machine learning method for a
given application.
• Prepare your own training and testing data
sets.
• Evaluate the performance of a machine
learning experiment.
Overview
Neural networks Deep learning
Perceptron Deep networks
Multiple layers Unstructured data
Gradient descent Convolution
Backpropagation Embedding
Activation functions Special architectures
Neural networks
Biological neuron
single output:
“all or nothing”

signal
processing

multiple
inputs
Perceptron (1958)
weighted
sum of
inputs

multiple single
inputs output:
0 or 1
arc
bias provides a
threshold

activation
function
Single-layer network
Multi-layer network A feedforward network
contains no directed loops
Multi-layer networks
• Use multiple hidden layers to create more complex
models => can solve more complex problems.
• The model will “discover” its own representation of
the data in a way that best fits the learning task.
• Neurons in hidden layers can therefore represent
complex features without any need to engineer these
a priori.
• This can be very powerful if enough training data is
available.
Fitting a neural network model
• To train our model, we need to find values for the
weights, w. (We can incorporate the bias b as an
additional weight w0).

• The best choice of w would minimise an appropriate


loss function for our predictions compared to the
training data labels (e.g. mean squared error for a
regression task).

• For a network with m arcs and n neurons, w has


(m+n) components: we will be optimising in a high-
dimensional parameter space.
Gradient descent
• This is a simple but effective optimisation method in
high dimensions, which uses the gradient (i.e. slope)
of the loss function and takes progressive small steps
downhill until it finds a minimum.

• Stochastic gradient descent includes a small random


movement to helps us escape local minima.
Gradient descent example
With the iris dataset,
Use gradient descent to fit a binary logistic
regression model that can predict iris versicolor from
only petal length and petal width.
Backpropagation

To calculate the gradient of


the loss function,we start
from the output layer and
work back layer by layer to
build up its derivative with
respect to each of the
weights.

This technique is known as


backpropagation
Problems with the activation function

binary output makes


network less expressive

z=
zero gradient everywhere (except z=0)
means backpropagation won’t work!
Solution: continuous activation functions

These continuous
activation functions
(and many others) are
used in multi-layer
neural networks.

Their different shapes


result in different
learning characteristics,
suitable for different
tasks.
Multi-layer network exercise
With the wine quality - red dataset,
Use a neural network to predict quality from the
other features.
Explore how performance is affected by adding
additional hidden layers to the network.
Deep learning
Deep networks
“Deep learning” concerns neural network models with
many hidden layers, used in both unsupervised and
supervised learning contexts.

It is difficult to train deep models effectively, so special


techniques have been developed for this class of
models.

Specialised software packages (e.g. TensorFlow, Keras)


and computer hardware are available.
Unstructured data
Deep networks are especially well suited to working
with unstructured input data, which doesn’t have
easily definable informative features.

Images
Text
Speech
Music

Image classification task

labelled training data

machine ? predictions
From pixels to features?
Convolution
• Convolution is a way to transform unstructured data
into network inputs in a way that preserves the
relevant relationships in time and/or space.
• A network architecture that uses this technique is
known as a Convolutional Neural Network (CNN).
• In practice, convolutional layers may form part of a
larger architecture.
Convolution The kernel is a matrix of weights
that passes over the image

next layer input layer


Google Inception (2015)
An image classification model, trained on the
ImageNet dataset.
Google Inception module

• Architecture is carefully designed to capture salient


features and ensure efficient training.
• Uses kernels of different widths to detect both local
and global features
Google Inception model
modules

The overall model for image classification is built from 9


of these modules. It is 27 layers deep in total.
The authors suggest it would take about a week to train
on a few high-end GPUs.
Embedding

Once the weights of the Inception model have been


trained, we can use the model to classify any image
into the ImageNet categories.

The hidden layers of the model must be capturing


some highly informative features. We can make use of
these features in our own models.

When an image is used as input, the activations of the


penultimate layer are extracted as features. This is
called embedding.
Image embedding example

Using image embedding, can we train a neural


network to distinguish cats from dogs?
Some special architectures
• Autoencoder
learns a compact representation of the data in an
unsupervised way (e.g. for dimensionality
reduction).
• Recurrent Neural Network (RNN)
represents processes evolving over time.
• Generative Adversarial Network (GAN)
two networks (a generator and a discriminator)
compete against each other. This is often used in
reinforcement learning.
Summary of Part 3
Artificial neural networks are a diverse family of
biologically inspired learning methods.
Neurons in a network receive multiple inputs and
combine these as a weighted sum. The neuron’s output is
determined by its activation function.
To fit the model, weights need to be optimised
e.g. by using gradient descent.
Deep learning implies any architecture with multiple
hidden layers that can discover relevant features from the
data itself.
Convolution layers may be used to work with
unstructured inputs such as images or speech.
Where next…?
Müller AC & Guido S, Introduction to Machine Learning with Python
http://ebookcentral.proquest.com/lib/imperial/detail.action?docID=46981
64

Burger SV, Introduction to Machine Learning With R


https://www.oreilly.com/library/view/introduction-to-
machine/9781491976432/

Nielsen M, Neural Networks and Deep Learning


http://neuralnetworksanddeeplearning.com/index.html

Tutorials and examples at


https://scikit-learn.org/
https://www.kaggle.com

You might also like