0% found this document useful (0 votes)
23 views112 pages

L02 - 03 Crash Course On NN

This document provides a summary of key concepts from an introductory lecture on neural networks and convolutional neural networks: 1. The lecture introduced deep neural networks and derived equations for deep learning, covering two main types of networks: multilayer fully connected networks and convolutional neural networks. 2. A fundamental element of neural networks is linear computing elements called artificial neurons organized into networks. These networks are used as tools to adaptively learn the parameters of decision functions via successive training examples. 3. The lecture began with an introduction to the foundations of neural networks, starting with the basic perceptron model and algorithm. Perceptrons learn linear decision boundaries to classify patterns via iterative weight updates.

Uploaded by

Paulo Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views112 pages

L02 - 03 Crash Course On NN

This document provides a summary of key concepts from an introductory lecture on neural networks and convolutional neural networks: 1. The lecture introduced deep neural networks and derived equations for deep learning, covering two main types of networks: multilayer fully connected networks and convolutional neural networks. 2. A fundamental element of neural networks is linear computing elements called artificial neurons organized into networks. These networks are used as tools to adaptively learn the parameters of decision functions via successive training examples. 3. The lecture began with an introduction to the foundations of neural networks, starting with the basic perceptron model and algorithm. Perceptrons learn linear decision boundaries to classify patterns via iterative weight updates.

Uploaded by

Paulo Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 112

•Lecturer: Paulo Santos

[email protected]
•Office: 4.24
Intro to NN and CNN
Neural Networks and Deep Learning
• Introduction to deep neural networks and derive the equations for
deep learning
• Two types of networks:
• Multilayer, fully connected, neural networks, whose inputs are pattern
vectors
• Convolutional neural networks which accept images as inputs
• Fundamental element:
• Linear computing elements (called artificial neurons) organised as networks
• Use these networks as tools for adaptively learning the parameters of
decision functions via successive presentations of training examples/patterns.
Introduction to neural networks
• Foundations of NN
• We start with a fundamental idea: Perceptron
• Although these computing elements are not used per se in current NN, the operations
they perform are almost identical to artificial neurons

https://www.cybercontrols.org/neuralnetworks
Perceptron
• A single perceptron unit learns a linear boundary between two
linearly separable pattern classes.
• E.g.:
Perceptron
• A linear boundary in 2D is a straight line with equation
• is the coefficient (the slope of the curve)
• is the y-intercept term…
• Also known as bias, bias coefficient, or bias
weight
• (yeah, bias, an overloaded term. It is not the
bias in statistics!)
• For higher dimensions we need a more general
notation:
•  coordinates of a point
•  coefficients
•  bias
Perceptron
• The boundary separating classes in n dimensions would then be
a plane, or rather, a hyperplane:
𝑤 1𝑥+𝑤 2𝑥 2+𝑤 3 𝑥3+…+𝑤 𝑛 𝑥=0
• Also expressed as :

• Or in vector form:

• Were is known as the weight vector and is the bias

• We say that an arbitrary point (x1, x2, … xn) is on the positive


side of a line (hyperplane) if :
• (therefore belonging to class c1)
• Conversely for a point on the negative side
Perceptron
• The perceptron is an algorithm that finds this hyperplane separating
two classes by
• Iteratively stepping through the patterns of each of two classes
• Starts with an arbitrary weight vector and bias,
• The task is, given a pattern vector x from a vector population, to find a set of
weights such that:
Schematic diagram of a perceptron
• All this machine does is the sum of
products of an input pattern using the
weights and bias found during training.
• The output is a scalar value that is then
passed through an activation function
to produce a binary decision
• +1 : belongs to c1
• -1: belongs to c2
• For the perceptron the activation
function is a thresholding function.
https://gfycat.com/obviouscarelessblackfootedferret
Signal flow graph of the Perceptron
Perceptron Learning
• Test Problem
Perceptron Learning
• Test Problem

Find a line that separates the classes


Perceptron Learning
• Train a two-input/single-output network without a bias

Learning: “A process by which the free parameters of a neural network are adapted”

Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w

Incorrectly classifies p1 as class 0

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w
• Learn: update free parameters
• Adding p1 to 1w would make
Incorrectly classifies p 1 as class 0
1w point more in the direction of p 1.

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w
• Learn: update free parameters
• Adding p1 to 1w would make
1w point more in the direction of p 1.

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Train a two-input/single-output network without a bias
• Randomly assign values to w
• Learn: update free parameters
• Adding p1 to 1w would make correctly classifies p as class 1 1
incorrectly classifies p as class 1
1w point more in the direction of p 1.
2

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Keep going!

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Keep going!

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Keep going!

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Keep going!

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Three rules for updating

• Last rule is “No Change” when we have a match

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Unified Update Rule

Learning: “A process by which the free parameters of a neural network are adapted”
Learning: the free parameters in this model are w1,1 and w1,2
Perceptron Learning
• Adjustment to weight vector w at time step n:

Learning Desired Actual Input


rate output output vector
Delta rule (or Widrow-Hoff rule)
• Equivalent to:
• If output is high, reduce weights on active inputs
• If output is low, increase weights on active inputs
Effect of the learning rate η

Plots of E as a function of wx: (a) A value of η that is too small can slow down convergence. (b) if η is too large, there
may be large oscillations or divergence. (c) Shape of the error function E in 2D
Perceptron Convergence Algorithm
• Iteratively update weights until convergence
• Each iteration is called an epoch

1. Initialize weights w to random values


2. For each training example, compute
perceptron response y
3. Updates weights (+bias) using
learning rule (see previous slide)
4. Go back to step 2
Repeat until perceptron response
matches desired response
Colab Perceptron Examples
• Step-by-Step
• https://colab.research.google.com/drive/1bzbmIPt-mUslGgjFR5nBOtPz
gKtZ32PF?usp=sharing

• Iterate
• https://colab.research.google.com/drive/1vz21O1cAdkYCEB9I19-5KtYiI
yTHtFni?usp=sharing

• Iterate with Bias


• https://colab.research.google.com/drive/17ZEu9Yf3y0Ozu32D5fqHcah
cBY70-lIR?usp=sharing
Perceptron Limitations
• Decision surface is a hyperplane
• Classes must be linearly separable!
• Perceptron cannot learn XOR, or parity function in general

X
Class A

Class B

??
Solving XOR
Adding a hidden layer... Decision boundaries

Neuron 1

Neuron 2
Mutilayer Feedforward
Neural Networks
Mutilayer Feedforward Neural Networks
• Neural networks
• interconnected perceptron-like computing elements called artificial neurons
• Formed from layers of computing units
• The output of one unit affects the behaviour of all units following it
• In a perceptron the activation function is a hard threshold
• Small variations cause large swings  which is terrible in a network!
• A neuron has a smooth activation function:
Perceptron x neuron
• Except from more complicated notation, and the use of a smooth activation
function, a neuron performs the same operations as a perceptron

• denotes the sum of products


• The output (denoted by a) is
obtained by passing through
• is the activation function
• Its output is the activation value
of the unit
• The inputs to a neuron are
activation values from neurons
in the previous layer
Error Backpropagation
• The problem is that this isn't what happens when our network
contains perceptrons.
• A small change in the weights or bias of any single perceptron
in the network can sometimes cause the output of that
perceptron to completely flip, say from 0 to 1.
• That flip may then cause the behaviour
of the rest of the network to completely
change in some very complicated way.

http://neuralnetworksanddeeplearning.com/chap1.html
Error Backpropagation
• We can overcome this problem by introducing a new type of
artificial neuron called a sigmoid neuron.
• Sigmoid neurons are similar to perceptrons, but modified!
small changes in their weights and bias
cause only a small change in their output.

• That's the crucial fact which will allow a


network of sigmoid neurons to
learn incrementally

http://neuralnetworksanddeeplearning.com/chap1.html
Error Backpropagation

0 to 1
1 to 0

small change in Z, big effect!


Error Backpropagation

0 to 1 0.51 to 0.49
1 to 0 0.49 to 0.51

small change in Z, big effect! small change Z, small effect!


MLP: Activation Functions
nonconstant, bounded, and monotone-increasing continuous function.
Activation functions
Vanishing and Exploding Gradients
• Vanishing Gradient
• Error travels from the output layer towards the input layer.
• The gradients often get smaller and smaller and approach zero.
• Eventually leaves the weights of the initial or lower layers nearly
unchanged.
• As a result, the gradient descent never converges to the optimum
• Gradient Explosion
• Error gradients can accumulate during an update and result in very
large gradients
• result in large updates to the network weights
• in turn, an unstable network
https://www.analyticsvidhya.com/blog/2021/06/the-challenge-of-vanishing-exploding-gradients-in-deep-neural-networks/

Vanishing and Exploding Gradients


• Vanishing Gradient
• Saturates at 0 or 1 with a derivative
All the fun happens very close to zero
here
• Backpropagation has no gradients to
propagate!

No gradients to propagate
activation=tf.nn.relu

Vanishing and Exploding Gradients


• Better, non-saturating activation functions
• ReLU and leaky ReLU

Rectified Linear Unit


Interconnecting neurons to form a fully
connected neural net
• A layer in a network is the set of nodes (neurons) in a column of the network
• All the nodes are artificial neurons
• Except the input layer, whose nodes are components of the input pattern vector x
• Each layer in the network can have a different number of nodes,
• But each node has a single output
• The output of every node is connected to the input of all nodes in the next layer to form a fully
connected network
• Values of the first layer: inputs
• Values of the last layer: outputs
• All the others: Hidden layers -- hidden neurons
• NN with a single layers: shallow neural network
• NN with multiple hidden layers: deep neural network
Task of a neural network
• Determine the class membership of unknown input patterns
• One way to do this: Assigning a class label to each output neuron
• Thus, an NN with outputs can classify an unknown pattern into one of classes
• The NN assigns an unknown pattern vector to a class if the output neuron
has the largest activation value
Backpropagation
• A NN is defined completely by its weights, biases and activation
function.
• Training a NN:
• Use a dataset to estimate these parameters
• During training, we know the desired output
• But there is no way of knowing the values of the outputs of the hidden layers
• Backpropagation:
• Tool of choice for finding the value of weights and biases in a multilayer network
Training by backpropagation
1. Input the pattern vectors
2. Forward pass through the NN to
• Classify the patterns in the training set
• Determine the classification/output error
• Output error is calculated by a cost function (quadratic cost, cross-entropy, etc)
3. Backward (backpropagation) pass
• “Distributes” the output error back through the network to compute the
changes required to update the parameters
4. Updating the weights and biases in the network
• Using gradient descent
https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

Error Backpropagation
• Forward pass or propagation of input to output
https://github.com/rasbt/python-machine-learning-book/blob/master/faq/visual-backpropagation.md

Error Backpropagation
• Backward pass or propagation of error (loss function)
• Gradient Descent
Error Backpropagation
• Weights Adjusted, New sample, feedforward
Error: Loss or Cost Function
• Loss function is basically it is a performance metric on how
well the MLP manages to reach its goal of generating outputs
as close as possible to the desired values.

loss = (Desired output - actual output)


loss = Absolute value of (desired – actual)
https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23

MLP: Loss Function


• Mean Square Error (MSE)/Quadratic Loss/L2 Loss
• predictions which are far away from actual values are penalized heavily

• Cross-Entropy Loss (most common in classification)


• increases as the predicted probability diverges from the actual label
• penalizes heavily the predictions that are confident but wrong
Backprop Algorithm
1. Initialize weights w to random values Actual output of
2. (For each training example...) neuron in layer l-1

Perform forward pass


1. Compute induced local fields:
2. Compute neuron outputs: Transfer/Activation
function
3. Compute error signal: Actual network
3. Perform backward pass Desired network
output

1. Compute local gradients: output

Derivative of
transfer/activation
function

2. Adjust weight vectors of all neurons

4. Go back to step 2 Momentum Learning


constant rate
Repeat until stopping criterion is met
Learning is Searching
• Learning = searching a hypothesis space
• E.g., weight set of an ANN
Known mathematical relationship between weight values and
error signal helps us in the search
• Gradient descent/hill-climbing:
Find the optimal parameter changes following the steepest path
up/down on the performance hypersurface
• e.g., delta rule or backpropagation algorithm for perceptron weights
Gradient Learning
• For a cost/eval function ε (•) and each parameter/weight,
iteratively update:

• …for all parameters until a given stopping criterion is met


• We usually start from some random
https://github.com/rasbt/python-machine-learning-book/blob/master/faq/visual-backpropagation.md

Error Backpropagation
• Backward pass or propagation of error (loss function)
• Gradient Descent
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/more_images/LossAlps.png

Local and global Minima


https://medium.com/datathings/neural-networks-and-backpropagation-explained-in-a-simple-way-f540a3611f5e

Gradient Descent: effect of optimizers


Limitations of Local Search
For backpropagation, transfer functions must be differentiable
(for cost function to be differentiable)
• Local optima occur with nonlinear hypothesis spaces

Multi-layer perceptron (MLP)


Global Optimum

Network Performance
? Local Optimum

Note: Hill Climbing


Starting Weights

Must randomize weights


until global optimum is reachable?

Weight Configuration
Initialization
• Initial weights decide which local optimum is reached
Backpropagation networks should be reset / trained multiple
times (keep the best)

Global Optimum

? Local Optimum

Possible starting
weights Note: Hill Climbing
Population Member

global optimum?
Momentum
• Combining current gradient and previous gradient:

• where α is a momentum constant


Smooths weight changes and suppresses oscillations
Accelerates learning
in the same direction
Enables escape from Optimum
Note: Hill Climbing
small local optima Momentum update

Starting Weights
Learning Rate
• Learning rate η too small:
very slow progress
• Learning rate η too large:
oscillations or reductions in performance
• Adaptive learning rates
• Increase rate in Optimum

absence of oscillations; Note: Hill Climbing


Optimum

decrease otherwise Weight


oscillations

Starting Weights
Stopping Criteria
• Stop when a maximum number of epochs has been exceeded
• Stop when the mean squared error (MSE) on the training set is
small enough
• Stop when the gradient is below a desired threshold
• Stop when overfitting is observed
Problem Solved?
• Found the global optimum? (lucky!)
• i.e., the network performs optimally
on the data that it was trained on
! Does not guarantee any performance on unseen data

Possible scenario...
• I trained a network to recognize people from passport photos…
• …but it fails whenever a person smiles
 The training data did not include any picture of people lee smiling!
Curse of Finite Sample Size
• A problem…
• Unlimited possibilities in nature
• …or at least a lot more than we can collect for training
So a classifier, in actual use, may encounter something new
• How can we guarantee that the classifier gives the best possible
response to this?
• Another problem…
• Collected sample may be noisy
• Inaccuracies in data collection
• May be different every time!
Generalization
• Input-output mapping of the network should be correct for data
never used in creating or training the network
• Generalization – the ability to produce satisfactory responses
to patterns that were not included in the training set
• Extra-sample error – the average prediction error for data that the
neural network has never seen
• In-sample error – is the average prediction error for data that the neural
network has been trained on
! In-sample (training) error is a poor predictor for extra-sample (testing)
error
Example
• Google Colaboratory Example

• https://colab.research.google.com/drive/1IsUmqqs-y0EAzmxaqj
VJWSVsRnrOXyjl?usp=sharing
Example
• Google Colaboratory Example
• Deep Learning Example Part01
https://colab.research.google.com/drive/1jcpFC8ZtSlRm-d1qdis
EPXqvxDHXyxMf?usp=sharing

• Deep Learning Example Part02


https://colab.research.google.com/drive/1KWvhT8mUnI4PMtox
6cd98eqQiw5xxEow?usp=sharing

• Deep Learning Example Part03


https://colab.research.google.com/drive/1_qUCLPB7MzDBMws
5I2LvJHWt5rJsiUtw?usp=sharing
Deep Convolutional Neural
Networks (CNN)
Deep Convolutional Neural Networks (CNN)
• Up to this point: patterns were organised in terms of feature vectors
• The form of these features are specified by a human designer
• Extracted from the images prior to being input to the NN
• Convolutional Neural Networks:
• Accept images as inputs
• Learn the features as well as the classification
Basic CNN architecture
Basics of a CNN operation
• The type of neighbourhood processing in a CNN is spatial convolution
• Computes a sum of products between pixels and a set of kernel weights
• At every spatial location in the input image
• The result at each (x,y) is a scalar value
• This scalar value is the output of a neuron
• Adding a bias passing the result through an activation function
•  we have our good old NN!
• Neighbourhoods  Receptive Fields (RF)
• The receptive fields move over the image executing convolution
• The set of weights, arranged as a receptive field, is a kernel
• Number of spatial increments of RF: strides
• To each convolution value we add a bias
• Then pass the result through an activation function to generate a single value
• This value is fed to the corresponding location in the input of next layer
• This is repeated to all locations in the input image, resulting in a 2D set of values stored in the next layer
as a 2D array called feature map
•  the role of convolution here is to extract features, such as edges, points, blobs
• Convolutional layer:
• three features maps, obtained from three distinct kernels!
• After convolution and activation:
• Subsampling (or pooling):
• Produces pooled features maps: Pooling Layer
• Reduction in spatial resolution:
• responsible for translational invariance
• Reduces the volume of data
• Done by subdividing the feature maps into a set of small (typically 2x2) regions:
• Pooling neighbourhoods
• Replacing all the values of that neighbourhood by a single value
• Common pooling methods:
• Average pooling: substitute by the average
• Max-pooling: substitute by the max value
• L2 pooling: substitute by the square root of the sum
• Convolution:
• Filtered images
• Pooling:
• Filtered images of lower resolution
• The pooled filter maps in the first layer become the inputs to the next layer
• But we now have multiple pooled feature maps
• As convolution is a linear operation (remember assignment 1??)
• The values can be combined into a single one by superposition

• The ultimate goal is classification:


• The final pooled feature maps are fed into a Fully Connected Neural Net
• As we’ve seen before  the input should be vectorized.
Example

• Think of each element of a 2D array in the top row as a


neuron
• The outputs of these neurons are pixel values, creating
feature maps
• The neurons in the feature map of the 1st layer have
output values generated by convolving with the input
image a kernel, whose size and shape are the same as the
receptive field
• And whose coefficients are learned during training
• To each convolution value we add a bias and pass the
result through an activation function to generate the
output value of the corresponding neuron in the feature
map
• The output values of neurons in the pooled feature maps
are generated by pooling the output values of neurons in
the feature maps
• The kernel weights (shown as intensity values) are
learned from sample images using backpropagation
• Therefore, the nature of the learned features is determined
by the learned kernel coefficients
Graphical illustration of the functions
performed by the components of a CNN
Feature Pooled Feature Pooled Neural
maps feature maps feature net
maps maps
0

Vector
5

9
https://cs231n.github.io/convolutional-networks/

CNN: Layer Architecture


The architecture shown here is a tiny VGG Net http://www.robots.ox.ac.uk/~vgg/research/very_deep/

pre-trained weights: https://keras.io/api/applications/


Teaching a CNN to recognise simple images
Teaching a CNN to recognise simple images

Training Image Set Test Image Set


CNN to recognise handwritten numerals
(MNIST dataset)
• 60,000 training images
• 10,000 test images
• Grayscale images of size 28x28 pixels
Architecture of the CNN trained to recognise
ten digits in the MNIST dataset
Kernels
Results of a forward pass
Let’s play with this
• http://www.cs.cmu.edu/~aharley/vis/
CIFAR-10 Dataset
Same architecture as before
Kernels of the first convolution layer
Kernels of the 2nd convolution layer
Graphical illustration of a forward pass
Example
• Google Colaboratory Example
• Deep Learning Example Part04: CNN
https://colab.research.google.com/drive/1FTl7TWxd05gj13hPC
2A-IrKnNx7W9cHS?usp=sharing
Training and evaluating a classifier
• The dataset should be divided into:
• Training set
• Usually 50% of the dataset
• Used to create a model:
• given the class label for each pattern
• the goal is to adjust the parameters in order to assign the right class to the appropriate example (data item)
• Validation set
• 25%
• Check if the performance objective is met against unseen data, if not train the classifier again (tweaking
the classifier design)
• Test set
• 25%
• Check the behaviour of the classifier with unseen data
• Almost like the validation set, but without trying to make it better this time!

• If training/validation results are acceptable but test results are not


• The training was over fit the system parameters to the available data
• Wrong architecture , small dataset, noisy/biased dataset
HEADS UP: THIS IS THE MOST IMPORTANT
PART OF THE WHOLE COURSE!
• The dataset should be divided into:
• Training set
• Usually 50% of the dataset
• Used to create a model:
• given the class label for each pattern
• the goal is to adjust the parameters in order to assign the right class to the appropriate example (data item)
• Validation set
• 25%
• Check if the performance objective is met against unseen data, if not train the classifier again (tweaking
the classifier design)
• Test set
• 25%
• Check the behaviour of the classifier with unseen data
• Almost like the validation set, but without trying to make it better this time!

• If training/validation results are acceptable but test results are not


• The training was over fit the system parameters to the available data
• Wrong architecture , small dataset, noisy/biased dataset
Classification metrics
• Key methods:
• Accuracy
• Recall
• Precision
• F1-Score
• Model evaluation:
• Either its answer is correct
• Or incorrect
• After training the model, we want to evaluate it
• Input a validation example (with known label): e.g. Image of a Dog
• Get the output of the model classification: e.g. is it a Dog or a Cat?
• Let’s assume we want to classify Dog
• Outputs of the model
• Input Dog => output Dog  True positive
• Input Cat => output Dog  False positive
• Input Dog => output Cat  False negative
• Input Cat => output Cat  True negative
• Repeat this to all the validation/test data
• Count the total number of TP, TN, FP, FN
• Calculate the metrics with these
Confusion Matrix

https://en.wikipedia.org/wiki/Precision_and_recall
Metrics: Accuracy
• Accuracy:
• Number of correct predictions divided by the total number of predictions
• (TP + TN)/(TP +TN + FP + FN)

• Useful when the target classes are well-balanced


• Not useful if that is unbalanced:
• 99% images of dogs and 1% of Cats
• A model that only predicts Dogs would have great accuracy, but it would be a terrible
classification model!
• We’ll have to look into Recall and Precision as well!
Metrics: Recall
• Recall
• Measures the extent of which the model find all the relevant information
• Number of true positives divided by the number of true positives plus the
number of true negatives
• TP/(TP+TN)
Metrics: Precision
• Precision:
• Measures the extent of which the model find only the relevant information
• Number of true positives divided by the number of true positives plus the
number of false positives
• TP / (TP + FP)
Precision vs Recall
• There is a trade off here:
• While precision measures the proportion of actual relevant data points your
model says it is relevant
• Recall measures the ability of finding all relevant instances

• Precision vs Recall curves are very useful as an analysis tool!


F1 Score
• F1 Score
• Measures the trade off between precision and recall
• It is the Harmonic Mean of precision and recall:

• Why not the average?


• Because it does not punish extreme values
• E.g. Precision = 1, but recall = 0
• Average = 0.5
• F1 score = 0
Other metrics…

https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/Confusion_matrix
Example

• Accuracy ((TP+TN)/total) = 0.91


• Precision (TP/TP+FP) = 0.91
• Recall (TP/TP + FN) = 0.95
• https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Further important reading:
• Bias in Machine Learning
https://towardsdatascience.com/june-edition-bias-in-the-machine-994eadbccec2

You might also like