MACHINE LEARNING FUNDAMENTALS
COEN 435
(Module_3: Neural Networks)
INSTRUCTOR: Dr. S. M. YUSUF
Department of Computer Engineering, A. B. U, Zaria.
OBJECTIVE
❑Software
✓ Python (Assumption: Python programmer)
✓ Numpy (Pandas, Torch vision)
✓ TensorFlow (Keras, Pytorch)
✓ Sklearn
✓ Matplotlib
❑Hardware
✓ 32-bit or 64-bit system architecture
✓ 2+ GHz CPU
✓ 4 GB RAM
✓ At least 10 GB of hard disk space
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 2
Neural Network (NN)
❑Introduction
✓ subset of machine learning techniques that
learns features directly from data by using
several number of neurons organized in
layers.
✓ This is a class of ML algorithms in the use a
cascade of layers of processing units to
extract features from data and make
predictive guess about new data.
✓ Also known as Artificial Neural Network Source: Machine Learning Guide, 9. Deep
Learning
(ANN).
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 3
Neural Network (Cont.)
✓ Inspired by the structure of the neurons located in the human brain and how
the brain works.
✓ Layers of neurons are interconnected in hierarchical manner.
✓ Learning is through progressive abstraction.
✓ The success of a subset of ANN (deep learning) is the availability of more
training data (e.g. ImageNet) and,
✓ Relatively low-cost GPUs for efficient numerical computation.
✓ Deep learning (DL) is utilized to analyze massive data of large industrial
companies and an integral part of modern software production.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 4
Artificial Neural Networks
From experience:
examples / training
data A physical neuron
Strength of connection
between the neurons is
stored as a weight-
value for the specific
connection.
An artificial neuron
Learning the solution to
a problem = changing
the connection weights
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 5
ANN Cont.
❑NN has 3 primary layers
✓ Input layer
✓ Contains input variables; number of nodes depends on the number of input variables;
Connected to a hidden layer.
✓ Hidden layer
✓ Produces intermediate output; number depends on the nonlinearity (complexity) in the
data.
✓ Output layer
ANNs (a) Shallow Network (b) DNN (Khalil et al. 2019)
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 6
Types of Artificial Neural Networks
❑ Includes
✓ Perceptron
✓ Multi-Output Perceptron
✓ Single Layer Neural Network
✓ Multi-layered Perceptron (MLP)
✓ Deep Learning
✓ Deep Neural Networks (DNN)
✓ Convolutional Neural Networks (CNN)
✓ Recurrent Neural Networks (RNN)
✓ Long Short Term Memory (LSTM)
✓ Bidirectional Long Short Term Memory (BLSTM)
✓ Gated Recurrent Units (GRU)
✓ Generative Adversarial Networks (GAN)
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 7
Forward Propagation in a Perceptron
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
❑ The bias term shift the activation function left/ right regardless of the
inputs.
❑ In vector form, the output of a single perceptron is the application of the
activation function on the dot product of X and W.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 8
Activation Function
❑ Also called Threshold or Transfer function.
❑ Functions that transforms the summed weighted inputs of a neuron, (Z)
into an output, g(Z).
❑ Common activation functions includes; sigmoid, hyperbolic tangent, and
Rectified linear units (ReLU).
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 9
Activation Function Cont.
❑ The purpose of activation function is to introduce non-linearities into the
neural network.
❑ E.g: How do you distinguish between the red and green colored points?
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 10
Example 1
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 11
Example 1 Cont.
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 12
Example 1 Cont.
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 13
Example 1 Cont.
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 14
Multi-Output Perceptron
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
❑ Since all inputs are fully connected to the layer of output neurons, the
output layer is called a dense layer.
❑ A dense layer is a layer of neurons whose inputs are fully (densely)
connected to outputs.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 15
Single Layer Neural Network
❑ This is a NN that is made up of a single hidden layer of neurons that
feed into the output layer of neurons.
❑ States of the hidden layer are learned.
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 16
Single Layer Neural Network
❑ This is a network that is made up of a single hidden layer of neurons
that feed into the output layer of neurons.
❑ States of the hidden layer are learned.
Source: Alexander Amini and Ava Soleimany, MIT 6.S191: Introduction to Deep Learning
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 17
Deep Neural Network
❑ This is a network that is made up of a stack of hidden layers.
Generic Deep Neural Network (DNN) Architecture (Khalil et al., 2019)
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 18
Training Neural Networks
❑ Why?
✓ Given a network wrong weights results to wrong predictions.
✓ The network needs to know if its predictions are acceptable or not
(based on historical data).
❑ Training a neural network is to teach it how to perform a task.
✓ Fitting a model.
❑ Empirical loss function
✓ Measures the cost incurred from incorrect predictions.
✓ The type of loss function depends on the task at hand; classification
or regression.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 19
Neural Network Algorithm
❑ Allows us to find the weights using backpropagation algorithm.
✓ 5 steps in the algorithm
▪ Random Initialization
✓ Initializes the weights by randomly selecting the values of the weights; using any prior
information or standard distribution.
▪ Activation and feed forward (Forward propagation)
✓ Involves weights multiplication with input variables, then summation, and applying activation
function.
✓ The predicted value is obtained at the end of this step.
▪ Error calculation & backward propagation
✓ Finding the error between the actual and predicted value.
✓ And propagating backward to find error contribution at the hidden layers is known as
backpropagation.
▪ Because error at the output layer is not entirely due to wrong weights connected to that
layer.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:33 PM
[email protected] 20
Neural Network Algorithm Cont.
▪ Weight Updating
✓ Adjusting the weight of a node so that the error at hidden nodes decreases, which in turn
reduces the total error.
▪ Stopping Criteria
✓ A full cycle of sending the whole data in a feedforward step followed by error calculation and
backpropagation followed by weight updating is 1 epoch.
✓ Repeat these epochs until we reach either zero error or when weights stop updating.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:54 PM [email protected] 21
Training Neural Networks Cont.
❑ EMPIRICAL LOSS FUNCTION
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 22
Training Neural Networks Cont.
❑ BINARY CROSSENTROPY LOSS FUNCTION
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 23
Training Neural Networks Cont.
❑ MEAN SQUARED ERROR LOSS FUNCTION
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 24
Training Neural Networks Cont.
❑ LOSS OPTIMIZATION
✓ Finding the network weights that achieve the lowest loss.
❑ What values of W gives us the minimum loss?
✓ First, understand how the gradient (slope) , changes.
✓ Tells us how much we want to change the weights, in order to
reduce the loss incurred on a particular training example.
✓ Compute using the gradient descent algorithm.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 25
Gradient Descent Optimization Algorithm
❑ Initialize weights randomly ( 0, ) 2
❑ Loop until convergence:
J (W )
✓ Compute gradient using back propagation,
✓ Update weights, W W − J (W )
W
W
❑ Return weights
❑ Optimizers for Training Deep Neural Networks
❑ Stocastic Gradient Descent (SGD)
❑ Mini-batch SGD
❑ SGDM
❑ Adadelta
❑ Adam
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 26
❑ 4 Key Ingridents of developing DNN
✓ Dataset
✓ How many data points?
✓ For regression/classification?
✓ How many classes?
✓ Multi-label/ multi-class/binary-class?
✓ Loss function
✓ Binary Cross-entropy/categorical cross-entropy?
✓ Model/ Architecture
✓ Type of DL algorithm?
✓ No. of layers and nodes?
✓ Make informed decision by experimenting different architectures
✓ Optimization method
✓ SGD?
✓ Batches / Adaptive step-size?
✓ Also, set learning rate, regularization strength, No. of epochs, e.t.c
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 27
❑ SELECTING LOSS FUNCTIONS AND ACTIVATION FUNCTIONS
✓ Loss function must fit activation function in the last layer.
✓ Squared loss and Hinge loss fit together with linear activation
function.
✓ Negative log likelihood (NLL) loss fits together with sigmoid function.
✓ Negative log likelihood Multiclass (NLLM) loss fits together with
Softmax function.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 28
OVER FITTING
❑ A scenario that occurs during training where the deep learning model learns
from details along with noise and random points on the curve.
✓ Tries to fit each data points on the curve.
✓ Given a new data point, the model curve may not correspond to the
patterns in the new data.
✓ Thus, model cannot predict very well on test/validation set.
https://www.simplilearn.com/tutorials/
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 29
OVER FITTING Cont.
❑ Reasons for Overfitting
✓ Data utilized for training contain noise values in it.
✓ Model has a high variance.
✓ Limited/ scarce training data.
✓ Model is too complex.
❑ NB:
✓ There is the risk of overfitting when optimizing loss on the training data.
✓ Overfitting is not a huge problem for large networks trained on large
amount of data.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 30
UNDERFITTING
❑ A scenario that occurs where a Deep Learning model can neither learn the
relationship between variables in the data nor predict (or classify) a new
data point.
❑ Underfitted model performs poorly on the training data and is not able to
model the relationship between the input data and output class labels.
❑ Reasons for Underfitting
✓ Data utilized for training is not clean
(contains garbage values).
✓ Model has high bias.
✓ Size of training data is not enough.
Overfit (Blue), Underfit (Orange) and
✓ Model is too simple. Generalizing (Green) Models
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 31
REGULARIZATION
❑ These are strategies utilized to calibrate a deep learning network in order
to minimize the loss as well as reduce overfitting or underfitting.
✓ reduce the test error, possibly at the expense of increased training error.
❑ Regularization strategies help choose a set of parameters that help ensure
deep learning model generalizes well.
✓ Regularization helps control the DL model capacity, ensuring that the
model is better at making (correct) classifications on data points that they
were not trained on.
❑ Without Regularization, classifiers can easily become too complex and
overfit to our training data.
❑ Second to learning rate, regularization is the most important parameter of
DL model that you can tune.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 32
REGULARIZATION Cont.
❑ Regularization function can be added to the loss function or explicitly or
implicitly added to the network architecture.
❑ Methods/ Strategies?
❑ Regularization penalties.
✓ Utilized to update the loss function by adding an additional
parameter to constrain the capacity of the model.
✓ E.g. Weight Decay (L1 , L2 & Elastic Net)
✓ Weight decay penalize the norm of all the weights.
❑ Dropout (Explicit)
❑ Batch normalization (Explicit)
❑ Data Augmentation (implicit)
❑ Early stopping (implicit)
❑ Too much of regularization can result to underfitting.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 33
REGULARIZATION Cont.
❑ Weight Decay Regularization Penalty
✓ Let the empirical loss be, L (Loss over the entire training set)
✓ Recall, L was originally;
✓ Regularization penalty; commonly written as; R(W)
✓ Penalizing the loss, J(W), the loss update; = Re gularization term
= Learning Rate
J (W ) = L + R (W )
✓ Consequently, the weight update will be of the form;
J (W )
W W − − W
W
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 34
REGULARIZATION Cont.
✓ Simplifying the weight update; ✓ The weight, W, decays by a factor of
J (W ) (1 − ) before taking a gradient step.
W W (1 − ) −
W
❑ For;
✓ L1 (Lasso) Regularization
✓ Takes the sum of the absolute value of the weights.
i.e. R (W ) = Wi , j
i j
✓ L2 (Ridge) Regularization
R (W ) = W = Wi ,2j
i j
✓ Discourages large weights in the matrix W; preferring smaller ones.
✓ Elastic Net Regularization seeks to combine both L1 & L2 regularization.
R (W ) = Wi ,2j + Wi , j
i j
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 35
REGULARIZATION Cont.
❑ Early Stopping
✓ Easiest to implement and is in fairly common use.
✓ The idea is to train on your training set, but at every epoch, evaluate the
loss of the current W on a validation set.
✓ Generally, loss on the training set reduces consistently with each
iteration.
✓ the loss on the validation set will initially decrease, but then begin to
increase again.
✓ Stop training when validation loss increases and return the weights that
had the lowest validation error.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 36
REGULARIZATION Cont.
❑ Dropout
✓ Perturbing the neural network by randomly, on each training step,
selecting randomly a set of units (neurons) in each layer and prohibiting
them from participating.
✓ All of the units take a kind of “collective” responsibility for getting the
answer right, and will not be able to rely on any small subset of the
weights to do all the necessary computation.
✓ This tends also to make the network more robust to data perturbations.
✓ aims to help prevent overfitting by increasing testing accuracy
perhaps at the expense of training accuracy.
✓ Common to set p to 0.5.
✓ Experiment with different ps to get good results on your problem and
data.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 37
REGULARIZATION Cont.
❑ Can apply dropout with smaller probabilities (i.e., p = 0.10 - 0.25) immediately
after pooling (subsampling).
❑ Common to place dropout layers with p = 0.5 in-between FC layers of an
architecture
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 38
REGULARIZATION Cont.
❑ Batch Normalization
✓ Tends to achieve better performance than Dropout regularization.
✓ The idea is to standardize the input values for each mini-batch.
✓ How?
✓ Subtracting off the mean and dividing by the standard deviation of
each input dimension.
✓ The scale of the inputs to each layer remains the same, no matter how
the weights in previous layers change.
✓ NB: the batchwise mean and standard deviation is required to compute
batch normalization.
✓ Add the BN transform immediately after the nonlinearity.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 39
REGULARIZATION Cont.
❑ Data Augmentation
✓ Deep learning models need a large amount of training data in order to
learn millions of parameters.
✓ Insufficient data inputs for training can contribute to overfitting.
✓ Data augmentation is a regularization approach that generates
additional samples of the original training data.
✓ Augmentation can extract more information from the original training
data and improve the performance of the deep learning network.
✓ Noise can be added to the original training inputs to generate new
training samples.
✓ Assignment; Try and explore several data augmentation schemes.
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM
[email protected] 40
Assignment 1
❑ Install all the listed Python packages using Anaconda distribution.
❑ Download the following datasets using the link
✓ https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
✓ http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
❑ Mention and explain
✓ 4 types of activation functions.
✓ 3 types of backpropagation algorithms
❑ Explain the following DL concepts:
✓ Shallow classifier
✓ Deep neural networks
✓ Forward pass
✓ Backpropagation
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 41
Assignment 2
❑ Explain the difference between Batches and Adaptive step-size as
training strategies of DNN?
❑ Mention and explain 4 adaptive step size optimizers?
❑ For what value of k is mini-batch gradient decent equivalent to
✓ stochastic gradient descent?
✓ Batch gradient descent?
❑ Discuss the following loss functions?
✓ Squared loss
✓ Hinge loss
✓ NLL
✓ NLLM
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 42
Assignment 3
❑ Practice.
❑ Mention and explain briefly
✓ 5 types of DL optimization algorithms.
❑ Explain the following DNN concepts:
✓ Weight initialization
✓ 5 weight initialization methods
✓ Regularization
✓ Dropout
Department of Computer Engineering, A. B. U, Zaria. 3/7/2024 1:25 PM [email protected] 43
谢谢
Thank you