100% found this document useful (1 vote)
146 views122 pages

Introduction To Deep Learning

The document outlines an introduction to deep learning course, including: - An overview of machine learning basics such as supervised vs unsupervised learning and linear vs non-linear classification methods. - A comparison of traditional machine learning which relies on human-designed features versus deep learning which uses multiple layers to learn data representations directly from data. - An introduction to elements of neural networks including activation functions.

Uploaded by

Sagar Photo1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
146 views122 pages

Introduction To Deep Learning

The document outlines an introduction to deep learning course, including: - An overview of machine learning basics such as supervised vs unsupervised learning and linear vs non-linear classification methods. - A comparison of traditional machine learning which relies on human-designed features versus deep learning which uses multiple layers to learn data representations directly from data. - An introduction to elements of neural networks including activation functions.

Uploaded by

Sagar Photo1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 122

CS 404/504, Fall 2021

Introduction to Deep Learning


1
CS 404/504, Fall 2021

Lecture Outline

• Machine learning basics


 Supervised and unsupervised learning
 Linear and non-linear classification methods

• Introduction to deep learning


• Elements of neural networks (NNs)
 Activation functions

2
CS 404/504, Fall 2021

3
CS 404/504, Fall 2021

4
CS 404/504, Fall 2021

5
CS 404/504, Fall 2021

6
CS 404/504, Fall 2021

Machine Learning Basics


Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the development of


algorithms that allow computers to learn without being explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which focuses on
methods that learn from data and make predictions on unseen data

Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data Prediction
model

Picture from: Ismini Lourentzou – Introduction to Deep Learning 7


CS 404/504, Fall 2021

Machine Learning Types


Machine Learning Basics

• Supervised: learning with labeled data


 Example: email classification, image classification
 Example: regression for predicting real-valued outputs
• Unsupervised: discover patterns in unlabeled data
 Example: cluster similar data points
• Reinforcement learning: learn to act based on feedback/reward
 Example: learn to play Go

class A

class B

Regression Clustering
Classification

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 8


CS 404/504, Fall 2021

Supervised Learning
Machine Learning Basics

• Supervised learning categories and techniques


 Numerical classifier functions
o Linear classifier, perceptron, logistic regression, support vector machines (SVM), neural
networks
 Parametric (probabilistic) functions
o Naïve Bayes, Gaussian discriminant analysis (GDA), hidden Markov models (HMM),
probabilistic graphical models
 Non-parametric (instance-based) functions
o k-nearest neighbors, kernel regression, kernel density estimation, local regression
 Symbolic functions
o Decision trees, classification and regression trees (CART)
 Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest

Slide credit: Y-Fan Chang – An Overview of Machine Learning 9


CS 404/504, Fall 2021

Unsupervised Learning
Machine Learning Basics

• Unsupervised learning categories and techniques


 Clustering
o k-means clustering
o Mean-shift clustering
o Spectral clustering
 Density estimation
o Gaussian mixture model (GMM)
o Graphical models
 Dimensionality reduction
o Principal component analysis (PCA)
o Factor analysis

Slide credit: Y-Fan Chang – An Overview of Machine Learning 10


CS 404/504, Fall 2021

Linear vs Non-linear Techniques


Linear vs Non-linear Techniques

• Linear classification techniques


 Linear classifier
 Perceptron
 Logistic regression
 Linear SVM
 Naïve Bayes
• Non-linear classification techniques
 k-nearest neighbors
 Non-linear SVM
 Neural networks
 Decision trees
 Random forest

11
CS 404/504, Fall 2021

Linear vs Non-linear Techniques


Linear vs Non-linear Techniques

• For some tasks, input data


can be linearly separable,
and linear classifiers can be
suitably applied

• For other tasks, linear


classifiers may have
difficulties to produce
adequate decision
boundaries

Picture from: Y-Fan Chang – An Overview of Machine Learning 12


CS 404/504, Fall 2021

Non-linear Techniques
Linear vs Non-linear Techniques

• Non-linear classification
 Features are obtained as non-linear functions of the inputs
 It results in non-linear decision boundaries
 Can deal with non-linearly separable data

Inputs:

Features:

Outputs:

Picture from: Y-Fan Chang – An Overview of Machine Learning 13


CS 404/504, Fall 2021

Binary vs Multi-class Classification


Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary classification


 The output labels are 0 or 1

• A problem with 3 or more classes is referred to as multi-class classification

14
CS 404/504, Fall 2021

Binary vs Multi-class Classification


Binary vs Multi-class Classification

• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
 Figure: linearly and non-linearly separated data for binary classification problem

15
CS 404/504, Fall 2021

Computer Vision Tasks


Machine Learning Basics

• Computer vision has been the primary area of interest for ML


• The tasks include: classification, localization, object detection, instance
segmentation

Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 16
CS 404/504, Fall 2021

No-Free-Lunch Theorem
Machine Learning Basics

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems


• The derived classification models for supervised learning are simplifications of
the reality
 The simplifications are based on certain assumptions
 The assumptions fail in some situations
o E.g., due to inability to perfectly estimate ML model parameters from limited data
• In summary, No-Free-Lunch Theorem states:
 No single classifier works the best for all possible problems
 Since we need to make assumptions to generalize

17
CS 404/504, Fall 2021

ML vs. Deep Learning


Introduction to Deep Learning

• Conventional machine learning methods rely on human-designed feature


representations
 ML becomes just optimizing weights to best make a final prediction

Picture from: Ismini Lourentzou – Introduction to Deep Learning 18


CS 404/504, Fall 2021

ML vs. Deep Learning


Introduction to Deep Learning

• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
 DL is exceptionally effective at learning patterns

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 19


CS 404/504, Fall 2021

ML vs. Deep Learning


Introduction to Deep Learning

• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
 Input image pixels → Edges → Textures → Parts → Objects

Low-Level Mid-Level High-Level Trainable


Output
Features Features Features Classifier

Slide credit: Param Vir Singh – Deep Learning 20


CS 404/504, Fall 2021

21
CS 404/504, Fall 2021

Why is DL Useful?
Introduction to Deep Learning

• DL provides a flexible, learnable framework for representing visual, text,


linguistic information
 Can learn in supervised and unsupervised manner
• DL represents an effective end-to-end learning system
• Requires large amounts of training data
• Since about 2010, DL has outperformed other ML techniques
 First in vision and speech, then NLP, and other applications
• general tasks that deep learning is good at :
 Things like face or handwritten text recognition have to do with computer vision
because you are feeding in graphics to the computer for analysis.
 Other tasks like language translation or speech recognition have to do with natural
language processing (NLP). 
• DL is a sub-branch of ML in that it also has a set of learning algorithms that can train on
and learn from data, and more specifically DL is powered by neural networks.

22
CS 404/504, Fall 2021

Deep learning can work with complex images, videos, and unstructured
data in ways that machine learning algorithms are not equipped to handle.

However, one must keep in mind that deep learning algorithms can be
computationally expensive as they often require high-end GPUs for processing
large volumes of data over numerous hours.

So, for fairly simpler problems that require far less computational effort, we can
use machine learning. ML algorithms are also a preferred choice for making
predictions based on comparatively smaller datasets. 
23
CS 404/504, Fall 2021

What are the types of Deep Learning?

There are two major types of deep learning, supervised and unsupervised.

Supervised Deep Learning: Those algorithms which are trained by input and


output examples. Just like supervised machine learning algorithms, we create an
input matrix (X) having predictors and output matrix (y) having the target variable.
We pass this data to the algorithms to learn. Most notable algorithms from
supervised deep learning are listed below.

•Artificial Neural Networks(ANN)

•Convolutional Neural Networks(CNN)

•Recurrent Neural Networks(RNN)

•Long Short Term Memory Networks(LSTM)

Picture from: Ismini Lourentzou – Introduction to Deep Learning 24


CS 404/504, Fall 2021

• Unsupervised Deep Learning: Those algorithms where we pass the whole data


as input. There is no target variable, hence, we derive the patterns out the data
directly without any supervision or explicit training. Just like unsupervised
machine learning. Most notable algorithms from unsupervised deep learning are
listed below.

• Self Organising Maps (SOM)

• Restricted Boltzmann machines(RBM)

• Autoencoders

• Deep Belief Networks(DBN)

25
CS 404/504, Fall 2021

Introduction to Neural Networks


• Deep Learning is inspired by the way a human brain functions.
• To put it in the simplest form, just as humans learn and improve from experience, a deep
learning algorithm learns from examples.
•  Neural Networks, also referred to as Artificial Neural Networks (ANN) are at the heart
of Deep Learning algorithms.

How Deep Learning Mimics Human Brain?

Biological neuron cell 26


CS 404/504, Fall 2021

• Dendrites are extensions of the nerve cell


 They receive signals and then transmit the signals to the cell body, which processes
the stimulus and decide whether to trigger signals to other neuron cells
• In case this cell decides to trigger signals, the extension on the cell body called axon will
triggers chemical transmission at the end of the axon to other cells.

Perceptron:

A Single Perceptron
27
CS 404/504, Fall 2021

• Perceptrons and other neural networks are inspired by real neurons in our brain.
The procedure of a perceptron processing data is as follows:

1. Inputs are fed into the perceptron

2. Weights are multiplied to each input

3. Summation and then add bias

4. Activation function is applied.

5. Note that here we use a step function, but there are other more sophisticated
activation functions like sigmoid, hyperbolic tangent (tanh), rectifier (relu) and
more. 

28
CS 404/504, Fall 2021

Elements of Neural Networks


• NNs consist of hidden layers with neurons (i.e., computational units)
• A single neuron maps a set of inputs into an output number.

• The bias is an external parameter of the neuron. It can be modeled by adding


an extra input.

x1 x2 xk
x1

x2
^𝑦 ¿𝜎 (𝑧 )

^
𝑦

output

Activation
x3 weights function
input
bias

Slide credit: Hung-yi Lee – Deep Learning Tutorial 29


CS 404/504, Fall 2021
The perceptron consists of 5
parts:

•Input layer
•Weights
•Summation & Bias
•Activation Function
•Output layer

• The activation function decides whether a signal should be fired (output


0) or activated (output).

• The final output is triggered as either 1 or 0, as per the above example.


Thus, a perceptron is a binary classifier. 

• Why is a Bias value-added?


This is a constant value added to the input of the activation function. It
works similar to an intercept term and typically has +1 value.
The bias value is the threshold value that is added to the weighted sum to
avoid passing no (0) output.

Why use Activation Functions?


An activation function is a form of non-linear function that is used
to map the input between the desired output values such (0, 1) or (-1, 1).
30
CS 404/504, Fall 2021

Suppose, x1 = 1 ; x2 = 0 and x3 = 0
Let, w1 = 6, w2 = 2, w3 = 2. 
The larger the weight, the more influential the corresponding input is.
threshold = 1  and bias= -5

31
CS 404/504, Fall 2021

• varying the weights and threshold/bias will result in different possible decision-
making models.
• So if, for instance, we lower the threshold from bias= -5 to bias = - 3, then we have
more possible scenarios for the output to be 1. 

32
CS 404/504, Fall 2021

• In a real DL model, we are given input data, which we can’t change. Bias term is
initialized before you train your neural networks model.
• Assume the bias is 7 and the following input data.

• Let’s further assume that our weights are initialized at the following:

• So with input data, bias, and the output label (desired output):

33
CS 404/504, Fall 2021

• Actual output from your neural network differs from real decision.
• So what should the neural network do to help itself learn and
improve, given this difference between actual and desired
output?

• we can’t change input data, and we have initialized our bias now. 
• So the only thing we can do is that we can tell the perceptron to
adjust the weights! If we tell the perceptron to increase w1 to 7,
without changing w2 and w3, then:

34
CS 404/504, Fall 2021

• Adjusting the weights is the key to the learning process of our perceptron.

• Single-layer perceptron with step function can utilize the learning to tune
the weights after processing each set of input data

35
• AND Perceptron:
– inputs are 0 or 1
– output is 1 when
both x1 and x2 are 1 -1
.75
x1 .5
 2-D input space a
 4 possible x2 .5
data points
.5*0+.5*1+.75*-1 .5*1+.5*1+.75*-1
= -.25 output = 0 = .25 output = 1

x1
.5*0+.5*0+.75*-1
= -.75 output = 0
1

0 x2
.5*1+.5*0+.75*-1
= -.25 output = 0 0 1
• OR Perceptron:
– inputs are 0 or 1
– output is 1 when
either x1 and/or x2 are 1
-1
.25
x1 .5
 2-D input space a
 4 possible x2 .5
data points .5*1+.5*1+.25*-1
.5*1+.5*0+.25*-1
= .25 output = 1 = .75 output = 1

x1
.5*0+.5*0+.25*-1
= -.25 output = 0
1

0 x2
.5*1+.5*0+.25*-1
=.25 output = 1 0 1
How might perceptrons learn?
• Programmer specifies:
– numbers of units in each layer
– connectivity between units
So the only unknown is the weights

• Perceptrons learn by changing their weights


– supervised learning is used
– the correct output is given for each training example
• an example is a list of values for the input units
• correct output is a list desired values for the output units
1. Initialize the weights in the network
(usually with random values)
2. Repeat until all examples correctly classified
or some other stopping criterion is met
for each example e in training set do
a. y^ = neural_net_output(network, e)
b. y = desired output, i.e Target or
c. update_weights(e, y, y^)

• Unlike other learning techniques, Perceptrons need to “see” all of the


training examples multiple times.

• Each pass through all of the training examples is called an epoch.


How should the weights be updated?
• Determining how to update the weights
is a case of the credit assignment problem.

• Perceptron Learning Rule:


– wi = wi + Dwi
– where Dwi = * xi * (y- y^)
• where xi is the value associated with ith input unit
•  is a constant between 0.0 and 1.0
called the learning rate
 Perceptron Convergence Theorem says
if a set of examples are learnable,
then PLR will find the necessary weights
– in a finite number of steps
– independent of the initial weights

• This theorem says that if a solution exists,


PLR's gradient descent is guaranteed to find an
optimal solution (i.e., 100% correct classification) for
any 1-layer neural network
What are the limitations of perceptron learning?
• A single perceptron's output is determined
by the separating hyperplane defined by
(w1 * x1) + (w2 * x2) + ... + (wn * xn) = t

• So, Perceptrons can only learn functions


that are linearly separable (in input space).
• XOR Perceptron:
-1
– inputs are 0 or 1 ???
– output is 1 when x1 .5
x1 is 1 and x2 is 0 or a
x1 is 0 and x2 is 1 x2 .5

 2-D input space with


4 possible data points x1

How do you separate 1


positives from negatives
using a straight line?
0 x2
0 1
In general, the goal of learning in a perceptron
is to adjust the separating hyperplane

which divides an n-dimensional input space


where n is the number of input units

by modifying the weights (and biases) until all of the


examples with target value 1 are on one side of the
hyperplane, and all of the examples with target value 0
are on the other side of the hyperplane.
 Perceptron's as a computing model
are too weak because they can only
learn linearly-separable functions.

 To enhance the computational ability,


general neural networks have
multiple layers of units.

 The challenge is to find a learning rule


that works for multi-layered networks.
CS 404/504, Fall 2021

A Perceptron learning algorithm:

46
CS 404/504, Fall 2021

47
CS 404/504, Fall 2021

48
CS 404/504, Fall 2021

49
CS 404/504, Fall 2021

Training a feedforward network requires making many of the same design


decisions as are necessary for a linear model:
• choosing the optimizer,
• the cost function,
• and the form of the output units.

50
CS 404/504, Fall 2021

 A feed-forward multi-layered network computes a function of


the inputs and the weights.

• Input units (on left or bottom):


– activation is determined by the environment
• Output units (on right or top):
– activation is the result
• Hidden units (between input and output units):
– cannot observe directly

• Perceptrons have input units followed


by one layer of output units, i.e. no hidden units

51
CS 404/504, Fall 2021
 NN's with one hidden layer of a sufficient number
of units, can compute functions associated with convex
classification regions in input space.

 NN's with two hidden layers are universal computing devices,


although the complexity
of the function is limited by the number of units.
– If too few, the network will be unable
to represent the function.
– If too many, the network will
memorize examples and is subject to overfitting.

• Basic Idea – go over all existing data patterns, whose


labeling is known, and check their classification with a
current weight vector
• If correct, continue
• If not, add to the weights a quantity that is proportional to
the product of the input pattern with the desired output.
52
CS 404/504, Fall 2021

 Multilayer perceptron/feedforward neural network


• Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together.
• It is a neural network where the mapping between inputs and output is non-
linear.

• A NN with one hidden layer and one output layer:


Weights Biases

𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉= 𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+ 𝒃𝟐)

Activation functions

4 + 2 = 6 neurons (not counting inputs)


𝒚 [3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases
𝒙 26 learnable parameters

𝒉
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 53
CS 404/504, Fall 2021

Elements of Neural Networks


Introduction to Neural Networks

• Deep NNs have many hidden layers


 Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
 Each neuron is connected to all neurons in the succeeding layer

Input Layer 1 Layer 2 Layer L Output


…… y1

…… y2

……
……

……

……

……
…… yM

Input Layer Output Layer


Hidden Layers
Slide credit: Hung-yi Lee – Deep Learning Tutorial 54
CS 404/504, Fall 2021

Elements of Neural Networks


Introduction to Neural Networks

• A simple network, toy example

( 1∙ 1 ) + ( − 1 ) ∙ (− 2 ) +1= 4

0.98 Sigmoid Function


1 4
1
-2
1
-1 -2 0.12
-1
1
0

1 -2
Slide credit: Hung-yi Lee – Deep Learning Tutorial 55
CS 404/504, Fall 2021

Elements of Neural Networks


Introduction to Neural Networks

• A simple network, toy example (cont’d)


 For an input vector , the output is

1 4 0.98 2 0.86 3 0.62


1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2

𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Matrix operations are helpful when working with multidimensional inputs and
outputs

1 4 0.98
1 W x + b a
-2
1

-1
-1 -2 0.12 [
1
𝜎
−1
−2
1 ] [ ] +¿ [ ] ¿ [ ]
(

1
1
1
)0
0 . 98
0 . 12

1
0 [ ]4
−2

Slide credit: Hung-yi Lee – Deep Learning Tutorial 57


CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for the first layer


 Input vector x, weights matrix W1, bias vector b1, output vector a1

…… y1

W1 …… y2
b1

……
……

……

……

……
x a1 …… yM

a1 W1 x + b1

Slide credit: Hung-yi Lee – Deep Learning Tutorial 58


CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for all layers

…… y1

W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
x a1 a2
…… y yM

𝜎
W1 x(+ )
b1
𝜎
W2 a1(+ )
b2
𝜎
WL a(
L-1 +)L
b

Slide credit: Hung-yi Lee – Deep Learning Tutorial 59


CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, function f maps inputs x to outputs y, i.e.,

…… y1

W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
x a1 a2
…… y yM

y ¿ 𝑓 ()
x WL …
¿ 𝜎
W2
𝜎 W1 x(
𝜎 (+(
b1)
) + b)
2 … + bL

Slide credit: Hung-yi Lee – Deep Learning Tutorial 60


CS 404/504, Fall 2021

Example: Learning XOR


• The XOR function (“exclusive or”) is an operation on two binary values, x1
and x2.

• The XOR function provides the target function y = f∗(x) that we want to
learn. Our model provides a function y = f(x; θ) and our learning algorithm
will adapt the parameters θ to make f as similar as possible to f *.

61
CS 404/504, Fall 2021

After solving the normal equations, we obtain w = 0 and b =1/2

The linear model simply outputs 0.5 everywhere and is not able to
represent the XOR function.

62
CS 404/504, Fall 2021

So here, we will introduce a very simple feedforward network with one


hidden layer containing two hidden units.

This feedforward network has a vector of hidden units h that are computed
by a function f (1)(x;W, c).

The values of these hidden units are then used as the input for a second
layer. The second layer is the output layer of the network.
The output layer is still just a linear regression model, but now it is applied
to h rather than to x .
The network now contains two functions chained together:
h = f(1)(x;W, c) and y = f(2)(h;w, b),
with the complete model being f(x;W, c,w, b) = f(2)(f (1)(x)).

63
CS 404/504, Fall 2021

• Now, by defining h = g( x + c), where W provides the weights of a linear


transformation and c the biases. The activation function g is typically chosen
to be a function that is applied element-wise.

• In modern neural networks, the default recommendation is to use the rectified


linear unit or ReLU defined by the activation function g(z) = max{0, z}

64
CS 404/504, Fall 2021

65
CS 404/504, Fall 2021

66
CS 404/504, Fall 2021

• Use Relu activation Function.


• Perform forward Pass.

67
CS 404/504, Fall 2021

68
CS 404/504, Fall 2021

• The neural network has obtained the correct answer for every example
in the batch.

• In this example, we simply specified the solution, then showed that it


obtained zero error.

• In a real situation, there might be billions of model parameters and


billions of training examples, so one cannot simply guess the solution as
we did here.

• Instead, a gradient-based optimization algorithm can find parameters


that produce very little error.

• The solution we described to the XOR problem is at a global minimum


of the loss function, so gradient descent could converge to this point.
• There are other equivalent solutions to the XOR problem that gradient
descent could also find.

• The convergence point of gradient descent depends on the initial values


of the parameters.
69
CS 404/504, Fall 2021

Implementation of X-OR Gate:

70
CS 404/504, Fall 2021

71
CS 404/504, Fall 2021

Activation Functions
• Activation Functions are extremely important feature of the Artificial Neural
Network. They basically decide whether a neuron should be activated or not. It
limits the output signal to a finite value.

• Activation Function does the non-linear transformation to the input making it


capable to learn more complex relation between input and output. It make the
network capable of learning more complex pattern.

• Without an activation function, the neural network is just a linear regression


model as it performs only summation of product of input and weights.
• Eg. In the below image 2 requires a complex relation which is curve unlike a
simple linear relation in image 1.

72
CS 404/504, Fall 2021

Activation Functions
Introduction to Neural Networks

• Non-linear activations are needed to learn complex (non-linear) data


representations
 Otherwise, NNs would be just a linear function (such as )
 NNs with large number of layers (and neurons) can approximate more complex
functions
o Figure: more neurons improve representation (but, may overfit)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 73


CS 404/504, Fall 2021

The Activation Functions can be basically divided into 3 types-


1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions

Binary Step Function:


A binary step function is a threshold-based activation function. If the input
value is above or below a certain threshold, the neuron is activated and sends
exactly the same signal to the next layer.

We decide some threshold value to decide output that neuron should be


activated or deactivated. It is very simple and useful to classify binary
problems or classifier.
Eg.f(x) = 1 if x > 0 else 0 if x <= 0

74
CS 404/504, Fall 2021

Linear or Identity Activation Function


As you can see the function is a line or linear. Therefore, the output of
the functions will not be confined between any range.

Equation: f(x) = x
Range : (-infinity to infinity)
It doesn‟t help with the complexity or various parameters of usual data
that is fed to the neural networks

75
CS 404/504, Fall 2021

Non-linear Activation Function


The Nonlinear Activation Functions are the most used activation functions.
Nonlinearity helps to makes the graph look something like this.

The main terminologies needed to understand for nonlinear functions are:


Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known
as slope.
Monotonic function: A function which is either entirely non-increasing or non-
decreasing.
The Nonlinear Activation Functions are mainly divided on the basis of their range
or curves-
Advantage of Non-linear function over the Linear function :
Differential is possible in all the non -linear function.
Stacking of network is possible, which helps us in creating deep neural nets.
It makes it easy for the model to generalize
76
CS 404/504, Fall 2021

Activation: Sigmoid
Introduction to Neural Networks

• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
 The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
 When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
 Sigmoid activations are less common in modern NNs

ℝ 𝑛 → [ 0,1 ]

𝑓 (𝑥 )

𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 77
CS 404/504, Fall 2021
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the
output of each neuron.

Disadvantages:
 Vanishing gradient—for very high or very low values of X,
there is almost no change to the prediction, causing a
vanishing gradient problem. This can result in the network
refusing to learn further, or being too slow to reach an
accurate prediction.
 Outputs not zero centered.
 Computationally expensive

78
CS 404/504, Fall 2021

Activation: Tanh
Introduction to Neural Networks

• Tanh function: takes a real-valued number and “squashes” it into range between
-1 and 1
 Like sigmoid, tanh neurons saturate
 Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
 Tanh is a scaled sigmoid:

ℝ 𝑛 → [ − 1 ,1 ]

𝑓 (𝑥 )

𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 79
CS 404/504, Fall 2021

This is implemented in the computation, just like the sigmoid it will


smooth the curve where gradient descent will converge towards the
minima based on the learning rate. Here is a visual of how it works,

Advantages
 Zero centered—making it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages
 Like the Sigmoid function is also suffers from vanishing gradient
problem
 hard to train on small datasets

80
CS 404/504, Fall 2021

Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at


zero
ℝ 𝑛 → ℝ +¿ ¿
𝑛

 Most modern deep NNs use ReLU


activations
 ReLU is fast to compute 𝑓 (𝑥 )
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of gradient
descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing problem 𝑥

81
CS 404/504, Fall 2021

Advantages
 Avoids vanishing gradient problem.
 Computationally efficient—allows the network to converge very quickly
 Non-linear—although it looks like a linear function, ReLU has a derivative
function and allows for backpropagation

Disadvantages
 Can only be used with a hidden layer
 hard to train on small datasets and need much data for learning non-linear
behavior.
 The Dying ReLU problem—when inputs approach zero, or are negative,
the gradient of the function becomes zero, the network cannot perform
backpropagation and cannot learn.
 The function and its derivative both are monotonic.
 All the negative values are converted into zero, and this conversion rate is so
fast that neither it can map nor fit into data properly which creates a
problem.

82
CS 404/504, Fall 2021

Activation: Leaky ReLU


Introduction to Neural Networks

• The problem of ReLU activations: they can “die”


 ReLU could cause weights to update in a way that the gradients can become zero and
the neuron will not activate again on any data
 E.g., when a large learning rate is used

• Leaky ReLU activation function is a variant of ReLU


 Instead of the function being 0 when , a leaky ReLU has a small negative slope (e.g., α
= 0.01, or similar)
 This resolves the dying ReLU problem
 Most current works still use ReLU
o With a proper setting of the learning rate,
𝑓 ( 𝑥 )=
{𝑥
𝛼 𝑥 for 𝑥< 0
for 𝑥 ≫0
the problem of dying ReLU can be avoided

83
CS 404/504, Fall 2021

This activation function also has drawbacks, during the front propagation if the learning
rate is set very high it will overshoot killing the neuron. This will happen when the
learning rate is not set at an optimum level like in the below graph,

High learning rate leading to overshoot during


gradient descent.

Low and optimal learning rate leading to a


gradual descent towards the minima.

84
CS 404/504, Fall 2021

Advantages
 Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation, even
for negative input values
 Otherwise like ReLU

Disadvantages
 Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.

85
CS 404/504, Fall 2021

Activation: Softmax
 Sigmoid able to handle more than two cases(class label).
 Softmax can handle multiple cases. Softmax function squeeze the
output for each class between 0 and 1 with sum of them is 1.
 It is ideally used in the final output layer of the classifier, where we
are actually trying to attain the probabilities.
 Softmax produces multiple outputs for an input array. For this reason,
we can build neural network models that can classify more than 2
classes instead of binary class solution.

sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
86
CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• In multi-class classification tasks, the output layer is typically a softmax layer


 I.e., it employs a softmax activation function
 If a layer with a sigmoid activation function is used as the output layer instead, the
predictions by the NN may not be easy to interpret
o Note that an output layer with sigmoid activations can still be used for binary classification

A Layer with Sigmoid Activations


3 0.95

1 0.73

-3 0.05

Slide credit: Hung-yi Lee – Deep Learning Tutorial 87


CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• The softmax layer applies softmax activations to output Probability:


a probability value in the range [0, 1]
 The values z inputted to the softmax layer are referred to as
logits
A Softmax Layer
3 20 0.88

1 2.7 0.12

-3 0.05 ≈0

Slide credit: Hung-yi Lee – Deep Learning Tutorial 88


CS 404/504, Fall 2021

Understanding Loss Function in Deep Learning


• What is the Loss function?

• In simple terms, the Loss function is a method of evaluating how well your algorithm is
modeling your dataset. It is a mathematical function of the parameters of the machine
learning algorithm.

• In simple linear regression, prediction is calculated using slope(m) and intercept(b). the
loss function for this is the (Yi – )^2 i.e loss function is the function of slope and
intercept.

• if the value of the loss function is lower then it’s a good model otherwise, we have to
change the parameter of the model and minimize the loss.

• Loss function vs Cost function

• Most people confuse loss function and cost function. let’s understand what is loss
function and cost function. Cost function and Loss function are synonymous and used
interchangeably but they are different.

89
CS 404/504, Fall 2021

Loss Function:
A loss function/error function is for a single training example/input.

Cost Function:
A cost function, on the other hand, is the average loss over the entire training
dataset.
Loss function in Deep Learning
1. Regression
 MSE(Mean Squared Error)
 MAE(Mean Absolute Error)
 Hubber loss

2. Classification
 Binary cross-entropy
 Categorical cross-entropy

3. AutoEncoder
 KL Divergence

90
CS 404/504, Fall 2021

4. GAN(Generative adversarial networks )


 Discriminator loss
 Minmax GAN loss

5. Object detection
 Focal loss

6. Word embeddings
 Triplet loss

we will understand regression loss and classification loss.

91
CS 404/504, Fall 2021

Regression Loss:

1. Mean Squared Error/Squared loss/ L2 loss:


 
The Mean Squared Error (MSE) is the simplest and most common
loss function. To calculate the MSE, you take the difference between the
actual value and model prediction, square it, and average it across the
whole dataset.

Advantage
•1. Easy to interpret.
•2. Always differential because of the square.
•3. Only one local minima.
Disadvantage
•1. Error unit in the square. because the unit in the square is not
understood properly.
•2. Not robust to outlier
Note – In regression at the last neuron use linear activation function.
92
CS 404/504, Fall 2021

2. Mean Absolute Error/ L1 loss:

The Mean Absolute Error (MAE) is also the simplest loss function. To
calculate the MAE, you take the difference between the actual value and
model prediction and average it across the whole dataset.

Advantage
•1. Intuitive and easy
•2. Error Unit Same as the output column.
•3. Robust to outlier

Disadvantage
• Graph, not differential. we can not use gradient descent directly, then we
can subgradient calculation.
Note – In regression at the last neuron use linear activation function.

93
CS 404/504, Fall 2021

3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that
is less sensitive to outliers in data than the squared error loss.

•n – the number of data points.


•y – the actual value of the data point. Also known as true value.
•ŷ – the predicted value of the data point. This value is returned by the model.
•δ – defines the point where the Huber loss function transitions from a
quadratic to linear.
Advantage
•1. Robust to outlier
•2. It lies between MAE and MSE.
Disadvantage
•1. Its main disadvantage is the associated complexity. In order to maximize
model accuracy, the hyperparameter δ will also need to be optimized which
increases the training requirements.
94
CS 404/504, Fall 2021

Classification Loss:

1. Binary Cross Entropy/log loss:


It is used in binary classification problems like two classes. example a person has
covid or not or if am I getting passed in exam or not.
Binary cross entropy compares each of the predicted probabilities to the actual
class output which can be either 0 or 1. It then calculates the score that penalizes
the probabilities based on the distance from the expected value. That means how
close or far from the actual value.

•yi – actual values


•Yi^– Neural Network prediction
Advantage –
•1. A cost function is a differential.
Disadvantage –
• Multiple local minima
Note – In classification at last neuron use sigmoid activation function.

95
CS 404/504, Fall 2021

2. Categorical Cross entropy:

• Categorical Cross entropy is used for Multiclass classification.


• Categorical Cross entropy is also used in softmax regression.

where
•k is classes,
•y = actual value
•Y^ – Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax
activation function.
96
CS 404/504, Fall 2021

if problem statement have 3 classes


softmax activation – f(z) = ez1/(ez1+ez2+ez3)

When to use categorical cross-entropy and sparse categorical cross-entropy?

If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use
categorical cross-entropy. and if the target column has Numerical encoding to
classes like 1,2,3,4….n then use sparse categorical cross-entropy.

Which is Faster?

sparse categorical cross-entropy faster than categorical cross-entropy.

97
CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• The network parameters include the weight matrices and bias vectors from all
layers
𝜃 = {𝑊 1 ,𝑏 1 , 𝑊 2 , 𝑏2 , ⋯ 𝑊 𝐿 ,𝑏 𝐿 }
 Often, the model parameters are referred to as weights
• Training a model to learn a set of parameters that are optimal (according to a
criterion) is one of the greatest challenges in ML

…… y1
0.1 is 1

Softmax
…… y2
0.7 is 2
……

……

……
…… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 98
CS 404/504, Fall 2021

Gradient Descent Algorithm


• Gradient Descent is an optimization algorithm for finding a local minimum of a
differentiable function. Gradient descent in machine learning is simply used to find the
values of a function's parameters (coefficients) that minimize a Loss (Total Cost) function
as far as possible.
• Optimizing the loss function
 Almost all DL models these days are trained with a variant of the gradient descent
(GD) algorithm
 GD applies iterative refinement of the network parameters
 GD uses the opposite direction of the gradient of the loss with respect to the NN
parameters (i.e., ) for updating
o The gradient of the loss function gives the direction of fastest increase of the loss function
when the parameters are changed

ℒ ( 𝜃) 𝜕ℒ
𝜕 𝜃𝑖

𝜃𝑖
99
CS 404/504, Fall 2021

Gradient Descent Algorithm


• Steps in the gradient descent algorithm:
1. Randomly initialize the model parameters
2. Compute the gradient of the loss function at the initial parameters :
3. Update the parameters as:
o Where  is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

Initial
parameters Gradient
Loss

Parameter update:

Global loss minimum

Parameters
100
CS 404/504, Fall 2021

Gradient Descent Algorithm


• Example: a NN with only 2 parameters and , i.e.,
 The different colors represent the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick a
starting point

2. Compute the
gradient at ,
𝜃∗
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃

𝑤1
𝛻 ℒ (𝜃 )=
0

[ 𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 1
𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 2 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 101
CS 404/504, Fall 2021

Gradient Descent Algorithm


Training Neural Networks

• Example (contd.)

Eventually, we would reach a minimum …..

2. Compute the gradient


𝜃2 at ,
𝜃 1 − 𝛼 𝛻 ℒ ( 𝜃 1)
𝑤2 𝜃 2 − 𝛼 𝛻 ℒ (𝜃 2 ) 3. Times the learning rate ,
and update
𝜃1

4. Go to step 2, repeat
0
𝜃

𝑤1 Slide credit: Hung-yi Lee – Deep Learning Tutorial 102


CS 404/504, Fall 2021

Gradient Descent Algorithm


Training Neural Networks

• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
 GD does not guarantee reaching a global minimum
 However, empirical evidence suggests that GD works well for NNs

Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 103


CS 404/504, Fall 2021

Gradient Descent Algorithm


Training Neural Networks

• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs results
in different initial parameters every
time the NN is trained ℒ
 Gradient descent may reach different
minima at every run
 Therefore, NN will produce different
predicted outputs
• In addition, currently we don’t have
algorithms that guarantee reaching a
global minimum for an arbitrary loss
function 𝑤1 𝑤2

Slide credit: Hung-yi Lee – Deep Learning Tutorial 104


CS 404/504, Fall 2021

Importance of the Learning Rate


• How big the steps gradient descent takes into the direction of the
local minimum are determined by the learning rate, which figures
out how fast or slow we will move towards the optimal weights.

• learning rate
 neither too low nor too high

• This is important because if the steps it takes are too big, it may


not reach the local minimum because it bounces back and forth
between the convex function of gradient descent 

• If we set the learning rate to a very small value, gradient descent will
eventually reach the local minimum but that may take a while 

105
CS 404/504, Fall 2021

• A good way to make sure the gradient descent algorithm runs


properly is by plotting the cost function as the optimization runs.

• Put the number of iterations on the x-axis and the value of the cost
function on the y-axis. This helps you see the value of your cost
function after each iteration of gradient descent, and provides a
way to easily spot how appropriate your learning rate is.

• If the gradient descent algorithm is working properly, the cost


function should decrease after every iteration.

106
CS 404/504, Fall 2021

Learning Rate

• Learning rate
 The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
 Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training

LR LR
too too
small large

107
CS 404/504, Fall 2021

Learning Rate

• Training loss for different learning rates


 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to reach a
solution)

Picture from: https://cs231n.github.io/neural-networks-3/

108
CS 404/504, Fall 2021

Backpropagation

• Modern NNs employ the backpropagation method for calculating the gradients
of the loss function
 Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to passing the
inputs through the hidden layers to obtain the model outputs (predictions)
 The loss function is then calculated
 Backpropagation traverses the network in reverse order, from the outputs backward
toward the inputs to calculate the gradients of the loss
 The chain rule is used for calculating the partial derivatives of the loss function with
respect to the parameters in the different layers in the network
• Each update of the model parameters during training takes one forward and
one backward pass (e.g., of a batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is available in
all current deep learning libraries
 It significantly simplifies the implementation of deep learning algorithms, since it
obviates deriving the partial derivatives of the loss function by hand

109
CS 404/504, Fall 2021

Chain Rule of Calculus


• Suppose that y = g(x) and z = f (g(x)) = f (y). Then
• the chain rule states that

110
CS 404/504, Fall 2021

Steps in Back propagation Algorithm


• STEP ONE: initialize the weights and biases.

• The weights in the network are initialized to random


numbers from the interval [-1,1].

• Each unit has a BIAS associated with it

• The biases are similarly initialized to random numbers


from the interval [-1,1].

• STEP TWO: feed the training sample.

111
CS 404/504, Fall 2021

Steps in Back propagation Algorithm

• STEP THREE: Propagate the inputs forward; we compute


the net input and output of each unit in the hidden and
output layers.

• STEP FOUR: back propagate the error.

• STEP FIVE: update weights and biases to reflect the


propagated errors.

• STEP SIX: terminating conditions.

112
CS 404/504, Fall 2021

Example of BPNN

113
CS 404/504, Fall 2021

114
CS 404/504, Fall 2021

115
CS 404/504, Fall 2021

116
CS 404/504, Fall 2021

Image Fundamentals
Pixels: The Building Blocks of Images
It is a “color” or the “intensity” of light that appears.

Most pixels are represented in two ways :

• Grayscale:
In a grayscale image, each pixel is a scalar value between 0 and 255, where zero corresponds
to “black” and 255 being “white”. Values between 0 and 255 are varying shades of gray, where
values closer to 0 are darker and values closer to 255 are lighter.

• Color:
Each of the three colors is represented by an integer in the range 0 to 255, which indicates how
“much” of the color there is

• Given that the pixel value only needs to be in the range [0, 255], we normally use an 8-bit
unsigned integer to represent each color intensity.

• We then combine these values into an RGB tuple in the form (red, green, blue). This tuple
represents our color.

117
CS 404/504, Fall 2021

Here are some common colors represented


as RGB tuples :
 Black: (0,0,0)
 White: (255,255,255)
 Red: (255,0,0)
 Green: (0,255,0)
 Blue: (0,0,255)
 Aqua: (0,255,255)
 Maroon: (128,0,0)
 Navy: (0,0,128)
 Olive: (128,128,0)
 Purple: (128,0,128)
 Teal: (0,128,128)

In this case, our matrix has 1,000 columns (the width)  Yellow: (255,255,0)
with 750 rows (the height).

Overall, there are 1,000 * 750 = 750,000 total pixels


in our image.

118
CS 404/504, Fall 2021

Understanding the co-ordinate System :

Imagine our grid as a piece of graph paper.


Using this graph paper, the point (0, 0) corresponds to the upper left corner
of the image.
As we move down and to the right, both the x and y values increase.

119
CS 404/504, Fall 2021

• The co-ordinate system (X, Y) uses (width x height ) format. However, OpenCV represents
images as NumPy arrays. NumPy stores the image in (height, width, channels) format.
• To access an individual pixel value from our image we use simple NumPy array indexing :

• # pixel located at (0, 0) – the topleft corner of the image

• (b, g, r) = image[0, 0]

• (b, g, r) = image[20, 100] # accesses pixel at x=100, y=20

• (b, g, r) = image[75, 25] # accesses pixel at x=25, y=75

RGB and BGR Ordering :

• It’s important to note that OpenCV stores RGB channels in reverse order. While we
normally think in terms of Red, Green, and Blue, OpenCV actually stores the pixel values in
Blue, Green, Red order.

# manipulate the top-left pixel in the image


image[0, 0] = (0, 0, 255)

120
CS 404/504, Fall 2021

Load Display Save


 
OpenCV has automatically converts images from PNG image to JPG for us! No
further effort is needed on our part to convert between image formats.
 cv.imread(<imagePath>) – load image
 cv.imshow(<numPy array>) – display image
 cv.imwrite(<imagePath>, <numPy array>) – save image to disk

Images as NumPy Arrays :

Image processing libraries such as OpenCV and scikit-image represent RGB


images as multidimensional NumPy arrays with shape (height, width, depth).

Example code to load an image from disk :


import cv2
image = cv2.imread("example.png") Output :
print(image.shape) (248, 300, 3)
cv2.imshow("Image", image)

121
CS 404/504, Fall 2021
Scaling and Aspect Ratios :
• Scaling, or simply resizing, is the process of increasing or decreasing the
size of an image in terms of width and height.

• When resizing an image, it’s important to keep in mind the aspect ratio,


which is the ratio of the width to the height of the image. (aspect ratio =
width / height )

• Ignoring the aspect ratio can lead to images that look compressed and
distorted. To prevent this behavior, we simply scale the width and height
of an image by equal amounts when resizing an image.

Most neural networks and Convolutional Neural Networks applied to the


task of image classification assume a fixed size input, meaning that the
dimensions of all images you pass through the network must be the
same.

• Common choices for width and height image sizes inputted to


Convolutional Neural Networks include 32 x 32, 64 x 64, 224 x 224, 227 x
227, 256 x 256, and 299 x 299.

122

You might also like