Introduction To Deep Learning
Introduction To Deep Learning
Lecture Outline
2
CS 404/504, Fall 2021
3
CS 404/504, Fall 2021
4
CS 404/504, Fall 2021
5
CS 404/504, Fall 2021
6
CS 404/504, Fall 2021
Machine Learning
Labeled Data algorithm
Training
Prediction
Learned
Labeled Data Prediction
model
class A
class B
Regression Clustering
Classification
Supervised Learning
Machine Learning Basics
Unsupervised Learning
Machine Learning Basics
11
CS 404/504, Fall 2021
Non-linear Techniques
Linear vs Non-linear Techniques
• Non-linear classification
Features are obtained as non-linear functions of the inputs
It results in non-linear decision boundaries
Can deal with non-linearly separable data
Inputs:
Features:
Outputs:
14
CS 404/504, Fall 2021
• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
Figure: linearly and non-linearly separated data for binary classification problem
15
CS 404/504, Fall 2021
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 16
CS 404/504, Fall 2021
No-Free-Lunch Theorem
Machine Learning Basics
17
CS 404/504, Fall 2021
• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
DL is exceptionally effective at learning patterns
• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
Input image pixels → Edges → Textures → Parts → Objects
21
CS 404/504, Fall 2021
Why is DL Useful?
Introduction to Deep Learning
22
CS 404/504, Fall 2021
Deep learning can work with complex images, videos, and unstructured
data in ways that machine learning algorithms are not equipped to handle.
However, one must keep in mind that deep learning algorithms can be
computationally expensive as they often require high-end GPUs for processing
large volumes of data over numerous hours.
So, for fairly simpler problems that require far less computational effort, we can
use machine learning. ML algorithms are also a preferred choice for making
predictions based on comparatively smaller datasets.
23
CS 404/504, Fall 2021
There are two major types of deep learning, supervised and unsupervised.
• Autoencoders
25
CS 404/504, Fall 2021
Perceptron:
A Single Perceptron
27
CS 404/504, Fall 2021
• Perceptrons and other neural networks are inspired by real neurons in our brain.
The procedure of a perceptron processing data is as follows:
5. Note that here we use a step function, but there are other more sophisticated
activation functions like sigmoid, hyperbolic tangent (tanh), rectifier (relu) and
more.
28
CS 404/504, Fall 2021
x1 x2 xk
x1
x2
^𝑦 ¿𝜎 (𝑧 )
^
𝑦
…
output
…
Activation
x3 weights function
input
bias
•Input layer
•Weights
•Summation & Bias
•Activation Function
•Output layer
Suppose, x1 = 1 ; x2 = 0 and x3 = 0
Let, w1 = 6, w2 = 2, w3 = 2.
The larger the weight, the more influential the corresponding input is.
threshold = 1 and bias= -5
31
CS 404/504, Fall 2021
• varying the weights and threshold/bias will result in different possible decision-
making models.
• So if, for instance, we lower the threshold from bias= -5 to bias = - 3, then we have
more possible scenarios for the output to be 1.
32
CS 404/504, Fall 2021
• In a real DL model, we are given input data, which we can’t change. Bias term is
initialized before you train your neural networks model.
• Assume the bias is 7 and the following input data.
• Let’s further assume that our weights are initialized at the following:
• So with input data, bias, and the output label (desired output):
33
CS 404/504, Fall 2021
• Actual output from your neural network differs from real decision.
• So what should the neural network do to help itself learn and
improve, given this difference between actual and desired
output?
• we can’t change input data, and we have initialized our bias now.
• So the only thing we can do is that we can tell the perceptron to
adjust the weights! If we tell the perceptron to increase w1 to 7,
without changing w2 and w3, then:
34
CS 404/504, Fall 2021
• Adjusting the weights is the key to the learning process of our perceptron.
• Single-layer perceptron with step function can utilize the learning to tune
the weights after processing each set of input data
35
• AND Perceptron:
– inputs are 0 or 1
– output is 1 when
both x1 and x2 are 1 -1
.75
x1 .5
2-D input space a
4 possible x2 .5
data points
.5*0+.5*1+.75*-1 .5*1+.5*1+.75*-1
= -.25 output = 0 = .25 output = 1
x1
.5*0+.5*0+.75*-1
= -.75 output = 0
1
0 x2
.5*1+.5*0+.75*-1
= -.25 output = 0 0 1
• OR Perceptron:
– inputs are 0 or 1
– output is 1 when
either x1 and/or x2 are 1
-1
.25
x1 .5
2-D input space a
4 possible x2 .5
data points .5*1+.5*1+.25*-1
.5*1+.5*0+.25*-1
= .25 output = 1 = .75 output = 1
x1
.5*0+.5*0+.25*-1
= -.25 output = 0
1
0 x2
.5*1+.5*0+.25*-1
=.25 output = 1 0 1
How might perceptrons learn?
• Programmer specifies:
– numbers of units in each layer
– connectivity between units
So the only unknown is the weights
46
CS 404/504, Fall 2021
47
CS 404/504, Fall 2021
48
CS 404/504, Fall 2021
49
CS 404/504, Fall 2021
50
CS 404/504, Fall 2021
51
CS 404/504, Fall 2021
NN's with one hidden layer of a sufficient number
of units, can compute functions associated with convex
classification regions in input space.
𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉= 𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )
Activation functions
𝒉
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 53
CS 404/504, Fall 2021
…… y2
……
……
……
……
……
…… yM
( 1∙ 1 ) + ( − 1 ) ∙ (− 2 ) +1= 4
1 -2
Slide credit: Hung-yi Lee – Deep Learning Tutorial 55
CS 404/504, Fall 2021
𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 56
CS 404/504, Fall 2021
Matrix Operation
Introduction to Neural Networks
• Matrix operations are helpful when working with multidimensional inputs and
outputs
1 4 0.98
1 W x + b a
-2
1
-1
-1 -2 0.12 [
1
𝜎
−1
−2
1 ] [ ] +¿ [ ] ¿ [ ]
(
−
1
1
1
)0
0 . 98
0 . 12
1
0 [ ]4
−2
Matrix Operation
Introduction to Neural Networks
…… y1
W1 …… y2
b1
……
……
……
……
……
x a1 …… yM
a1 W1 x + b1
Matrix Operation
Introduction to Neural Networks
…… y1
W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
x a1 a2
…… y yM
𝜎
W1 x(+ )
b1
𝜎
W2 a1(+ )
b2
𝜎
WL a(
L-1 +)L
b
Matrix Operation
Introduction to Neural Networks
…… y1
W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
x a1 a2
…… y yM
y ¿ 𝑓 ()
x WL …
¿ 𝜎
W2
𝜎 W1 x(
𝜎 (+(
b1)
) + b)
2 … + bL
• The XOR function provides the target function y = f∗(x) that we want to
learn. Our model provides a function y = f(x; θ) and our learning algorithm
will adapt the parameters θ to make f as similar as possible to f *.
61
CS 404/504, Fall 2021
The linear model simply outputs 0.5 everywhere and is not able to
represent the XOR function.
62
CS 404/504, Fall 2021
This feedforward network has a vector of hidden units h that are computed
by a function f (1)(x;W, c).
The values of these hidden units are then used as the input for a second
layer. The second layer is the output layer of the network.
The output layer is still just a linear regression model, but now it is applied
to h rather than to x .
The network now contains two functions chained together:
h = f(1)(x;W, c) and y = f(2)(h;w, b),
with the complete model being f(x;W, c,w, b) = f(2)(f (1)(x)).
63
CS 404/504, Fall 2021
64
CS 404/504, Fall 2021
65
CS 404/504, Fall 2021
66
CS 404/504, Fall 2021
67
CS 404/504, Fall 2021
68
CS 404/504, Fall 2021
• The neural network has obtained the correct answer for every example
in the batch.
70
CS 404/504, Fall 2021
71
CS 404/504, Fall 2021
Activation Functions
• Activation Functions are extremely important feature of the Artificial Neural
Network. They basically decide whether a neuron should be activated or not. It
limits the output signal to a finite value.
72
CS 404/504, Fall 2021
Activation Functions
Introduction to Neural Networks
74
CS 404/504, Fall 2021
Equation: f(x) = x
Range : (-infinity to infinity)
It doesn‟t help with the complexity or various parameters of usual data
that is fed to the neural networks
75
CS 404/504, Fall 2021
Activation: Sigmoid
Introduction to Neural Networks
• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
Sigmoid activations are less common in modern NNs
ℝ 𝑛 → [ 0,1 ]
𝑓 (𝑥 )
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 77
CS 404/504, Fall 2021
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the
output of each neuron.
Disadvantages:
Vanishing gradient—for very high or very low values of X,
there is almost no change to the prediction, causing a
vanishing gradient problem. This can result in the network
refusing to learn further, or being too slow to reach an
accurate prediction.
Outputs not zero centered.
Computationally expensive
78
CS 404/504, Fall 2021
Activation: Tanh
Introduction to Neural Networks
• Tanh function: takes a real-valued number and “squashes” it into range between
-1 and 1
Like sigmoid, tanh neurons saturate
Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
Tanh is a scaled sigmoid:
ℝ 𝑛 → [ − 1 ,1 ]
𝑓 (𝑥 )
𝑥
Slide credit: Ismini Lourentzou – Introduction to Deep Learning 79
CS 404/504, Fall 2021
Advantages
Zero centered—making it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages
Like the Sigmoid function is also suffers from vanishing gradient
problem
hard to train on small datasets
80
CS 404/504, Fall 2021
Activation: ReLU
Introduction to Neural Networks
81
CS 404/504, Fall 2021
Advantages
Avoids vanishing gradient problem.
Computationally efficient—allows the network to converge very quickly
Non-linear—although it looks like a linear function, ReLU has a derivative
function and allows for backpropagation
Disadvantages
Can only be used with a hidden layer
hard to train on small datasets and need much data for learning non-linear
behavior.
The Dying ReLU problem—when inputs approach zero, or are negative,
the gradient of the function becomes zero, the network cannot perform
backpropagation and cannot learn.
The function and its derivative both are monotonic.
All the negative values are converted into zero, and this conversion rate is so
fast that neither it can map nor fit into data properly which creates a
problem.
82
CS 404/504, Fall 2021
83
CS 404/504, Fall 2021
This activation function also has drawbacks, during the front propagation if the learning
rate is set very high it will overshoot killing the neuron. This will happen when the
learning rate is not set at an optimum level like in the below graph,
84
CS 404/504, Fall 2021
Advantages
Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation, even
for negative input values
Otherwise like ReLU
Disadvantages
Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.
85
CS 404/504, Fall 2021
Activation: Softmax
Sigmoid able to handle more than two cases(class label).
Softmax can handle multiple cases. Softmax function squeeze the
output for each class between 0 and 1 with sum of them is 1.
It is ideally used in the final output layer of the classifier, where we
are actually trying to attain the probabilities.
Softmax produces multiple outputs for an input array. For this reason,
we can build neural network models that can classify more than 2
classes instead of binary class solution.
sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
86
CS 404/504, Fall 2021
Softmax Layer
Introduction to Neural Networks
1 0.73
-3 0.05
Softmax Layer
Introduction to Neural Networks
1 2.7 0.12
-3 0.05 ≈0
• In simple terms, the Loss function is a method of evaluating how well your algorithm is
modeling your dataset. It is a mathematical function of the parameters of the machine
learning algorithm.
• In simple linear regression, prediction is calculated using slope(m) and intercept(b). the
loss function for this is the (Yi – )^2 i.e loss function is the function of slope and
intercept.
• if the value of the loss function is lower then it’s a good model otherwise, we have to
change the parameter of the model and minimize the loss.
• Most people confuse loss function and cost function. let’s understand what is loss
function and cost function. Cost function and Loss function are synonymous and used
interchangeably but they are different.
89
CS 404/504, Fall 2021
Loss Function:
A loss function/error function is for a single training example/input.
Cost Function:
A cost function, on the other hand, is the average loss over the entire training
dataset.
Loss function in Deep Learning
1. Regression
MSE(Mean Squared Error)
MAE(Mean Absolute Error)
Hubber loss
2. Classification
Binary cross-entropy
Categorical cross-entropy
3. AutoEncoder
KL Divergence
90
CS 404/504, Fall 2021
5. Object detection
Focal loss
6. Word embeddings
Triplet loss
91
CS 404/504, Fall 2021
Regression Loss:
Advantage
•1. Easy to interpret.
•2. Always differential because of the square.
•3. Only one local minima.
Disadvantage
•1. Error unit in the square. because the unit in the square is not
understood properly.
•2. Not robust to outlier
Note – In regression at the last neuron use linear activation function.
92
CS 404/504, Fall 2021
The Mean Absolute Error (MAE) is also the simplest loss function. To
calculate the MAE, you take the difference between the actual value and
model prediction and average it across the whole dataset.
Advantage
•1. Intuitive and easy
•2. Error Unit Same as the output column.
•3. Robust to outlier
Disadvantage
• Graph, not differential. we can not use gradient descent directly, then we
can subgradient calculation.
Note – In regression at the last neuron use linear activation function.
93
CS 404/504, Fall 2021
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that
is less sensitive to outliers in data than the squared error loss.
Classification Loss:
95
CS 404/504, Fall 2021
where
•k is classes,
•y = actual value
•Y^ – Neural Network prediction
Note – In multi-class classification at the last neuron use the softmax
activation function.
96
CS 404/504, Fall 2021
If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use
categorical cross-entropy. and if the target column has Numerical encoding to
classes like 1,2,3,4….n then use sparse categorical cross-entropy.
Which is Faster?
97
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
• The network parameters include the weight matrices and bias vectors from all
layers
𝜃 = {𝑊 1 ,𝑏 1 , 𝑊 2 , 𝑏2 , ⋯ 𝑊 𝐿 ,𝑏 𝐿 }
Often, the model parameters are referred to as weights
• Training a model to learn a set of parameters that are optimal (according to a
criterion) is one of the greatest challenges in ML
…… y1
0.1 is 1
Softmax
…… y2
0.7 is 2
……
……
……
…… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 98
CS 404/504, Fall 2021
ℒ ( 𝜃) 𝜕ℒ
𝜕 𝜃𝑖
𝜃𝑖
99
CS 404/504, Fall 2021
Initial
parameters Gradient
Loss
Parameter update:
Parameters
100
CS 404/504, Fall 2021
1. Randomly pick a
starting point
2. Compute the
gradient at ,
𝜃∗
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
𝑤1
𝛻 ℒ (𝜃 )=
0
[ 𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 1
𝜕 ℒ ( 𝜃0 ) /𝜕 𝑤 2 ]
Slide credit: Hung-yi Lee – Deep Learning Tutorial 101
CS 404/504, Fall 2021
• Example (contd.)
4. Go to step 2, repeat
0
𝜃
• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
GD does not guarantee reaching a global minimum
However, empirical evidence suggests that GD works well for NNs
• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs results
in different initial parameters every
time the NN is trained ℒ
Gradient descent may reach different
minima at every run
Therefore, NN will produce different
predicted outputs
• In addition, currently we don’t have
algorithms that guarantee reaching a
global minimum for an arbitrary loss
function 𝑤1 𝑤2
• learning rate
neither too low nor too high
• If we set the learning rate to a very small value, gradient descent will
eventually reach the local minimum but that may take a while
105
CS 404/504, Fall 2021
• Put the number of iterations on the x-axis and the value of the cost
function on the y-axis. This helps you see the value of your cost
function after each iteration of gradient descent, and provides a
way to easily spot how appropriate your learning rate is.
106
CS 404/504, Fall 2021
Learning Rate
• Learning rate
The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training
LR LR
too too
small large
107
CS 404/504, Fall 2021
Learning Rate
108
CS 404/504, Fall 2021
Backpropagation
• Modern NNs employ the backpropagation method for calculating the gradients
of the loss function
Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to passing the
inputs through the hidden layers to obtain the model outputs (predictions)
The loss function is then calculated
Backpropagation traverses the network in reverse order, from the outputs backward
toward the inputs to calculate the gradients of the loss
The chain rule is used for calculating the partial derivatives of the loss function with
respect to the parameters in the different layers in the network
• Each update of the model parameters during training takes one forward and
one backward pass (e.g., of a batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is available in
all current deep learning libraries
It significantly simplifies the implementation of deep learning algorithms, since it
obviates deriving the partial derivatives of the loss function by hand
109
CS 404/504, Fall 2021
110
CS 404/504, Fall 2021
111
CS 404/504, Fall 2021
112
CS 404/504, Fall 2021
Example of BPNN
113
CS 404/504, Fall 2021
114
CS 404/504, Fall 2021
115
CS 404/504, Fall 2021
116
CS 404/504, Fall 2021
Image Fundamentals
Pixels: The Building Blocks of Images
It is a “color” or the “intensity” of light that appears.
• Grayscale:
In a grayscale image, each pixel is a scalar value between 0 and 255, where zero corresponds
to “black” and 255 being “white”. Values between 0 and 255 are varying shades of gray, where
values closer to 0 are darker and values closer to 255 are lighter.
• Color:
Each of the three colors is represented by an integer in the range 0 to 255, which indicates how
“much” of the color there is
• Given that the pixel value only needs to be in the range [0, 255], we normally use an 8-bit
unsigned integer to represent each color intensity.
• We then combine these values into an RGB tuple in the form (red, green, blue). This tuple
represents our color.
117
CS 404/504, Fall 2021
In this case, our matrix has 1,000 columns (the width) Yellow: (255,255,0)
with 750 rows (the height).
118
CS 404/504, Fall 2021
119
CS 404/504, Fall 2021
• The co-ordinate system (X, Y) uses (width x height ) format. However, OpenCV represents
images as NumPy arrays. NumPy stores the image in (height, width, channels) format.
• To access an individual pixel value from our image we use simple NumPy array indexing :
• (b, g, r) = image[0, 0]
• It’s important to note that OpenCV stores RGB channels in reverse order. While we
normally think in terms of Red, Green, and Blue, OpenCV actually stores the pixel values in
Blue, Green, Red order.
120
CS 404/504, Fall 2021
121
CS 404/504, Fall 2021
Scaling and Aspect Ratios :
• Scaling, or simply resizing, is the process of increasing or decreasing the
size of an image in terms of width and height.
• Ignoring the aspect ratio can lead to images that look compressed and
distorted. To prevent this behavior, we simply scale the width and height
of an image by equal amounts when resizing an image.
122