0% found this document useful (0 votes)
13 views105 pages

Unit 3

The document provides an overview of deep learning, focusing on deep neural networks and convolutional neural networks (CNNs), including their architectures and functionalities. It explains the differences between deep learning and traditional machine learning, highlighting how deep learning can process unstructured data and automate feature extraction. Additionally, it covers key concepts such as convolution layers, pooling layers, and the importance of padding in CNNs for preserving spatial dimensions and enhancing feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views105 pages

Unit 3

The document provides an overview of deep learning, focusing on deep neural networks and convolutional neural networks (CNNs), including their architectures and functionalities. It explains the differences between deep learning and traditional machine learning, highlighting how deep learning can process unstructured data and automate feature extraction. Additionally, it covers key concepts such as convolution layers, pooling layers, and the importance of padding in CNNs for preserving spatial dimensions and enhancing feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

19CSE456 Neural Networks and

Deep Learning

Unit 3

Course Instructor: Dr. M. Anbazhagan


Unit 2
Introduction to deep learning - Deep neural networks -
convolutional nets – case studies using Keras/Tensorflow
- neural nets for sequences - Recurrent Nets – Long-
Short-Term-memory; Introduction to Deep unsupervised
learning – PCA to autoencoders.

2
Deep Learning
A neural network with three or more layers

3
What is Deep Learning?
▪ Deep learning is a subset of machine learning, which is essentially
a neural network with three or more layers
▪ These neural networks attempt to simulate the behavior of the human
brain - albeit far from matching its ability - allowing it to “learn” from
large amounts of data
▪ While a neural network with a single layer can still make approximate
predictions, additional hidden layers can help to optimize and refine for
accuracy
▪ These networks can learn complex representations of data by
discovering hierarchical patterns and features in the data
▪ Deep Learning has achieved significant success in various fields, including
image recognition, natural language processing, speech recognition, and
recommendation systems
▪ Training deep neural networks typically requires a large amount of data
4 and computational resources
What is Deep Learning?

5
Deep Learning Vs. Machine Learning
▪ Deep learning distinguishes itself from classical machine learning
by the type of data that it works with and the methods in which it
learns
▪ Machine learning algorithms leverage structured, labeled data to
make predictions - meaning that specific features are defined from
the input data for the model and organized into tables
▪ Deep learning eliminates some of data pre-processing that is
typically involved with machine learning - these algorithms can
ingest and process unstructured data, like text and images, and it
automates feature extraction
▪ Machine learning and deep learning models are capable of different
types of learning as well, which are usually categorized as supervised
learning, unsupervised learning, and reinforcement learning
6
Deep Neural Nets
▪ A class of artificial neural networks that are designed to mimic the
structure and function of the human brain
▪ They are called "deep" because they have multiple layers of
interconnected nodes (neurons) between the input and output layers
▪ Each layer is composed of numerous artificial neurons or nodes, and each
node is connected to the nodes in the adjacent layers
▪ These connections, known as weights, hold the parameters that are
learned during the training process
▪ Deep neural networks are trained using a technique called
backpropagation
▪ The depth of the network allows it to learn complex representations and
hierarchies of features from the input data
▪ Deep neural networks have gained significant attention and popularity in
recent years due to their impressive performance on a wide range of
7 tasks.
Hierarchical Features

8
Deep Neural Nets

9
Convolutional Nets
To view the world as humans do, perceive it in a similar
manner, and even use the knowledge for a multitude of
tasks.
10
Convolutional Neural Network
▪ It is a type of artificial neural network that is particularly effective
in analyzing visual data
▪ CNNs are widely used in computer vision tasks such as image
recognition, object detection, and image classification
▪ The architecture of a CNN is inspired by the organization of the
visual cortex in animals, which has different layers of neurons that
respond to different visual stimuli
▪ Similarly, a CNN consists of multiple layers of interconnected
neurons, including convolutional layers, pooling layers, and fully
connected layers

11
Visual Cortex and Receptive Fields

12
CNN Vs Visual Cortex
▪ A Convolutional Neural Network is a Deep Learning algorithm that
▪ can take in an input image, assign importance to various
aspects/objects in the image, and be able to differentiate one from
the other

13
CNN Architecture
▪ The architecture of a ConvNet is analogous to that of the connectivity
pattern of Neurons in the Human Brain and was inspired by the
organization of the Visual Cortex

14
CNN Architecture
▪ The CNN architecture has three main types of layers, which are:
▪ Convolutional layer
▪ The convolutional layer is the core building block of a CNN, and it is
where the majority of computation occurs
▪ It requires a few components, which are input data, a filter, and a feature
map.
▪ Pooling layer
▪ Pooling layers, also known as downsampling, conducts dimensionality
reduction, reducing the number of parameters in the input
▪ There are two major types of pooling: max pooling and average pooling
▪ Fully-connected (FC) layer
▪ This layer performs the task of classification based on the features
extracted through the previous layers and their different filters
15
Why ConvNets over Feed-Forward Neural Nets?
▪ An image is nothing but a matrix of pixel values, right?
▪ So why not just flatten the image (e.g., 3x3 image matrix into a 9x1
vector) and feed it to a Multi-Level Perceptron for classification
purposes?
▪ In cases of extremely basic binary images, the method might show
an average precision score while performing prediction of classes but
would have little to no accuracy when it comes to complex images
having pixel dependencies throughout
▪ A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters

16
Spatial Vs. Temporal Dependencies
▪ Spatial dependency refers to the ▪ Temporal dependency is more
relationships between different applicable to sequences of data,
parts of an image or data in space such as video frames or time-
▫ For example, in an image, nearby series data
pixels often have some ▫ Temporal dependencies capture
relationship to each other (like the relationships between data
edges or textures) points over time

17
Input Image
▪ In the figure, we have an RGB image that has been separated by
its three color planes - Red, Green, and Blue
▪ There are a number of such color spaces in which images exist -
Grayscale, RGB, HSV, CMYK, etc.

18
Convolution Layer
▪ The convolution layer is the core building block of the CNN
▪ It carries the main portion of the network’s computational load
▫ This layer performs a dot product between two matrices, where one
matrix is the set of learnable parameters otherwise known as a
kernel, and the other matrix is the restricted portion of the receptive
field
▫ The kernel is spatially smaller than an image but is more in-depth
▫ This means that, if the image is composed of three (RGB) channels,
the kernel height and width will be spatially small, but the depth
extends up to all three channels

19
Convolution Layer

• During the forward pass, the kernel


slides across the height and width of
the image-producing the image
representation of that receptive
region

• This produces a two-dimensional


representation of the image known
as an activation map that gives the
response of the kernel at each spatial
position of the image

• The sliding size of the kernel is called


a stride

20
Convolution Layer
• In the demonstration, the green
section resembles our 5x5x1 input
image, I.

• The element involved in the


convolution operation in the first
part of a Convolutional Layer is
called the Kernel/Filter, K,
represented in color yellow. We
have selected K as a 3x3x1 matrix.

• The Kernel shifts 9 times because of


Stride Length = 1 (Non-Strided),
every time performing an
elementwise multiplication
operation (Hadamard Product)
between K and the portion P of the
image over which the kernel is
hovering.
21
Convolution operation on a MxNx3 image matrix with a 3x3x3
Kernel

22
Convolution Layer

23
Convolution Operation with Stride Length = 2
• The objective of the Convolution
Operation is to extract the high-level
features such as edges, from the
input image
• Conventionally, the first ConvLayer is
responsible for capturing the Low-
Level features such as edges, color,
gradient orientation, etc.
• With added layers, the architecture
adapts to the High-Level features as
well, giving us a network that has a
wholesome understanding of images
in the dataset, similar to how we
would
24
Padding
▪ There are two types of results to the operation
▪ One in which the convolved feature is reduced in dimensionality as
compared to the input – Valid padding
▪ If we perform the same operation without padding, we are presented
with a matrix that has dimensions of the Kernel (3x3x1) itself - Valid
Padding
▪ The other in which the dimensionality is either increased or remains the
same – Same padding
▪ When we augment the 5x5x1 image into a 6x6x1 image and then apply
the 3x3x1 kernel over it, we find that the convolved matrix turns out to
be of dimensions 5x5x1. Hence the name - Same Padding

25
Padding

26
Why Padding?
▪ Padding refers to the practice of adding extra pixels (usually zeros)
around the borders of an input image or feature map before
applying a convolutional filter
▫ Why use Padding?
▫ Preserve Spatial Dimensions: Without padding, applying a
convolutional filter reduces the spatial dimensions (height and
width) of the input
▫ Handle Edge Information: Convolutional filters might ignore or
inadequately process the pixels near the edges of the input -
Padding ensures that edge pixels are treated with the same
importance as central pixels
▫ Enable Deeper Networks: Maintaining spatial dimensions across
multiple layers can be essential for constructing deeper
networks without excessively reducing the feature map size
27
28
Padding Types
Padding Type Description Size Formula

• No padding is added to the input.


The convolution is applied only to 𝐼𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 − 𝐹𝑖𝑙𝑡𝑒𝑟 𝑠𝑖𝑧𝑒
Valid Padding +1
the valid (completely overlapping) 𝑆𝑡𝑟𝑖𝑑𝑒
regions

• Padding is added so that the


output feature map has the same
spatial dimensions as the input 𝐼𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒
Same Padding
𝑆𝑡𝑟𝑖𝑑𝑒
• The amount of padding depends
on the filter size and stride
• Padding is added so that every
possible position of the filter
overlaps with the input, including
Full Padding partial overlaps 𝐼𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 + 𝐹𝑖𝑙𝑡𝑒𝑟 𝑠𝑖𝑧𝑒 − 1
• This typically results in an output
larger than the input

29
30
Reflective Vs. Replicative Padding
▪ The padding is a mirror reflection ▪ Replicate Padding: The edge
of the input boundary pixels are replicated.
▫ For example, if the edge pixels ▫ Using the same example,
are [A, B, C], reflective padding replicate padding would add [A,
would add [B, A] before and [C, A] before and [C, C] after
B] after
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
A A A B C D D D
B A A B C D D C
31
Calculating Padding Size
▪ The padding size depends on several factors:
▫ Filter Size (F): The size of the convolutional filter (e.g., 3x3, 5x5)
▫ Stride (S): The number of pixels the filter moves across the input
▫ Dilation (D): The spacing between filter elements, used in dilated
convolutions
▪ For same padding, the padding 𝑃 can be calculated as:

(𝑆 × 𝑂 − 1 + 𝐹 − 𝐼
𝑃=
2

32
Formula for Convolution Layer
▪ If we have an input of size W x W x D and Dout number of kernels with a
spatial size of F with stride S and amount of padding P, then the size of
output volume can be determined by the following formula:

33
Pooling Layer
▪ Similar to the Convolutional Layer, the Pooling layer is responsible
for reducing the spatial size of the Convolved Feature
▪ This is to decrease the computational power required to process the
data through dimensionality reduction
▪ Furthermore, it is useful for extracting dominant features which are
rotational and positional invariant, thus maintaining the process of
effectively training the model
▪ There are two types of Pooling:
▪ Max Pooling and Average Pooling. Max Pooling returns the
maximum value from the portion of the image covered by the
Kernel.
▪ On the other hand, Average Pooling returns the average of all the
values from the portion of the image covered by the Kernel.

34
3x3 pooling over 5x5 convolved feature

35
Max and Average Pooling
▪ Max Pooling also performs as a Noise Suppressant. It discards the noisy
activations altogether and also performs de-noising along with
dimensionality reduction
▪ On the other hand, Average Pooling simply performs dimensionality
reduction as a noise-suppressing mechanism
▪ Hence, we can say that Max Pooling performs a lot better than Average
Pooling.

36
Convoluion and Pooling Layers
▪ The Convolutional Layer and the Pooling Layer, together form the
ith layer of a Convolutional Neural Network
▪ Depending on the complexities in the images, the number of such
layers may be increased for capturing low-level details even further,
but at the cost of more computational power

37
Fully Connected Layer
▪ After going through the above process, we have successfully
enabled the model to understand the features
▪ Moving on, we are going to flatten the final output and feed it to a
regular Neural Network for classification purposes

38
Fully Connected Layer
▪ Adding a Fully-Connected layer is a (usually) cheap way of learning
non-linear combinations of the high-level features as represented
by the output of the convolutional layer
▪ Now that we have converted our input image into a suitable form for
our Multi-Level Perceptron, we shall flatten the image into a column
vector
▪ The flattened output is fed to a feed-forward neural network and
backpropagation is applied to every iteration of training
▪ Over a series of epochs, the model is able to distinguish between
dominating and certain low-level features in images and classify
them using the SoftMax Classification technique

39
A complete CNN Architecture

40
Building of CNN
Load the Dataset • filters
• kernel_size
• strides
Pre-process the Data • padding
• activation

Add Convolutional Layer


• pool_size

Add Pooling Layer

• rate • optimizer
Add Dropout Layer
• loss
• metrics
Compile the model

• X_train
• y_train
Fit the Model • batch_size
• epochs
• validation_data
Evaluate the Model
41
Can you recognize the object given below using a MLP
or CNN?

42
Can you predict where the object is going to travel to
next using a MLP or CNN?

43
Whatif the object’s past positions are available?

44
Sequntial Data Examples

45
What is Sequential Modeling?
Sequential modeling is a technique used in machine
learning and deep learning where models are designed
to process sequences of data

46
Recurrent Nets
The first algorithm with an internal memory that
remembers its input, making it perfect for problems
involving sequential data in machine learning
47
Perceptron

48
Multilayer Perceptron

49
Whats is RNN?
▪ It is a type of neural network architecture that is designed to
handle sequential data or data with temporal dependencies
▪ RNNs are particularly well-suited for tasks such as natural language
processing, speech recognition, machine translation, time series analysis,
and more
▪ The distinguishing feature of RNNs is its ability to capture and utilize
information from previous time steps in the sequence
▪ Unlike traditional feedforward neural networks, which process each input
independently, RNNs have a hidden state that is updated at each time
step and carries information about the previous steps
▪ This hidden state allows RNNs to maintain an internal memory and
capture the context and dependencies within the sequential data
▪ The basic building block of an RNN is a recurrent neuron, which takes an
input and the previous hidden state as inputs, and produces an output
and a new hidden state
50
Neuron with Recurrence

𝑦ො𝑡 𝑦ො0 𝑦ො1 𝑦ො2 𝑦ො𝑇

ℎ0 ℎ1
ℎ2

𝑥𝑡 𝑥0 𝑥1 𝑥2 𝑥𝑇

51
Whats is RNN?
▪ RNNs are a type of neural network that can be used to model
sequence data
▪ RNNs, which are formed from feedforward networks, are similar to
human brains in their behavior
▪ Simply said, recurrent neural networks can anticipate sequential data
in a way that other algorithms can’t

52
What is RNN?

53
Whats is RNN?

• All of the inputs and outputs in standard neural networks are independent of
one another, however in some circumstances, such as when predicting the next
word of a phrase, the prior words are necessary, and so the previous words
must be remembered
• As a result, RNN was created, which used a Hidden Layer to overcome the
problem
• RNNs have a Memory that stores all information about the calculations
54
The Architecture of a Traditional RNN
▪ RNNs are a type of neural network that has hidden states and
allows past outputs to be used as inputs
▪ They usually go like this:

55
The Architecture of a Traditional RNN

56
RNN Types
▪ RNN architecture can vary depending on the problem you’re trying
to solve
▪ From those with a single input and output to those with many

57
How do Recurrent Neural Networks work?
▪ The information in recurrent neural networks cycles through a loop
to the middle layer

▪ The input layer x receives and processes the neural network’s input
before passing it on to the middle layer
▪ Multiple hidden layers can be found in the middle layer h, each with
its own activation functions, weights, and biases
58
How do Recurrent Neural Networks work?
▪ The different activation functions, weights, and biases will be
standardized by the Recurrent Neural Network, ensuring that each
hidden layer has the same characteristics
▪ Rather than constructing numerous hidden layers, it will create
only one and loop over it as many times as necessary
▪ Parameter Efficiency
▪ Memory Across Time Steps
▪ Sequential Processing

59
Common Activation Functions
▪ A neuron’s activation function dictates whether it should be turned
on or off
▪ Nonlinear functions usually transform a neuron’s output to a
number between 0 and 1 or -1 and 1

60
Recurrent Neural Network Vs Feedforward Neural Network
▪ A feed-forward neural network has only one route of information flow:
from the input layer to the output layer, passing through the hidden layers
▪ The data flows across the network in a straight route, never going
through the same node twice
▪ Feed-forward neural networks are poor predictions of what will happen next
because they have no memory of the information they receive
▪ The information is in an RNN cycle via a loop

61
Recurrent Neural Network: Design Criteria
▪ To model sequences, we need to:
▪ Handle variable-length sequences
▪ Track long-term dependencies
▪ Maintain information about order
▪ Share parameters across the sequence

62
Input to an RNN
▪ Can an RNN understand text written in natural language?

“neural” RNN “network”

0.1 0.3
0.6 RNN 0.7
0.5 0.4

• Embedding languages/Word embeddings are techniques used in NLP


to represent words in a continuous vector space
• These vectors capture semantic meanings and relationships between
words, allowing neural networks to process and understand text more
effectively
63
Backpropagation Through Time (BPTT)
▪ When we apply a Backpropagation algorithm to a Recurrent Neural
Network with time series data as its input, we call it
backpropagation through time
▪ In BPTT, the recurrent neural network is "unrolled" in time, meaning
that the network is unfolded into a series of connected copies of
itself, each representing a specific time step
▪ This creates a directed acyclic graph (DAG) where each copy of the
network represents a different point in time
▪ During training, BPTT involves feeding the network with a sequence
of input data, propagating it forward through time, and calculating
the loss at each time step
▪ The loss is then used to compute the gradients of the network's
parameters with respect to the loss using the backpropagation
algorithm
64
How do we compute Loss?
෍ 𝐿𝑖

𝐿0 𝐿1 𝐿2 ⋯ 𝐿𝑇

𝑦ො𝑡 𝑦ො0 𝑦ො1 𝑦ො2 𝑦ො𝑇


𝑊ℎ𝑦 𝑊ℎ𝑦 𝑊ℎ𝑦 𝑊ℎ𝑦
𝑊ℎℎ 𝑊ℎℎ
𝑊ℎℎ
RNN
= ⋯

𝑊𝑥ℎ 𝑊𝑥ℎ 𝑊𝑥ℎ 𝑊𝑥ℎ

𝑥𝑡 𝑥0 𝑥1 𝑥2 𝑥𝑇
65
Backpropagation Through Time (BPTT)
▪ The backpropagation algorithm starts from the last time step and
iteratively computes the gradients of the loss with respect to the
network's parameters by propagating the gradients backward
through time
▪ Once the gradients are computed, they are used to update the
network's parameters using an optimization algorithm

66
BPTT
෍ 𝐿𝑖

𝐿0 𝐿1 𝐿2 ⋯ 𝐿𝑇

𝑦ො𝑡 𝑦ො0 𝑦ො1 𝑦ො2 𝑦ො𝑇


𝑊ℎ𝑦 𝑊ℎ𝑦 𝑊ℎ𝑦 𝑊ℎ𝑦
𝑊ℎℎ 𝑊ℎℎ
𝑊ℎℎ
RNN
= ⋯

𝑊𝑥ℎ 𝑊𝑥ℎ 𝑊𝑥ℎ 𝑊𝑥ℎ

𝑥𝑡 𝑥0 𝑥1 𝑥2 𝑥𝑇
67
Backpropagation Through Time (BPTT)
Forward Pass Unrolling the RNN Across Time
• At each time step 𝑡, the input 𝑥𝑡
• To apply BPTT, the RNN is unrolled
is passed to the network along
through time for the entire sequence
with the hidden state ℎ𝑡-1 from
the previous time step • Unrolling means expanding the
recurrent connections of the network
• The output 𝑦𝑡 is generated at
into a deep feed forward-like structure
each step

Backpropagation
Calculate Loss
• After computing the loss,
• After the forward pass through
backpropagation begins from the last
the sequence, the loss 𝐿 is
calculated using a loss function time step 𝑇, and gradients are
propagated backward through time
by comparing the predicted
outputs 𝑦𝑡 at each time step • The error at each time step is back-
with the actual target values propagated not only through the
weights of that time step but also
through the previous hidden states
68
BP Vs. BPTT

▪ The key difference from standard backpropagation


is that in BPTT, gradients flow not just through
layers, but through time steps as well
▫ The gradients at time step 𝑡 depend on the error at time
step 𝑡, as well as the errors at all later time steps 𝑡+1,
𝑡+2, …, 𝑇 because the hidden state at each step affects all
future time steps

69
Vanishing and Exploding Gradient
▪ Vanishing Gradient Problem:
▪ The vanishing gradient problem occurs when the gradients computed
during backpropagation diminish as they propagate backward
through the layers of the network
▪ In other words, the gradients become increasingly small as they
move from the output layer towards the initial layers
▪ Consequently, the parameters in the early layers of the network
receive very small updates, and their learning becomes slow or even
stagnates

70
Vanishing and Exploding Gradient
▪ Exploding Gradient Problem:
▪ On the other hand, the exploding gradient problem occurs when the
gradients become extremely large as they propagate backward
through the network
▪ In this case, the gradients can grow exponentially, causing instability
in the training process
▪ When the gradients become too large, the updates to the network's
parameters can be so substantial that the optimization algorithm
overshoots the optimal values, leading to unstable or divergent
behavior

71
LSTM
Capable of handling vanishing and exploding gradient
problems

72
Long Short-Term Memory
Memory Cell
• LSTMs have a memory cell that can
retain information for long periods
• The memory cell allows information
to flow unchanged over many time
steps
Gates
• The flow of information into and out
of the memory cell is controlled by • Forget Gate: Decides what
several gates information to throw away from the
memory
• Input Gate: Decides which new
information to store in the memory
• Output Gate: Decides what part of
the memory to output

73
Why LSTM?
▪ The shortcoming of RNN is they cannot remember long-term
dependencies due to vanishing gradient
▪ LSTMs are explicitly designed to avoid long-term dependency
problems
▪ The LSTM network architecture consists of three parts, as shown in
the image below, and each part performs an individual function

chooses whether the in the third part, the cell


information coming from passes the updated
the previous timestamp information from the current
is to be remembered or timestamp to the next
is irrelevant and can be timestamp
forgotten In the second part, the cell
tries to learn new information
74 from the input to this cell
LSTM
▪ These parts of an LSTM unit are known as gates
▪ They control the flow of information in and out of the memory cell or
LSTM cell
▪ The first gate is called Forget gate, the second gate is known as the
Input gate, and the last one is the Output gate

75
LSTM
▪ Just like a simple RNN, an LSTM also has a hidden state where Ht-1
represents the hidden state of the previous timestamp and Ht is the
hidden state of the current timestamp
▪ In addition to that, LSTM also has a cell state represented by Ct-1 and Ct
for the previous and current timestamps, respectively
▪ Here the hidden state is known as Short term memory, and the cell state
is known as Long term memory. Refer to the following image

76
LSTM Architecture

77
Dissecting the LSTM Architecture

78
Forget Gate

79
Forget Gate
▪ The forget gate determines how much of the previous memory should be
kept
▫ The decision is based on the previous hidden state ℎ𝑡−1 and the
current input 𝑥𝑡

𝑓𝑡 = 𝜎(𝑊𝑓 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 )
▫ Where 𝑓𝑡 is the forget gate output, 𝑊𝑓 is the weight matrix, and 𝑏𝑓 is
the bias
▫ The sigmoid function 𝜎 squashes the values to a range between 0 and
1, where 0 means "forget everything" and 1 means "keep everything"

80
Input Gate

81
Input Gate
▪ The input gate decides which new information should be added to the
memory
▪ It has two components:
▫ A sigmoid layer to control which values are updated
▫ A tanh layer to create a new candidate for memory, which represents
the potential new information to store
𝑖𝑡 = 𝜎(𝑊𝑖 ⋅ [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 )
𝐶෩𝑡 = tanh 𝑊𝐶 ⋅ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶
▫ Where 𝑖𝑡 is the input gate output, and 𝐶෩𝑡 is the candidate memory
content

82
Memory Update
▪ The next step is to update the memory cell 𝐶𝑡. The old memory 𝐶𝑡−1 is
multiplied by the forget gate 𝑓𝑡, and the candidate memory 𝐶෩𝑡 is
multiplied by the input gate 𝑖𝑡, which determines how much new
information to add

𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶෩𝑡
▪ This step ensures that only relevant information is retained or updated in
the memory cell

83
Output Gate

84
Output Gate
▪ The output gate controls what part of the memory should be passed to
the next time step's hidden state and as output
▪ It applies a sigmoid function to determine what parts of the cell state are
allowed to affect the output
𝑜𝑡 = 𝜎 𝑊𝑜 ⋅ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜
▪ The hidden state ℎ𝑡 is then updated using the current memory cell 𝐶𝑡 and
the output gate:
ℎ𝑡 = 𝑜𝑡 ∗ tanh 𝐶𝑡
▪ This final hidden state ℎ𝑡 is passed to the next time step, while 𝐶𝑡 is
carried forward as the cell state

85
LSTM in action

86
Summary of LSTM Workflow
▪ The forget gate determines how much of the previous cell state to
retain.
▪ The input gate decides how much new information should be added to
the cell state.
▪ The cell state is updated by combining the old state (after applying the
forget gate) with the new candidate memory (after applying the input
gate).
▪ The output gate determines what part of the updated cell state should
influence the next hidden state (and the final output of the cell).

87
Deep unsupervised learning
Exploring Hidden Patterns: Mastering Deep Unsupervised
Learning for Autonomous Knowledge Discovery

88
Consider the following sequences of numbers and observe which one
of them is easier to remember
Sequence 01: 6, 7, 5, 4, 9, 9, 2, 8…
Sequence 02: 2, 4, 6, 8, 10, 12, 14, 16…

89
Your camera captures a waterfall’s movement, the surrounding trees
and rocks, and the changing light and shadows over time. All of this
information is three-dimensional — there are height, width, and
depth components to the scenery you’re capturing.

Now, when you play back the video, you’re watching a two-
dimensional representation of that three-dimensional scene.

90
Deep Supervised and Unsupervised Learning
▪ Unsupervised deep learning refers to a subset of deep learning
techniques where the model learns from data that has no explicit labels or
predefined output
▪ Unlike supervised learning, where models are trained with input-output
pairs, unsupervised deep learning aims to discover patterns, structures,
or representations in the data on its own

91
Key concepts of Deep Unsupervised Learning
Autoencoders: These are neural networks used to learn efficient data representations
(encoding) and to reconstruct the input data (decoding). Autoencoders are often used for
tasks like dimensionality reduction, denoising, and anomaly detection.

Generative Models: These models, such as Generative Adversarial Networks (GANs) and
Variational Autoencoders (VAEs), learn to model the underlying data distribution, allowing
them to generate new samples that resemble the training data.

Clustering: In some deep learning approaches, models attempt to group similar data points
together. For instance, neural networks can be combined with clustering algorithms (e.g., k-
means) to perform tasks like image clustering.

Self-supervised Learning: A related field where models are trained with tasks that are
indirectly supervised. For example, a network might predict missing parts of an input (like the
next word in a sentence or a masked region in an image), without requiring labeled data.

92
Key concepts of Deep Unsupervised Learning
Representation Learning: The goal of unsupervised deep learning is often to learn
meaningful representations of the data, which can then be used for downstream tasks (like
classification) or to understand the data's structure.

93
PCA to Autoencoders
▪ PCA is a linear method for reducing the dimensionality of data by
transforming it into a new coordinate system
▫ The principal components are the directions in which the data varies the most
▫ These components are orthogonal (uncorrelated) and ranked according to the
amount of variance they explain in the dataset

94
Steps in PCA
Standardize the Data:
The data is standardized (zero mean, unit
variance) so that each feature contributes
equally to the analysis

Covariance Matrix Calculation:


The covariance matrix is computed to identify
relationships between the features

Eigenvalue Decomposition:
• The covariance matrix is decomposed into its eigenvalues and eigenvectors
• The eigenvectors define the directions (principal components) of the new
feature space, while the eigenvalues give the magnitude of variance in these
directions.

Projection:
The data is projected onto the top 'k' eigenvectors
(principal components) with the largest eigenvalues,
95reducing the dimensionality
Autoencoders
▪ Autoencoders are a type of neural network that attempts to mimic its
input as closely as possible to its output
▫ It aims to take an input, transform it into a reduced representation called code
or embedding
▫ Then, this code or embedding is transformed back into the original input
▫ The code is also called the latent-space representation

Formally, we can say, an autoencoder describes a nonlinear relationship of an


input to an output through an intermediate representation called code or
embedding
96
Autoencoders – Key points
• Non-Linear: Autoencoders can model complex, non-linear relationships
between features using non-linear activation functions.
• Data-specific compression: Autoencoders compresses the data that is similar
to what it had been trained on. For e.g., an autoencoder trained on dog photos
cannot compress human faces photos easily.
• Unsupervised: Training an autoencoder is easy as we don’t need labelled data.
It is easily trained on any kind of input data.
• Lossy in nature: There is always going to be some difference between the input
and output of the autoencoder. The output will always have some missing
information in it.
• Reconstruction: The aim is to reconstruct the input from a reduced
representation, and the model can be trained using backpropagation.

97
Architecture of Autoencoders

98
Architecture of Autoencoders
▪ The encoder compresses the given input into a fixed dimension code or
embedding and the decoder transforms that code or embedding the same
as the original input
▪ The decoder architecture is the mirror image of an encoder
▪ While building an autoencoder, it should be constrained to prioritize which
information should be kept, which information should be discarded
▪ This constraint is introduced in the following ways:
▫ Reducing the number of units or nodes in the layers
▫ Adding some noise to the input images
▫ Adding some regularization

▪ The hyperparameters that will affect the performance of your


autoencoder: i) Number of Layers, ii) Number of Nodes in the Code Layer,
and iii) Loss
99
Breakdown of AE steps
▪ Step 1: Input Layer
▫ Let the input to the Autoencoder be denoted as:
𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑛​) ∈ ℝ𝑛

▪ Step 2: Encoder
▫ The encoder compresses the input data into a lower-dimensional latent space
(also called the bottleneck or code)
▫ This can be represented as:
z = 𝑓𝜃𝑒 (x) = σ(We ​𝑋 + be )
▫ 𝑧 ∈ ℝ𝑚 is the latent code (compressed representation), with 𝑚≪𝑛
▫ 𝑓𝜃𝑒 is the encoding function with parameters 𝜃𝑒 = 𝑊𝑒 , 𝑏𝑒 , where 𝑊𝑒 is the
weight matrix and 𝑏𝑒 is the bias vector

100
Breakdown of AE steps
▪ Step 3: Latent Space
▫ The latent space 𝑧 is the lower-dimensional representation of the input
▫ The network aims to encode essential information in 𝑧 while discarding noise
or redundant information

▪ Step 4: Decoder
▫ The decoder reconstructs the original input from the latent representation and
is given by:
𝑋෠ = 𝑔𝜃𝑑 (𝑧) = 𝜎(𝑊𝑑 ​𝑧 + 𝑏𝑑 )
▫ 𝑋෠ ∈ ℝ𝑚 is the reconstruction of the original input 𝑥
▫ 𝑔𝜃𝑑 is the decoding function with parameters 𝜃𝑑 = 𝑊𝑑 , 𝑏𝑑 , where 𝑊𝑑 is the
weight matrix and 𝑏𝑑 is the bias vector

101
Breakdown of AE steps
▪ Step 5: Loss Function
▫ The loss function measures how well the Autoencoder reconstructs the
original input
▫ A common choice for the loss function is Mean Squared Error (MSE) between
the input 𝑋 and the reconstructed output 𝑋:

𝑛
1
𝐿 𝑋, 𝑋෠ = ෍ 𝑥𝑖 − 𝑥ො𝑖 2
𝑛
𝑖=1
▫ This loss function quantifies the reconstruction error, i.e., how much
information is lost during the encoding and decoding process

102
Training
▪ Forward Pass:
▫ The input X is passed through the encoder to compute the latent code 𝑧
▫ The latent code 𝑧 is passed through the decoder to generate the
reconstruction 𝑋෠
▫ Compute the loss (reconstruction error) using the loss function
▪ Backpropagation:
▫ Gradients of the loss function with respect to the weights and biases of both
the encoder and decoder are computed using backpropagation
▫ The weights and biases are updated to minimize the reconstruction error
∂𝐿
▫ Encoder Gradient Update: 𝑊𝑒 ← 𝑊𝑒 − 𝜂
∂𝑊𝑒
∂𝐿
▫ Decode Gradient Update: 𝑊𝑑 ← 𝑊𝑑 − 𝜂
∂ 𝑊𝑑

103
𝐼𝑛𝑝𝑢𝑡 𝐷𝑎𝑡𝑎 Summary of Steps 𝑅𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑
𝐼𝑛𝑝𝑢𝑡

𝐿𝑎𝑡𝑒𝑛𝑡 𝑆𝑝𝑎𝑐𝑒

𝑋 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 𝑍 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑋෠

104
Thank you!

105

You might also like