UNIT 2 Notes
UNIT 2 Notes
The chain rule that underlies the back-propagation algorithm was invented in the
seventeenth century (Leibniz, 1676; L’ Hôpital, 1696)
Beginning in the 1940s, the function approximation techniques were used to
motivate machine learning models such as the perceptron
The earliest models were based on linear models. Critics including Marvin
Minsky pointed out several of the laws of the linear model family, such as its
inability to learn the XOR function, which led to a backlash against the entire
neural network approach
Efficient applications of the chain rule based on dynamic programming began to
appear in the 1960s and 1970s
Werbos (1981) proposed applying chain rule techniques for training artificial
neural networks. The idea was finally developed in practice after being
independently rediscovered in different ways (LeCun, 1985; Parker, 1985;
Rumelhart et al., 1986a)
Following the success of back-propagation, neural network research gained
popularity and reached a peak in the early 1990s. Afterwards, other machine
learning techniques became more popular until the modern deep learning
renaissance that began in 2006
The core ideas behind modern feedforward networks have not changed
substantially since the 1980s. The same back-propagation algorithm and the
same approaches to gradient descent are shall in use
1
A Probabilistic Theory of Deep Learning
Single Layer networks cannot be used to solve Linear Inseparable problems & can
only be used to solve linear separable problems
Single layer networks cannot solve complex problems
Single layer networks cannot be used when large input-output data set is available
Single layer networks cannot capture the complex information’s available in the
training pairs
Hence to overcome the above said Limitations we use Multi-Layer Networks.
Multi-Layer Networks
Any neural network which has at least one layer in between input and output layers is
called Multi-Layer Networks
Layers present in between the input and out layers are called Hidden Layers
Input layer neural unit just collects the inputs and forwards them to the next higher
layer
Hidden layer and output layer neural units process the information’s feed to them and
produce an appropriate output
Multi -layer networks provide optimal solution for arbitrary classification problems
Multi -layer networks use linear discriminants, where the inputs are non linear
2
Back Propagation Networks (BPN)
Introduced by Rumelhart, Hinton, & Williams in 1986.
3
BPN Algorithm
The algorithm for BPN is as classified int four major steps as follows:
1. Initialization of Bias, Weights
2. Feedforward process
3. Back Propagation of Errors
4. Updating of weights & biases
Algorithm:
I. Initialization of weights:
Step 1: Initialize the weights to small random values near zero
Step 2: While stop condition is false , Do steps 3 to 10
Step 3: For each training pair do steps 4 to 9
II. Feed forward of inputs
Step 4: Each input xi is received and forwarded to higher layers (next hidden)
Step 5: Hidden unit sums its weighted inputs as follows Zinj = Woj + Σxiwij
Applying Activation function Zj = f(Zinj)
This value is passed to the output layer
Step 6: Output unit sums it’s weighted inputs yink= Voj + Σ ZjVjk
Applying Activation function
Yk = f(yink)
4
Woj(new) = Woj(old) + Δwoj Vok(new) = Vok(old) + ΔVok
Step 10: Test for Stop Condition
Merits
•Has smooth effect on weight correction •Computing time is less if weight’s are small •100
times faster than perceptron model
L1 In the context of deep learning, most regularization strategies are based on regularizing
estimators.
Many regularization approaches are based on limiting the capacity of models, such as
neural networks, linear regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)
5
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty
term, Ω, relative to the standard objective function J. Setting α to 0 results in no regularization.
Larger values of α correspond to more regularization.
The parameter norm penalty Ω that penalizes only the weights of the a ffine transformation at
each layer and leaves the biases unregularized
L2 Regularization
One of the simplest and most common kind of parameter norm penalty is L2 parameter
& it’s also called commonly as weight decay. This regularization strategy drives the
weights
closer to the origin by adding a regularization term . L2
regularization is also known as ridge regression or Tikhonov regularization. To
simplify, we assume no bias parameter, so θ is just w. Such a model has the following
total objective function
Difference between L1 & L2 Parameter Regularization
6
Batch Normalization:
It is a method of adaptive reparameterization, motivated by the difficulty of training
very deep models.In Deep networks, the weights are updated for each layer. So the
output will no longer be on the same scale as the input (even though input is
normalized).Normalization - is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape.when we input the data to a machine
or deep learning algorithm we tend to change the values to a balanced scale because,
we ensure that our model can generalize appropriately.(Normalization is used to bring
the input into a balanced scale/ Range)
Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to calculate the mean
of this hidden activation.
(2) After calculating the mean the next step is to calculate the standard deviation of the
hidden activations.
(3) Now we normalize the hidden activations using these Mean & Standard Deviation
values. To do this, we subtract the mean from each input and divide the whole value
with the sum of standard deviation and the smoothing term (ε).
7
(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two
components of the BN algorithm is used, γ(gamma) and β (beta). These parameters are
used for re-scaling (γ) and shifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence during the training of neural
network, the optimal values of γ and β are obtained and used. Hence we get the accurate
normalization of each batch
Shallow Networks
Shallow neural networks give us basic idea about deep neural network which consist of
only 1 or 2 hidden layers. Understanding a shallow neural network gives us an
understanding into what exactly is going on inside a deep neural network A neural
network is built using various hidden layers. Now that we know the computations that
occur in a particular layer, let us understand how the whole neural network computes
the output for a given input X. These can also be called the forward-propagation
equations.
8
Difference Between a Shallow Net & Deep Learning Net:
9
Convolution Networks:
Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some
other object. The first thing you do is feed the pixels of the image in the form of arrays to the
input layer of the neural network (multi-layer networks used to classify things). The hidden
layers carry out feature extraction by performing different calculations and manipulations.
There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer,
that perform feature extraction from the image. Finally, there’s a fully connected layer that
identifies the object in the image.
10
What is Convolutional Neural Network?
A convolutional neural network is a feed-forward neural network that is generally used to analyze visual
images by processing data with grid-like topology. It’s also known as a ConvNet. A convolutional
neural network is used to detect and classify objects in an image.
Below is a neural network that identifies two types of flowers: Orchid and Rose.
A convolution neural network has multiple hidden layers that help in extracting information
from an image. The four important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
6. Flattening
7. Output Layer
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A convolution
layer has several filters that perform the convolution operation. Every image is considered as a
matrix of pixel values.
11
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also a filter
matrix with a dimension of 3x3. Slide the filter matrix over the image and compute the dot
product to get the convolved feature matrix.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is
to move them to a ReLU layer. ReLU performs an element-wise operation and sets all the
negative pixels to 0. It introduces non-linearity to the network, and the generated output is
a rectified feature map. Below is the graph of a ReLU function:
The original image is scanned with multiple convolutions and ReLU layers for locating the
features.
12
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
13
The pooling layer uses various filters to identify different parts of the image like edges, corners,
body, feathers, eyes, and beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the resultant
2-Dimensional arrays from pooled feature maps into a single long continuous linear vector.
14
The flattened matrix is fed as input to the fully connected layer to classify the image.
The pixels from the image are fed to the convolutional layer that performs the convolution
operation
The convolved map is applied to a ReLU function to generate a rectified feature map
The image is processed with multiple convolutions and ReLU layers for locating the
features
15
Different pooling layers with various filters are used to identify specific parts of the image
The pooled feature map is flattened and fed to a fully connected layer to get the final output
Activation Layer
The activation layer introduces nonlinearity into the network by applying an activation function
to the output of the previous layer. This is crucial for the network to learn complex patterns.
Common activation functions, such as ReLU, Tanh, and Leaky ReLU, transform the input
while keeping the output size unchanged.
Flattening
After the convolution and pooling operations, the feature maps still exist in a multi-dimensional
format. Flattening converts these feature maps into a one-dimensional vector. This process is
essential because it prepares the data to be passed into fully connected layers for classification
or regression tasks.
Output Layer
In the output layer, the final result from the fully connected layers is processed through a
logistic function, such as sigmoid or softmax. These functions convert the raw scores into
probability distributions, enabling the model to predict the most likely class label.
16
with the help of training data, and this is possible with the technology named GAN or Generative
Adversarial Networks.
Generative adversarial networks (GANs) are among the most popular and recent unsupervised
machine learning innovations developed by Ian J. Goodfellow in 2014.
GAN is a class of algorithmic machine learning framework having two neural networks that
connect and can analyze, capture and copy the variations within a dataset.
Both neural networks work against one another in GAN machine learning, hence called
adversarial networks.
It is most often used in various ML applications, such as image generation, video generation,
and speech generation.
A Generative Adversarial Network or GAN is defined as the technique of generative modeling used
to generate new data sets based on training data sets. The newly generated data set appears similar
to the training data sets.
o Generative: It is used to learn a generative model that visually explains how data is
generated.
o Adversarial: As both neural networks compete with each other or are adversarial to
one another, hence training of the model is done in an adversarial manner.
o Networks: It uses deep neural networks to train models, hence called networks.
17
also a neural network with hidden layers, activation, and loss function.
Further, the generator primarily focuses on generating fake data based on feedback
given by the discriminator and makes the discriminator fool so that it cannot identify
the difference between actual output and generated output by the generator.
(GAN)s
DCGAN: DCGAN or Deep Convolutional GAN is one of the most famous implementations
of GAN. It makes use of ConvNets instead of Multi-layered perceptron. Contents use a
convolutional stride and are built without max pooling. Further, layers in ConvNets are not
entirely connected.
Conditional and Unconditional GAN: It is defined as a deep learning neural network having
extra parameters. In conditional and unconditional GAN, labels are kept in such a way so that
they can easily classify the input of the discriminator.
Least Square GAN: It is a particular type of generative adversarial network that uses the least-
square loss function for the discriminator. Further, whenever the objective function of least
square GAN is minimized, Pearson divergence also gets minimized automatically.
Auxiliary Classifier GAN: ACGAN or Auxiliary Classifier GAN is a similar but improved
version of CGAN. Its discriminator not only classifies an image as real or fake but also gives
information about the source of the input image.
Dual Video Discriminator GAN: It is the most helpful type of GAN for video generation built
upon the BigGAN architecture. Further, it uses a spatial and temporal discriminator for
generating videos.
SRGAN: Super Resolution or SRGAN is also known as domain transformation, primarily used
to transform low-resolution images to high resolution.
18
Cycle GAN: It is used to perform image translation. E.g., we have trained it on a horse image
dataset, and we can translate it into zebra images.
Info GAN is the latest and advanced version of generative adversarial networks used for
unsupervised machine learning.
19