0% found this document useful (0 votes)
36 views19 pages

UNIT 2 Notes

Deep learning

Uploaded by

Anami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views19 pages

UNIT 2 Notes

Deep learning

Uploaded by

Anami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT 2 Notes

History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and


regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow
Networks Convolutional Networks- Generative Adversarial Networks (GAN), Semi-
supervised Learning

History of Deep Learning [DL]:

 The chain rule that underlies the back-propagation algorithm was invented in the
seventeenth century (Leibniz, 1676; L’ Hôpital, 1696)
 Beginning in the 1940s, the function approximation techniques were used to
motivate machine learning models such as the perceptron
 The earliest models were based on linear models. Critics including Marvin
Minsky pointed out several of the laws of the linear model family, such as its
inability to learn the XOR function, which led to a backlash against the entire
neural network approach
 Efficient applications of the chain rule based on dynamic programming began to
appear in the 1960s and 1970s
 Werbos (1981) proposed applying chain rule techniques for training artificial
neural networks. The idea was finally developed in practice after being
independently rediscovered in different ways (LeCun, 1985; Parker, 1985;
Rumelhart et al., 1986a)
 Following the success of back-propagation, neural network research gained
popularity and reached a peak in the early 1990s. Afterwards, other machine
learning techniques became more popular until the modern deep learning
renaissance that began in 2006
 The core ideas behind modern feedforward networks have not changed
substantially since the 1980s. The same back-propagation algorithm and the
same approaches to gradient descent are shall in use

1
A Probabilistic Theory of Deep Learning

 Probability is the science of quantifying uncertain things.


 Most of machine learning and deep learning systems utilize a lot of data to learn
about patterns in the data.
 Whenever data is utilized in a system rather than sole logic, uncertainty grows up
and whenever uncertainty grows up, probability becomes relevant.
 By introducing probability to a deep learning system, we introduce common sense
to the system.
 In deep learning, several models like Bayesian models, probabilistic graphical
models, Hidden Markov models are used.
 They depend entirely on probability concepts.
Real world data is chaotic. Since deep learning systems utilize real world data, they require a
tool to handle the chaoticness.
Back Propagation Networks (BPN)
Need for Multilayer Networks

 Single Layer networks cannot be used to solve Linear Inseparable problems & can
only be used to solve linear separable problems
 Single layer networks cannot solve complex problems
 Single layer networks cannot be used when large input-output data set is available
 Single layer networks cannot capture the complex information’s available in the
training pairs
Hence to overcome the above said Limitations we use Multi-Layer Networks.
Multi-Layer Networks

 Any neural network which has at least one layer in between input and output layers is
called Multi-Layer Networks
 Layers present in between the input and out layers are called Hidden Layers
 Input layer neural unit just collects the inputs and forwards them to the next higher
layer
 Hidden layer and output layer neural units process the information’s feed to them and
produce an appropriate output
 Multi -layer networks provide optimal solution for arbitrary classification problems
 Multi -layer networks use linear discriminants, where the inputs are non linear

2
Back Propagation Networks (BPN)
Introduced by Rumelhart, Hinton, & Williams in 1986.

 BPN is a Multilayer Feedforward Network but error is back propagated, Hence


the name Back Propagation Network (BPN).
 It uses Supervised Training process; it has a systematic procedure for training
the network and is used in Error Detection and Correction. Generalized Delta
Law /Continuous Perceptron Law/ Gradient Descent Law is used in this network.
 Generalized Delta rule minimizes the mean squared error of the output calculated
from the output.
 Delta law has faster convergence rate when compared with Perceptron Law. It is
the extended version of Perceptron Training Law. Limitations of this law is the
Local minima problem.
 Due to this the convergence speed reduces, but it is better than perceptron’s.
Figure 1 represents a BPN network architecture.
 Even though Multi level perceptron’s can be used they are flexible and efficient
that BPN.
 In figure 1 the weights between input and the hidden portion is considered as Wij
and the weight between first hidden to the next layer is considered as Vjk.
 This network is valid only for Differential Output functions. The Training
process used in backpropagation involves three stages, which are listed as below
1. Feedforward of input training pair
2. Calculation and backpropagation of associated error
3. Adjustments of weights

3
BPN Algorithm

The algorithm for BPN is as classified int four major steps as follows:
1. Initialization of Bias, Weights
2. Feedforward process
3. Back Propagation of Errors
4. Updating of weights & biases
Algorithm:
I. Initialization of weights:
Step 1: Initialize the weights to small random values near zero
Step 2: While stop condition is false , Do steps 3 to 10
Step 3: For each training pair do steps 4 to 9
II. Feed forward of inputs
Step 4: Each input xi is received and forwarded to higher layers (next hidden)
Step 5: Hidden unit sums its weighted inputs as follows Zinj = Woj + Σxiwij
Applying Activation function Zj = f(Zinj)
This value is passed to the output layer
Step 6: Output unit sums it’s weighted inputs yink= Voj + Σ ZjVjk
Applying Activation function
Yk = f(yink)

III. Backpropagation of Errors


Step 7: δk = (tk – Yk)f(yink )
Step 8: δinj = Σ δjVjk

IV. Updating of Weights & Biases


Step 8: Weight correction is bias
Correction is Δwij =
αδkZjΔwoj =
V. Updating of Weights & Biases
αδk
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij Vjk(new) = Vjk(old) + ΔVjk
New bias is

4
Woj(new) = Woj(old) + Δwoj Vok(new) = Vok(old) + ΔVok
Step 10: Test for Stop Condition

Merits
•Has smooth effect on weight correction •Computing time is less if weight’s are small •100
times faster than perceptron model

• Has a systematic weight updating procedure


Demerits
• Learning phase requires intensive calculations
• Selection of number of Hidden layer neurons is an issue
• Selection of number of Hidden layers is also an issue
• Network gets trapped in Local Minima
• Temporal Instability
• Network Paralysis
• Training time is more for Complex problems
Regularization
A fundamental problem in machine learning is how to make an algorithm that will perform
well not just on the training data, but also on new inputs. Many strategies used in machine
learning are explicitly designed to reduce the test error, possibly at the expense of increased
training error. These strategies are known collectively as regularization.
Definition: - “any modification we make to a learning algorithm that is intended to reduce its
generalization error but not its training error.”

L1 In the context of deep learning, most regularization strategies are based on regularizing
estimators.

L2Regularization of an estimator works by trading increased bias for reduced variance.


An effective regularizer is one that makes a profitable trade, reducing variance significantly
while not overly increasing the bias.

 Many regularization approaches are based on limiting the capacity of models, such as
neural networks, linear regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)

5
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty
term, Ω, relative to the standard objective function J. Setting α to 0 results in no regularization.
Larger values of α correspond to more regularization.

The parameter norm penalty Ω that penalizes only the weights of the a ffine transformation at
each layer and leaves the biases unregularized
L2 Regularization

One of the simplest and most common kind of parameter norm penalty is L2 parameter
& it’s also called commonly as weight decay. This regularization strategy drives the
weights
closer to the origin by adding a regularization term . L2
regularization is also known as ridge regression or Tikhonov regularization. To
simplify, we assume no bias parameter, so θ is just w. Such a model has the following
total objective function
Difference between L1 & L2 Parameter Regularization

 L1 regularization attempts to estimate the median of data, L2 regularization


makes estimation for the mean of the data in order to evade overfitting.
 L1 regularization can add the penalty term in cost function. But L2 regularization
appends the squared value of weights in the cost function.
 L1 regularization can be helpful in features selection by eradicating the unimportant
features, whereas, L2 regularization is not recommended for feature selection
 L1 doesn’t have a closed form solution since it includes an absolute value and it is a
non differentiable function, while L2 has a solution in closed form as it’s a square of a
weight

6
Batch Normalization:
It is a method of adaptive reparameterization, motivated by the difficulty of training
very deep models.In Deep networks, the weights are updated for each layer. So the
output will no longer be on the same scale as the input (even though input is
normalized).Normalization - is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape.when we input the data to a machine
or deep learning algorithm we tend to change the values to a balanced scale because,
we ensure that our model can generalize appropriately.(Normalization is used to bring
the input into a balanced scale/ Range)
Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to calculate the mean
of this hidden activation.
(2) After calculating the mean the next step is to calculate the standard deviation of the
hidden activations.
(3) Now we normalize the hidden activations using these Mean & Standard Deviation
values. To do this, we subtract the mean from each input and divide the whole value
with the sum of standard deviation and the smoothing term (ε).

7
(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two
components of the BN algorithm is used, γ(gamma) and β (beta). These parameters are
used for re-scaling (γ) and shifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence during the training of neural
network, the optimal values of γ and β are obtained and used. Hence we get the accurate
normalization of each batch
Shallow Networks
Shallow neural networks give us basic idea about deep neural network which consist of
only 1 or 2 hidden layers. Understanding a shallow neural network gives us an
understanding into what exactly is going on inside a deep neural network A neural
network is built using various hidden layers. Now that we know the computations that
occur in a particular layer, let us understand how the whole neural network computes
the output for a given input X. These can also be called the forward-propagation
equations.

8
Difference Between a Shallow Net & Deep Learning Net:

9
Convolution Networks:

Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some
other object. The first thing you do is feed the pixels of the image in the form of arrays to the
input layer of the neural network (multi-layer networks used to classify things). The hidden
layers carry out feature extraction by performing different calculations and manipulations.
There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer,
that perform feature extraction from the image. Finally, there’s a fully connected layer that
identifies the object in the image.

10
What is Convolutional Neural Network?
A convolutional neural network is a feed-forward neural network that is generally used to analyze visual
images by processing data with grid-like topology. It’s also known as a ConvNet. A convolutional
neural network is used to detect and classify objects in an image.

Below is a neural network that identifies two types of flowers: Orchid and Rose.

Layers in a Convolutional Neural Network

A convolution neural network has multiple hidden layers that help in extracting information
from an image. The four important layers in CNN are:

1. Convolution layer

2. ReLU layer

3. Pooling layer

4. Fully connected layer

5. ReLU layer/ Activation Layer

6. Flattening

7. Output Layer

Convolution Layer

This is the first step in the process of extracting valuable features from an image. A convolution
layer has several filters that perform the convolution operation. Every image is considered as a
matrix of pixel values.

11
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also a filter
matrix with a dimension of 3x3. Slide the filter matrix over the image and compute the dot
product to get the convolved feature matrix.

ReLU layer

ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is
to move them to a ReLU layer. ReLU performs an element-wise operation and sets all the
negative pixels to 0. It introduces non-linearity to the network, and the generated output is
a rectified feature map. Below is the graph of a ReLU function:

The original image is scanned with multiple convolutions and ReLU layers for locating the
features.

12
Pooling Layer

Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.

13
The pooling layer uses various filters to identify different parts of the image like edges, corners,
body, feathers, eyes, and beak.

Here’s how the structure of the convolution neural network looks so far:

The next step in the process is called flattening. Flattening is used to convert all the resultant
2-Dimensional arrays from pooled feature maps into a single long continuous linear vector.

14
The flattened matrix is fed as input to the fully connected layer to classify the image.

Here’s how exactly CNN recognizes a bird:

 The pixels from the image are fed to the convolutional layer that performs the convolution
operation

 It results in a convolved map

 The convolved map is applied to a ReLU function to generate a rectified feature map

 The image is processed with multiple convolutions and ReLU layers for locating the
features

15
 Different pooling layers with various filters are used to identify specific parts of the image

 The pooled feature map is flattened and fed to a fully connected layer to get the final output

 Activation Layer

The activation layer introduces nonlinearity into the network by applying an activation function
to the output of the previous layer. This is crucial for the network to learn complex patterns.
Common activation functions, such as ReLU, Tanh, and Leaky ReLU, transform the input
while keeping the output size unchanged.

 Flattening

After the convolution and pooling operations, the feature maps still exist in a multi-dimensional
format. Flattening converts these feature maps into a one-dimensional vector. This process is
essential because it prepares the data to be passed into fully connected layers for classification
or regression tasks.

 Output Layer

In the output layer, the final result from the fully connected layers is processed through a
logistic function, such as sigmoid or softmax. These functions convert the raw scores into
probability distributions, enabling the model to predict the most likely class label.

Generative Adversarial networks


Deep Learning and Neural networks, a part of Machine Learning, are such powerful technologies that
are capable of generating new human faces from scratch that did not even exist before but appear natural

16
with the help of training data, and this is possible with the technology named GAN or Generative
Adversarial Networks.

Generative adversarial networks (GANs) are among the most popular and recent unsupervised
machine learning innovations developed by Ian J. Goodfellow in 2014.

 GAN is a class of algorithmic machine learning framework having two neural networks that
connect and can analyze, capture and copy the variations within a dataset.
 Both neural networks work against one another in GAN machine learning, hence called
adversarial networks.
 It is most often used in various ML applications, such as image generation, video generation,
and speech generation.

A Generative Adversarial Network or GAN is defined as the technique of generative modeling used
to generate new data sets based on training data sets. The newly generated data set appears similar
to the training data sets.

o Generative: It is used to learn a generative model that visually explains how data is
generated.
o Adversarial: As both neural networks compete with each other or are adversarial to
one another, hence training of the model is done in an adversarial manner.
o Networks: It uses deep neural networks to train models, hence called networks.

o Discriminator: It is used as a supervised machine learning approach in which a simple


classifier is appointed to discriminate between real and fake data. Although, it is trained
on actual training data sets and gives feedback to the generator.
o Generator: Unlike the discriminator, the generator is an unsupervised machine
learning method used to generate fake samples based on actual training data sets. It is

17
also a neural network with hidden layers, activation, and loss function.
Further, the generator primarily focuses on generating fake data based on feedback
given by the discriminator and makes the discriminator fool so that it cannot identify
the difference between actual output and generated output by the generator.

(GAN)s
DCGAN: DCGAN or Deep Convolutional GAN is one of the most famous implementations
of GAN. It makes use of ConvNets instead of Multi-layered perceptron. Contents use a
convolutional stride and are built without max pooling. Further, layers in ConvNets are not
entirely connected.
Conditional and Unconditional GAN: It is defined as a deep learning neural network having
extra parameters. In conditional and unconditional GAN, labels are kept in such a way so that
they can easily classify the input of the discriminator.
Least Square GAN: It is a particular type of generative adversarial network that uses the least-
square loss function for the discriminator. Further, whenever the objective function of least
square GAN is minimized, Pearson divergence also gets minimized automatically.
Auxiliary Classifier GAN: ACGAN or Auxiliary Classifier GAN is a similar but improved
version of CGAN. Its discriminator not only classifies an image as real or fake but also gives
information about the source of the input image.
Dual Video Discriminator GAN: It is the most helpful type of GAN for video generation built
upon the BigGAN architecture. Further, it uses a spatial and temporal discriminator for
generating videos.
SRGAN: Super Resolution or SRGAN is also known as domain transformation, primarily used
to transform low-resolution images to high resolution.

18
Cycle GAN: It is used to perform image translation. E.g., we have trained it on a horse image
dataset, and we can translate it into zebra images.
Info GAN is the latest and advanced version of generative adversarial networks used for
unsupervised machine learning.

19

You might also like