Unit-3 NNDL
Unit-3 NNDL
UNIT-III
Deep Learning is a subset of Machine Learning that uses mathematical functions to map the
input to the output. These functions can extract non-redundant information or patterns from
the data, which enables them to form a relationship between the input and the output. This is
known as learning, and the process of learning is called training.
In traditional computer programming, input and a set of rules are combined together to get
the desired output. In machine learning and deep learning, input and output are correlated to
the rules. These rules when combined with new input-yield desired results
Modern deep learning models use artificial neural networks or simply neural networks to
extract information.
These neural networks are made up of a simple mathematical function that can be stacked on
top of each other and arranged in the form of layers, giving them a sense of depth, hence the
term Deep Learning.
1
Neural networks and Deep learning CSE(AI&ML)
Deep learning algorithms are highly progressive algorithms that learn about the image that
we discussed previously by passing it through each neural network layer. The layers are
highly sensitive to detect low-level features of the image like edges and pixels and henceforth
the combined layers take this information and form holistic representations by comparing it
with previous data. For example, the middle layer might be programmed to detect some
special parts of the object in the photograph which other deep trained layers are programmed
to detect special objects like dogs, trees, utensils, etc.
However, if we talk out the simple task that involves less complexity and a data-driven
resource, deep learning algorithms fail to generalize simple data. This is one of the main
reasons deep learning is not considered effective as linear or boosted tree models. Simple
models aim to churn out custom data, track fraudulent transactions and deal with less
complex datasets with fewer features. Also, there are various cases like multiclass
2
Neural networks and Deep learning CSE(AI&ML)
classification where deep learning can be effective because it involves smaller but more
structured datasets but is not preferred usually.
3
Neural networks and Deep learning CSE(AI&ML)
Walter Pitts and Warren McCulloch show the model of biological neuron. This
McCulloch-Pitts Neuron could be used to learn logical inferences based on learning
via simple thresholding functions.
Marvin Minsky and Seymour Papert publish the book "Perceptrons" that proves
that the Perceptron algorithm cannot solve simple nonlinear problems, such as the
XOR problem. This leads to a decrease in interest in neural network research and the
first AI winter.
4
Neural networks and Deep learning CSE(AI&ML)
John Hopfield creates the Hopfield Network, a recurrent neural network that can
serve as associative memory systems with binary threshold nodes. This foundation
stone for recurrent networks and deep learning
Sepp Hochreiter and Jurgen Schmidhuber make the invention of LSTM (Long Short-Term
Memory) which is a refinement of RNN with all architecture. This will become the common
architecture for deep learning in the years to come.
5
Neural networks and Deep learning CSE(AI&ML)
DeepMind's deep reinforcement learning model beats human champion in the complex game
of Go. The game is much more complex than chess, so this feat captures the imagination of
everyone and takes the promise of deep learning to a whole new level. AlexNet, a GPU-
implemented CNN model designed by Alex Krizhevsky, wins ImageNet's image
classification contest with an accuracy of 84%. It is a huge jump over the 75% accuracy that
earlier models had achieved. This win triggers a new deep learning boom globally.
2016 AlphaGo beats human: AlphaGo defeated one of the best human players in the game
Go, proving the power of deep learning in complex board games. This had a huge impact on
the AI community and brought deep learning to the forefront of AI research globally.
6
Neural networks and Deep learning CSE(AI&ML)
7
Neural networks and Deep learning CSE(AI&ML)
We want our network to perform correctly on the four points X = {[0, 0], [0, 1],[1, 0], and [1, 1]}.
We will train the network on all four of these points. The only challenge is to fit the training set.
We can treat this problem as a regression problem and use a mean squared error loss
function. In practical applications, MSE is usually not an appropriate cost function for
modeling binary data.
Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to
be
We can minimize J(θ) in closed form with respect to w and b using the normal equations.
After solving the normal equations, we obtain w = 0 and b = 1/2. The linear model simply
outputs 0.5 everywhere. Why does this happen? A linear model is not able to represent the
XOR function. One way to solve this problem is to use a model that learns a different feature
space in which a linear model is able to represent the solution
Specifically, we will introduce a very simple feedforward network with one hidden layer
containing two hidden units.
8
Neural networks and Deep learning CSE(AI&ML)
This feedforward network has a vector of hidden units h that are computed by a
function f (1) ¿(x ; W , c). The values of these hidden units are then used as the input for a
second layer. The second layer is the output layer of the network. The output layer is still just
a linear regression model, but now it is applied to h rather than to x . The network now
contains two functions chained together:
(1) (2)
h=f (x ; W ,c )∧ y =f (h ; w , b),
and b = 0
9
Neural networks and Deep learning CSE(AI&ML)
We can now walk through the way that the model processes a batch of inputs. Let X be the
design matrix containing all four points in the binary input space, with one example per row:
The first step in the neural network is to multiply the input matrix by the first layer’s weight
matrix:
In this space, all of the examples lie along a line with slope 1. As we move along this line, the
output needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot
implement such a function. To finish computing the value of h for each example, we apply
the rectified linear transformation:
This transformation has changed the relationship between the examples. They no longer lie
on a single line. They now lie in a space where a linear model can solve the problem. We
finish by multiplying by the weight vector w:
The neural network has obtained the correct answer for every example in the batch.
10
Neural networks and Deep learning CSE(AI&ML)
In this example, we simply specified the solution, then showed that it obtained zero error. In a
real situation, there might be billions of model parameters and billions of training examples,
so one cannot simply guess the solution as we did here. Instead, a gradient-based
optimization algorithm can find parameters that produce very little error.
Gradient-Based Learning
As with other machine learning models, to apply gradient-based learning we must choose a
cost function, and we must choose how to represent the output of the model. Largest
difference between simple ML Models and neural networks are nonlinearity of a neural
network causes most interesting loss functions to become non-convex. This means that neural
networks are usually trained by using iterative, gradient-based optimizers that merely drive
the cost function to a very low value, rather than exact linear equation solvers used to train
linear regression models or the convex optimization algorithms used for logistic regression or
A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset. It calculates the difference between the expected value and
predicted value and represents it as a single real number.
11
Neural networks and Deep learning CSE(AI&ML)
.
The specific form of the cost function changes from model to model, depending on the
specific form of log p model .
An advantage of this approach of deriving the cost function from maximum likelihood is that
it removes the burden of designing cost functions for each model. Specifying a model
p( y ∨x) automatically determines a cost function log p( y ∨x).
Learning Conditional Statistics:
Here, the cost function can be viewed as a functional (a mapping from functions to real
numbers), and learning becomes a process of choosing a function, not just a set of
parameters. We can design the cost functional so that its minimum corresponds to the
function that maps ( x) to the expected value of ( y) given ( x ).
Solving such optimization problems involves *calculus of variations*. While understanding
this tool is not necessary here, it helps derive certain key results.
The first result derived using calculus of variations shows that solving the optimization
problem:
……………….(1)
……………………. (3)
However, both MSE and MAE can lead to poor results with gradient-based optimization, as
certain output units may saturate and produce very small gradients. This is why the cross-
12
Neural networks and Deep learning CSE(AI&ML)
entropy cost function is often preferred, even when estimating the full distribution p( y ∨x)
is not necessary.
Output Units
The choice of cost function is tightly coupled with the choice of output unit. Most of
the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
the feedforward network provides a set of hidden features defined by h=f (x ; θ). The
role of the output layer is then to provide some additional transformation from the features to
complete the task that the network must perform.
13
Neural networks and Deep learning CSE(AI&ML)
to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli
distribution controlled by a sigmoidal transformation of z:
When we use other loss functions, such as mean squared error, the loss can saturate
anytime σ(z) saturates. The sigmoid activation function saturates to 0 when z becomes very
negative and saturates to 1 when z becomes very positive. The gradient can shrink too small
to be useful for learning whenever this happens, whether the model has the correct answer or
the incorrect answer. For this reason, maximum likelihood is almost always the preferred
approach to training sigmoid output units.
As with the logistic sigmoid, the use of the exp function works very well when training the
softmax to output a target value y using maximum log-likelihood.
14
Neural networks and Deep learning CSE(AI&ML)
Hidden Units
The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default
choice of hidden unit.It is usually impossible to predict in advance which will work best. The
design process consists of trial and error, intuiting that a kind of hidden unit may work well,
and evaluating its performance on a validation set.
Some hidden units are not differentiable at all input points. For example, the rectified linear
function g(z)=max{0, z} is not differentiable at z = 0. This may seem like it invalidates g
for use with a gradient based learning algorithm. In practice, gradient descent still performs
well enough for these models to be used for machine learning tasks.
Most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation z=W T x+b , and then applying an element-wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form of the
Only difference with linear units that they output 0 across half its domain
Thus gradient direction is far more useful than with activation functions with second-order
effects
Rectified linear units are typically used on top of an affine transformation: h=g ¿ ¿)
Good practice to set all elements of b to a small value such as 0.1. This makes it likely that
ReLU will be initially active for most training samples and allow derivatives to pass through
15
Neural networks and Deep learning CSE(AI&ML)
Sigmoid and tanh activation functions cannot be with many layers due to the vanishing
gradient problem.
ReLU overcomes the vanishing gradient problem, allowing models to learn faster and
perform better.
One drawback to rectified linear units is that they cannot learn via gradient based methods on
examples for which their activation is zero.
Three generalizations of rectified linear units are based on using a non-zero slope α i when
zi < 0;hi =g ( z , α )i =max ( 0 , z i ) +α i min (0 , zi )
Most neural networks used the logistic sigmoid activation function prior to rectified linear
units. g(z )=σ (z)
We have already seen sigmoid units as output units, used to predict the probability that a
binary variable is 1.
16
Neural networks and Deep learning CSE(AI&ML)
Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the identity
function more closely. Because tanh is similar to the identity function near 0, training a
deep neural network
resembles training a linear model so long as the activations of the network can be kept small.
Architecture Design
The word architecture refers to the overall structure of the network: how many units it should
have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural
network architectures arrange these layers in a chain structure, with each layer being a
function of the layer that preceded it. In this structure, the first layer is given by
In these chain-based architectures, the main architectural considerations are to choose the
depth of the network and the width of each layer.
17
Neural networks and Deep learning CSE(AI&ML)
The universal approximation theorem (Horniket al., 1989; Cybenko, 1989) states that a
feedforward network with a linear output layer and at least one hidden layer with any
“squashing” activation function (such as the logistic sigmoid activation function) can
approximate any Borel measurable function from one finite-dimensional space to another
with any desired nonzero amount of error, provided that the network is given enough hidden
units.
A feed-forward network with a single hidden layer containing a finite number of neurons can
Simple neural networks can represent a wide variety of interesting functions when given
appropriate parameters
However, it does not touch upon the algorithmic learnability of those parameters
18
Neural networks and Deep learning CSE(AI&ML)
The universal approximation theorem means that regardless of what function we are trying to
learn, we know that a large multilayer perception (MLP) will be able to represent this
function. However, we are not guaranteed that the training algorithm will be able to learn that
function. Even if the MLP is able to represent the function, learning can fail for two different
reasons.
1. Optimizing algorithms may not be able to find the value of the parameters that corresponds to
the desired function.
The UA theorem says that, there exists a network large enough to achieve any degree of
accuracy we desire, but the theorem does not say how large this network will be. provides
some bounds on the size of a single-layer network needed to approximate a broad class of
functions. Unfortunately, in the worse case, an exponential number of hidden units may be
required. This is easiest to see in the binary case: the number of possible binary functions on
vectors
n
v ∈ { 0 ,1 } is 22n and selecting one such function requires 2nbits, which will in general require
Deeper models perform better than wider models with more parameters because they
can capture more complex hierarchical functions.
Shallow models with many parameters tend to overfit quickly (around 20 million
parameters), while deeper models benefit from larger parameter counts (over 60
million).
Deep models learn representations by composing simpler functions, such as detecting
edges, corners, and then objects.
Depth expresses a preference for learning complex functions as compositions of
simpler ones, improving generalization over shallow models.
This explains why adding depth is more effective than merely increasing the number of
parameters in neural networks.
Forward Propagation : In a feedforward neural network, input xxx flows through layers
to produce output y^\hat{y}y^. This process is called forward propagation. During training,
forward propagation computes the cost J(θ)J(\theta)J(θ).
Backprop Misconceptions:
It is not the entire learning algorithm but just a method for computing gradients.
It is not specific to multi-layer networks but can compute derivatives of any function.
Computational Graphs:
General Use: Backpropagation extends beyond neural networks, being useful for
computing various derivatives in machine learning tasks.
20
Neural networks and Deep learning CSE(AI&ML)
Chain Rule with Tensors: For tensor-valued functions, the chain rule becomes:
21
Neural networks and Deep learning CSE(AI&ML)
22