0% found this document useful (0 votes)
39 views20 pages

Mod 2.4,2.5,2.6 Architecture Design

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views20 pages

Mod 2.4,2.5,2.6 Architecture Design

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Architecture Design

 One key design consideration for neural networks is determining the


architecture.
 The word architecture refers to the overall structure of the network: how
many units it should have and how these units should be connected to each
other.
 Most neural networks are organized into groups of units called layers.
 Most neural network architectures arrange these layers in a chain structure,
with each layer being a function of the layer that preceded it. In this structure,
the first and second layer are given respectively as
 In these chain-based architectures, the main architectural
considerations are to choose the depth of the network and the
width of each layer.
 A network with even one hidden layer is sufficient to fit the
training set.
 Deeper networks often are able to use much lesser units per
layer and much lesser parameters and often generalize to the test
set, but are also often harder to optimize.
 The ideal network architecture for a task must be found via
experimentation guided by monitoring the validation set error.
Universal Approximation Properties
and Depth
 A linear model, mapping from features to outputs via matrix multiplication, can
by definition represent only linear functions. It has the advantage of being easy to
train.
 But we often want to learn nonlinear functions.
 At first glance, we might presume that learning a nonlinear function requires
designing a specialized model family for the kind of nonlinearity we want to
learn.
 Fortunately, feedforward networks with hidden layers provide a universal
approximation framework.
 The universal approximation theorem means that regardless of
what function we are trying to learn, we know that a large MLP
will be able to represent this function.
 However, we are not guaranteed that the training algorithm will
be able to learn that function.
 Even if the MLP is able to represent the function, learning can fail
for two different reasons.
1. First, the optimization algorithm used for training may not be
able to find the value of the parameters that corresponds to the
desired function.
2. Second, the training algorithm might choose the wrong function
due to overfitting.
 The universal approximation theorem says that there exists
a network large enough to achieve any degree of accuracy
we desire, but the theorem does not say how large this
network will be.
 In summary, a feedforward network with a single layer is
sufficient to represent any function, but the layer may be
infeasibly large and may fail to learn and generalize
correctly.
 In many circumstances, using deeper models can reduce
the number of units required to represent the desired
function and can reduce the amount of generalization error.
Other Architectural Considerations
 In practice, neural networks show considerably more diversity.
 Many neural network architectures have been developed for specific tasks.
1. Simulating Basic Machine Learning with Shallow Models
Most of the basic machine learning models like linear regression,
classification, support vector machines, logistic regression, singular value
decomposition, and matrix factorization can be simulated with shallow
neural networks containing no more than one or two layers.
2. Convolutional Neural Networks
3. Recurrent Neural Networks
4. Restricted Boltzmann Machines
Training a Neural Network with Backpropagation

 Backpropagation is one of the important concepts of a neural network.


 Our task is to classify our data best.
 For this, we have to update the weights of parameter and bias, In the
linear regression model, we use gradient descent to optimize the
parameter.
 Similarly here we also use gradient descent algorithm using
Backpropagation.
 For a single training example, Backpropagation algorithm calculates
the gradient of the error function.
Training a Neural Network with Backpropagation

The backpropagation algorithm contains two main phases, referred to as the forward and
backward phases, respectively.
 
1. Forward phase: In this phase, the inputs for a training instance are fed into the neural
network. This results in a forward cascade of computations across the layers, using the current
set of weights. The final predicted output can be compared to that of the training instance and
the derivative of the loss function with respect to the output is computed. The derivative of this
loss now needs to be computed with respect to the weights in all layers in the backwards phase.
2. Backward phase: The main goal of the backward phase is to learn the gradient of the loss
function with respect to the different weights by using the chain rule of differential calculus.
These gradients are used to update the weights. Since these gradients are learned in the
backward direction, starting from the output node, this learning process is referred to as the
backward phase.
Training a Neural Network
with Backpropagation
 In the single-layer neural network, the training process is relatively
straightforward because the error (or loss function) can be computed as a direct
function of the weights, which allows easy gradient computation.
 In the case of multi-layer networks, the problem is that the loss is a complicated
composition function of the weights in earlier layers. The gradient of a
composition function is computed using the backpropagation algorithm.

 The backpropagation algorithm leverages the chain rule of differential calculus,


which computes the error gradients in terms of summations of local-gradient
products over the various paths from a node to the output.
 Backpropagation algorithms are a set of methods used to efficiently train
artificial neural networks following a gradient descent approach which
exploits the chain rule.
Illustration of chain rule in
computational graphs
Example to understand how exactly updates the weight using Backpropagation.
Gradient Descent

 Let’s visualize how we might minimize the squared error over all of the training
examples
 Imagine a three-dimensional space where the horizontal dimensions correspond to
the weights w1 and w2, and the vertical dimension corresponds to the value of the
error function E.
 In this space, points in the horizontal plane correspond to different settings of the
weights, and the height at those points corresponds to the incurred error.
 If we consider the errors we make over all possible weights, we get a surface in
this three-dimensional space, in particular, a quadratic bowl.
Gradient Descent

The quadratic error surface for a linear neuron


We can also conveniently visualize this surface as a set of
elliptical contours, where the minimum error is at the center of
the ellipses.
Here we are working in a two-dimensional plane where the
dimensions correspond to the two weights. Contours
correspond to settings of w1 and w2 that evaluate to the same
value of E.
The closer the contours are to each other, the steeper the slope.
 In fact, it turns out that the direction of the steepest descent is
always perpendicular to the contours.
This direction is expressed as a vector known as the gradient.
Visualizing the error surface as a set of contours
How to find the values of the weight that minimizes
the error function?
 Suppose we randomly initialize the weights of our network so we find
ourselves somewhere on the horizontal plane.
 By evaluating the gradient at our current position, we can find the direction
of steepest descent, and we can take a step in that direction.
 Then we’ll find ourselves at a new position that’s closer to the minimum
than we were before. We can reevaluate the direction of steepest descent by
taking the gradient at this new position and taking a step in this new
direction.
 Following this strategy will eventually get us to the point of minimum
error.
 This algorithm is known as gradient descent
Gradient Descent(GD)
 Gradient descent is an optimization algorithm which is commonly-
used to train machine learning models and neural networks.Until
the function is close to or equal to zero, the model will continue to
adjust its parameters to yield the smallest possible error.
 Gradient descent is simply used to find the values of a
function's parameters (coefficients) that minimize a cost
function as far as possible.
 Gradient descent is best used when the parameters cannot be
calculated analytically (e.g. using linear algebra) and must be
searched for by an optimization algorithm.
 To start finding the right values we initialize w and b with some
random numbers
 Gradient descent then starts at that point and it takes one step after
another in the steepest downside direction until it reaches the point
where the cost function is as small as possible.
 How big the steps are gradient descent takes into the direction of the
local minimum are determined by the learning rate(another
hyperparameter), which figures out how fast or slow we will move
towards the optimal weights.
 Picking the learning rate is a hard problem.If we pick a learning rate that’s too small, we risk
taking too long during the training process. But if we pick a learning rate that’s too big, we’ll
mostly likely start diverging away from the minimum
Gradient Descent Algorithm

1. Randomly initialize weights w


2. Compute gradient G using derivative of cost function wrt weights J(w)

3.Weight update equation: wnew = wold-ηG.


(Here, η is a learning rate which should not be too high or low to skip or not
at all converging to min point. If we compute the gradient of the loss function
w.r.t our weights and take small steps in the opposite direction of gradient
our loss will gradually decrease until it converges to some local minima.)
4. Repeat steps 2 to 3 till convergence .Meaning till the weights, Wnew
approximately or equal to Wold.
Gradient descent

It is an iterative process to find the parameters (or the


weights) that converge with the optimum solution. The
optimum solution is where the loss function is minimized.
If gradient descent is working properly, the cost function
should decrease after every iteration.
When gradient descent can’t decrease the cost-function
anymore and remains more or less on the same level, it
has converged.

You might also like