0% found this document useful (0 votes)
350 views

Module 2 Deep Feed Forward Networks

HKKJ

Uploaded by

Rajeshwari R P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
350 views

Module 2 Deep Feed Forward Networks

HKKJ

Uploaded by

Rajeshwari R P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Module 2 Deep Feed Forward Networks

Deep Feed Forward Networks

Deep feedforward networks, also called feedforward neural networks, or multilayer


perceptrons (MLPs), are the classic deep learning models. The goal of a feed-forward
network is to approximate some function f*. For example, for a classifier, y = f*(x) maps
an input x to a category y.

A feed forward network defines a mapping y = f (x; θ) and learns the value of the
parameters θ that result in the8 best function approximation .These models are called
feedforward because information flows through the function being evaluated from x,
through the intermediate computations used to define f , and finally to the output y.

There are no feedback connections in which outputs of the model are fed back into itself.
When feed forward neural networks are extended to include feedback connections, they
are called recurrent neural networks

Feed forward networks are of extreme importance to machine learning practitioners.


They form the basis of many important commercial applications.
ex: Object detection ,Classification, Segmentation tasks etc.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 1


Module 2 Deep Feed Forward Networks

φ provides a set of features describing x, or as providing a new


representation for x.

6.1 Example: Learning XOR

The XOR (exclusive or) function is a logical operation on two binary inputs, x1 and x2, where the output
is 1 if one, and only one, of the inputs is 1; otherwise, the output is 0. This function is challenging for
linear models because XOR is not linearly separable. To solve the XOR problem with a neural network,
we need a non-linear model. Here's a breakdown of how a simple feed-forward neural network with one
hidden layer can be used to solve this problem.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 2


Module 2 Deep Feed Forward Networks

Figure 6.1: Solving the XOR problem by learning a representation.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 3


Module 2 Deep Feed Forward Networks

Figure 6.3: The rectified linear activation function. This activation function is the default activation
function recommended for use with most feed-forward neural networks.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 4


Module 2 Deep Feed Forward Networks

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 5


Module 2 Deep Feed Forward Networks

6.2 Gradient-Based Learning

Gradient-based learning is the backbone of many deep learning algorithms. This approach involves
iteratively adjusting model parameters to minimize the loss function, which measures the difference
between the actual and predicted outputs. At its core, Gradient-based learning leverages the gradient of
the loss function to navigate the complex landscape of parameters

Gradient

 The gradient is a vector that points in the direction of the steepest increase of a
function. In the context of machine learning, it shows how to adjust the model's
parameters to increase or decrease the loss.
 A positive gradient indicates that increasing the parameter will increase the loss, while a
negative gradient suggests decreasing the parameter will reduce the loss.

Gradient Descent

 Gradient Descent is an optimization algorithm used to minimize the loss function. The
algorithm updates the model’s parameters in the opposite direction of the gradient.
 The size of the steps taken in the direction of the negative gradient is determined by a
hyper-parameter called the learning rate. A smaller learning rate makes more precise
but slower updates, while a larger learning rate speeds up learning but may cause
instability.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 6


Module 2 Deep Feed Forward Networks

Challenges in Gradient-Based Learning

 Non-convexity: Neural networks typically have non-convex loss functions, meaning


there are many local minima and saddle points. The optimizer might converge to a
suboptimal solution.
 Vanishing/Exploding Gradients: In deep neural networks, gradients can become very
small (vanishing gradients) or very large (exploding gradients), making it difficult for the
network to learn.
 Overfitting: If the model is too complex or trained for too long, it might fit the training
data too closely and fail to generalize to new, unseen data

Choices for gradient learning

• choose a cost function

• how to represent the output of the model

1. Cost function wrt Gradient Based Learning


The choice of a cost function directly impacts how well the network learns from data. For most
cases, neural networks use cost functions based on the principle of maximum likelihood, where
the objective is to minimize the difference between the predicted probability distribution of the
network and the actual data distribution, typically using cross-entropy as the cost function.
1.1 Learning Conditional Distributions with Maximum Likelihood
 Cost Functions and Maximum Likelihood: The most common approach to designing
cost functions in neural networks is through maximum likelihood estimation. The cost
function is often defined as the negative log-likelihood, which is mathematically
expressed as the cross-

This approach derives cost functions from the model itself, removing the need for manually
designing them for each model.

 Mean Squared Error (MSE) and Gaussian Distribution: For cases where the model
assumes a normal distribution (Gaussian) for predicting the outputs, the cost function
reduces to the mean squared error (MSE):

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 7


Module 2 Deep Feed Forward Networks

Here, the MSE cost function is linked to the maximum likelihood estimation for models
predicting a Gaussian distribution with the mean being f(x;θ)f(x; \theta)f(x;θ).

1.2 Learning Conditional Statistics


Mathematical Expression for learning a conditional statistic, specifically for minimizing the mean
squared error (MSE) between the predicted value and the actual value of y. The goal is to find the

function f (x) that minimizes the expectation of the squared error:

f∗: The optimal function that minimizes the mean squared error.
2. Output Units wrt Gradient Based Learning

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 8


Module 2 Deep Feed Forward Networks

2.1 Linear Units for Gaussian Output Distributions

Cost Function: Minimizing the negative log-likelihood of this Gaussian results in the mean squared error
(MSE) cost function.

2.2 Sigmoid Units for Bernoulli Output Distributions:

2.3 Soft-Max Units for Multi-noulli Output Distributions

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 9


Module 2 Deep Feed Forward Networks

3. Hidden Units
The design of hidden units is an extremely important. Rectified linear units are an excellent
default choice of hidden unit.
3.1 Rectified Linear Units (ReLU):

RELU units use the function g(z)=max(0,z)

 ReLUs are easy to optimize because their gradients are large and consistent when active
 This property makes them less prone to vanishing gradients
 ReLUs are well-suited for deep learning models as they maintain strong gradients and
are computationally efficient.
3.2 Sigmoid and Hyperbolic Tangent (Tanh)
most neural networks used the logistic sigmoid activation function
 Sigmoid activation function: g(z)=σ(z) where σ(z) is the logistic function.
 Tanh activation function: g(z)=tanh(z)
 Tanh typically performs better than sigmoid because it outputs values centered at zero,
reducing the effect of saturated outputs.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 10


Module 2 Deep Feed Forward Networks

Back-Propagation and Other Differentiation Algorithms:

Feed-forward Propagation:

 In a feed-forward neural network, the input x is passed through multiple layers of the
network until an output y^ is produced. This process of information moving through the
network, from input to output, is called forward propagation.
 Forward propagation continues until it produces the final output, and during training, it
continues onward to compute a scalar cost, which measures how far the network’s
output y^ is from the true output y.

Back-propagation is a method for efficiently calculating the gradient of the cost function J(θ)
with respect to the model's parameters θ.
While back-propagation itself only refers to the gradient computation, the actual learning
process involves an optimization algorithm, like stochastic gradient descent (SGD),that uses the
gradients computed by back-propagation to update the model’s parameters.

Forward propagation moves inputs through the network to produce outputs, and back-propagation
helps in computing the gradients needed for optimizing the model’s parameters

The computational graph and Chain rule are crucial in the computation of gradients.

computational graphs provide a more formal and structured way to visualize and describe the flow of
computations.

1. Computational Graph

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 11


Module 2 Deep Feed Forward Networks

Chain Rule:
1. Let x be a real number, and let f and g both be functions mapping from a real number to a
real number. Suppose that y = g(x) and z = f(g(x)) = f (y).
Then the chain rule states that

2. For Vectors:

and

The chain rule in the scalar case can be seen as multiplying simple
derivatives. In the vector case, it becomes a matrix multiplication of Jacobians.

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 12


Module 2 Deep Feed Forward Networks

Algorithm 6.1: Algorithm for Computational graph construction.

Algorithm 6.2: Algorithm to compute derivatives

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 13


Module 2 Deep Feed Forward Networks

Algorithm 6.3: Algorithm to Cost function

Algorithm 6.3 : Forward propagation through a typical deep neural network and the computation of the
cost function. The loss L(yˆ, y) depends on the output yˆ and on the target y

To obtain the total cost J, the loss may be added to a regularizer Ω(θ ), where θ contains all the
parameters (weights and biases).

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 14


Module 2 Deep Feed Forward Networks

Algorithm 6.4: Algorithm FOR Backward Computation

Backward computation for the deep neural network of algorithm 6.3, which uses in addition to the input
x a target y

Explanation of algorithm 6.4

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 15


Module 2 Deep Feed Forward Networks

6 General Back-Propagation Algorithm


Algorithm 6.5 The outermost skeleton of the back-propagation
algorithm

This Algorithm uses 6.1 to 6.4 to compute necessary components

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 16


Module 2 Deep Feed Forward Networks

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 17


Module 2 Deep Feed Forward Networks

Rajeswari R P, Dept of CSE, RYMEC Deep Learning 21CS743 Page 18

You might also like