Module 2 Deep Feed Forward Networks
Module 2 Deep Feed Forward Networks
A feed forward network defines a mapping y = f (x; θ) and learns the value of the
parameters θ that result in the8 best function approximation .These models are called
feedforward because information flows through the function being evaluated from x,
through the intermediate computations used to define f , and finally to the output y.
There are no feedback connections in which outputs of the model are fed back into itself.
When feed forward neural networks are extended to include feedback connections, they
are called recurrent neural networks
The XOR (exclusive or) function is a logical operation on two binary inputs, x1 and x2, where the output
is 1 if one, and only one, of the inputs is 1; otherwise, the output is 0. This function is challenging for
linear models because XOR is not linearly separable. To solve the XOR problem with a neural network,
we need a non-linear model. Here's a breakdown of how a simple feed-forward neural network with one
hidden layer can be used to solve this problem.
Figure 6.3: The rectified linear activation function. This activation function is the default activation
function recommended for use with most feed-forward neural networks.
Gradient-based learning is the backbone of many deep learning algorithms. This approach involves
iteratively adjusting model parameters to minimize the loss function, which measures the difference
between the actual and predicted outputs. At its core, Gradient-based learning leverages the gradient of
the loss function to navigate the complex landscape of parameters
Gradient
The gradient is a vector that points in the direction of the steepest increase of a
function. In the context of machine learning, it shows how to adjust the model's
parameters to increase or decrease the loss.
A positive gradient indicates that increasing the parameter will increase the loss, while a
negative gradient suggests decreasing the parameter will reduce the loss.
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function. The
algorithm updates the model’s parameters in the opposite direction of the gradient.
The size of the steps taken in the direction of the negative gradient is determined by a
hyper-parameter called the learning rate. A smaller learning rate makes more precise
but slower updates, while a larger learning rate speeds up learning but may cause
instability.
This approach derives cost functions from the model itself, removing the need for manually
designing them for each model.
Mean Squared Error (MSE) and Gaussian Distribution: For cases where the model
assumes a normal distribution (Gaussian) for predicting the outputs, the cost function
reduces to the mean squared error (MSE):
Here, the MSE cost function is linked to the maximum likelihood estimation for models
predicting a Gaussian distribution with the mean being f(x;θ)f(x; \theta)f(x;θ).
f∗: The optimal function that minimizes the mean squared error.
2. Output Units wrt Gradient Based Learning
Cost Function: Minimizing the negative log-likelihood of this Gaussian results in the mean squared error
(MSE) cost function.
3. Hidden Units
The design of hidden units is an extremely important. Rectified linear units are an excellent
default choice of hidden unit.
3.1 Rectified Linear Units (ReLU):
ReLUs are easy to optimize because their gradients are large and consistent when active
This property makes them less prone to vanishing gradients
ReLUs are well-suited for deep learning models as they maintain strong gradients and
are computationally efficient.
3.2 Sigmoid and Hyperbolic Tangent (Tanh)
most neural networks used the logistic sigmoid activation function
Sigmoid activation function: g(z)=σ(z) where σ(z) is the logistic function.
Tanh activation function: g(z)=tanh(z)
Tanh typically performs better than sigmoid because it outputs values centered at zero,
reducing the effect of saturated outputs.
Feed-forward Propagation:
In a feed-forward neural network, the input x is passed through multiple layers of the
network until an output y^ is produced. This process of information moving through the
network, from input to output, is called forward propagation.
Forward propagation continues until it produces the final output, and during training, it
continues onward to compute a scalar cost, which measures how far the network’s
output y^ is from the true output y.
Back-propagation is a method for efficiently calculating the gradient of the cost function J(θ)
with respect to the model's parameters θ.
While back-propagation itself only refers to the gradient computation, the actual learning
process involves an optimization algorithm, like stochastic gradient descent (SGD),that uses the
gradients computed by back-propagation to update the model’s parameters.
Forward propagation moves inputs through the network to produce outputs, and back-propagation
helps in computing the gradients needed for optimizing the model’s parameters
The computational graph and Chain rule are crucial in the computation of gradients.
computational graphs provide a more formal and structured way to visualize and describe the flow of
computations.
1. Computational Graph
Chain Rule:
1. Let x be a real number, and let f and g both be functions mapping from a real number to a
real number. Suppose that y = g(x) and z = f(g(x)) = f (y).
Then the chain rule states that
2. For Vectors:
and
The chain rule in the scalar case can be seen as multiplying simple
derivatives. In the vector case, it becomes a matrix multiplication of Jacobians.
Algorithm 6.3 : Forward propagation through a typical deep neural network and the computation of the
cost function. The loss L(yˆ, y) depends on the output yˆ and on the target y
To obtain the total cost J, the loss may be added to a regularizer Ω(θ ), where θ contains all the
parameters (weights and biases).
Backward computation for the deep neural network of algorithm 6.3, which uses in addition to the input
x a target y