0% found this document useful (0 votes)
52 views

What is Gradient Based Learning in Deep Learning

Uploaded by

deepunageti112
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

What is Gradient Based Learning in Deep Learning

Uploaded by

deepunageti112
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What is a Deep Feed-Forward Network?

Basically Deep Feed-Forward Networks (I will use the abbreviation DFN for the
rest of the article) are such neural networks which only uses input to feed forward
through a function, let’s say f*, but only through forward. There is no feedback
mechanism in DFN. There are indeed such cases when we have feedback
mechanism from the output, that are called Recurrent Neural Networks (I am also
planning to write about that later).

What are its applications?

DFNs are really useful in many areas. One of the most-known applications of AI
is so called object-recognition. It uses convolutional neural networks, that is a
special kind of feed-forward network.

applications of a DFN
What is happening in the kitchen of this business anyways?
Let’s get back to our function f* dive deeper. As I previously mentioned we have
a function y=f*(x) that we would like to approximate to f(x) by doing some
calculations. Clearly our input is x and we feed it through our function f* and we
get our result y. This is simple. But how does this help us to implement such
functionalities in the field? Imagine y as outputs such that we want to classify
inputs x. DNF defines a mapping y = f(x; θ), learns the values of θ and maps the
input x to the categories of y.

fig 1.1 — schema of a DFN

As you can observe from the above picture, DFN consists of many layers. You
can think of those layers like composite functions:

f(x) = f1(f2 (f3 (x)))

For the above function it has depth of 3. The name of deep learning is derived
from this terminology. The output layer is the outter-most function which is f1 in
this case. These chain structures are the most-common used structures. The more
composite the function gets, the more layers it has. Some of those layers are
called hidden layers. There is a reason behind why they are called hidden-layers.
It’s because the learning algorithm itself figures out how to use these layers in
order to achieve the best approximation of f* and we don’t interrupt with the
process. Namely, the training data doesn’t individually say what each layer should
do. In addition, the dimension of the hidden layers define the width of the
network. These layers also consist of units, nodes whatever you want to call them.

“Each unit resembles a neuron in the sense that it receives input from many other
units and computes its own activation value.” — G. Ian et al., Deep Learning

fig 1.2 — a schema of a net with depth 2: one hidden layer

Units of a Model
Each element of vector is viewed as a neuron. It’s easier to think them like units
in parallel that recieves inputs from many other units and computes its own
activation value rather than a vector-vector function.
retrieved from https://towardsdatascience.com/an-introduction-to-deep-
feedforward-neural-networks-1af281e306cd

Actually what we do here is converting biological nature into mathematical


expressions. Every neuron has their own input (vectore shaped e.g. x = [x1, x2,
x3, x4, … , xn]) from other neurons, multiplies those inputs by distinct weights
(w = [w1, w2, w3, …, wn])and appends a bias noted as b which is a constant value,
it’s used to adjust the net output. Then the net output is fed through a
function g(z) called activation function and the result of g(z) is sent to the other
neurons.

Activation Function

In activation functions the main idea is to define a threshold such that if it’s likely
to happen, we accept and vise verse we reject. Namely, we need functions that
jumps from 0 to 1 in a very tight manner. There are several functions, techniques
that do this job:

 1. Binary-Step Function

 2. Linear Function
 3. Sigmoid Function*

 4. Hyperbolic Tangent Function

 5. Rectified Linear Unit Function* (ReLU)

Sigmoid Function

Sigmoid function is a non-linear continuous function that outputs between 0 and


1 according to the input value.

fig 2.1 — sigmoid formula and the graph

A plot of the sigmoid function is shown in the fig 2.1. This function was used very
much in the past. However, nowadays there are other activation functions that are
used more frequently in the field. One of which is RELU

Rectified Linear Unit Function (ReLU)


This activation function is very popular in the area. It has the formula:

fig 2.2 — ReLU

The plot of this function is shown in fig 2.3.

fig 2.3 — plot of the ReLU

fig 2.4 — graph of the derivative of ReLU

However, the derivative of this function is not defined at 0 since it has a break
point there. So we assume the derivative equals to 0 or 1 at that point (fig 2.4).
Some Architectural Considerations

I shall begin this part of the article by explaining what architecture means in a
deep-forward network. “It refers to the overall structure of the network: how many
units it should have and how these units should be connected to each other”
(Goodfellow et al., 2016). Most neural networks have organizations such as

h1 = g_1( weight * input + bias)

the second layer is given

h2 = g_2(weight_2 * h1 + bias_2)

The layering process can go as much as we want, but you got the idea. In the chain-
base architectures, main architectural consideration is to pick the optimum depth
and width of the layers. The ideal parameters for depth and width are decided
through careful observations and experiments that had been conducted.

fig 3.1 — effect of depth in test accuracy


On the other hand, the depth of the neural network really affects the test accuracy,
performance in the application. We can infer that the more layers the architecture
has the better performance the model has as shown in fig 3.1.

Another consideration in the architectural design is that the connection method of


the layers. In some cases units of the layers in the neural network has fewer
connections to the following layer resulting in less computation need. As amount
of computation is reduced, the resource that is needed to train the model decreases
significantly. For instance a special case of a neural network is
called Convolutional Neural Network (CNN).
What is Gradient Based Learning in Deep
Learning?
Gradient-based learning is the backbone of many deep learning algorithms. This approach involves iteratively
adjusting model parameters to minimize the loss function, which measures the difference between the actual and
predicted outputs. At its core, Gradient-based learning leverages the gradient of the loss function to navigate the
complex landscape of parameters. In this blog, let’s discuss the essentials of Gradient-Based Learning, and if this
blog excites you for more, you can always dive into our Deep Learning Courses with Certificates online.

Cost Functions: The Mathematical Backbone


Learning Conditional Distributions with Max Likelihood

Maximizing likelihood is finding parameter values that make the observed data most probable. This is often
expressed as

Learning Conditional Statistics


This involves understanding the relationships between variables and focusing on the conditional expectation. The
goal is to minimize the difference between predicted and actual values, often using mean squared error (MSE) as
a cost function.

Output Units: Adapting to Data Types

Linear Units for Gaussian Output Distributions


The linear unit is used for outputs resembling a Gaussian distribution. The output is a linear combination of inputs:
Sigmoid Units for Bernoulli Output Distributions
These units are used for binary outcomes, modeled as

This function maps any input to a value between 0 and 1, ideal for binary classification.

Softmax Units for Multinoulli Output Distributions


For multi-class classification, the softmax function, which generalizes the sigmoid function for multiple classes,
is used.

Other Output Types


Deep learning can handle various output types, with specific functions tailored to different data distributions.

Role of an Optimizer in Deep Learning

Optimizers are algorithms designed to minimize the cost function. They play a critical role in Gradient-based
learning by updating the weights and biases of the network based on the calculated gradients.
Intuition Behind Optimizers with an Example
Consider a hiker trying to find the lowest point in a valley. They take steps proportional to the steepness of the
slope. In deep learning, the optimizer works similarly, taking steps in the parameter space proportional to the
gradient of the loss function.

Instances of Gradient Descent Optimizers

Batch Gradient Descent (GD)


This optimizer calculates the gradient using the entire dataset, ensuring a smooth descent but at a computational
cost. The update rule is

Stochastic Gradient Descent (SGD)


SGD updates parameters for each training example, leading to faster but less stable convergence. The update rule
is:

Mini-batch Gradient Descent (MB-GD)


MB-GD strikes a balance between GD and SGD by using mini-batches of the dataset. It combines efficiency with
a smoother convergence than SGD.

Challenges with All Types of Gradient-Based Optimizers

The journey of mastering Gradient-based learning in deep learning has its challenges. Each optimizer in deep
learning faces unique hurdles that can impact the learning process:

 Learning Rate Dilemmas: One of the foremost challenges in Gradient-based learning is selecting the optimal
learning rate. A rate too high can cause the model to oscillate or even diverge, missing the minimum—
conversely, a rate too low leads to painfully slow convergence, increasing computational costs.

 Local Minima and Saddle Points: These are areas in the cost function where the gradient is zero, but they
are not the global minimum. In high-dimensional spaces, common in deep learning, these points become more
prevalent and problematic. This issue is particularly challenging for certain types of optimizers in deep
learning, as some may get stuck in these points, hindering effective learning.

 Vanishing and Exploding Gradients: A notorious problem in deeper networks. With vanishing gradients, as
the error is back-propagated to earlier layers, the gradient can become so small that it has virtually no effect,
stopping the network from learning further. Exploding gradients occur when large error gradients accumulate,
causing large updates to the network weights, leading to an unstable network. These issues are a significant
concern for Gradient-based learning.

 Plateaus: A plateau is a flat region of the cost function. When using Gradient-based learning, the learning
process can slow down significantly on plateaus, making it difficult to reach the minimum.

 Choosing the Right Optimizer: With various types of optimizers in deep learning, such as Batch Gradient
Descent, Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent (MB-GD), selecting the right
one for a specific problem can be challenging. Each optimizer has its strengths and weaknesses, and the choice
can significantly impact the efficiency and effectiveness of the learning process.

 Hyperparameter Tuning: In Gradient-based learning, hyperparameters like learning rate, batch size, and the
number of epochs need careful tuning. This process can be time-consuming and requires both experience and
experimentation.

 Computational Constraints: Deep learning models can be computationally intensive, particularly those that
leverage complex Gradient-based learning techniques. This challenge becomes more pronounced when
dealing with large datasets or real-time data processing.

 Adapting to New Data Types and Structures: As deep learning evolves, new data types and structures
emerge, requiring adaptability and innovation in Gradient-based learning methods.

These challenges highlight the complexity and dynamic nature of Gradient-based learning in deep learning.
Overcoming them requires a deep understanding of both theoretical concepts and practical implementations, often
covered in depth in our Certified Deep Learning Course.

Conclusion and Pathways to Learning


Mastering Gradient-based learning in deep learning requires a deep understanding of these concepts. Enrolling in
Deep Learning Courses with Certificates Online from JanBask Training can provide structured, comprehensive
insights into these complex topics. These courses combine theoretical knowledge with practical applications,
equipping learners with the skills needed to innovate in deep learning.

You might also like