What is Gradient Based Learning in Deep Learning
What is Gradient Based Learning in Deep Learning
Basically Deep Feed-Forward Networks (I will use the abbreviation DFN for the
rest of the article) are such neural networks which only uses input to feed forward
through a function, let’s say f*, but only through forward. There is no feedback
mechanism in DFN. There are indeed such cases when we have feedback
mechanism from the output, that are called Recurrent Neural Networks (I am also
planning to write about that later).
DFNs are really useful in many areas. One of the most-known applications of AI
is so called object-recognition. It uses convolutional neural networks, that is a
special kind of feed-forward network.
applications of a DFN
What is happening in the kitchen of this business anyways?
Let’s get back to our function f* dive deeper. As I previously mentioned we have
a function y=f*(x) that we would like to approximate to f(x) by doing some
calculations. Clearly our input is x and we feed it through our function f* and we
get our result y. This is simple. But how does this help us to implement such
functionalities in the field? Imagine y as outputs such that we want to classify
inputs x. DNF defines a mapping y = f(x; θ), learns the values of θ and maps the
input x to the categories of y.
As you can observe from the above picture, DFN consists of many layers. You
can think of those layers like composite functions:
For the above function it has depth of 3. The name of deep learning is derived
from this terminology. The output layer is the outter-most function which is f1 in
this case. These chain structures are the most-common used structures. The more
composite the function gets, the more layers it has. Some of those layers are
called hidden layers. There is a reason behind why they are called hidden-layers.
It’s because the learning algorithm itself figures out how to use these layers in
order to achieve the best approximation of f* and we don’t interrupt with the
process. Namely, the training data doesn’t individually say what each layer should
do. In addition, the dimension of the hidden layers define the width of the
network. These layers also consist of units, nodes whatever you want to call them.
“Each unit resembles a neuron in the sense that it receives input from many other
units and computes its own activation value.” — G. Ian et al., Deep Learning
Units of a Model
Each element of vector is viewed as a neuron. It’s easier to think them like units
in parallel that recieves inputs from many other units and computes its own
activation value rather than a vector-vector function.
retrieved from https://towardsdatascience.com/an-introduction-to-deep-
feedforward-neural-networks-1af281e306cd
Activation Function
In activation functions the main idea is to define a threshold such that if it’s likely
to happen, we accept and vise verse we reject. Namely, we need functions that
jumps from 0 to 1 in a very tight manner. There are several functions, techniques
that do this job:
1. Binary-Step Function
2. Linear Function
3. Sigmoid Function*
Sigmoid Function
A plot of the sigmoid function is shown in the fig 2.1. This function was used very
much in the past. However, nowadays there are other activation functions that are
used more frequently in the field. One of which is RELU
However, the derivative of this function is not defined at 0 since it has a break
point there. So we assume the derivative equals to 0 or 1 at that point (fig 2.4).
Some Architectural Considerations
I shall begin this part of the article by explaining what architecture means in a
deep-forward network. “It refers to the overall structure of the network: how many
units it should have and how these units should be connected to each other”
(Goodfellow et al., 2016). Most neural networks have organizations such as
h2 = g_2(weight_2 * h1 + bias_2)
The layering process can go as much as we want, but you got the idea. In the chain-
base architectures, main architectural consideration is to pick the optimum depth
and width of the layers. The ideal parameters for depth and width are decided
through careful observations and experiments that had been conducted.
Maximizing likelihood is finding parameter values that make the observed data most probable. This is often
expressed as
This function maps any input to a value between 0 and 1, ideal for binary classification.
Optimizers are algorithms designed to minimize the cost function. They play a critical role in Gradient-based
learning by updating the weights and biases of the network based on the calculated gradients.
Intuition Behind Optimizers with an Example
Consider a hiker trying to find the lowest point in a valley. They take steps proportional to the steepness of the
slope. In deep learning, the optimizer works similarly, taking steps in the parameter space proportional to the
gradient of the loss function.
The journey of mastering Gradient-based learning in deep learning has its challenges. Each optimizer in deep
learning faces unique hurdles that can impact the learning process:
Learning Rate Dilemmas: One of the foremost challenges in Gradient-based learning is selecting the optimal
learning rate. A rate too high can cause the model to oscillate or even diverge, missing the minimum—
conversely, a rate too low leads to painfully slow convergence, increasing computational costs.
Local Minima and Saddle Points: These are areas in the cost function where the gradient is zero, but they
are not the global minimum. In high-dimensional spaces, common in deep learning, these points become more
prevalent and problematic. This issue is particularly challenging for certain types of optimizers in deep
learning, as some may get stuck in these points, hindering effective learning.
Vanishing and Exploding Gradients: A notorious problem in deeper networks. With vanishing gradients, as
the error is back-propagated to earlier layers, the gradient can become so small that it has virtually no effect,
stopping the network from learning further. Exploding gradients occur when large error gradients accumulate,
causing large updates to the network weights, leading to an unstable network. These issues are a significant
concern for Gradient-based learning.
Plateaus: A plateau is a flat region of the cost function. When using Gradient-based learning, the learning
process can slow down significantly on plateaus, making it difficult to reach the minimum.
Choosing the Right Optimizer: With various types of optimizers in deep learning, such as Batch Gradient
Descent, Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent (MB-GD), selecting the right
one for a specific problem can be challenging. Each optimizer has its strengths and weaknesses, and the choice
can significantly impact the efficiency and effectiveness of the learning process.
Hyperparameter Tuning: In Gradient-based learning, hyperparameters like learning rate, batch size, and the
number of epochs need careful tuning. This process can be time-consuming and requires both experience and
experimentation.
Computational Constraints: Deep learning models can be computationally intensive, particularly those that
leverage complex Gradient-based learning techniques. This challenge becomes more pronounced when
dealing with large datasets or real-time data processing.
Adapting to New Data Types and Structures: As deep learning evolves, new data types and structures
emerge, requiring adaptability and innovation in Gradient-based learning methods.
These challenges highlight the complexity and dynamic nature of Gradient-based learning in deep learning.
Overcoming them requires a deep understanding of both theoretical concepts and practical implementations, often
covered in depth in our Certified Deep Learning Course.