Deep Learning: Architectures: Deep Neural Network
Deep Learning: Architectures: Deep Neural Network
Deep Learning is a subset of Machine Learning that is based on artificial neural networks
(ANNs) with multiple layers, also known as deep neural networks (DNNs). These neural
networks are inspired by the structure and function of the human brain, and they are designed to
learn from large amounts of data in an unsupervised or semi-supervised manner.
Architectures :
1. Deep Neural Network – It is a neural network with a certain level of complexity
(having multiple hidden layers in between input and output layers). They are capable
of modeling and processing non-linear relationships.
2. Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is multi-layer
belief networks. Steps for performing DBN : a. Learn a layer of features from
visible units using Contrastive Divergence algorithm. b. Treat activations of
previously trained features as visible units and then learn features of features. c.
Finally, the whole DBN is trained when the learning for the final hidden layer is
achieved.
3. Recurrent (perform same task for every element of a sequence) Neural Network –
Allows for parallel and sequential computation. Similar to the human brain (large
feedback network of connected neurons). They are able to remember important things
about the input they received and hence enables them to be more precise.
Applications :
1. Automatic Text Generation – Corpus of text is learned and from this model new
text is generated, word-by-word or character-by-character. Then this model is capable
of learning how to spell, punctuate, form sentences, or it may even capture the style.
2. Healthcare – Helps in diagnosing various diseases and treating it.
3. Automatic Machine Translation – Certain words, sentences or phrases in one
language is transformed into another language (Deep Learning is achieving top
results in the areas of text, images).
4. Image Recognition – Recognizes and identifies peoples and objects in images as
well as to understand content and context. This area is already being used in Gaming,
Retail, Tourism, etc.
5. Predicting Earthquakes – Teaches a computer to perform viscoelastic computations
which are used in predicting earthquakes.
6. Deep learning has a wide range of applications in various fields such as computer
vision, speech recognition, natural language processing, and many more. Some of the
most common applications include:
7. Image and video recognition: Deep learning models are used to automatically classify
images and videos, detect objects, and identify faces. Applications include image and
video search engines, self-driving cars, and surveillance systems.
8. Speech recognition: Deep learning models are used to transcribe and translate speech
in real-time, which is used in voice-controlled devices, such as virtual assistants, and
accessibility technology for people with hearing impairments.
9. Natural Language Processing: Deep learning models are used to understand, generate
and translate human languages. Applications include machine translation, text
summarization, and sentiment analysis.
10. Robotics: Deep learning models are used to control robots and drones, and to improve
their ability to perceive and interact with the environment.
11. Healthcare: Deep learning models are used in medical imaging to detect diseases, in
drug discovery to identify new treatments, and in genomics to understand the
underlying causes of diseases.
12. Finance: Deep learning models are used to detect fraud, predict stock prices, and
analyze financial data.
13. Gaming: Deep learning models are used to create more realistic characters and
environments, and to improve the gameplay experience.
14. Recommender Systems: Deep learning models are used to make personalized
recommendations to users, such as product recommendations, movie
recommendations, and news recommendations.
15. Social Media: Deep learning models are used to identify fake news, to flag harmful
content and to filter out spam.
16. Autonomous systems: Deep learning models are used in self-driving cars, drones, and
other autonomous systems to make decisions based on sensor data.
APPLICATIONS OF CNN
CNNs are multistage architectures with convolution, pooling, and fully connected layers. They
have become an integral part of computer vision which aims at imitating the functionality of the
human eye and its brain insights by a machine to understand and process images. Its applications
include object detection, action recognition, scene labelling, character or handwriting
recognition, etc. Som important applications are discussed in the following sub-sections.
Object Detection
Object detection is a technology that deals with detecting real-world objects from a given scene
Technically, it detects the instances of a particular class like animals, birds, cars, etc. from the
given image. As CNNs are getting deeper, many complex computer vision problems can be
solved using these deep CNNs. All applications of CNN are built on top of object detection.
Face Recognition
Face recognition is a problem that predicts whether there is a match between the face that is
input and those available in the database. The common facial features include eyes, nose, mouth,
and chin but sometimes the background of the image is also taken into account. Face recognition
is affected by many problems, including the following:
1. Identifying all faces possible.
2. Focus on each face regardless of lighting and perspective.
3. Finding unique features to face
4. Comparing identified features with that available in the database.
Scene labeling
Scene labeling is the process of labeling every pixel in the given image with the category of the
object it belongs. State-of-the-art systems incorporate CNN with a recurrent architecture.
Optical Character Recognition (OCR)
OCR is one of the domains where CNN gives the best result. Traditional systems rely on
methodologies which need a large amount of training knowledge. But a system that uses
multilayer neural networks with CNN can be used to design highly accurate text detectors and
character modelers.
A simple system can provide extraction of text from scanned documents. However, we need a
system that can recognize text in unconstrained images, where characters can be found anywhere
on the image randomly with different formats (e.g., characters may be rotated through different
degrees, or have different pixel density, or have different foreground and background color, or
no restriction on noise level). For such cases, CNNs have proven to provide higher accuracy than
most traditional approaches and other neural networks.
Error Functions
The error function, also known as the loss function, is used to represent the difference between
the actual and predicted outputs. Some of the common loss functions are the least absolute
deviations, least square error and the cross-entropy loss function. Let the actual output be O and
the predicted output be y.
Least Absolute Deviations (LAD): LAD (Lead Absolute Deviations) is also termed the L1-norm
loss function. It is given by the formula
where n is the number of samples.
Least Square Error (LSE): LSE is also termed the L2-norm loss function. It is given by the
formula
Epoch
The number of times the weight is updated as it moves towards the global minima need not be
till the difference between the predicted and the actual output is zero. Most of the times, the user
may decide on the number of updations to reduce the computational complexity. This number is
termed as epoch.
Weight Regularization
Weight regularization is useful for solving the problem of exploding gradient and to avoid over-
fitting in machine learning problems. The regularization term is added to the loss function to
encourage smaller weights. There are two types of regularization namely, L1 regularization and
L2 regularization.
L1 Regularization
In L1 regularization, the sum of the absolute weights is added to the loss function [L(w, b)], that
is,
Feature Map
The size of the feature map is controlled by three parameters, namely, depth, stride, and zero-
padding. These parameters must be decided before the convolution step is performed:
1. Depth: Depth corresponds to the number of filters used for the convolution operation. If
convolution is performed on an original image using n distinct filters, then it produces n different
feature maps. Thus, the depth of the feature map would be n.
2. Stride: Stride is the number of pixels by which the filter matrix is slide over the input matrix.
When the stride is 1 then the filters are moved one pixel at a time. When the stride is 2, then the
filters jump 2 pixels at a time. A larger stride will duce smaller feature maps.
3. Zero-padding: Zero-padding is the process of adjusting the input size as per the require ment
by adding zeros to the input matrix. It is mostly used in designing the CNN layers when the
dimensions of the input volume need to be preserved in the output volume Sometimes filter does
not perfectly fit the input image. In that case, we need to either pad the picture with zeros so that
it fits or drop the part of the image where the filter did not fit. This is called valid padding which
keeps only the valid part of the image. When we add zero-padding, convolution is called wide
convolution, and when zero-padding is not added, it is called narrow convolution.
Pooling layers section would reduce the number of parameters when the images are too large
Spatial pooling is called subsampling or downsampling; it reduces the dimensionality of each but
retains the important information.
ARCHITECTURES OF CNN
CNNs are special type of networks specifically designed to identify patterns from images. There
are several CNN architectures designed to solve a particular image processing problem such as
optical character recognition (OCR), object detection, face recognition, etc. The architectures
differ in the number of parameters, number of layers, and type of layers. The error rate decreases
as the architecture gets deeper. But some architectures, such as GoogLeNet, achieved minimal
error rates with reduced number of parameters.
1. LeNet: Introduced in 1998 by LeCun et al., LeNet was the first deep convolutional
architecture. It was used to perform OCR by several banks to recognize handwritten
numbers on cheques digitized in 32 x 32 grayscale images. Though its ability to solve
was admirable, it was constrained by the availability of computing resources at that time
because of its high computational costs.
2. AlexNet: Visual Recognition Competition (ILSVRC) 2012 contest with error rates half
that of its neares and Ilya Sutskever. It was the first CNN architecture that triumphed in
the ImageNet Large Scale Alex Net was designed by the Super Vision group consisting
of Alex Krizhevsky, Geoffrey Hinton, competitors (from 26% to 15.3%) . This victory
dramatically stimulated the trend toward deep learning architectures in computer vision.
It has 5 convolutional layers and 3 fully connected layers sum. ming up to 8
layers in total. ReLU was applied after each convolutional and fully connected layer
whereas dropout was applied before the first and second fully connected layers. It was
trained for 6 consecutive days on two NVIDIA Geforce GTX 580 GPUs and so the
processing was split into two pipelines.
3. ZFNet: ZFNet is the winner of the ILSVRC 2013. It achieved a top-5 error rate of 14.8%
which is already a significant improvement over AlexNet. It was mostly an achievement
by tweaking the hyper- parameters of AlexNet while maintaining the same structure with
additional deep learning elements.
4. GoogLeNet: Since 2012, CNN architectures have been coming out with flying colors in
the ILSVRC contest. GoogleNet (also known as Inception) is the winner of the ILSVRC
2014 contest. It achieved an error rate of 6.67% with its 22 layers, which was very close
to the human level performance.
5. VGGNet: The runner-up at the ILSVRC 2014 competition was dubbed VGGNet by the
community. It was developed by Simonyan and Zisserman. VGGNet consists of 16
convolutional layers. It is very appealing because of its very uniform architecture. Similar
weeks, it is currently the most preferred choice in the community for extracting features
from images. The weight configuration of the VGGNet is publicly available and has been
used in many applications and challenges as a baseline feature extractor. However,
VGGNet consists of 138 million parameters. which can be a bit challenging to handle.
Backpropagation Through Time (BPTT)
The implementation steps of BPTT are as follows:
STEP 1: Generate input and output data.
STEP 2: Normalize the data with respect to maximum and minimum values.
STEP 3: Assume the number of hidden layers and number of neurons in the hidden layer.
STEP 4: Initialize the weights between 0 and 1. Let [v] be the weights connecting input neurons
and hidden neurons, and [w] be the weights connecting hidden and output neurons in the case of
a single hidden layer.
STEP 5: Compute the input and output at every neuron present in each layer. Let the last hidden
layer compute the output using sigmoidal activation function given by.
STEP 6: Find the error and check the tolerance. If the error is above tolerance, update the
weights.
STEP 7: Repeat from Step 4 till the error is within tolerance.
RNN Topology
RNNs do not have the limitation of performing the transformation from input to output in a
constant number of steps given by the constant number of layers in the model. Sequences in the
input, the output, or both are possible. This means that RNNs can be organized in various ways
to resolve specific problems.
RNN topologies are as follow:
1. One-to-One represents vanilla neural network (CNN) of processing without recurrent nets,
from constant-sized input to constant-sized output (e.g., image classification).
2. One-to-Many represents a sequence output (e.g., image captioning acquires an image as input
and outputs a sentence of words).
3. Many-to-One denotes sequence input (e.g., sentiment analysis where a known sentence is
classified as stating positive or negative sentiment).
4. Many-to-Many for Sequence Input and Sequence Output (e.g., language translation: an RNN
examines a sentence in English and then outputs it in French).
5. Many-to-Many for Synced Sequence Input and Output (e.g., video classification where we
want to label every frame). It should be noted that there are no specific constraints on the lengths
of sequences because the recurrent transformation (green) is not fixed and can be applied as
many times as we want.
RNN Application
Following are most common applications of RNN
1) Speech Recognition: Identifies words and phrases spoken and converts them into
machine readable format. For example, an audio clip X is taken as input and it is mapped
to text transcript Y.
2) Music Generation: Here, the input may be genre of music to be generated and the output
is music sequence.
3) Machine Translation: Conversion of a sentence from one language to another
4) DNA Sequence Analysis: DNA is represented by alphabets A,C,G and T. A DNA
sequence is given as input and RNN labels the protein DNA sequence.
5) Video Activity Recognition: Identifies activity from sequence of video frames.
6) Sentiment Classification: Classification of text data according to sentimental polarities of
view contained in it.
LSTM
Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that is
specifically designed to handle sequential data, such as time series, speech, and text. LSTM
networks are capable of learning long-term dependencies in sequential data, which makes them
well suited for tasks such as language translation, speech recognition, and time series forecasting.
Structure Of LSTM:
LSTM has a chain structure that contains four neural networks and different memory blocks
called cells.
Information is retained by the cells and the memory manipulations are done by the gates. There
are three gates –
1. Forget Gate: The information that is no longer useful in the cell state is removed with the
forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are fed
to the gate and multiplied with weight matrices followed by the addition of bias. The resultant is
passed through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained for
future use.
2. Input gate: The addition of useful information to the cell state is done by the input gate. First,
the information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs h_t-1 and x_t. Then, a vector is created using tanh function
that gives an output from -1 to +1, which contains all the possible values from h_t-1 and x_t. At
last, the values of the vector and the regulated values are multiplied to obtain the useful
information
3. Output gate: The task of extracting useful information from the current cell state to be
presented as output is done by the output gate. First, a vector is generated by applying tanh
function on the cell. Then, the information is regulated using the sigmoid function and filter by
the values to be remembered using inputs h_t-1 and x_t. At last, the values of the vector and the
regulated values are multiplied to be sent as an output and input to the next cell.
Vanishing gradient and Exploding gradient
As the backpropagation algorithm advances downwards(or backward) from the output layer
towards the input layer, the gradients often get smaller and smaller and approach zero which
eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the
gradient descent never converges to the optimum. This is known as the vanishing
gradients problem.
On the contrary, in some cases, the gradients keep on getting larger and larger as the
backpropagation algorithm progresses. This, in turn, causes very large weight updates and causes
the gradient descent to diverge. This is known as the exploding gradients problem.
Certain activation functions, like the logistic function (sigmoid), have a very huge difference
between the variance of their inputs and the outputs. In simpler words, they shrink and transform
a larger input space into a smaller output space that lies between the range of [0,1].
Observing the above graph of the Sigmoid function, we can see that for larger inputs (negative or
positive), it saturates at 0 or 1 with a derivative very close to zero. Thus, when the backpropagation
algorithm chips in, it virtually has no gradients to propagate backward in the network, and
whatever little residual gradients exist keeps on diluting as the algorithm progresses down through
the top layers. So, this leaves nothing for the lower layers.
Similarly, in some cases suppose the initial weights assigned to the network
generate some large loss. Now the gradients can accumulate during an update and result in very
large gradients which eventually results in large updates to the network weights and leads to an
unstable network. The parameters can sometimes become so large that they overflow and result in
NaN values.
Autoencoder
An autoencoder is a feedforward network that uses the backpropagation algorithm to learn
weights. It has a simpler architecture when compared to the deep learning architecture. It is a
two-layer architecture as the input layer is not counted as a layer in the artificial neural network
(ANN) terminology. The autoencoder has an input layer, a hidden layer and an
output layer. The only difference is that the output should be the same as the input. This
differentiates autoencoders from the architectures seen so far. That is, if (1, 1, 0, 0, 0, 1) is given
as input, the autoencoder neural network tries to output (1, 1, 0, 0, 0, 1). This validates the
requirement that the number of input neurons should be the same as the number of output
neurons.
The hidden layer captures the best representative features of the input. That is, the hidden layer
stores the representation of the input. Let us assume that the number of hidden nodes are much
less than the number of input nodes. This type of autoencoder in which the dimension of the
hidden layer is less than the dimension of the input layer is called undercomplete autoencoder.
The values of the hidden layer are viewed as a compressed version of the input. In this example,
let the number of input and output neurons be 6. The hidden layer has a smaller number of
neurons than the number of neurons in the input layer.
Let us have 2 neurons in the hidden layer. The autoencoder takes 6 features and encodes it using
just 2 features. These 2 features are enough to reconstruct the 6 features. That is, we have just
performed dimensionality reduction. The dimension of the original data is 6. The dimension of
the dataset is reduced from 6 to 2; this is termed as dimensionality reduction.
FEATURES OF AUTOENCODER
Autoencoders exhibit the following features:
1. Data Dependent: Autoencoders are compression techniques where the model can be used only
on data in which they have already been trained. For example, the model of an autoencoder used
to compress house images cannot be used to compress human faces.
2. Lossy Compression: Reconstruction of the original data from the compressed representation
would result in a degraded output. This is illustrated in following figure.
Types of autoencoders
1) Vanila Autoencoder
2) Multilayer Autoencoder
3) Stacked Autoencoder
4) Deep autoencoder
5) Denoising autoencoder
6) Convolutional autoencoder
7) Regularized autoencoder
Vanila Autoencoder
1. Vanilla autoencoder.
2. Multilayer autoencoder.
3. Stacked autoencoder.
4. Deep autoencoder.
5. Denoising autoencoder.
6. Convolutional autoencoder.
7. Regularized autoencoder.
Vanilla Autoencoder
The following figure is an illustration of a vanilla autoencoder. This is an ordinary autoencoder
with no added features. It may be noted that the layers are fully connected.
Multilayer Autoencoder
When the vanilla autoencoder has more than one hidden layer it is termed the multilayer
autoencoder. Multilayer autoencoder is a multilayer perceptron (MLP) with symmetry in the
encoder and decoder sides. Following figure shows the architecture of a simple multilayer
autoencoder.
The value in any layer can be used as an intermediate representation of the features. But usually
it is symmetric and the middle layer is used for feature representation. The loss function is like
that of a vanilla autoencoder.
Stacked Autoencoder
Stacked autoencoders stacks various layers of the hidden layer in the encoder and the decoder.
The training does not involve training end-to-end as in multilayer perceptrons using the
backpropagation algorithm. Rather, when there are multiple hidden layers, the first hidden layer
(h') is trained and the parameters are identified using backpropagation. That is, the stacked
autoencoder looks like a simple autoencoder as given in following figure.
To train the second hidden layer (h²), h' is the given in the input and output layers and h² is used
as the code layer. To find h³, we use h² and so on. This is shown in following figure. The size of
the hidden layer continuously decreases, and each hidden layer is expected to learn an abstract
feature. The final code layer can be given as input to a supervised learner to perform
classification and regression.
Denoising Autoencoder
Denoising autoencoder adds random noise to the input and forces the autoencoder to learn
original data after removing the noise. The autoencoder is trained in such a way that it identify
the noise, removes the noise and learns only the required features of the original data. The genera
architecture of the autoencoder is given in following figure.
The loss function still checks the difference between the input data and output data. This ensures
that there is no overfitting of data and the autoencoder can remove the noise and learn the
important features of the input data after removing noise.
In a vanilla autoencoder, the encoder learns the features of the input in a compressed form.
variational autoencoders, the encoder also learns the features of the input, but outputs a vector
means and standard deviations. If c is a sample in the unit Gaussian distribution (a unit Gaussi
distribution has mean 0 and standard deviation 1), then ox+ is a sample with mean and standard
deviation o. The decoder takes as input a sample from the vector of means and standard
deviations. This helps the decoder to generate an output in the category of the input data, but
different from the actual input data. This is shown in the following example.
Let
Output vector (u) = [0.2, 0.9, 0.3, 0.7, ...)
Output vector (a) = [0.3, 0.4, 1, 1.3, ...]
Then the samples generated will be
From this, a sample of encoding vectors are stochastically generated. The architecture of the
variational autoencoder is represented in following figure.
Loss Function of Variational Autoencoder
The loss function is a combination of reconstruction loss and latent loss. The reconstruction loss
is the usual mean squared error. The latent loss is the KL divergence and it measures how closely
latent variables match the unit Gaussian distribution (N(0,1)). The loss function can be written
RBM
Restricted Boltzmann Machine is an undirected graphical model. Restricted Boltzmann
Machines are shallow, two-layer neural nets that constitute the building blocks of deep-belief
networks. The first layer of the RBM is called the visible, or input layer, and the second is the
hidden layer. Each circle represents a neuron-like unit called a node.
Based on the distribution used and the structure of the hidden layers there are many types of
possible RBM. Some of these are as follows:
1. Bernoulli-Bernoulli RBM: The units in the RBM considered so far are random variables
taking binary values. The probability density function is conditioned to a Bernoulli distribu-
tion. This can be referred to as the Bernoulli-Bernoulli RBM where the visible and hidden
units are modeled using the Bernoulli distribution.
2. Gaussian-Bernoulli RBM: There are variants of the RBM that model visible units as
Gaussian and hidden units as Bernoulli and are referred to as Gaussian-Bernoulli RBM.
This allows the visible units to take real-valued input that are modelled using a normal
distribution.
3. Conditional RBM: In this RBM, the visible units are modelled using the Gaussian distribu-
tion and the hidden units use rectified linear unit transformation. Using binary values in the
hidden layer restricts the number of latent features that can be represented. Rectified linear
units help to represent more features.
4. Deep Belief Network (DBN): When the features can be represented as a hierarchy, deep
belief networks are useful. They stack RBMs to represent the features of the training data as
a hierarchy. The architecture of deep belief networks is given in following figure.
Convolutional Neural Network
Aspect Recurrent Neural Network (RNN)
(CNN)
Convolutional filters or kernels that Recurrent neurons with loops that maintain a
Primary Units slide over input to detect spatial hidden state capturing previous inputs over
patterns time
Data Type Primarily processes grid-like data, Primarily processes sequential data, such as
Processed such as images or videos time series, speech, or text
Input Data
Typically, fixed-size 2D or 3D data Variable-length sequences
Structure
Generally lower, as each layer’s Higher memory needs due to retention of past
Memory
operation is mostly local to spatial information and backpropagation through
Requirement
regions time
1. Adaptive Decision-Making: RL enables autonomous learning through trial and error, ideal for
dynamic, unpredictable environments. This is valuable in applications like real-time strategy
games, robotic navigation, and personalized recommendations.
2. Versatility: RL handles diverse tasks, from strategic games to real-world applications like
autonomous vehicles, finance, and healthcare, often outperforming traditional systems in
complex settings.
3. Generalization: With advancements like transfer learning, RL agents can generalize across tasks,
adapting efficiently to new environments with minimal retraining.
4. Deep Reinforcement Learning (DRL): Combining deep learning with RL enhances capabilities,
allowing agents to handle high-dimensional data like images or sensory inputs, where traditional
RL would struggle.
1. Sample Inefficiency: RL agents often need millions of interactions to learn effective strategies,
which is costly in real-world applications like robotics or healthcare.
3. Reward Engineering: Crafting effective reward structures is critical; poorly designed rewards can
lead to unintended or suboptimal behaviors (reward hacking).
4. Credit Assignment: In tasks with delayed rewards, it is difficult to identify which actions led to
success, requiring complex algorithms to manage this "credit assignment problem."
5. Interpretability: RL models, especially deep ones, often function as black boxes, making it hard
to understand their decision-making, which limits their use in fields needing high
interpretability, like healthcare.
6. Ethical and Safety Concerns: In high-stakes environments, RL agents can act unpredictably,
posing risks. Ensuring safe exploration and ethical adherence remains a complex and developing
area.
1. Agent: The decision-making entity that learns to interact with the environment. It aims to
maximize the long-term reward based on feedback it receives from the environment.
2. Environment: The external system the agent interacts with, providing feedback on the agent’s
actions. The environment changes in response to these actions, presenting new states for the
agent to observe and act upon.
4. Action (aaa): A decision taken by the agent to interact with the environment. The agent selects
actions based on a policy (its learned strategy), which evolves as the agent learns.
5. Reward (rrr): A scalar value the environment provides as feedback after each action, indicating
the immediate success or failure of that action. Rewards are the primary feedback mechanism
guiding the agent toward the desired behavior.
6. Policy (π\piπ): The strategy or mapping that the agent uses to select actions based on its
current state. The policy can be deterministic (specific action for each state) or stochastic
(probabilistic approach to selecting actions).
7. Value Function: A function estimating the expected cumulative reward for each state (or state-
action pair), representing the long-term benefit of a state. This helps the agent gauge the overall
utility of different actions over time, beyond immediate rewards.
8. Q-Function: A specific type of value function that represents the expected cumulative reward of
taking a particular action in a given state and following the policy thereafter.
9. Exploration vs. Exploitation: The balance between exploring new actions (to discover
potentially better rewards) and exploiting known actions (to maximize reward based on current
knowledge). Effective RL requires a careful balance of both.
10. Learning Algorithm: The process by which the agent updates its policy and value functions
based on experiences. Common algorithms include Q-learning, policy gradient methods, and
deep reinforcement learning.
1. Initial Interaction: The agent begins in an initial state, observes it, and selects an action based
on its policy.
2. Feedback and Update: The environment responds to the agent’s action, providing a new state
and a reward. The agent uses this feedback to adjust its policy, gradually improving its decisions.
3. Iteration and Optimization: Through repeated interactions, the agent learns the optimal policy
to maximize cumulative rewards. This learning continues until it converges to an optimal or
near-optimal strategy.
In essence, reinforcement learning involves a cycle of interaction, feedback, and policy refinement,
allowing the agent to adapt and optimize its actions for long-term success in complex, often dynamic
environments.
Gradient descent is an optimization algorithm which is commonly-used to train machine learning models
and neural networks. It trains machine learning models by minimizing errors between predicted and
actual results.
Limited scalability; larger TNNs become Highly scalable; additional modules can be
Scalability
difficult to train and optimize added without extensive retraining
Errors are distributed across the entire Errors are often contained within specific
Error Isolation network, which can affect overall modules, reducing their impact on other
performance modules
Less adaptable; a change in one part of More adaptable; individual modules can be
Adaptability
the network affects the entire model modified without impacting other modules
Gradient descent is an optimization algorithm widely used in machine learning and deep learning to
minimize a function by iteratively moving toward its minimum value. It’s commonly applied to reduce
the error (or "loss") in models by adjusting parameters, such as weights in neural networks, to improve
performance. The goal of gradient descent is to find the parameters that minimize a given function
(usually a loss function in machine learning). The algorithm achieves this by following the negative
gradient of the function, which indicates the direction of steepest descent.
1. Initialize Parameters: Start with initial values for the parameters, typically chosen randomly.
2. Calculate the Gradient: Compute the gradient (partial derivatives) of the loss function with
respect to each parameter. This gradient points in the direction where the function increases
the fastest.
3. Update Parameters: Move the parameters in the opposite direction of the gradient by a factor
of the learning rate .
4. Repeat: Continue calculating the gradient and updating parameters until the loss function
converges to a minimum (or until a set number of iterations is reached).
o Divides the dataset into small batches and calculates the gradient on each batch.
o Balances the efficiency of SGD with the stability of batch gradient descent.
A feedback neural network, also known as a recurrent or feedback-connected neural network, is a type
of neural network where connections between neurons allow for cycles or loops. Unlike feedforward
neural networks, which have a unidirectional flow of information from input to output, feedback
networks allow information to be "fed back" into the network, enabling it to maintain a form of memory
over time. This is especially useful for tasks involving sequential or time-dependent data, such as
language processing, speech recognition, and time series prediction.
• Cycles in Connections: Feedback networks have loops where outputs from neurons are sent
back as inputs to the same or previous layers, enabling the network to use previous information.
• Memory and State: These networks retain information about previous inputs, allowing them to
handle dependencies in sequential data. This is in contrast to feedforward networks, which have
no memory and treat each input independently.
• Dynamic Behavior: The feedback connections make these networks dynamic, as they can
change their output based on both current and previous inputs.
The working of a feedback neural network typically involves the following steps:
1. Input Layer:
o The network receives an input at each time step (for example, a word in a sentence or a
data point in a time series). This input is processed similarly to a feedforward neural
network.
o The hidden layer has feedback (or recurrent) connections, meaning that the output of
each neuron in this layer is connected not only to the next layer but also to itself or
other neurons within the hidden layer.
o At each time step, the network maintains a hidden state, which is updated based on the
current input and the previous hidden state. This hidden state essentially stores
memory of prior inputs and influences the output.
3. Output Layer:
o The network produces an output at each time step, which can be based on the current
hidden state or a combination of the hidden state and the input.
• Simple Recurrent Networks (RNNs): These networks have simple feedback connections and are
used for basic sequence processing tasks.
• Long Short-Term Memory (LSTM): LSTMs are specialized RNNs designed to handle long-term
dependencies by using gating mechanisms to control the flow of information.
• Gated Recurrent Units (GRUs): Similar to LSTMs, GRUs are another type of gated RNN that uses
fewer parameters than LSTMs while still handling dependencies well.
Feedback networks are particularly useful for tasks where past information is essential for current
decision-making, such as:
• Natural Language Processing: For tasks like machine translation, text generation, and sentiment
analysis.
• Speech Recognition: To process sequential audio data and convert spoken language into text.
• Time Series Prediction: For forecasting future values in finance, weather, and other fields
involving sequential data.
Computational Generally lower, as there are no Higher, as each time step depends on
Complexity dependencies over time previous computations
Suitability for Less suited, as it lacks memory and Ideal for sequential data, where past
Sequential Data temporal awareness context is essential
The concept of modularity in neural networks draws inspiration from the human brain, where
different regions specialize in different functions, such as vision, language, and motor control.
Similarly, MNNs aim to replicate this specialization to enhance learning efficiency and
performance.
Key Characteristics
1. Decentralization: Instead of relying on a single network to process all information,
MNNs distribute tasks across multiple modules.
2. Specialization: Each module is specialized to handle specific aspects or features of the
data.
3. Interconnectivity: Modules communicate and collaborate, often through a central
coordinator or through interconnected pathways.
4. Scalability: New modules can be added to handle tasks that are more complex without
overhauling the entire network.
5. Fault Tolerance: If one module fails or underperforms, others can compensate,
enhancing the overall robustness.
Architecture of MNNs
The architecture of Modular Neural Networks can vary widely based on the specific application
and design goals. However, common architectural patterns include:
1. Parallel Modules
● Structure: Multiple modules operate independently on different parts of the input data.
● Example: In image processing, separate modules might handle color, texture, and shape
features.
2. Hierarchical Modules
● Structure: Modules are organized in layers, where higher-level modules receive input
from lower-level ones.
● Example: In natural language processing, lower modules might handle syntax, while
higher modules manage semantics.
3. Ensemble Modules
● Structure: Each module produces its own output, which is then combined (e.g., averaged
or voted) to form the final prediction.
● Example: In classification tasks, different modules might specialize in different classes
or features.
4. Hybrid Modules
Applications of MNNs
1. Computer Vision:
o Object detection, image segmentation, and scene understanding by dividing tasks
among specialized modules.
2. Natural Language Processing:
o Handling syntax, semantics, and context through different modules for
comprehensive language understanding.
3. Robotics:
o Separating perception, planning, and control tasks to enhance robotic autonomy
and efficiency.
4. Healthcare:
o Integrating modules for diagnostic analysis, patient data interpretation, and
treatment recommendation.
5. Autonomous Vehicles:
o Managing perception, navigation, decision-making, and control through distinct
modules.
6. Financial Modeling:
o Risk assessment, market prediction, and fraud detection handled by specialized
sub-networks.