Unit 2
Unit 2
Unit 2
Deep learning architectures refer to the structural design of artificial neural networks used in
deep learning. These architectures are composed of multiple layers of interconnected nodes
(neurons) that process and transform input data to produce an output. Deep learning architectures
have gained popularity due to their ability to automatically learn representations of data through
the use of multiple layers, enabling them to effectively model complex patterns and relationships
in data.
Also known as multilayer perceptron (MLPs), FNNs consist of multiple layers of nodes,
including an input layer, one or more hidden layers, and an output layer. Each node in a layer is
connected to every node in the subsequent layer, and the information flows in one direction from
the input to the output.
CNNs are primarily used for processing grid-structured data such as images and are
characterized by their use of convolution layers. Convolution layers apply filters (kernels) to
input data, capturing spatial hierarchies and patterns. CNNs are widely used in image
recognition, object detection, and image generation tasks.
RNNs are designed to handle sequential data, such as time series or natural language. They have
loops within their architecture, allowing them to maintain and process information over time.
RNNs are used in tasks like speech recognition, machine translation, and sentiment analysis.
LSTMs are a type of RNN that are capable of learning long-term dependencies in sequential
data. They use a memory cell and various gates to selectively retain and update information over
time, making them well-suited for tasks requiring memory over long sequences.
Auto encoders:
Auto encoders are used for unsupervised learning and dimensionality reduction. They consist of
an encoder that maps input data to a lower-dimensional latent space and a decoder that
reconstructs the input from the latent space representation. Auto encoders are used for tasks like
data denoising, feature learning, and generative modeling.
Generative Adversarial Networks (GANs):
GANs consist of two neural networks, a generator and a discriminator, which are trained
adversarial. The generator learns to produce realistic data samples, while the discriminator learns
to differentiate between real and generated samples. GANs are used for generative modeling
tasks such as image synthesis, style transfer, and data augmentation. These are just a few
examples of deep learning architectures, and there are many other variations and specialized
architectures designed for specific tasks and domains. The choice of architecture depends on the
nature of the data, the complexity of the problem, and the specific requirements of the
application.
The number of architectures and algorithms that are used in deep learning is wide and varied.
This section explores six of the deep learning architectures spanning the past 20 years. Notably,
long short-term memory (LSTM) and convolution neural networks (CNNs) are two of the oldest
approaches in this list but also two of the most used in various applications.
This article classifies deep learning architectures into supervised and unsupervised learning and
introduces several popular deep learning architectures: convolution neural networks, recurrent
neural networks (RNNs), long short-term memory/gated recurrent unit (GRU), self-organizing
map (SOM), auto encoders (AE) and restricted Boltzmann machine (RBM). It also gives an
overview of deep belief networks (DBN) and deep stacking networks (DSNs) Artificial neural
network (ANN) is the underlying architecture behind deep learning. Based on ANN, several
variations of the algorithms have been invented. To learn about the fundamentals of deep
learning and artificial neural networks, read the introduction to deep learning article.
Supervised learning refers to the problem space wherein the target to be predicted is clearly
labeled within the data that is used for training. In this section, we introduce at a high-level two
of the most popular supervised deep learning architectures - convolution neural networks and
recurrent neural networks as well as some of their variants.
Convolution neural networks A CNN is a multilayer neural network that was biologically
inspired by the animal visual cortex. The architecture is particularly useful in image-processing
applications. The first CNN was created by Yann Le Can; at the time, the architecture focused on
handwritten character recognition, such as postal code interpretation. As a deep network, early
layers recognize features (such as edges), and later layers recombine these features into higher-
level attributes of the input. The LeNet CNN architecture is made up of several layers that
implement feature extraction and then classification (see the following image). The image is
divided into receptive fields that feed into a convolution layer, which then extracts features from
the input image. The next step is pooling, which reduces the dimensionality of the extracted
features (through down-sampling) while retaining the most important information (typically,
through max pooling). Another convolution and pooling step is then performed that feeds into a
fully connected multilayer perceptron. The final output layer of this network is a set of nodes that
identify features of the image (in this case, a node per identified number). You train the network
by using back-propagation.
The use of deep layers of processing, convolutions, pooling, and a fully connected classification
layer opened the door to various new applications of deep learning neural networks. In addition
to image processing, the CNN has been successfully applied to video recognition and various
tasks within natural language processing.
Recurrent neural networks
The RNN is one of the foundational network architectures from which other deep learning
architectures are built. The primary difference between a typical multilayer network and a
recurrent network is that rather than completely feed-forward connections, a recurrent network
might have connections that feed back into prior layers (or into the same layer). This feedback
allows RNNs to maintain memory of past inputs and model problems in time. RNNs consist of a
rich set of architectures (we'll look at one popular topology called LSTM next). The key
differentiator is feedback within the network, which could manifest itself from a hidden layer,
the output layer, or some combination thereof.
RNNs can be unfolded in time and trained with standard back-propagation or by using a variant
of back-propagation that is called back-propagation in time (BPTT).
LSTM networks
The LSTM was created in 1997 by Hochreiter and Schimdhuber, but it has grown in popularity
in recent years as RNN architecture for various applications. You'll find LSTMs in products that
you use every day, such as smart phones. IBM applied LSTMs in IBM Watson® for milestone-
setting conversational speech recognition. The LSTM departed from typical neuron-based neural
network architectures and instead introduced the concept of a memory cell. The memory cell can
retain its value for a short or long time as a function of its inputs, which allows the cell to
remember what's important and not just its last computed value. The LSTM memory cell
contains three gates that control how information flows into or out of the cell. The input gate
controls when new information can flow into the memory. The forget gate controls when an
existing piece of information is forgotten, allowing the cell to remember new data. Finally, the
output gate controls when the information that is contained in the cell is used in the output from
the cell. The cell also contains weights, which control each gate. The training algorithm,
commonly BPTT, optimizes these weights based on the resulting network output error.
Recent applications of CNNs and LSTMs produced image and video captioning systems in
which an image or video is captioned in natural language. The CNN implements the image or
video processing, and the LSTM is trained to convert the CNN output into natural language.
GRU networks
In 2014, a simplification of the LSTM was introduced called the gated recurrent unit. This model
has two gates, getting rid of the output gate present in the LSTM model. These gates are an
update gate and a reset gate. The update gate indicates how much of the previous cell contents to
maintain. The reset gate defines how to incorporate the new input with the previous cell contents.
A GRU can model a standard RNN simply by setting the reset gate to 1 and the update gate to 0.
The GRU is simpler than the LSTM, can be trained more quickly, and can be more efficient in its
execution. However, the LSTM can be more expressive and with more data can lead to better
results. Example applications: Natural language text compression, handwriting recognition,
speech recognition, gesture recognition, image captioning
Unsupervised deep learning
Unsupervised learning refers to the problem space wherein there is no target label within the data
that is used for training. This section discusses three unsupervised deep learning architectures:
self-organized maps, auto encoders, and restricted Boltzmann machines. We also discuss how
deep belief networks and deep stacking networks are built based on the underlying unsupervised
architecture.
Self-organized maps
Self-organized map (SOM) was invented by Dr. Teuvo Kohonen in 1982 and was popularly
known as the Kohonen map. SOM is an unsupervised neural network that creates clusters of the
input data set by reducing the dimensionality of the input. SOMs vary from the traditional
artificial neural network in quite a few ways.
The first significant variation is that weights serve as a characteristic of the node. After the inputs
are normalized, a random input is first chosen. Random weights close to zero are initialized to
each feature of the input record. These weights now represent the input node. Several
combinations of these random weights represent variations of the input node. The Euclidean
distance between each of these output nodes with the input node is calculated. The node with the
least distance is declared as the most accurate representation of the input and is marked as
the best matching unit or BMU. With these BMUs as center points, other units are similarly
calculated and assigned to the cluster that it is the distance from. Radius of points around BMU
weights are updated based on proximity. Radius is shrunk. Next, in an SOM, no activation
function is applied, and because there are no target labels to compare against there is no concept
of calculating error and back propagation. Example applications: Dimensionality reduction,
clustering high-dimensional inputs to 2-dimensional output, radiant grade result, and cluster
visualization
Auto encoders
Though the history of when auto encoders were invented is hazy, the first known usage of auto
encoders was found to be by LeCun in 1987. This variant of an ANN is composed of 3 layers:
input, hidden, and output layers. First, the input layer is encoded into the hidden layer using an
appropriate encoding function. The number of nodes in the hidden layer is much less than the
number of nodes in the input layer. This hidden layer contains the compressed representation of
the original input. The output layer aims to reconstruct the input layer by using a decoder
function.
During the training phase, the difference between the input and the output layer is calculated
using an error function, and the weights are adjusted to minimize the error. Unlike traditional
unsupervised learning techniques, where there is no data to compare the outputs against, auto
encoders learn continuously using backward propagation. For this reason, auto encoders are
classified as self supervised algorithms. Example applications: Dimensionality reduction, data
interpolation, and data compression/decompression
Though RBMs became popular much later, they were originally invented by Paul Smolensky in
1986 and were known as a Harmonium. An RBM is a 2-layered neural network. The layers are
input and hidden layers. As shown in the following figure, in RBMs every node in a hidden layer
is connected to every node in a visible layer. In a traditional Boltzmann Machine, nodes within
the input and hidden layer are also connected. Due to computational complexity, nodes within a
layer are not connected in a Restricted Boltzmann Machine.
During the training phase, RBMs calculate the probability distribution of the training set using a
stochastic approach. When the training begins, each neuron gets activated at random. Also, the
model contains respective hidden and visible bias. While the hidden bias is used in the forward
pass to build the activation, the visible bias helps in reconstructing the input. Because in an RBM
the reconstructed input is always different from the original input, they are also known
as generative models. Also, because of the built-in randomness, the same predictions result in
different outputs. In fact, this is the most significant difference from an auto encoder, which is a
deterministic model. Example applications: Dimensionality reduction and collaborative filtering
The DBN is typical network architecture, but includes a novel training algorithm. The DBN is a
multilayer network (typically deep and including many hidden layers) in which each pair of
connected layers is an RBM. In this way, a DBN is represented as a stack of RBMs. In the DBN,
the input layer represents the raw sensory inputs, and each hidden layer learns abstract
representations of this input. The output layer, which is treated somewhat differently than the
other layers, implements the network classification. Training occurs in two steps: unsupervised
retraining and supervised fine-tuning.
In unsupervised retraining, each RBM is trained to reconstruct its input (for example, the first
RBM reconstructs the input layer to the first hidden layer). The next RBM is trained similarly,
but the first hidden layer is treated as the input (or visible) layer, and the RBM is trained by using
the outputs of the first hidden layer as the inputs. This process continues until each layer is
retrained. When the retraining is complete, fine-tuning begins. In this phase, the output nodes are
applied labels to give them meaning (what they represent in the context of the network). Full
network training is then applied by using either gradient descent learning or back-propagation to
complete the training process. Example applications: Image recognition, information retrieval,
natural language understanding, and failure prediction
The final architecture is the DSN, also called a deep convex network. A DSN is different from
traditional deep learning frameworks in that although it consists of a deep network, it's actually a
deep set of individual networks, each with its own hidden layers. This architecture is a response
to one of the problems with deep learning, the complexity of training. Each layer in a deep
learning architecture exponentially increases the complexity of training, so the DSN views
training not as a single problem but as a set of individual training problems. The DSN consists of
a set of modules, each of which is a sub network in the overall hierarchy of the DSN. In one
instance of this architecture, three modules are created for the DSN. Each module consists of an
input layer, a single hidden layer, and an output layer. Modules are stacked one on top of
another, where the inputs of a module consist of the prior layer outputs and the original input
vector. This layering allows the overall network to learn more complex classification than would
be possible given a single module.
The DSN permits training of individual modules in isolation, making it efficient given the ability
to train in parallel. Supervised training is implemented as back-propagation for each module
rather than back-propagation over the entire network. For many problems, DSNs can perform
better than typical DBNs, making them popular and efficient network architecture. Example
applications: Information retrieval and continuous speech recognition
Deep learning is represented by a spectrum of architectures that can build solutions for a range of
problem areas. These solutions can be feed-forward focused or recurrent networks that permit
consideration of previous inputs. Although building these types of deep architectures can be
complex, various open source solutions, such as Cafe, Deeplearning4j, Tensor Flow, and DDL,
are available to get you up and running quickly. Representation learning is a fundamental
concept in both machine learning and deep learning. It refers to the process of learning
meaningful representations of data from raw input, which can then be used for various tasks such
as classification, clustering, and prediction. Representation learning aims to capture the
underlying structure or patterns in the data, making it easier for machine learning models to
extract relevant features and make accurate predictions. In traditional machine learning, feature
engineering is often used to manually design representations of the input data before feeding it
into the learning algorithm. This process requires domain knowledge and can be time-
consuming, especially for complex datasets. Representation learning, on the other hand, aims to
automate this process by learning useful features or representations directly from the data. Deep
learning, with its ability to automatically learn hierarchical representations from raw data, has
revolutionized representation learning. Deep neural networks, especially architectures like
convolution neural networks (CNNs) and recurrent neural networks (RNNs), are capable of
learning complex and hierarchical representations of data. For example:
Convolution Neural Networks (CNNs) are particularly effective for learning representations
from images. They automatically learn hierarchical features such as edges, textures, and object
parts, which are then combined to represent higher-level concepts like objects or scenes.
Recurrent Neural Networks (RNNs) are well-suited for sequential data such as text or time
series. They can capture temporal dependencies and learn representations that take into account
the sequential nature of the data.
In deep learning, representation learning is often achieved through the use of multiple layers of
neurons, which enable the network to learn increasingly abstract and complex representations of
the input data. Each layer in a deep network can be seen as learning a new representation of the
data based on the representations learned by the previous layers. This hierarchical learning
process allows deep learning models to automatically discover meaningful features and patterns
in the data, without the need for explicit feature engineering.
Representation learning has led to significant advancements in various fields, including computer
vision, natural language processing, speech recognition, and many others. By learning rich and
meaningful representations of data, machine learning and deep learning models can achieve
higher levels of performance and generalization across a wide range of tasks.
Need of Representation Learning
Assume you’re developing a machine-learning algorithm to predict dog breeds based on pictures.
Because image data provides all of the answers, the engineer must rely heavily on it when
developing the algorithm. Each observation or feature in the data describes the qualities of the
dogs. The machine learning system that predicts the outcome must comprehend how each
attribute interacts with other outcomes such as Pug, Golden Retriever, and so on.
Representation learning is a class of machine learning approaches that allow a system to discover
the representations required for feature detection or classification from raw data. The
requirement for manual feature engineering is reduced by allowing a machine to learn the
features and apply them to a given activity. In representation learning, data is sent into the
machine, and it learns the representation on its own. It is a way of determining a data
representation of the features, the distance function, and the similarity function that determines
how the predictive model will perform. Representation learning works by reducing high-
dimensional data to low-dimensional data, making it easier to discover patterns and anomalies
while also providing a better understanding of the data’s overall behavior. Basically, Machine
learning tasks such as classification frequently demand input that is mathematically and
computationally convenient to process, which motivates representation learning. Real-world
data, such as photos, video, and sensor data, has resisted attempts to define certain qualities
algorithmically. An approach is to examine the data for such traits or representations rather than
depending on explicit techniques.
We must employ representation learning to ensure that the model provides invariant and
untangled outcomes in order to increase its accuracy and performance. In this section, we’ll look
at how representation learning can improve the model’s performance in three different learning
frameworks: supervised learning, unsupervised learning.
Supervised Learning
This is referred to as supervised learning when the ML or DL model maps the input X to the
output Y. The computer tries to correct itself by comparing model output to ground truth, and the
learning process optimizes the mapping from input to output. This process is repeated until the
optimization function reaches global minima. Even when the optimization function reaches the
global minima, new data does not always perform well, resulting in over fitting. While
supervised learning does not necessitate a significant amount of data to learn the mapping from
input to output, it does necessitate the learned features. The prediction accuracy can improve by
up to 17 percent when they learned attributes are incorporated into the supervised learning
algorithm. Using labeled input data, features are learned in supervised feature learning.
Supervised neural networks, multilayer perceptron, and (supervised) dictionary learning are
some examples.
Unsupervised Learning
Unsupervised learning is a sort of machine learning in which the labels are ignored in favor of
the observation itself. Unsupervised learning isn’t used for classification or regression; instead,
it’s used to uncover underlying patterns, cluster data, denies it, detect outliers, and decompose
data, among other things. When working with data x, we must be very careful about whatever
features z we use to ensure that the patterns produced are accurate. It has been observed that
having more data does not always imply having better representations. We must be careful to
develop a model that is both flexible and expressive so that the extracted features can convey
critical information. Unsupervised feature learning learns features from unlabeled input data by
following the methods such as Dictionary learning, independent component analysis, auto
encoders, matrix factorization, and various forms of clustering are among examples. In the next
section, we will see more about these methods and workflow, how they learn the representation
in detail.
Supervised Methods
Dictionary learning creates a set of representative elements (dictionary) from the input data,
allowing each data point to be represented as a weighted sum of the representative elements. By
minimizing the average representation error (across the input data) and applying L1
regularization to the weights, the dictionary items and weights may be obtained i.e., the
representation of each data point has only a few nonzero weights. For optimizing dictionary
elements, supervised dictionary learning takes advantage of both the structure underlying the
input data and the labels. The supervised dictionary learning technique uses dictionary learning
to solve classification issues by optimizing dictionary elements, data point weights, and classifier
parameters based on the input data. A minimization problem is formulated, with the objective
function consisting of the classification error, the representation error, an L1 regularization on
the representing weights for each data point (to enable sparse data representation), and an L2
regularization on the parameters of the classification algorithm.
Multi-Layer Perceptron
The perceptron is the most basic neural unit, consisting of a succession of inputs and weights that
are compared to the ground truth. A multi-layer perceptron, or MLP, is a feed-forward neural
network made up of layers of perceptron units. MLP is made up of three-node layers: an input, a
hidden layer, and an output layer. MLP is commonly referred to as the vanilla neural network
because it is a very basic artificial neural network.
This notion serves as a foundation for hidden variables and representation learning. Our goal in
this theorem is to determine the variables or required weights that can represent the underlying
distribution of the entire data so that when we plug those variables or required weights into
unknown data, we receive results that are almost identical to the original data. In a word,
artificial neural networks (ANN) assist us in extracting meaningful patterns from a dataset.
Neural Networks
Neural networks are a class of learning algorithms that employ a “network” of interconnected
nodes in various layers. It’s based on the animal nervous system, with nodes resembling neurons
and edges resembling synapses. The network establishes computational rules for passing input
data from the network’s input layer to the network’s output layer, and each edge has an
associated weight. The relationship between the input and output layers, which is parameterized
by the weights, is described by a network function associated with a neural network. Various
learning tasks can be achieved by minimizing a cost function over the network function (w) with
correctly defined network functions.
Unsupervised Methods
K-Means Clustering
K-means clustering is a vector quantization approach. An n-vector set is divided into k clusters
(i.e. subsets) via K-means clustering, with each vector belonging to the cluster with the closest
mean. Despite the use of inferior greedy techniques, the problem is computationally NP-hard.
K-means clustering divides an unlabeled collection of inputs into k groups before obtaining
centroid-based features. These characteristics can be honed in a variety of ways. The simplest
method is to add k binary features to each sample; with each feature j having a value of one of
the k-means learned jth centroid is closest to the sample under consideration. Cluster distances
can be used as features after being processed with a radial basis function.
Local Linear Embedding
For optimizing dictionary elements, unsupervised dictionary learning does not use data labels
and instead relies on the structure underlying the data. Sparse coding, which seeks to learn basic
functions (dictionary elements) for data representation from unlabeled input data, is an example
of unsupervised dictionary learning. When the number of vocabulary items exceeds the
dimension of the input data, sparse coding can be used to learn over complete dictionaries. K-
SVD is an algorithm for learning a dictionary of elements that allows for sparse representation.
Deep learning architectures for feature learning are inspired by the hierarchical architecture of
the biological brain system, which stacks numerous layers of learning nodes. The premise of
distributed representation is typically used to construct these architectures: observable data is
generated by the interactions of many diverse components at several levels.
In multilayer learning frameworks, RBMs (restricted Boltzmann machines) are widely used as
building blocks. An RBM is a bipartite undirected network having a set of binary hidden
variables, visible variables, and edges connecting the hidden and visible nodes. It’s a variant of
the more general Boltzmann machines, with the added constraint of no intra-node connections.
In an RBM, each edge has a weight assigned to it. The connections and weights define an energy
function that can be used to generate a combined distribution of visible and hidden nodes. For
unsupervised representation learning, an RBM can be thought of as a single-layer design. The
visible variables, in particular, relate to the input data, whereas the hidden variables correspond
to the feature detectors. Hinton’s contrastive divergence (CD) approach can be used to train the
weights by maximizing the probability of visible variables.
Auto encoders
Deep network representations have been found to be insensitive to complex noise or data
conflicts. This can be linked to the architecture to some extent. The employment of convolutional
layers and max-pooling, for example, can be proven to produce transformation insensitivity.
Auto encoders are therefore neural networks that may be taught to do representation learning.
Auto encoders seek to duplicate their input to their output using an encoder and a decoder. Auto
encoders are typically trained via recirculation, a learning process that compares the activation of
the input network to the activation of the reconstructed input.
Final Words
Unlike typical learning tasks like classification, which has the end goal of reducing
misclassifications, representation learning is an intermediate goal of machine learning making it
difficult to articulate a straight and obvious training target. In this post, we understood how to
overcome such difficulties from scratch. From the starting, we have seen what the actual need for
this method was and understood different methodologies in supervised, unsupervised, and some
deep learning frameworks. The terms "width" and "depth" in the context of neural networks refer
to two important architectural aspects that can significantly impact the behavior and performance
of the network.
Width:
The width of a neural network refers to the number of neurons in each layer. In other words, it
determines the "breadth" of the network. Increasing the width of a network means adding more
neurons to each layer. This can increase the model's capacity to learn complex patterns in the
data. A wider network has more parameters and therefore more capacity to represent complex
functions. However, it also requires more computational resources and data to train effectively.
Depth:
The depth of a neural network refers to the number of layers in the network. It determines the
"depth" of the network's hierarchy of features. Increasing the depth of a network means adding
more layers. Deeper networks can learn hierarchical representations of the input data, where
lower layers capture simple features and higher layers capture more abstract and complex
features. Deep networks have been shown to be effective in learning complex patterns in data,
especially in tasks such as image recognition, natural language processing, and speech
recognition. Both width and depth are important considerations when designing neural network
architectures:
Trade-offs: Increasing the width or depth of a network can improve its representational capacity,
but it can also lead to over fitting if the model becomes too complex for the available training
data.
Computational Cost: Deeper and wider networks typically require more computational
resources (memory, processing power) for training and inference.
Training Difficulty: Deeper networks can be more challenging to train due to issues like
vanishing/exploding gradients, and wider networks can require more careful hyper parameter
tuning to prevent over fitting. The choice of width and depth in neural network architecture
depends on various factors such as the complexity of the task, the size of the training dataset, the
computational resources available, and the trade-off between model complexity and
generalization performance.
Neural Networks
Neural networks are algorithms explicitly created as an inspiration for biological neural
networks. The basis of neural networks is neurons that interconnect according to the type of
network. Initially, the idea was to create an artificial system that would function just like the
human brain.
There are many types of neural networks, but they roughly fall into three main classes:
For the most part, the difference between them is the type of neurons that form them and how the
information flows through the network. In this article, we’ll briefly explain only convolution
neural networks.
3. Convolution Neural Networks
Convolution neural networks (CNN) are a type of artificial neural network, a machine learning
technique. They have been around for a while but have recently gained more exposure because
of their success in image recognition. A convolution neural network is a powerful tool that we
can use to process any data to apply the convolution operation. The success of CNN is
because they can process large amounts of data such as images, videos, and text. Primarily,
we can use them to classify images, localize objects, and extract features from the image, such as
edges or corners. They are typically composed of one or more hidden layers, each of which
contains a set of learnable filters called neurons.
Neural Networks consist of layers where each layer has multiple neurons. The number of layers
in a neural network defines its depth. Also, a neural network must have at least two layers:
Input layer – it brings the input data into the system and represents the beginning of the neural
network architecture. Output layer – this is the last layer in the neural networks, and it produces
the result of a model. In addition, all layers different from the input and output layers are hidden
layers. It’s common that CNN has around five to ten layers, but some modern architecture has up
to one hundred layers.
A convolution layer is a layer where we apply filters to the input images or tensors. We can
visualize this process more intuitively by looking at the following figure:
The figure above shows the matrix to apply the convolution using filter . This means
that filter passes through matrix , and an element-by-element multiplication is applied
between the corresponding element of the matrix and filter . Then we sum the results
of this multiplication into a number. Usually, the inputs to the CNN are color images. They
consist of the three channels that represent the intensity of red, green, and blue colors. Every
pixel in the image combines these three colors, where the intensity of the color is described with
an integer number from 0 to 255.
Hence, input images have their width, height, and depth. The depth of the input images
defines the depth of the input layer. Consequently, the depth of the second layer depends
on the number of kernels we used in the input layer.
For instance, let input image has dimension and filter or kernel has dimension.
Notice that the kernel depth must be the same as the depth of the input image. Let the
convolution between and be matrix. With a kernel step of one, the matrix has a
dimension .Similarly, if we apply two filters and, we’ll get two result matrices and. After that,
we stack together and into one tensor with the dimension. Analogous to that, if we
apply filters, the output tensor will have dimension, where defines depth:
5. Conclusion
In this short article, we presented the relationship between the term “depth” and CNN’s.
Dimensions such as width, height, and depth often sound confusing for beginners, and because
of that, we provided a simple example with illustrations. Activation functions are a fundamental
component in neural networks. They introduce non-linearity into the network, enabling it to learn
and represent more complex patterns in the data. Without activation functions, a neural network,
regardless of how many layers it has, would behave like a single-layer linear perceptron, which
can only learn linear mappings.
Input Layer: This layer accepts input features. It provides information from the outside world
to the network, no computation is performed at this layer, nodes here just pass on the
information (features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the
abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to the
output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to
introduce non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we would
update the weights and biases of the neurons on the basis of the error at the output. This
process is known as back-propagation. Activation functions make the back-propagation
possible since the gradients are supplied along with the error to update the weights and biases.
A neural network without an activation function is essentially just a linear regression model.
The activation function does the non-linear transformation to the input making it capable to
learn and perform more complex tasks.
Mathematical proof
Here,
W(1) be the vector zed weights assigned to neurons of hidden layer i.e. w1, w2, w3 and w4
b is the vector zed bias assigned to neurons in hidden layer i.e. b1 and b2
a(2) = z(2)
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
This observation results again in a linear function even after applying a hidden layer, hence we
can conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will
behave same way because the composition of two linear function is a linear function itself.
Neuron cannot learn with just a linear function attached to it. A non-linear activation function
will let it learn as per the difference w.r.t error. Hence we need an activation function.
Linear Function
Equation: Linear function has the equation similar to as of a straight line i.e. y = x No matter
how many layers we have, if all are linear in nature, the final activation function of last layer is
nothing but just a linear function of the input of first layer. Range: -in to +in Uses: Linear
activation function is used at just one place i.e. output layer. Issues: If we will differentiate
linear function to bring non-linearity, result will no more depend on input “x” and function
will become constant, it won’t introduce any ground-breaking behavior to our algorithm. For
example : Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net
must have any non-linear function at hidden layers.
Sigmoid Function
It is a function which is plotted as ‘S’ shaped graph. Equation: A = 1/(1 + e-x) Nature: Non-
linear. Notice that X values lies between -2 to 2, Y values are very steep. This means, small
changes in x would also bring about large changes in the value of Y. Value Range: 0 to 1
Uses: Usually used in output layer of a binary classification, where result is either 0 or 1, as
value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to be 1 if
value is greater than 0.5 and 0 otherwise.
Tanh Function
The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
Equation:-
Uses: - Usually used in hidden layers of a neural network as its values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer much
easier.
RELU Function
It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of neural network. Equation: - A(x) = max (0, x). It gives an
output x if x is positive and 0 otherwise. Value Range: - [0, in) Nature: - non-linear, which
means we can easily back propagate the errors and have multiple layers of neurons being
activated by the ReLU function. Uses: - ReLu is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation. In simple
words, RELU learns much faster than sigmoid and Tanh function.
Soft ax Function
Nature: - non-linear
Uses: - Usually used when trying to handle multiple classes. The soft ax function was
commonly found in the output layer of image classification problems. The soft ax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of
the outputs.
Output:- The soft ax function is ideally used in the output layer of the classifier where we are
actually trying to attain the probabilities to define the class of each input.
The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.
If your output is for binary classification then, sigmoid function is very natural choice for
output layer.
If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.
Unsupervised training of neural networks involves training the network on input data without
explicit supervision or labeled target outputs. Instead of using labeled data (as in supervised
learning), the network learns to extract features, patterns, or representations directly from the
input data without specific guidance on what to learn.
Here are some common methods for unsupervised training of neural networks:
Auto encoders:
Auto encoders are a type of neural network designed for unsupervised learning and
dimensionality reduction. They consist of an encoder network that maps input data to a lower-
dimensional latent space representation, and a decoder network that reconstructs the input data
from the latent space. During training, the network learns to minimize the reconstruction error,
effectively learning a compressed representation of the input data.
GANs are a type of generative model that consists of two neural networks: a generator and a
discriminator. The generator generates fake samples that resemble the training data, while the
discriminator tries to distinguish between real and fake samples. Through adversarial training,
the generator learns to produce more realistic samples, while the discriminator learns to become
better at distinguishing real from fake samples.
Self-Supervised Learning:
Self-supervised learning is a technique where a model is trained to solve a pretext task using the
input data itself, without requiring explicit labels. The learned representations from this pretext
task can then be used as features for downstream supervised tasks. Examples of pretext tasks
include predicting missing parts of an input (e.g., predicting masked pixels in an image) or
predicting the future in a sequence (e.g., predicting the next word in a sentence).
Clustering:
Neural networks can be trained to perform clustering, where the network learns to group similar
data points together. This can be done using techniques such as competitive learning or using
specific network architectures designed for clustering tasks.
Sparse coding is a method that aims to represent input data as a combination of a small number
of basic functions. Variation auto encoders (VAEs) are a type of auto encoder that learns to
generate new data samples by modeling the probability distribution of the input data in the latent
space. Unsupervised learning is particularly useful for tasks where labeled data is scarce or
expensive to obtain. It can also be used for tasks such as data denoising, feature learning, and
generative modeling, where the goal is to learn useful representations of the input data without
explicit supervision.
In supervised learning, the artificial neural network is under the supervision of an educator (say
a system designer) who utilizes his or her knowledge of the system to prepare the network with
labeled data sets. Thus, the artificial neural networks learn by receiving input and target the sets
of a few observations from the labeled data sets. It is the process of comparing the input and
output with the objective and computing the error between the output and objective. It utilizes
the error signal through the idea of backward propagation to alter the weights that interconnect
the network neuron with the point of limiting the error and optimizing performance. Fine-tuning
of the network proceeds until the set of weights that limit the discrepancy between the output and
the targeted output. The supervised learning process is used to solve classification and regression
problems. The output of a supervised learning algorithm can either be a classifier or predictor.
The application of this process is restricted when the supervisor's knowledge of the system is
sufficient to supply the network's input and targeted output pairs for training.
Unsupervised learning:
Unsupervised learning is used when it is absurd to augment the training data sets with class
identities (labels). This difficulty happens in situations where there is no knowledge of the
system, or the cost of obtaining such knowledge is too high. In unsupervised learning, as its
name suggests, the ANN is not under the guidance of a "teacher." Instead, it is provided with
unlabelled data sets (contains only the input data) and left to discover the patterns in the data and
build a new model from it. In this situation, ANN figures out how to arrange the data by
exploiting the separation between clusters within it.
Reinforcement learning:
Reinforcement learning is another type of unsupervised learning. It includes cooperation with the
system, getting the condition of such a system, choosing an activity to change this state, sending
the action to a system and accepting a numerical reward or a penalty in the form of feedback
which can be positive or negative with the target of learning a policy. Activities that boost the
reward are chosen by trial and error techniques. The figure illustrates the block diagram to
describe the concept of reinforcement learning. Reinforcement and unsupervised learning are
different from each other in many aspects. Reinforcement learning includes learning policy by
maximizing a few rewards. The objective of unsupervised learning is to exploit the similarities
and differences in the input data, which is used for categorization later.
While supervised learning prompts to regression and classification, unsupervised learning plays
out the tasks of pattern recognition, data dimensionality reduction, and clustering. Unsupervised
learning is aimed at discovering some patterns in the input data. Recognition of patterns in
unlabeled datasets prompts clustering. One of the significant stages of recognition systems is
pattern recognition. Pattern recognition has discovered application in data mining, classification
of documents, diagnosing diseases, recognize faces, etc. Data mining, as its name suggests,
includes automatic or semi-automatic mining extracting useful information, patterns from huge
datasets. Self-organizing maps are artificial neural network algorithms used for data mining.
Huge data can be analyzed and visualized proficiently by self-organizing maps. Unsupervised
neural networks, based on the self-organizing map, were used for the clustering of medical data
with three subspaces named as patient's drugs, body locations, and physiological abnormalities.
The self-organizing map was used to analyze and visualize yeast gene expression, and
distinguished as an excellent, quick, and advantageous procedure for organization and
interpretation of huge data sets like that of yeast gene expression. Unsupervised learning also
plays out the task of lessening the number of variables in high dimensional data, a process
known as dimensionality reduction. Data dimensionality reduction tasks can be additionally
segmented into feature extraction and feature selection. Feature selection includes selecting a
subset of the significant variable from the original dataset. Transformation of the dataset in high
dimensional space to low dimensional space is considered as feature extraction. The principal
component analysis is one of the best strategies for extracting linear features. In auto-coders with
weights, initialized effectively was exhibited as a better tool than principal components analysis
for data dimensionality reduction. Dimensionality reduction of data is normally performed at the
pre-processing phases of other tasks to minimize computational complexity and improve the
performance of machine learning models. In performance component analysis, an unsupervised
learning algorithm was used to reduce the dimension of the data before classification for
improvement in execution and better computational speed. A Restricted Boltzmann Machine
(RBM) is a type of generative stochastic artificial neural network that can learn a probability
distribution over its set of inputs. RBMs are particularly well-suited for unsupervised learning
tasks, such as dimensionality reduction, feature learning, and collaborative filtering.
Architecture:
RBMs consist of two layers of nodes: a visible layer and a hidden layer. Each node in one layer
is connected to each node in the other layer, but nodes within the same layer are not connected.
The nodes in each layer can be binary (0 or 1) or real-valued, depending on the type of RBM
(binary or Gaussian RBM).
Energy-Based Model:
RBMs are based on an energy-based model, where each configuration of the visible and hidden
units is assigned an energy value. Lower energy configurations are more likely to occur. The
probability distribution over the visible and hidden units is defined by the Boltzmann
distribution, which assigns a probability to each configuration based on its energy.
Training (Learning):
RBMs are trained using a technique called Contrastive Divergence (CD) or its variants, which is
a form of stochastic gradient descent. During training, the model learns to adjust its weights to
minimize the difference between the observed data distribution and the distribution of the
model's generated samples.
Applications:
RBMs have been used for various tasks, including dimensionality reduction, feature learning,
collaborative filtering, and deep learning pre-training. In deep learning, RBMs have been used as
building blocks for training deep belief networks (DBNs), which are composed of multiple layers
of RBMs stacked on top of each other.
Gaussian RBM:
While the traditional RBM uses binary units, Gaussian RBMs allow for real-valued inputs and
outputs. This makes them suitable for modeling continuous data distributions.
Probabilistic Modeling:
RBMs can be used to model complex probability distributions over the input data, capturing
dependencies and patterns in the data without requiring labeled examples. Overall, RBMs are
powerful tools for unsupervised learning, especially in scenarios where the underlying structure
of the data is complex and difficult to capture using traditional statistical methods. They have
been instrumental in advancing the field of deep learning and have found applications in various
domains, including computer vision, natural language processing, and recommendation systems.
The general architecture of an auto encoder includes an encoder, decoder, and bottleneck layer.
Encoder
The hidden layers progressively reduce the dimensionality of the input, capturing important
features and patterns. These layer compose the encoder. The bottleneck layer (latent space) is
the final hidden layer, where the dimensionality is significantly reduced. This layer represents
the compressed encoding of the input data.
Decoder
The bottleneck layer takes the encoded representation and expands it back to the
dimensionality of the original input. The hidden layers progressively increase the
dimensionality and aim to reconstruct the original input. The output layer produces the
reconstructed output, which ideally should be as close as possible to the input data. The loss
function used during training is typically a reconstruction loss, measuring the difference
between the input and the reconstructed output. Common choices include mean squared error
(MSE) for continuous data or binary cross-entropy for binary data. During training, the auto
encoder learns to minimize the reconstruction loss, forcing the network to capture the most
important features of the input data in the bottleneck layer. After the training process, only the
encoder part of the auto encoder is retained to encode a similar type of data used in the training
process. The different ways to constrain the network are: –
Keep small Hidden Layers: If the size of each hidden layer is kept as small as possible, then
the network will be forced to pick up only the representative features of the data thus encoding
the data.
Regularization: In this method, a loss term is added to the cost function which encourages the
network to train in ways other than copying the input.
Denoising: Another way of constraining the network is to add noise to the input and teach the
network how to remove the noise from the data.
Tuning the Activation Functions: This method involves changing the activation functions of
various nodes so that a majority of the nodes are dormant thus, effectively reducing the size of
the hidden layers.
There are diverse types of auto encoders and analyze the advantages and disadvantages
associated with different variation:
Denoising auto encoder works on a partially corrupted input and trains to recover the original
undistorted image. As mentioned above, this method is an effective way to constrain the
network from simply copying the input and thus learn the underlying structure and important
features of the data.
Advantages
This type of auto encoder can extract important features and reduce the noise or the useless
features. Denoising auto encoders can be used as a form of data augmentation, the restored
images can be used as augmented data thus generating additional training samples.
Disadvantages
Selecting the right type and level of noise to introduce can be challenging and may require
domain knowledge. Denoising process can result into loss of some information that is needed
from the original input. This loss can impact accuracy of the output.
Advantages
The sparsity constraint in sparse auto encoders helps in filtering out noise and irrelevant
features during the encoding process. These auto encoders often learn important and
meaningful features due to their emphasis on sparse activations.
Disadvantages
The choice of hyperparameters plays a significant role in the performance of this auto encoder.
Different inputs should result in the activation of different nodes of the network.
Variation auto encoder makes strong assumptions about the distribution of latent variables and
uses the Stochastic Gradient Variation Bayes estimator in the training process. It assumes
that the data is generated by a Directed Graphical Model and tries to learn
anapproximationto to the conditional property where and are the parameters of the
encoder and the decoder respectively.
Advantages
Variation Auto encoders are used to generate new data points that resemble the original
training data. These samples are learned from the latent space. Variation Auto encoder is
probabilistic framework that is used to learn a compressed representation of the data that
captures its underlying structure and variations, so it is useful in detecting anomalies and data
exploration.
Disadvantages
Variation Auto encoder use approximations to estimate the true distribution of the latent
variables. This approximation introduces some level of error, which can affect the quality of
generated samples. The generated samples may only cover a limited subset of the true data
distribution. This can result in a lack of diversity in generated samples.
Advantages
Convolution auto encoder can compress high-dimensional image data into a lower-dimensional
data. This improves storage efficiency and transmission of image data. Convolutional auto
encoder can reconstruct missing parts of an image. It can also handle images with slight
variations in object position or orientation.
Disadvantages
This auto encoder is prone to over fitting. Proper regularization techniques should be used to
tackle this issue. Compression of data can cause data loss which can result in reconstruction of
a lower quality image.
We’ve created an auto encoder comprising two Dense layers: an encoder responsible for
condensing the images into a 64-dimensional latent vector, and a decoder tasked with
reconstructing the initial image based on this latent space.
For the implementation, we are going to import mat plotlib, numpy, pandas, sklearn and Keras.
Auto encoders are a type of artificial neural network used for unsupervised learning of efficient
codlings, typically for the purposes of dimensionality reduction or feature extraction. The key
idea of an auto encoder is to learn a representation (encoding) for a set of data, typically for the
purpose of dimensionality reduction or feature learning. The auto encoder aims to learn to
encode the input data into a lower-dimensional representation and then decode it back to
reconstruct the original input as accurately as possible. Here are some common types of auto
encoders and their applications in deep learning:
A basic form of auto encoder with a symmetrical architecture, where the number of neurons in
the hidden layer(s) is less than the number of input/output neurons. Applications: Dimensionality
reduction, data denoising, feature learning.
Trained to reconstruct the original, clean input from a corrupted version of the input (e.g., by
adding noise). Applications: Removing noise from images, data preprocessing for improved
generalization.
A type of generative model that learns a probabilistic mapping from input data to a latent space,
where each point represents a probability distribution over the input data. Applications:
Generating new data samples, image generation, unsupervised representation learning.
Utilizes convolution layers for both the encoder and decoder parts of the network, making it
well-suited for handling high-dimensional input data such as images. Applications: Image
compression, image denoising, feature learning in computer vision tasks.
Designed for sequential data (e.g., time series, natural language) and incorporates recurrent
neural network (RNN) layers in the encoder and/or decoder. Applications: Sequence-to-sequence
learning, time series forecasting, natural language processing. Auto encoders have a wide range
of applications in deep learning, including but not limited to dimensionality reduction, data
denoising, feature learning, generative modeling, and representation learning. They are
particularly useful in scenarios where labeled data is scarce or when the goal is to learn useful
representations of the input data in an unsupervised manner.
Deep learning has a wide range of applications across various fields due to its ability to learn
from large amounts of data and make complex decisions. Some common applications of deep
learning include:
Computer Vision: Deep learning models are used for tasks like image recognition, object
detection, facial recognition, and image generation.
Natural Language Processing (NLP): Deep learning is used for language translation, sentiment
analysis, chat bots, and text generation.
Speech Recognition: Deep learning powers speech recognition systems used in virtual
assistants, voice-controlled devices, and speech-to-text applications.
Healthcare: Deep learning is used for medical image analysis, disease detection, drug discovery,
and personalized medicine.
Autonomous Vehicles: Deep learning is used in self-driving cars for tasks like object detection,
lane detection, and decision-making.
Finance: Deep learning is used for fraud detection, risk assessment, algorithmic trading, and
customer service automation.
Robotics: Deep learning is used in robotics for object recognition, motion planning, and
autonomous navigation.
Gaming: Deep learning is used in game development for tasks like character behavior modeling
and game testing.
Manufacturing: Deep learning is used for quality control, predictive maintenance, and process
optimization in manufacturing.
These are just a few examples, and the list continues to grow as researchers and practitioners
explore new ways to apply deep learning to solve complex problems.