100% found this document useful (1 vote)
582 views

Deep Learning Notes

The document provides information about deep learning modules 1 through 5. Module 1 discusses recurrent neural networks (RNNs) with an example, advantages and disadvantages of RNNs, and the steps to train a neural network with RNNs. Module 2 defines self-organizing maps and explains their working. Module 3 discusses echo state networks, their characteristics and applications. Module 4 explains training RNNs with backpropagation through time. Module 5 explains long short-term memory (LSTM) networks, their components, and real-world applications such as text prediction and stock prediction.

Uploaded by

AJAY SINGH NEGI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
582 views

Deep Learning Notes

The document provides information about deep learning modules 1 through 5. Module 1 discusses recurrent neural networks (RNNs) with an example, advantages and disadvantages of RNNs, and the steps to train a neural network with RNNs. Module 2 defines self-organizing maps and explains their working. Module 3 discusses echo state networks, their characteristics and applications. Module 4 explains training RNNs with backpropagation through time. Module 5 explains long short-term memory (LSTM) networks, their components, and real-world applications such as text prediction and stock prediction.

Uploaded by

AJAY SINGH NEGI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DEEP LEARNING MODULE-1 TO MODULE-5

MODULE-1

1) Briefly explain RNN with an example?

Ans: Recurrent Neural Network (RNN) are a type of Neural Network where
the output from previous step are fed as input to the current step. In traditional
neural networks, all the inputs and outputs are independent of each other, but in
cases like when it is required to predict the next word of a sentence, the previous
words are required and hence there is a need to remember the previous words.
Thus RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is Hidden state, which
remembers some information about a sequence.
RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task
on all the inputs or hidden layers to produce the output. This reduces the
complexity of parameters, unlike other neural networks.
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three hidden layers and
one output layer. Then like other neural networks, each hidden layer will have its
own set of weights and biases, let’s say, for hidden layer 1 the weights and biases
are (w1, b1), (w2, b2) for second hidden layer and (w3, b3) for third hidden layer.
This means that each of these layers are independent of each other, i.e. they do not
memorize the previous outputs.
Now the RNN will do the following:
 RNN converts the independent activations into dependent activations by
providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous outputs
by giving each output as input to the next hidden layer.
 Hence these three layers can be joined together such that the weights and bias
of all the hidden layers is the same, into a single recurrent layer.

2) What are the advantages and disadvantages of RNN and mention the
steps involved to train neural network through RNN?

Ans: Advantages of Recurrent Neural Network


 An RNN remembers each and every information through time. It is useful in
time series prediction only because of the feature to remember previous inputs
as well. This is called Long Short Term Memory.

 Recurrent neural network are even used with convolutional layers to extend
the effective pixel neighbourhood.

Disadvantages of Recurrent Neural Network

 Gradient vanishing and exploding problems.

 Training an RNN is a very difficult task.

 It cannot process very long sequences if using tan h as an activation function.

The following steps are performed to train a neural network through RNN:

 A single time step of the input is provided to the network.


 Then calculate its current state using set of current input and the previous
state.
 The current ht becomes ht-1 for the next time step.
 One can go as many time steps according to the problem and join the
information from all the previous states.
 Once all the time steps are completed the final current state is used to
calculate the output.
 The output is then compared to the actual output i.e the target output and the
error is generated.
 The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained.

3) Explain the echo state networks and application?

Ans: Echo state network is a type of Recurrent Neural Network, part of


the reservoir computing (Reservoir computing is an extension of neural networks
in which the input signal is connected to a fixed (non-trainable) and random
dynamical system (the reservoir), thus creating a higher dimension representation
(embedding). This embedding is then connected to the desired output via trainable
units) framework, which has the following particularities:

 the weights between the input -the hidden layer ( the ‘reservoir’) : Win and also
the weights are randomly assigned and not trainable

 the weights of the output neurons (the ‘readout’ layer) are trainable and can be
learned so that the network can reproduce specific temporal patterns

 the hidden layer (or the ‘reservoir’) is very sparsely connected (typically < 10%
connectivity)

 the reservoir architecture creates a recurrent non-linear embedding (H on the


image below) of the input which can be then connected to the desired output
and these final weights will be trainable

 It is possible to connect the embedding to a different predictive model (a


trainable NN or a ridge regressor/SVM for classification problems).

The following are the reasons of why to use and when to use echo state
network:
 Traditional NN architectures suffer from the vanishing/exploding gradient
problem and as such, the parameters in the hidden layers either don’t change
that much or they lead to numeric instability and chaotic behavior. Echo state
networks don’t suffer from this problem

 Traditional NN architectures are computationally expensive, Echo State


Networks are very fast as there is no back propagation phase on the reservoir.

4) Explain the training of RNN with backpropagation through time


algorithm?
Ans: Backpropagation Through Time (BPTT) is the algorithm that is used to
update the weights in the recurrent neural network. One of the common examples
of a recurrent neural network is LSTM. Backpropagation is an essential skill that
you should know if you want to effectively frame sequence prediction problems
for the recurrent neural network. You should also be aware of the effects of the
Backpropagation Through time on the stability, the speed of the system while
training the system.
The ultimate goal of the Backpropagation algorithm is to minimize the error of the
network outputs.

The general algorithm is

1. First, present the input pattern and propagate it through the network to get
the output.

2. Then compare the predicted output to the expected output and calculate the
error.

3. Then calculate the derivates of the error with respect to the network weights

4. Try to adjust the weights so that the error is minimum.

The Backpropagation algorithm is suitable for the feed forward neural network on
fixed sized input-output pairs.

The Backpropagation Through Time is the application of Backpropagation training


algorithm which is applied to the sequence data like the time series. It is applied to
the recurrent neural network. The recurrent neural network is shown one input each
timestep and predicts the corresponding output. So, we can say that BTPP works
by unrolling all input timesteps. Each timestep has one input time step, one output
time step and one copy of the network. Then the errors are calculated and
accumulated for each timestep. The network is then rolled back to update the
weights.

5) Explain LSTM and its uses in real world applications?

Ans: LSTM is a unique type of Recurrent Neural Network (RNN) capable of


learning long-term dependencies, which is useful for certain types of
prediction that require the network to retain information over longer time
periods, a task that traditional RNNs struggle with.

The chain-like architecture of LSTM allows it to contain information for


longer time periods, solving challenging tasks that traditional RNNs struggle
to or simply cannot solve.

The three major parts of the LSTM include:

 Forget gate—removes information that is no longer necessary for the


completion of the task. This step is essential to optimizing the performance
of the network.

 Input gate—responsible for adding information to the cells


 Output gate—selects and outputs necessary information

Applications of Long Short-Term Memory Networks

LSTMs can be applied to a variety of deep learning tasks that mostly include
prediction based on previous information. Two noteworthy examples include text
prediction and stock prediction:

 Text Prediction

The long-term memory capabilities of LSTM means it excels at predicting text


sequences. In order to predict the next word in a sentence, the network has to
retain all the words that preceded it. One of the most common applications of text
prediction is in chatbots used by ecommerce sites.

 Stock Prediction

Simple Machine Learning (SML) models are able to predict stock values and
prices based on inputs such as the opening value and the volume of the stock.
While these values do take part in stock prediction, they lack a key component.
To properly predict a stock value with high accuracy, the model needs to take
into account one of the biggest factors—the trend of the stock. To do so, the
model needs to identify the trend based on the values recorded over the preceding
days—a task suited to an LSTM network.
MODULE-2

1. Define self organizing maps. Explain the working of SOM.

A self-organizing map (SOM) is a type of artificial neural network (ANN) that is


trained using unsupervised learning to produce a low-dimensional (typically
twodimensional), discretized representation of the input space of the training
samples, called a map, and is therefore a method to do dimensionality reduction.
Each data point in the data set recognizes themselves by competeting for
representation. SOM mapping steps starts from initializing the weight vectors.
From there a sample vector is selected randomly and the map of weight vectors is
searched to find which weight best represents that sample. Each weight vector
has neighboring weights that are close to it. The weight that is chosen is rewarded
by being able to become more like that randomly selected sample vector. The
neighbors of that weight are also rewarded by being able to become more like the
chosen sample vector. This allows the map to grow and form different shapes.
Most generally, they form square/rectangular/hexagonal/L shapes in 2D feature
space. Q2. Explain the algorithm involved in self organizing maps. Explain the
cons of SOM.

The Algorithm:

1. Each node’s weights are initialized.

2. A vector is chosen at random from the set of training data.

3. Every node is examined to calculate which one’s weights are most like the
input vector. The winning node is commonly known as the Best Matching Unit
(BMU).
4. Then the neighbourhood of the BMU is calculated. The amount of neighbors
decreases over time.

5. The winning weight is rewarded with becoming more like the sample vector.
The nighbors also become more like the sample vector. The closer a node is to
the BMU, the more its weights get altered and the farther away the neighbor is
from the BMU, the less it learns.

6. Repeat step 2 for N iterations. Cons of Kohonen Maps:

1. It does not build a generative model for the data, i.e, the model does not
understand how data is created.

2. It does not behave so gently when using categorical data, even worse for mixed
types data.

3. The time for preparing model is slow, hard to train against slowly evolving
data Q3. Explain the training processes of SOM.

SOM doesn’t use backpropagation with SGD to update weights, this type of
unsupervised artificial neural network uses competetive learning to update its
weights.

Competetive learning is based on three processes :

• Competetion

• Cooperation

• Adaptation
Competetion : each neuron in a SOM is assigned a weight vector with the same
dimensionality as the input space. InIn the example below, in each neuron of the
output layer we will have a vector with dimension n. WeWe compute distance
between each neuron (neuron from the output layer) and the input data, and the
neuron with the lowest distance will be the winner of the competetion. The

Euclidean metric is commonly used to measure distance

Coorporation: the vector of the winner neuron in the final process will be
updated (adaptation) but it is not the only one, also it’s neighbor will be updated.
To choose neighbors we use neighborhood kernel function, this function depends
on two factor : time ( time incremented each new input data) and distance
between the winner neuron and the other neuron (How far is the neuron from the
winner neuron).The image below show us how the winner neuron’s ( The most
green one in the center) neighbors are choosen depending on distance and time
factors.

Adaptation: After choosing the winner neuron and it’s neighbors we compute
neurons update. Those choosen neurons will be updated but not the same update,

more the distance between neuron and the input data grow less we adjust it like
in
The winner neuron and it’s neighbors will be updated using this formula: This
learning rate indicates how much we want to adjust our weights.After time t
(positive infinite), this learning rate will converge to zero so we will have no
update even for the neuron winner .

4. Define and explain the working of K means algorithm.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into
Kpre-defined distinct non-overlapping subgroups (clusters) where each data point
belongs to only one group. It tries to make the inter-cluster data points as similar
as possible while also keeping the clusters as different (far) as possible. It assigns
data points to a cluster such that the sum of the squared distance between the data
points and the cluster’s centroid (arithmetic mean of all the data points that
belong to that cluster) is at the minimum. The less variation we have within
clusters, the more homogeneous (similar) the data points are within the same
cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K. 2. Initialize centroids by first shuffling the


dataset and then randomly selecting K data points for the centroids without
replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing. 4. Compute the sum of the squared distance
between data points and all centroids. 5. Assign each data point to the closest
cluster (centroid). 6. Compute the centroids for the clusters by taking the average
of the all data points that belong to each cluster.
5. Explain the application and distance measure of k means clustering.

Applications of K-Means Clustering K-Means clustering is used in a variety of


examples or business cases in real life, like:

1. Academic performance 2. Diagnostic systems 3. Search engines 4.


Wireless sensor networks

Academic Performance

Based on the scores, students are categorized into grades like A, B, or C.

Diagnostic systems

The medical profession uses k-means in creating smarter medical decision


support systems, especially in the treatment of liver ailments.

Search engines

Clustering forms a backbone of search engines. When a search is performed, the


search results need to be grouped, and the search engines very often use
clustering to do this.

Wireless sensor networks

The clustering algorithm plays the role of finding the cluster heads, which
collects all the data in its respective cluster.

Distance Measure

Distance measure determines the similarity between two elements and influences
the shape of clusters.
K-Means clustering supports various kinds of distance measures, such as:

1. Euclidean distance measure 2. Manhattan distance measure 3. A squared


euclidean distance measure 4. Cosine distance measure

Euclidean Distance Measure

The most common case is determining the distance between two points. If we
have a point P and point Q, the euclidean distance is an ordinary straight line. It is
the distance between the two points in Euclidean space.

Squared Euclidean Distance Measure

This is identical to the Euclidean distance measurement but does not take the
square root at the end.

Manhattan Distance Measure

The Manhattan distance is the simple sum of the horizontal and vertical
components or the distance between two points measured along axes at right
angles.

Cosine Distance Measure

In this case, we take the angle between the two vectors formed by joining the
points from the origin.

6. Explain the pros and cons of K means clustering.

Pros:
1. Simple: It is easy to implement k-means and identify unknown groups of data
from complex data sets. The results are presented in an easy and simple manner.

2. Flexible: K-means algorithm can easily adjust to the changes. If there are any
problems, adjusting the cluster segment will allow changes to easily occur on the
algorithm.

3. Suitable in a large dataset: K-means is suitable for a large number of datasets


and it’s computed much faster than the smaller dataset. It can also produce higher
clusters.

4. Efficient: The algorithm used is good at segmenting the large data set. Its
efficiency depends on the shape of the clusters. K-means work well in
hyperspherical clusters.

5. Time complexity: K-means segmentation is linear in the number of data


objects thus increasing execution time. It doesn’t take more time in classifying
similar characteristics in data like hierarchical algorithms.

6. Tight clusters: Compared to hierarchical algorithms, k-means produce tighter


clusters especially with globular clusters.

7. Easy to interpret: The results are easy to interpret. It generates cluster


descriptions in a form minimized to ease understanding of the data.

8. Computation cost: Compared to using other clustering methods, a k-means


clustering technique is fast and efficient in terms of its computational cost
O(K*n*d).
9. Accuracy: K-means analysis improves clustering accuracy and ensures
information about a particular problem domain is available. Modification of the
kmeans algorithm based on this information improves the accuracy of the
clusters.

10. Spherical clusters: This mode of clustering works great when dealing with
spherical clusters. It operates with an assumption of joint distributions of features

since each cluster is spherical. All the clusters features or characters have equal
variance and each is independent of each other.

Cons:

1. NoNo-optimal set of clusters: K-means doesn’t allow development of an


optimal set of clusters and for effective results, you should decide on the clusters
before.

2. Lacks consistency: K-means clustering gives varying results on different runs


of an algorithm. A random choice of cluster patterns yields different clustering
results resulting in inconsistency.

3. Uniform effect: It produces cluster with uniform size even when the input data
has different sizes.

4. Order of values: The way in which data is ordered in building the algorithm
affects the final results of the data set.

5. Sensitivity to scale: Changing or rescaling the dataset either through


normalization or standardization will completely change the final results.
6. Crash computer: When dealing with a large dataset, conducting a dendrogram
technique will crash the computer due to a lot of computational load and Ram
limits.

7. Handle numerical data: K-means algorithm can be performed in numerical data


only.

8. Operates in assumption: K-means clustering technique assumes that we deal


with spherical clusters and each cluster has equal numbers for observations. The
spherical assumptions have to be satisfied. The algorithm can’t work with clusters
of unusual size.

9. Specify K-values: For K-means clustering to be effective, you have to specify


the number of clusters (K) at the beginning of the algorithm.

10. Prediction issues: It is difficult to predict the k-values or the number of


clusters. It is also difficult to compare the quality of the produced clusters.
MODULE-3
1. What is the idea behind autoencoders? Explain with the general structure
An autoencoder is a neural network that is trained to attempt to copy its input to its
output. Internally, it has a hidden layer that describes a code used to represent the input.
The network may be viewed as consisting of two parts: an encoder function=f(x) and a
decoder that produces a reconstruction=g(h).

Autoencoder objective is to minimize reconstruction error between the input and output.
This helps autoencoders to learn important features present in the data. When a
representation allows a good reconstruction of its input then it has retained much of the
information present in the input.

Modern autoencoders have generalized the idea of an encoder and a decoder beyond
deterministic functions to stochastic mappings p encoder(h | x) and p decoder(x | h).

The idea of autoencoders has been part of the historical landscape of neural networks for
decades. Traditionally, autoencoders were used for dimensionality reduction or feature
learning. Recently, theoretical connections between autoencoders and latent variable
models have brought autoencoders to the forefront of generative modeling. Autoencoders
may be thought of as being a special case of feedforward networks and may be trained
with all the same techniques, typically mini batch gradient descent following gradients
computed by back-propagation. Unlike general feedforward networks, autoencoders may
also be trained using recirculation, a learning algorithm based on comparing the
activations of the network on the original input to the activations on the reconstructed
input. Recirculation is regarded as more biologically plausible than back-propagation but
is rarely used for machine learning applications.
The general structure of an autoencoder, mapping an input x to an output(called
reconstruction) r through an internal representation or code h. The autoencoder has two
components: the encoder f (mapping x to h) and the decoder g (mapping h to r).

2. Explain the procedure of training of autoencoders.


An important characterization of a manifold is the set of its tangent planes.At a point x
on a d-dimensional manifold, the tangent plane is given by d basis vectors that span the
local directions of variation allowed on the manifold. These local directions specify how
one can change x infinitesimally while staying on the manifold.

All autoencoder training procedures involve a compromise between two forces:

1. Learning a representation h of a training example x such that x can be


approximately recovered from h through a decoder. The fact that x is drawn from
the training data is crucial, because it means the autoencoder need not
successfully reconstruct inputs that are not probable under the data-generating
distribution
2. Satisfying the constraint or regularization penalty. This can be an architectural
constraint that limits the capacity of the autoencoder, or it can be a regularization
term added to the reconstruction cost. These techniques generally prefer solutions
that are less sensitive to the input.
The important principles that the autoencoder can afford to represent only the variations
that are needed to reconstruct training examples. If the data-generating distribution
concentrates near a low-dimensional manifold, this yields representations that implicitly
capture a local coordinate system for this manifold: only the variations tangent to the
manifold around x need to correspond to changes in h=f(x). Hence the encoder learns a
mapping from the input space x to a representation space, a mapping that is only sensitive
to changes along the manifold directions, but that is insensitive to changes orthogonal to
the manifold.

3. What are the different types of autoencoders? Explain briefly.


The different types of autoencoders are:
1. Undercomplete autoencoders: An autoencoder whose code dimension is less
than the input dimension is called under complete.Undercomplete autoencoders
have a smaller dimension for hidden layer compared to the input layer. This helps
to obtain important features from the data.When decoder is linear and we use a
mean squared error loss function then undercomplete autoencoder generates a
reduced feature space similar to PCA. Undercomplete autoencoders do not need
any regularization as they maximize the probability of data rather than copying
the input to the output.
2. Overcomplete autoencoders: If the hidden code is allowed to have dimension
equal to the input, and in the overcomplete case in which the hidden code has
dimension greater than the input. In these cases, even a linear encoder and a linear
encoder can learn to copy the input to the output without learning anything useful
about the data distribution.This is when our encoding output's dimension is larger
than our input's dimension.
3. Sparse autoencoders: Sparse autoencoders have hidden nodes greater than input
nodes. They can still discover important features from the data.Sparsity constraint
is introduced on the hidden layer. This is to prevent output layer copy input
data.Sparse autoencoders are typically used to learn features for another task, such
as classification. An autoencoder that has been regularized to be sparse must
respond to unique statistical features of the dataset it has been trained on, rather
than simply acting as an identity function. In this way, training to perform the
copying task with a sparsity penalty can yield a model that has learned useful
features as a byproduct.Sparse autoencoders take the highest activation values in
the hidden layer and zero out the rest of the hidden nodes. This prevents
autoencoders to use all of the hidden nodes at a time and forcing only a reduced
number of hidden nodes to be used.
4. Denoising autoencoders: Denoising refers to intentionally adding noise to the
raw input before providing it to the network. Denoising can be achieved using
stochastic mapping.Denoising autoencoders create a corrupted copy of the input
by introducing some noise. This helps to avoid the autoencoders to copy the input
to the output without learning features about the data.Corruption of the input can
be done randomly by making some of the input as zero. Remaining nodes copy
the input to the noised input.Denoising autoencoders must remove the corruption
to generate an output that is similar to the input. Output is compared with input
and not with noised input. To minimize the loss function we continue until
convergence
5. Contractive autoencoders: Contractive autoencoder(CAE) objective is to have a
robust learned representation which is less sensitive to small variation in the
data.Robustness of the representation for the data is done by applying a penalty
term to the loss function. The penalty term is Frobenius norm of the Jacobian
matrix. Frobenius norm of the Jacobian matrix for the hidden layer is calculated
with respect to input. Frobenius norm of the Jacobian matrix is the sum of square
of all elements.Contractive autoencoder is another regularization technique like
sparse autoencoders and denoising autoencoders. CAE surpasses results obtained
by regularizing autoencoder using weight decay or by denoising. CAE is a better
choice than denoising autoencoder to learn useful feature extraction.

4. Differentiate between undercomplete and overcomplete autoencoders.


This is when our encoding output's dimension is smaller than our input's dimension

One way to obtain useful features from the autoencoder is to constrain to have a smaller
dimension than x. An autoencoder whose code dimension is less than the input dimension
is called undercomplete. Learning an under complete representation forces the
autoencoder to capture the most salient features of the training data.The learning process
is described simply as minimizing a loss function L(x, g(f(x))), where Lis a loss function
penalizing g(f(x)) for being dissimilar from x, such as the mean squared error.When the
decoder is linear andLis the mean squared error, an undercomplete autoencoder learns to
span the same subspace as PCA. In this case, an autoencoder trained to perform the
copying task has learned the principal subspace of the training data as a side effect.
Undercomplete autoencoders, with code dimension less than the input dimension,can
learn the most salient features of the data distribution. We have seen that these
autoencoders fail to learn anything useful if the encoder and decoder are given too much
capacity.

This is when our encoding output's dimension is larger than our input's dimension. A
similar problem occurs if the hidden code is allowed to have dimension equal to the
input, and in the overcomplete case in which the hidden code has dimension greater than
the input. In these cases, even a linear encoder and a linear encoder can learn to copy the
input to the output without learning anything useful about the data distribution.

5. List various applications of autoencoders.


● Autoencoders have been successfully applied to dimensionality reduction and infor-
mation retrieval tasks. Dimensionality reduction was one of the first application of
representation learning and deep learning. It was one of the early motivation for studying
autoencoders.
● Lower-dimensional representations can improve performance on many tasks,such as
classification. Models of smaller spaces consume less memory and runtime.
● The hints provided by the mapping to the lower-dimensional space aid generalization.
● Information retrieval: the task of finding entries in a database that resembles a query
entry. Also provides additional benefit of search can become extremely efficient in
certain kinds of low-dimensional spaces.
● If we train the dimensionality reduction algorithm to produce a code that is Iow-
dimensional and binary, then we can store all database entries in a hash table that maps
binary code vectors to entries. This hash table allows us to perform information retrieval
by returning all database entries that have the same binary code as the query.
● We can also search over slightly less similar entries very efficiently, just by flipping
individual bits from the encoding of the query. This approach to information retrieval via
dimensionality reduction and binarization is called semantic hashing and has been
applied to both textual input and images.
● To produce binary codes for semantic hashing, one typically uses an encoding function
with sigmoids on the final layer. The sigmoid units must be trained to be saturate to
nearly 0 or nearly 1 for all input values. One trick that can accomplish this is simply to
inject additive noise just before the sigmoid nonlinearity during training. The magnitude
of the noise should increase over time. To fight that noise and preserve as much
information as possible, the network must increase the magnitude of the inputs to the
sigmoid function, until saturation occurs.

6. What is CAE? Discuss the prominence of CAE.


The contractive autoencoder introduces an explicit regularizeron the code h=f(x),
encouraging the derivatives of f to be as small as possible:

The penalty Ω(h) is the squared Frobenius norm (sum of squared elements) of
theJacobian matrix of partial derivatives associated with the encoder function.There is a
connection between the denoising autoencoder and the contractive autoencoder: Alain
and Bengio showed that in the limit of small Gaussian Input noise, the denoising
reconstruction error is equivalent to a contractive penalty on the reconstruction function
that maps x to r=g(f(x)).

In otherwords, denoising autoencoders make the reconstruction function resist small


butfinite-sized perturbations of the input, while contractive autoencoders make the feature
extraction function resist infinitesimal perturbations of the input. When using the
Jacobian-based contractive penalty to pretrain featuresf(x) for use with a classifier, the
best classification accuracy usually results from applying the contractive penalty to f(x)
rather than tog(f(x)). The namecontractive arises from the way that the CAE warps space.
Specifi-cally, because the CAE is trained to resist perturbations of its input, it is
encouraged to map a neighborhood of input points to a smaller neighborhood of output
points.We can think of this as contracting the input neighborhood to a smaller output
neighborhood.

Contractive autoencoder is another regularization technique like sparse autoencoders and


denoising autoencoders.CAE surpasses results obtained by regularizing autoencoder
using weight decay or by denoising. CAE is a better choice than denoising autoencoder to
learn useful feature extraction.Penalty term generates mapping which are strongly
contracting the data and hence the name contractive autoencoder.

7. What is DAE? Explain its training process.


The denoising autoencoder(DAE) is an autoencoder that receives a corrupted data point
as input and is trained to predict the original, uncorrupted data point as its output. We
introduce a corruption processC(˜x | x), which represents a conditional distribution over
corrupted samples˜x, given a data samplex. The autoencoder then learns are construction
distribution preconstruct(x |˜x) estimated from training pairs(x,˜x) as follows:

1. Sample a training example x from the training data.

2. Sample a corrupted version˜x from C(˜x | x = x).

3.Use (x,˜x) as a training example for estimating the autoencoder reconstruction


distribution preconstruct(x |˜x) =pdecoder(x | h) with the output of encoder f(˜x) and p
decoder typically defined by a decoder g(h).

Denoising refers to intentionally adding noise to the raw input before providing it to the
network. Denoising can be achieved using stochastic mapping.Denoising autoencoders
create a corrupted copy of the input by introducing some noise. This helps to avoid the
autoencoders to copy the input to the output without learning features about the data.

Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.Denoising autoencoders must remove
the corruption to generate an output that is similar to the input. Output is compared with
input and not with noised input. To minimize the loss function we continue until
convergence

Denoising autoencoders minimizes the loss function between the output node and the
corrupted input.Denoising helps the autoencoders to learn the latent representation
present in the data. Denoising autoencoders ensures a good representation is one that can
be derived robustly from a corrupted input and that will be useful for recovering the
corresponding clean input.
MODULE_4
1. What is Boltzmann Machine? Explain with its Testing and Training Algorithm.
 A Boltzmann Machine is a network of symmetrically connected, neuron like units that
make stochastic decisions about whether to be on or off. Boltzmann machines have a
simple learning algorithm that allows them to discover interesting features in datasets
composed of binary vectors. The learning algorithm is very slow in networks with many
layers of feature detectors, but it can be made much faster by learning one layer of feature
detectors at a time.
 Boltzmann machines are used to solve two quite different computational problems.
 For a search problem, the weights on the connections are fixed and are used to represent
the cost function of an optimization problem. The stochastic dynamics of a Boltzmann
machine then allow it to sample binary state vectors that represent good solutions to the
optimization problem.
 For a learning problem, the Boltzmann machine is shown a set of binary data vectors and
it must find weights on the connections so that the data vectors are good solutions to the
optimization problem defined by those weights. To solve a learning problem, Boltzmann
machines make many small updates to their weights, and each update requires them to
solve many different search problems.
 The following diagram shows the architecture of Boltzmann machine. It is clear
from the diagram, that it is a two-dimensional array of units. Here, weights on
interconnections between units are –p where p > 0. The weights of self-
connections are given by b where b > 0.


2. What is Restricted Boltzmann machine? Explain its working in detail.

RBMs are a two-layered artificial neural network with generative capabilities. They have the
ability to learn a probability distribution over its set of input. RBMs were invented by
Geoffrey Hinton and can be used for dimensionality reduction, classification, regression,
collaborative filtering, feature learning, and topic modeling. RBMs are a special class
of Boltzmann Machines and they are restricted in terms of the connections between the visible
and the hidden units. This makes it easy to implement them when compared to Boltzmann
Machines. As stated earlier, they are a two-layered neural network (one being the visible layer
and the other one being the hidden layer) and these two layers are connected by a fully
bipartite graph. This means that every node in the visible layer is connected to every node in
the hidden layer but no two nodes in the same group are connected to each other. This
restriction allows for more efficient training algorithms than what is available for the general
class of Boltzmann machines, in particular, the gradient-based contrastive divergence
algorithm.

Working of Restricted Boltzmann Machine

Each visible node takes a low-level feature from an item in the dataset to be learned. At node
1 of the hidden layer, x is multiplied by a weight and added to a bias. The result of those two
operations is fed into an activation function, which produces the node’s output, or the
strength of the signal passing through it, given input x.
Next, let’s look at how several inputs would combine at one hidden node. Each x is
multiplied by a separate weight, the products are summed, added to a bias, and again the result
is passed through an activation function to produce the node’s output.

At each hidden node, each input x is multiplied by its respective weight w. That is, a single
input x would have three weights here, making 12 weights altogether (4 input nodes x 3 hidden
nodes). The weights between the two layers will always form a matrix where the rows are equal
to the input nodes, and the columns are equal to the output nodes.
Each hidden node receives the four inputs multiplied by their respective weights. The sum of
those products is again added to a bias (which forces at least some activations to happen), and
the result is passed through the activation algorithm producing one output for each hidden node.

Now that you have an idea about how Restricted Boltzmann Machine works, let’s continue our
Restricted Boltzmann Machine Tutorial and have a look at the steps involved in the training of
RBM

3. Explain the training process of Restricted Boltzmann Machine?

The training of the Restricted Boltzmann Machine differs from the training of regular neural
networks via stochastic gradient descent.

The Two main Training steps are:

 Gibbs Sampling

The first part of the training is called Gibbs Sampling. Given an input vector v we use p(h|v)for
prediction of the hidden values h. Knowing the hidden values we use p(v|h) :

for prediction of new input values v. This process is repeated k times. After k iterations, we
obtain another input vector v_k which was recreated from original input values v_0.
 Contrastive Divergence step

The update of the weight matrix happens during the Contrastive Divergence step.
Vectors v_0 and v_k are used to calculate the activation probabilities for hidden
values h_0 and h_k :

The difference between the outer products of those probabilities with input
vectors v_0 and v_k results in the updated matrix :

Using the update matrix the new weights can be calculated with gradient ascent, given by:

Now that you have an idea of what are Restricted Boltzmann Machines and the layers of RBM,
let’s move on with our Restricted Boltzmann Machine Tutorial and understand their working
with the help of an example.

4. Explain Deep Boltzmann Machine. How does it differ from Deep Belief Network?
Deep Boltzmann Machine

 Unsupervised, probabilistic, generative model with entirely undirected connections between


different layers

 Contains visible units and multiple layers of hidden units

 Like RBM, no intralayer connection exists in DBM. Connections exists only between units
of the neighboring layers

 Network of symmetrically connected stochastic binary units

 DBM can be organized as bipartite graph with odd layers on one side and even layers on one
side

 Units within the layers are independent of each other but are dependent on neighboring layers

 Learning is made efficient by layer by layer pre training — Greedy layer wise pre training
slightly different than done in DBM

 After learning the binary features in each layer, DBM is fine tuned by back propagation.
Difference between Deep Belief networks(DBN) and Deep Boltzmann Machine(DBM)

 Deep Belief Network(DBN) have top two layers with undirected connections and lower
layers have directed connections

 Deep Boltzmann Machine(DBM) have entirely undirected connections.

 Approximate inference procedure for DBM uses a top-down feedback in addition to the usual
bottom-up pass, allowing Deep Boltzmann Machines to better incorporate uncertainty about
ambiguous inputs.

 A disadvantage of DBN is the approximate inference based on mean field approach is slower
compared to a single bottom-up pass as in Deep Belief Networks. Mean field inference needs
to be performed for every new test input.

5. Explain Deep Belief Networks along with its training algorithm.

Deep belief nets are probabilistic generative models that are composed of multiple layers of
stochastic, latent variables. The latent variables typically have binary values and are often
called hidden units or feature detectors. The top two layers have undirected, symmetric
connections between them and form an associative memory. The lower layers receive top-down,
directed connections from the layer above. The states of the units in the lowest layer represent a
data vector.
The two most significant properties of deep belief nets are:

 There is an efficient, layer-by-layer procedure for learning the top-down, generative weights
that determine how the variables in one layer depend on the variables in the layer above.
 After learning, the values of the latent variables in every layer can be inferred by a single,
bottom-up pass that starts with an observed data vector in the bottom layer and uses the
generative weights in the reverse direction.
Deep belief nets are learned one layer at a time by treating the values of the latent variables in
one layer, when they are being inferred from data, as the data for training the next layer. This
efficient, greedy learning can be followed by, or combined with, other learning procedures that
fine-tune all of the weights to improve the generative or discriminative performance of the whole
network.

Discriminative fine-tuning can be performed by adding a final layer of variables that represent
the desired outputs and backpropagating error derivatives. When networks with many hidden
layers are applied to highly-structured input data, such as images, backpropagation works much
better if the feature detectors in the hidden layers are initialized by learning a deep belief net that
models the structure in the input data
The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as
the building blocks for each layer .The process is as follows:

1. Train the first layer as an RBM that models the raw input as its visible layer.
2. Use that first layer to obtain a representation of the input that will be used as data for the second
layer. Two common solutions exist. This representation can be chosen as being the mean
activations or samples of .
3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as
training examples (for the visible layer of that RBM).
4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples
or mean values.
5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log-
likelihood, or with respect to a supervised training criterion (after adding extra learning machinery
to convert the learned representation into supervised predictions, e.g. a linear classifier).

6. What are Energy-based Models. Explain Implicit Generation and


Generalization
Methods for Energy-Based Models.
The main purpose of statistical modeling and machine learning is to encode
dependencies between variables. By capturing those dependencies, a model can be used
to answer questions about the values of unknown variables given the values of known
variables. Energy-Based Models (EBMs) capture dependencies by associating a scalar
energy (a measure of compatibility) to each configuration of the variables. Inference,
i.e., making a prediction or decision, consists in setting the value of observed variables 1
and finding values of the remaining variables that minimize the energy. Learning
consists in finding an energy function that associates low energies to correct values of
the remaining variables, and higher energies to incorrect values. A loss functional,
minimized during learning, is used to measure the quality of the available energy
functions. Within this common inference/learning framework, the wide choice of
energy functions and loss functionals allows for the design of many types of statistical
models, both probabilistic and non-probabilistic. Energy-based learning provides a
unified framework for many probabilistic and non-probabilistic approaches to learning,
particularly for non-probabilistic training of graphical models and other structured
models. Energy-based learning can be seen as an alternative to probabilistic estimation
for prediction, classification, or decision-making tasks. Because there is no requirement
for proper normalization, energy-based approaches avoid the problems associated with
estimating the normalization constant in probabilistic models. Furthermore, the
absence of the normalization condition allows for much more flexibility in the design of
learning machines. Most probabilistic models can be viewed as special types of energy-
based models in which the energy function satisfies certain normalizability conditions,
and in which the loss function, optimized by learning, has a particular form.
Generative modeling is the task of observing data, such as images or text, and learning
to model the underlying data distribution. Accomplishing this task leads models to
understand high level features in data and synthesize examples that look like real data.
Generative models have many applications in natural language, robotics, and
computer vision. Energy-based models represent probability distributions over data by
assigning an unnormalized probability scalar (or “energy”) to each input data point.
This provides useful modeling flexibility—any arbitrary model that outputs a real
number given an input can be used as an energy model. The difficulty however, lies in
sampling from these models. To generate samples from EBMs, we use an iterative
refinement process based on Langevin dynamics. Informally, this involves performing
noisy gradient descent on the energy function to arrive at low-energy configurations
Unlike GANs, VAEs, and Flow-based models, this approach does not require an explicit
neural network to generate samples - samples are generated implicitly. The
combination of EBMs and iterative refinement have the following benefits:

 Adaptive computation time. We can run sequential refinement for long amount of time
to generate sharp, diverse samples or a short amount of time for coarse less diverse
samples. In the limit of infinite time, this procedure is known to generate true samples
from the energy model.
 Not restricted by generator network. In both VAEs and Flow based models, the
generator must learn a map from a continuous space to a possibly disconnected space
containing different data modes, which requires large capacity and may not be possible to
learn. In EBMs, by contrast, can easily learn to assign low energies at disjoint regions.
 Built-in compositionality. Since each model represents an unnormalized probability
distribution, models can be naturally combined through product of experts or other
hierarchical models.

Generation

Studies found energy-based models are able to generate qualitatively and quantitatively high-
quality images, especially when running the refinement process for a longer period at test time.
By running iterative optimization on individual images, we can auto-complete images and morph
images from one class (such as truck) to another (such as frog).

In addition to generating images, they found that energy-based models are able to generate
stable robot dynamics trajectories across large number of timesteps. EBMs can generate a
diverse set of possible futures, while feedforward models collapse to a mean prediction.
MODULE-5
1)Generative Adversarial Network (GAN) and why were GAN developed ?

Generative Adversarial Networks (GANs) are a powerful class of neural networks that
are used for unsupervised learning. It was developed and introduced by Ian J.
Goodfellow in 2014. GANs are basically made up of a system of two competing neural
network models which compete with each other and are able to analyze, capture and
copy the variations within a dataset.

GAN s were developed as it has been noticed most of the mainstream neural networks
can be easily fooled into misclassifying things by adding only a small amount of noise
into the original data. The model after adding noise has higher confidence in the wrong
prediction than when it predicted correctly. The reason for such adversary is that most
machine learning models learn from a limited amount of data, which is a huge
drawback, as it is prone to overfitting. Also, the mapping between the input and the
output is almost linear. It may seem that the boundaries of separation between the
various classes are linear, but in reality, they are composed of linearity‟s and even a
small change in a point in the feature space might lead to misclassification of data.

2)How does GANs work?

Generative Adversarial Networks (GANs) can be broken down into three parts:
 Generative: To learn a generative model, which describes how data is generated
in terms of a probabilistic model.
 Adversarial: The training of a model is done in an adversarial setting.
 Networks: Use deep neural networks as the artificial intelligence (AI) algorithms
for training purpose in GANs, there is a generator and a discriminator.
The Generator generates fake samples of data(be it an image, audio, etc.) and tries
to fool the Discriminator. The Discriminator tries to distinguish between the real
and fake samples. The Generator and the Discriminator are both Neural Networks
and they both run in competition with each other in the training phase. The steps
are repeated several times the Generator and Discriminator get better and better in
their respective jobs after each repetition. The working can be visualized by the
diagram given.

The generative model captures the distribution of data and is trained in such a
manner that it tries to maximize the probability of the Discriminator. The Discriminator, is
based on a model that estimates the probability that the sample that it got is received
from the training data and not from the Generator.

Training a GAN has two parts:


 Part 1: The Discriminator is trained while the Generator is idle. In this phase, the
network is only forward propagated and no back-propagation is done. The
Discriminator is trained on real data for n epochs, and see if it can correctly predict
them as real. In this phase, the Discriminator is also trained on the fake generated
data from the Generator and see if it can correctly predict them as fake.
 Part 2: The Generator is trained while the Discriminator is idle. After the
Discriminator is trained by the generated fake data of the Generator, the
predictions and use the results for training the Generator can get better from the
previous state to try and fool the Discriminator.

The above method is repeated for a few epochs and then manually check the fake data
if it seems genuine. If it seems acceptable, then the training is stopped, otherwise, its
allowed to continue for few more epochs.
3)What are the Different types of GAN's?

Many different types of GAN have been implemented. Some of the important ones that
are actively used are described below:
1. Vanilla GAN: This is the simplest type GAN. Here, the Generator and the
Discriminator are simple multi-layer perceptron‟s. In vanilla GAN, the algorithm is
really simple, it tries to optimize the mathematical equation using stochastic
gradient descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in
which some conditional parameters are put into place. In CGAN, an additional
parameter „y‟ is added to the Generator for generating the corresponding data.
Labels are also put into the input to the Discriminator in order for the Discriminator
to help distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular also the
most successful implementation of GAN. It is composed of Convents in place of
multi-layer perceptron‟s. The Convent‟s are implemented without max pooling,
which is in fact replaced by convolutional stride. Also, the layers are not fully
connected.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible
image representation consisting of a set of band-pass images, spaced an octave
apart, plus a low-frequency residual. This approach uses multiple numbers of
Generator and Discriminator networks and different levels of the Laplacian
Pyramid. This approach is mainly used because it produces very high-quality
images. The image is down-sampled at first at each layer of the pyramid and then
it is again up-scaled at each layer in a backward pass where the image acquires
some noise from the Conditional GAN at these layers until it reaches its original
size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of
designing a GAN in which a deep neural network is used along with an adversarial
network in order to produce higher resolution images. This type of GAN is
particularly useful in optimally up-scaling native low-resolution images to enhance
its details minimizing errors.

4) Describe GAN architecture ?

Generative adversal network consist of two parts: generators and discriminators. The
generator model produces synthetic examples (e.g., images) from random noise
sampled using a distribution, which along with real examples from a training data set
are fed to the discriminator, which attempts to distinguish between the two. Both the
generator and discriminator improve in their respective abilities until the discriminator is
unable to tell the real examples from the synthesized examples with better than the 50%
accuracy expected.

GANs train in an unsupervised fashion, meaning that they infer the patterns within data
sets without reference to known, labelled, or annotated outcomes. Interestingly, the
discriminator‟s work informs that of the generator every time the discriminator correctly
identifies a synthesized work, it tells the generator how to tweak its output so that it
might be more realistic in the future.

GANs suffer from a number of shortcomings owing to their architecture. The


simultaneous training of generator and discriminator models is inherently unstable.
Sometimes the parameters the configuration values internal to the model oscillate or
destabilize, which isn‟t surprising given that after every parameter update, the nature of
the optimization problem being solved. Alternatively, the generator collapses, and it
begins to produce data samples that are largely homogeneous in appearance.
The generator and discriminator also run the risk of overpowering each other. If the
generator becomes too accurate, it‟ll exploit weaknesses in the discriminator that lead to
undesirable results, whereas if the discriminator becomes too accurate, it‟ll impede the
generator‟s progress toward convergence.

5) Explain GAN applications?

The applications are listed below:

1. Image and video synthesis- GANs are perhaps best known for their
contributions to image synthesis.StyleGAN, a model NVidia developed,
has generated high-resolution head shots of fictional people by learning
attributes like facial pose, freckles, and hair. A newly released version .

StyleGAN 2 makes improvements with respect to both architecture and training


methods, redefining the state of the art in terms of perceived quality.

In June 2019, Microsoft researchers detailed ObjGAN, a novel GAN that could
understand captions, sketch layouts, and refine the details based on the wording. The
co-authors of a related study proposed a system . StoryGAN that synthesizes
storyboards from paragraphs. GANs have been applied to the problems of super-
resolution and pose estimation (object transformation).

Tang says one of his teams used GANs to train a model to upscale 200-by-200-pixel
satellite imagery to 1,000 by 1,000 pixels, and to produce images that appear as though
they were captured from alternate angles.

Scientists at Carnegie Mellon last year demoed Recycle-GAN, a data-driven approach


for transferring the content of one video or photo to another. When trained on footage of
human subjects, the GAN generated clips that captured subtle expressions like dimples
and lines that formed when subjects smiled and moved their mouths.

More recently, researchers at Seoul-based Hyperconnect published Marionette, which


synthesizes a reenacted face animated by a person‟s movement while preserving the
face‟s appearance.

2) Video- Predicting future events from only a few video frames a task
once considered impossible is nearly within grasp thanks to state-of-
the-art approaches involving GANs and novel data sets.

One of the newest papers on the subject from DeepMind details recent advances in the
budding field of AI clip generation. Using “computationally efficient” components and
techniques and a new custom-tailored data set, researchers say their best-performing
model Dual Video Discriminator GAN (DVD-GAN) can generate coherent 256 x 256-
pixel videos of “notable fidelity” up to 48 frames in length.

3) Artwork- GANs are capable of more than generating images and video
footage. When trained on the right data sets, they’re able to produce de
novo works of art.

Researchers at the Indian Institute of Technology Hyderabad and the Sri Sathya Sai
Institute of Higher Learning devised a GAN, dubbed SkeGAN, that generates stroke-
based vector sketches of cats, fire trucks, mosquitoes, and yoga poses.

Scientists at the Maastricht University in the Netherlands created a GAN that


produces logos from one of 12 different colours.

Victor Dibia, a human-computer interaction researcher and Carnegie Mellon graduate,


trained a GAN to synthesize African tribal masks.

Meanwhile, a team at the University of Edinburgh‟s Institute for Perception and Institute
for Astronomy designed a model that generates images of fictional galaxies that closely
follow the distributions of real galaxies.

In March during its GPU Technology Conference (GTC) in San Jose, California, Nvidia
took the wraps off of GauGAN, a generative adversarial AI system that lets users create
lifelike landscape images that never existed.

4) Music-GANs are architecturally well-suited to generating media, and


that includes music. In a paper published in August, researchers hailing
from the National Institute of Informatics in Tokyo describe a system
that’s able to generate “lyrics-conditioned” melodies from learned
relationships between syllables and notes.

Not to be outdone, in December, Amazon Web Services detailed Decomposer, a cloud-


based service that taps a GAN to fill in compositional gaps in songs.

5) Speech- Google and Imperial College London researchers recently set


out to create a GAN-based text-to-speech system capable of matching
(or besting) state-of-the-art methods.

GAN-TTS consists of a neural network that learned to produce raw audio


by training on a corpus of speech with 567 pieces of encoded phonetic,
duration, and pitch data. To enable the model to generate sentences of
arbitrary length, the co-authors sampled 44 hours’ worth of two-second
snippets together with the corresponding linguistic features computed
for five-millisecond snippets. An ensemble of 10 discriminators some of
which assess linguistic conditioning, while others assess general
realism attempt to distinguish between real and synthetic speech.

6)Medicine- In the medical field, GANs have been used to produce data
on which other AI models in some cases, other GANs might train and to
invent treatments for rare diseases that to date haven’t received much
attention.

In April, the Imperial College London, University of Augsburg, and Technical University
of Munich sought to synthesize data to fill in gaps in real data with a model dubbed
Snore-GAN. In a similar vein, researchers from Nvidia, the Mayo Clinic, and the MGH
and BWH Centre for Clinical Data Science proposed a model that generates synthetic
magnetic resonance images (MRIs) of brains with cancerous tumours

Baltimore-based Insilico Medicine pioneered the use of GANs in molecular structure


creation for diseases with a known ligand but no target .Its team of researchers is
actively working on drug discovery programs in cancer, dermatological diseases,
fibrosis, Parkinson‟s, Alzheimer‟s, ALS, diabetes, sarcopenia, and aging.

7) Robotics- The field of robotics has a lot to gain from GANs, as it


turns out. A tuned discriminator can determine whether a machine’s
trajectory has been drawn from a distribution of human demonstrations
or from synthesized examples. In that way, it’s able to train agents to
complete tasks accurately, even when it has access only to the robot’s
positional information.

“The idea of using adversarial loss for training agent trajectories is not new, but what‟s
new is allowing it to work with a lot less data,” Tang said. “The trick to applying these
adversarial learning approaches is figuring out which inputs the discriminator has
access to what information is available to avoid being tricked, discriminators need
access to data alone, allowing us to train with expert demonstrations where all we have
are the state data.”

8) Deepfake detection- GANs’ ability to generate convincing photos


and videos of people makes them ripe targets for abuse. Researchers
suggests that GANs could root out Deepfakes just as effectively as they
produce them. A paper published on the preprint server Arxiv.org in
March describes spamGAN, which learns from a limited corpus of
annotated and unannotated data. In experiments, the researchers say
that spamGAN outperformed existing spam detection techniques with
limited labelled data, achieving accuracy of between 71% and 86% when
trained on as little as 10% of labelled data.

You might also like