Introduction To Recurrent Neural Network
Introduction To Recurrent Neural Network
Introduction To Recurrent Neural Network
•
Introduction a new variation of neural network which is the Recurrent Neural Network also
known as (RNN) that works better than a simple neural network when data is sequential like
Time-Series data and text data.
Recurrent Neural Network (RNN)
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the
previous step is fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other
when it is required to predict the next word of a sentence, the previous words are required and
hence there is a need to remember the previous words. Thus RNN came into existence, which
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN
is its Hidden state, which remembers some information about a sequence.
The state is also referred to as Memory State since it remembers the previous input to the
network.
It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other
neural networks.
Recurrent Neuron
RNN Unfolding
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence
having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the network
generating only one output. This type of network is used in the problems like sentimental
analysis. Where we give multiple words as input and predict only the sentiment of the sentence
as output.
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a
problem. One Example of this Problem will be language translation. In language translation, we
provide multiple words from one language as input and predict multiple words from the second
language as output.
The Numbers of parameter in the RNN The Numbers of Parameter are lower
are higher than in simple DNN than RNN
Model Building:
Build RNN Model using ‘relu’ and ‘softmax‘ activation function.
Python
model = Sequential()
model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)), activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
Model Compilation:
The model.compile line builds the neural network for training by specifying the optimizer
(Adam), the loss function (categorical crossentropy), and the training metric (accuracy).
Python
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Model Training:
Using the input sequences (X_one_hot) and corresponding labels (y_one_hot) for 100 epochs,
the model is trained using the model.fit line, which optimises the model parameters to minimise
the categorical crossentropy loss.
Python
model.fit(X_one_hot, y_one_hot, epochs=100)
output:
Epoch 1/100
2/2 [==============================] - 2s 54ms/step - loss: 2.8327 - accuracy:
0.0000e+00
Epoch 2/100
2/2 [==============================] - 0s 16ms/step - loss: 2.8121 - accuracy:
0.0000e+00
Epoch 3/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7944 - accuracy: 0.0208
Epoch 4/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7766 - accuracy: 0.0208
Epoch 5/100
2/2 [==============================] - 0s 15ms/step - loss: 2.7596 - accuracy: 0.0625
Epoch 6/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7424 - accuracy: 0.0833
Epoch 7/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7254 - accuracy: 0.1042
Epoch 8/100
2/2 [==============================] - 0s 12ms/step - loss: 2.7092 - accuracy: 0.1042
Epoch 9/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6917 - accuracy: 0.1458
Epoch 10/100
2/2 [==============================] - 0s 12ms/step - loss: 2.6742 - accuracy: 0.1667
Epoch 11/100
2/2 [==============================] - 0s 10ms/step - loss: 2.6555 - accuracy: 0.1667
Epoch 12/100
2/2 [==============================] - 0s 16ms/step - loss: 2.6369 - accuracy: 0.1667
Model Prediction:
Generated text using pre-trained model.
Python
start_seq = "This is G"
generated_text = start_seq
for i in range(50):
x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])
x_one_hot = tf.one_hot(x, len(chars))
prediction = model.predict(x_one_hot)
next_index = np.argmax(prediction)
next_char = index_to_char[next_index]
generated_text += next_char
print("Generated Text:")
print(generated_text)
output:
1/1 [==============================] - 1s 517ms/step
1/1 [==============================] - 0s 75ms/step
1/1 [==============================] - 0s 101ms/step
1/1 [==============================] - 0s 93ms/step
1/1 [==============================] - 0s 132ms/step
1/1 [==============================] - 0s 143ms/step
1/1 [==============================] - 0s 140ms/step
details of the research methodology and dataset used in this paper. The experimental details and
results are discussed in Section 5. Finally, the paper is concluded in Section 6. 2. Related Work
GANs were first introduced by Goodfellow [7] in 2014, but Reed et al. [8] was the first to use
them for text-to-image generation in 2016. Salimans et al. [9] proposed training stabilizing
techniques for previously untrainable models and achieved better results on the MNIST, CIFAR-
10, and SVHN datasets. The attention-based recurrent neural network was developed by Zia et
al. [10]. In their model, word-to-pixel dependencies were learned by an attention-based auto-
encoder and pixel-to-pixel dependencies were learned by an autoregressive-based decoder. Liu et
al. [11] offered a diverse conditional image synthesis model and performed large-scale
experiments for different conditional generation tasks. Gao et al. [12] proposed an effective
approach known as lightweight dynamic conditional GAN (LD-CGAN), which disentangled the
text attributes and provided image features by capturing multi-scale features. Dong et al. [13]
trained a model for generating images from text in an unsupervised manner. Berrahal et al. [14]
focused on the development of textto-image conversion applications. They used deep fusion
GAN (DF-GAN) for generating human face images from textual descriptions. The cross-domain
feature fusion GAN (CFGAN) was proposed by Zhang et al. [15] for converting textual
descriptions into images with more semantic detail. In general, the existing methods of text-to-
image generation use wide-ranging parameters and heavy computations for generating high-
resolution images, which result in unstable and high-cost training.
This section describes the training details of deep learning-based generative models. Conditional
GANs were used with recurrent neural networks (RNNs) and convolutional neural networks
(CNNs) for generating meaningful images from a textual description. The dataset used consisted
of images of flowers and their relevant textual descriptions. For generating plausible images
from text using a GAN, preprocessing of textual data and image resizing was performed. We
took textual descriptions from the dataset, preprocessed these caption sentences, and created a
list of their vocabulary. Then, these captions were stored with their respective ids in the list. The
images were loaded and resized to a fixed dimension. These data were then given as input to our
proposed model. RNN was used for capturing the contextual information of text sequences by
defining the relationship between words at altered time stamps. Text-to-image mapping was
performed using an RNN and a CNN. The CNN recognized useful characteristics from the
images without the need for human intervention. An input sequence was given to the RNN,
which converted the textual descriptions into word embeddings with a size of 256. These word
embeddings were concatenated with a 512-dimensional noise vector. To train our model, we
took a batch size of 64 with gated-feedback 128 and fed the input noise and text input to a
generator. The architecture of the proposed model is presented in Figure 1. Eng. Proc. 2022, 20,
16 3 of 6 images were loaded for resizing to the same dimensions. All training images and
testing images were resized to a resolution of 128 × 128.
For training purposes, the images were converted into arrays, and both the vocabulary and
images were loaded onto the model. 4. Proposed Methodology This section describes the training
details of deep learning-based generative models. Conditional GANs were used with recurrent
neural networks (RNNs) and convolutional neural networks (CNNs) for generating meaningful
images from a textual description. The dataset used consisted of images of flowers and their
relevant textual descriptions. For generating plausible images from text using a GAN,
preprocessing of textual data and image resizing was performed. We took textual descriptions
from the dataset, preprocessed these caption sentences, and created a list of their vocabulary.
Then, these captions were stored with their respective ids in the list.
The images were loaded and resized to a fixed dimension. These data were then given as input
to our proposed model. RNN was used for capturing the contextual information of text sequences
by defining the relationship between words at altered time stamps. Text-to-image mapping was
performed using an RNN and a CNN. The CNN recognized useful characteristics from the
images without the need for human intervention. An input sequence was given to the RNN,
which converted the textual descriptions into word embeddings with a size of 256. These word
embeddings were concatenated with a 512-dimensional noise vector. To train our model, we
took a batch size of 64 with gated-feedback 128 and fed the input noise and text input to a
generator. The architecture of the proposed model is presented
Figure 1. Architecture of the proposed method, which can generate images from text
descriptions. Semantic information from the textual description was used as input in the
generator model, which converts characteristic information to pixels and generates the images.
This generated image was used as input in the discriminator along with real/wrong textual
descriptions and real sample images from the dataset. A sequence of distinct (picture and text)
pairings are then provided as input to the model to meet the goals of the discriminator: input
pairs of real images and real textual descriptions, wrong images and mismatched textual
descriptions, and generated images and real textual descriptions. The real photo and real text
combinations are provided so that the model can determine if a particular image and text
combination align. An incorrect picture and real text description indicates that the image does
not match the caption. The discriminator is trained to identify real and generated images. At the
start of training, the discriminator was good at classification of real/wrong images. Loss was
calculated to improve the weight and to provide training feedback to the generator and
discriminator model. As soon as the training proceeded, the generator produced more realistic
images and it fooled the discriminator when distinguishing between real and generated and
images
Content-based recommendation systems operate on the premise of suggesting items to users based
on the content attributes of those items and a user’s past preferences. These systems focus on
features and characteristics associated with items, such as text descriptions, genres, keywords, or
metadata.
The recommendations generated are aligned with the user’s historical interactions and
preferences. Content-based systems excel in providing recommendations that are closely related
to the user’s demonstrated interests. For example, a content-based movie recommendation system
might suggest films with similar genres or themes to those the user has previously enjoyed.
Collaborative filtering recommendation systems, on the other hand, rely on the collective
behavior and preferences of a large user base to make suggestions. This approach assumes that
users who have exhibited similar preferences in the past will continue to do so in the future.
Collaborative filtering can be further categorized into two subtypes: user-based and item-based.
User-based collaborative filtering recommends items to a user based on the preferences of users
who are similar to them. Item-based collaborative filtering suggests items similar to those the user
has shown interest in, based on the behavior of other users. These systems are effective at
suggesting items that are trending or popular among users with similar preferences.
Two-Tower Architecture
The user tower processes user data, such as profiles and historical interactions, while the item
tower encodes item features like metadata and content descriptors. By separately encoding user
and content information, the Two-Tower architecture excels in delivering highly personalized
recommendations. It is particularly adept at addressing challenges like the cold start problem,
where it must recommend to new users or new items with limited interaction data. This
architecture is highly efficient, scalable, and capable of fine-tuning recommendations based on
nuanced user preferences.
In the realm of retrieval systems, Two-Tower Neural Networks (NNs) hold a special significance.
Our retrieval approach, grounded in machine learning, harnesses the power of the Word2Vec
algorithm to create embeddings for both users and media/authors based on their unique identifiers.
The Two Towers model expands upon the Word2Vec algorithm, permitting the incorporation of
diverse user or media/author characteristics. This adaptation also facilitates concurrent learning
across multiple objectives, enhancing its utility for multi-objective retrieval tasks. Notably, this
model retains the scalability and real-time capabilities inherent in Word2Vec, making it an
excellent choice for candidate sourcing algorithms.
Here’s a high-level overview of how Two-Tower retrieval operates in conjunction with a schema:
1. The Two Tower model comprises two distinct neural networks — one for users and one for
items.
2. Each neural network exclusively processes features pertinent to its respective entity and
generates an embedding.
3. The primary objective is to predict engagement events (e.g., user likes on a post) by measuring
the similarity between user and item embeddings.
4. Following training, user embeddings are optimized to closely match embeddings of relevant
items, enabling the use of nearby item embeddings for ranking purposes.
learning using four popular machine learning algorithms namely, Random Forest Classifier,
KNN, Decision Tree Classifier, and Naive Bayes classifier. We will directly jump into
implementation step-by-step.
classification using machine learning and machine learning image classification. However,
the work demonstrated here will help serve research purposes if one desires to compare their
CNN image classifier model with some machine learning algorithms.
Learning Objective:
• Showcase how to test the trained models on custom input images and evaluate their
performance.
Table of contents
Dataset Acquisition
Source: cs.toronto
The dataset utilized in this blog is the CIFAR-10 dataset, which is a Keras dataset that can be
easily downloaded using the following code. The dataset includes ten classes: airplane,
automobile, bird, cat, deer, dog, frog, horse, ship, and truck, indicating that we will be
addressing a multi-class classification problem.