Introduction To Recurrent Neural Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Introduction to Recurrent Neural Network


Introduction a new variation of neural network which is the Recurrent Neural Network also
known as (RNN) that works better than a simple neural network when data is sequential like
Time-Series data and text data.
Recurrent Neural Network (RNN)
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the
previous step is fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other
when it is required to predict the next word of a sentence, the previous words are required and
hence there is a need to remember the previous words. Thus RNN came into existence, which
solved this issue with the help of a Hidden Layer. The main and most important feature of RNN
is its Hidden state, which remembers some information about a sequence.
The state is also referred to as Memory State since it remembers the previous input to the
network.
It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output. This reduces the complexity of parameters, unlike other
neural networks.

How RNN differs from Feedforward Neural Network?


Artificial neural networks that do not have looping nodes are called feed forward neural
networks. Because all information is only passed forward, this kind of neural network is also
referred to as a multi-layer neural network.
Information moves from the input layer to the output layer – if any hidden layers are present –
unidirectionally in a feedforward neural network. These networks are appropriate for image
classification tasks,
Example, where input and output are independent. Nevertheless, their inability to retain previous
inputs automatically renders them less useful for sequential data analysis.

Recurrent Vs Feedforward networks


Recurrent Neuron and RNN Unfolding
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit,
which is not explicitly called a “Recurrent Neuron.” This unit has the unique ability to maintain a
hidden state, allowing the network to capture sequential dependencies by remembering previous
inputs while processing. Long Short-Term Memory (LSTM) and Gated Recurrent Unit
(GRU) versions improve the RNN’s ability to handle long-term dependencies.

Recurrent Neuron

RNN Unfolding

Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.
One To Many
In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence
having Multiple words.

Many to One
In this type of network, Many inputs are fed to the network at several states of the network
generating only one output. This type of network is used in the problems like sentimental
analysis. Where we give multiple words as input and predict only the sentiment of the sentence
as output.

Many to Many
In this type of neural network, there are multiple inputs and multiple outputs corresponding to a
problem. One Example of this Problem will be language translation. In language translation, we
provide multiple words from one language as input and predict multiple words from the second
language as output.

Recurrent Neural Network Architecture


RNNs have the same input and output architecture as any other deep neural architecture.
However, differences arise in the way information flows from input to output. Unlike Deep
neural networks where we have different weight matrices for each Dense network in RNN, the
weight across the network remains the same. It calculates state hidden state Hi for every
input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep

Recurrent Neural Architecture

How does RNN work?


The Recurrent Neural Network consists of multiple fixed activation function units, one for each
time step. Each unit has an internal state which is called the hidden state of the unit. This hidden
state signifies the past knowledge that the network currently holds at a given time step. This
hidden state is updated at every time step to signify the change in the knowledge of the network
about the past. The hidden state is updated using the following recurrence relation:-
The formula for calculating the current state:
ℎ𝑡=𝑓(ℎ𝑡−1,𝑥𝑡)ht=f(ht−1,xt)
where,
• ht -> current state
• ht-1 -> previous state
• xt -> input state
Formula for applying Activation function(tanh)
ℎ𝑡=𝑡𝑎𝑛ℎ(𝑊ℎℎℎ𝑡−1+𝑊𝑥ℎ𝑥𝑡)ht=tanh(Whhht−1+Wxhxt)
where,
• whh -> weight at recurrent neuron
• wxh -> weight at input neuron
The formula for calculating output:
𝑦𝑡=𝑊ℎ𝑦ℎ𝑡yt=Whyht
• Yt -> output
• Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works on sequential
data here we use an updated backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered network each
variable is computed one at a time in a specified order like first h1 then h2 then h3 so on. Hence
we will apply backpropagation throughout all these hidden time states sequentially.

• L(θ)(loss function) depends on h3


• h3 in turn depends on h2 and W
• h2 in turn depends on h1 and W
• h1 in turn depends on h0 and W
• where h0 is a constant starting state.
∂𝐿(𝜃)∂𝑊=∑𝑡=1𝑇∂𝐿(𝜃)∂𝑊∂W∂L(θ)=∑t=1T∂W∂L(θ)
For simplicity of this equation, we will apply backpropagation on only one row
∂𝐿(𝜃)∂𝑊=∂𝐿(𝜃)∂ℎ3∂ℎ3∂𝑊∂W∂L(θ)=∂h3∂L(θ)∂W∂h3
We already know how to compute this one as it is the same as any simple deep neural network
backpropagation.
∂𝐿(𝜃)∂ℎ3∂h3∂L(θ)
.However, we will see how to apply backpropagation to this term ∂ℎ3∂𝑊∂W∂h3
As we know h3 = σ(Wh2 + b)
And In such an ordered network, we can’t compute ∂ℎ3∂𝑊∂W∂h3 by simply treating h3 as a
constant because as it also depends on W. the total derivative ∂ℎ3∂𝑊∂W∂h3 has two parts:
1. Explicit: ∂ℎ3+∂𝑊∂W∂h3+ treating all other inputs as constant
2. Implicit: Summing over all indirect paths from h3 to W
Let us see how to do this
∂ℎ3∂𝑊=∂ℎ3+∂𝑊+∂ℎ3∂ℎ2∂ℎ2∂𝑊=∂ℎ3+∂𝑊+∂ℎ3∂ℎ2[∂ℎ2+∂𝑊+∂ℎ2∂ℎ1∂ℎ1∂𝑊]=∂ℎ3+∂𝑊+∂ℎ3∂ℎ
2∂ℎ2+∂𝑊+∂ℎ3∂ℎ2∂ℎ2∂ℎ1[∂ℎ1+∂𝑊]∂W∂h3=∂W∂h3++∂h2∂h3∂W∂h2=∂W∂h3++∂h2∂h3
[∂W∂h2++∂h1∂h2∂W∂h1]=∂W∂h3++∂h2∂h3∂W∂h2++∂h2∂h3∂h1∂h2[∂W∂h1+]
For simplicity, we will short-circuit some of the paths
∂ℎ3∂𝑊=∂ℎ3+∂𝑊+∂ℎ3∂ℎ2∂ℎ2+∂𝑊+∂ℎ3∂ℎ1∂ℎ1+∂𝑊∂W∂h3=∂W∂h3++∂h2∂h3∂W∂h2++∂h1∂h3
∂W∂h1+
Finally, we have
∂𝐿(𝜃)∂𝑊=∂𝐿(𝜃)∂ℎ3⋅∂ℎ3∂𝑊∂W∂L(θ)=∂h3∂L(θ)⋅∂W∂h3
Where
∂ℎ3∂𝑊=∑𝑘=13∂ℎ3∂ℎ𝑘⋅∂ℎ𝑘∂𝑊∂W∂h3=∑k=13∂hk∂h3⋅∂W∂hk
Hence,
∂𝐿(𝜃)∂𝑊=∂𝐿(𝜃)∂ℎ3∑𝑘=13∂ℎ3∂ℎ𝑘⋅∂ℎ𝑘∂𝑊∂W∂L(θ)=∂h3∂L(θ)∑k=13∂hk∂h3⋅∂W∂hk
This algorithm is called backpropagation through time (BPTT) as we backpropagate over all
previous time steps
Issues of Standard RNNs
1. Vanishing Gradient: Text generation, machine translation, and stock market prediction are
just a few examples of the time-dependent and sequential data problems that can be modelled
with recurrent neural networks. You will discover, though, that the gradient problem makes
training RNN difficult.
2. Exploding Gradient: An Exploding Gradient occurs when a neural network is being trained
and the slope tends to grow exponentially rather than decay. Large error gradients that build
up during training lead to very large updates to the neural network model weights, which is
the source of this issue.
Training through RNN
1. A single-time step of the input is provided to the network.
2. Then calculate its current state using a set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information from all the
previous states.
5. Once all the time steps are completed the final current state is used to calculate the output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
7. The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained using Backpropagation through time.
Advantages and Disadvantages of Recurrent Neural Network
Advantages
1. An RNN remembers each and every piece of information through time. It is useful in time
series prediction only because of the feature to remember previous inputs as well. This is
called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the effective
pixel neighborhood.
Disadvantages
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Variation Of Recurrent Neural Network (RNN)
To overcome the problems like vanishing gradient and exploding gradient descent several new
advanced versions of RNNs are formed some of these are as;
1. Bidirectional Neural Network (BiNN)
2. Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information flows in
both direction and then the output of both direction are combined to produce the input. BiNN is
useful in situations when the context of the input is more important such as Nlp tasks and Time-
series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given the input
information network reads and writes the most useful information from the data and it forgets
about the information which is not important in predicting the output. For doing this three new
gates are introduced in the RNN. In this way, only the selected information is passed through the
network.
Difference between RNN and Simple Neural Network
RNN is considered to be the better version of deep neural when the data is sequential. There are
significant differences between the RNN and deep neural networks they are listed as:
Recurrent Neural Network
Deep Neural Network

Weights are same across all the layers


Weights are different for each layer of
number of a Recurrent Neural
the network
Network

A Simple Deep Neural network does


Recurrent Neural Networks are used
not have any special method for
when the data is sequential and the
sequential data also here the the
number of inputs is not predefined.
number of inputs is fixed

The Numbers of parameter in the RNN The Numbers of Parameter are lower
are higher than in simple DNN than RNN

These problems also occur in DNN but


Exploding and vanishing gradients is
these are not the major problem with
the the major drawback of RNN
DNN
RNN Code Implementation
Imported libraries:
Imported some necessary libraries such as numpy, tensorflow for numerical calculation an model
building.
Python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
Input Generation:
Generated some example data using text.
Python
text = "This is GeeksforGeeks a software training institute"
chars = sorted(list(set(text)))
char_to_index = {char: i for i, char in enumerate(chars)}
index_to_char = {i: char for i, char in enumerate(chars)}
Created input sequences and corresponding labels for further implementation.
Python
seq_length = 3
sequences = []
labels = []

for i in range(len(text) - seq_length):


seq = text[i:i+seq_length]
label = text[i+seq_length]
sequences.append([char_to_index[char] for char in seq])
labels.append(char_to_index[label])
Converted sequences and labels into numpy arrays and used one-hot encoding to convert text
into vector.
Python
X = np.array(sequences)
y = np.array(labels)

X_one_hot = tf.one_hot(X, len(chars))


y_one_hot = tf.one_hot(y, len(chars))

Model Building:
Build RNN Model using ‘relu’ and ‘softmax‘ activation function.
Python
model = Sequential()
model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)), activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
Model Compilation:
The model.compile line builds the neural network for training by specifying the optimizer
(Adam), the loss function (categorical crossentropy), and the training metric (accuracy).
Python
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Model Training:
Using the input sequences (X_one_hot) and corresponding labels (y_one_hot) for 100 epochs,
the model is trained using the model.fit line, which optimises the model parameters to minimise
the categorical crossentropy loss.
Python
model.fit(X_one_hot, y_one_hot, epochs=100)
output:
Epoch 1/100
2/2 [==============================] - 2s 54ms/step - loss: 2.8327 - accuracy:
0.0000e+00
Epoch 2/100
2/2 [==============================] - 0s 16ms/step - loss: 2.8121 - accuracy:
0.0000e+00
Epoch 3/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7944 - accuracy: 0.0208
Epoch 4/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7766 - accuracy: 0.0208
Epoch 5/100
2/2 [==============================] - 0s 15ms/step - loss: 2.7596 - accuracy: 0.0625
Epoch 6/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7424 - accuracy: 0.0833
Epoch 7/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7254 - accuracy: 0.1042
Epoch 8/100
2/2 [==============================] - 0s 12ms/step - loss: 2.7092 - accuracy: 0.1042
Epoch 9/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6917 - accuracy: 0.1458
Epoch 10/100
2/2 [==============================] - 0s 12ms/step - loss: 2.6742 - accuracy: 0.1667
Epoch 11/100
2/2 [==============================] - 0s 10ms/step - loss: 2.6555 - accuracy: 0.1667
Epoch 12/100
2/2 [==============================] - 0s 16ms/step - loss: 2.6369 - accuracy: 0.1667

Model Prediction:
Generated text using pre-trained model.
Python
start_seq = "This is G"
generated_text = start_seq

for i in range(50):
x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])
x_one_hot = tf.one_hot(x, len(chars))
prediction = model.predict(x_one_hot)
next_index = np.argmax(prediction)
next_char = index_to_char[next_index]
generated_text += next_char

print("Generated Text:")
print(generated_text)
output:
1/1 [==============================] - 1s 517ms/step
1/1 [==============================] - 0s 75ms/step
1/1 [==============================] - 0s 101ms/step
1/1 [==============================] - 0s 93ms/step
1/1 [==============================] - 0s 132ms/step
1/1 [==============================] - 0s 143ms/step
1/1 [==============================] - 0s 140ms/step

\Application to text Image


1. Introduction When people listen to or read a narrative, they quickly create pictures in their
mind to visualize the content. Many cognitive functions, such as memorization, reasoning ability,
and thinking, rely on visual mental imaging or “seeing with the mind’s eye”
[1]. Developing a technology that recognizes the connection between vision and words and can
produce pictures that represent the meaning of written descriptions is a big step toward user
intellectual ability. Image-processing techniques and applications of computer vision (CV) have
grown immensely in recent years from advances made possible by artificial intelligence and deep
learning’s success. One of these growing fields is text-to-image generation.
The term text-to-image (T2I) is the generation of visually realistic pictures from text inputs.
T2I generation is the reverse process of image captioning, also known as image-to-text (I2T)
generation [2–4], which is the generation of textual description from an input image. In T2I
generation, the model takes an input in the form of human written description and produces a
RGB image that matches the description. T2I generation has been an important field of study due
to its tremendous capability in multiple areas. Photo-searching, photo-editing, art generation,
captioning, portrait drawing, industrial design, and image manipulation are some common
applications of creating photo-realistic images from text. The evolution of generative adversarial
networks (GANs) has demonstrated exceptional performance in image synthesis, image super-
resolution, data augmentation, and image-to-image conversion. GANs are deep learning-based
convolution neural networks
(CNNs) [5,6]. It consists of two neural networks: one for generating data and the other for
classifying real/fake data. GANs are based on game theory for learning generative models. Its
major purpose is to train a generator (G) to generate samples and a discriminator (D) to discern
between true and false data. For generating better-quality realistic image, we performed text
encoding using recurrent neural networks (RNN), and convolutional layers were used for image
decoding. We developed recurrent convolution GAN (RC-GAN), a simple an effective
framework for appealing to image synthesis from human written textual descriptions. The model
was trained on the Oxford-102 Flowers Dataset and ensures the identity of the synthesized
pictures. The key contributions of this research include the following: • Building a deep learning
model RC-GAN for generating more realistic images. • Generating more realistic images from
given textual descriptions. • Improving the inception score and PSNR value of images generated
from text. The following is how the rest of the paper is arranged: In Section 2, related work is
described. The dataset and its preprocessing are discussed in Section 3. Section 4 explains the
details of the research methodology and dataset used in this paper. The experimental details and
results are discussed in Section 5. Finally, the paper is concluded in Section 6. 2. Related Work
GANs were first introduced by Goodfellow
[7] in 2014, but Reed et al. [
8] was the first to use them for text-to-image generation in 2016. Salimans et al.
[9] proposed training stabilizing techniques for previously untrainable models and achieved
better results on the MNIST, CIFAR-10, and SVHN datasets. The attention-based recurrent
neural network was developed by Zia et al.
[10]. In their model, word-to-pixel dependencies were learned by an attention-based auto-
encoder and pixel-to-pixel dependencies were learned by an autoregressive-based decoder. Liu et
al.
[11] offered a diverse conditional image synthesis model and performed large-scale experiments
for different conditional generation tasks. Gao et al.
[12] proposed an effective approach known as lightweight dynamic conditional GAN (LD-
CGAN), which disentangled the text attributes and provided image features by capturing multi-
scale features. Dong et al.
[13] trained a model for generating images from text in an unsupervised manner. Berrahal et al.
[14] focused on the development of textto-image conversion applications. They used deep
fusion GAN (DF-GAN) for generating human face images from textual descriptions. The cross-
domain feature fusion GAN (CFGAN) was proposed by Zhang et al.
[15] for converting textual descriptions into images with more semantic detail. In general, the
existing methods of text-to-image generation use wide-ranging parameters and heavy
computations for generating high-resolution images, which result in unstable and high-cost
training.
(CNNs) [5,6]. It consists of two neural networks: one for generating data and the other for
classifying real/fake data. GANs are based on game theory for learning generative models. Its
major purpose is to train a generator (G) to generate samples and a discriminator (D) to discern
between true and false data. For generating better-quality realistic image, we performed text
encoding using recurrent neural networks (RNN), and convolutional layers were used for image
decoding. We developed recurrent convolution GAN (RC-GAN), a simple an effective
framework for appealing to image synthesis from human written textual descriptions. The model
was trained on the Oxford-102 Flowers Dataset and ensures the identity of the synthesized
pictures. The key contributions of this research include the following: • Building a deep learning
model RC-GAN for generating more realistic images. • Generating more realistic images from
given textual descriptions. • Improving the inception score and PSNR value of images generated
from text. The following is how the rest of the paper is arranged: In Section 2, related work is
described. The dataset and its preprocessing are discussed in Section 3. Section 4 explains the

details of the research methodology and dataset used in this paper. The experimental details and
results are discussed in Section 5. Finally, the paper is concluded in Section 6. 2. Related Work
GANs were first introduced by Goodfellow [7] in 2014, but Reed et al. [8] was the first to use
them for text-to-image generation in 2016. Salimans et al. [9] proposed training stabilizing
techniques for previously untrainable models and achieved better results on the MNIST, CIFAR-
10, and SVHN datasets. The attention-based recurrent neural network was developed by Zia et
al. [10]. In their model, word-to-pixel dependencies were learned by an attention-based auto-
encoder and pixel-to-pixel dependencies were learned by an autoregressive-based decoder. Liu et
al. [11] offered a diverse conditional image synthesis model and performed large-scale
experiments for different conditional generation tasks. Gao et al. [12] proposed an effective
approach known as lightweight dynamic conditional GAN (LD-CGAN), which disentangled the
text attributes and provided image features by capturing multi-scale features. Dong et al. [13]
trained a model for generating images from text in an unsupervised manner. Berrahal et al. [14]
focused on the development of textto-image conversion applications. They used deep fusion
GAN (DF-GAN) for generating human face images from textual descriptions. The cross-domain
feature fusion GAN (CFGAN) was proposed by Zhang et al. [15] for converting textual
descriptions into images with more semantic detail. In general, the existing methods of text-to-
image generation use wide-ranging parameters and heavy computations for generating high-
resolution images, which result in unstable and high-cost training.

This section describes the training details of deep learning-based generative models. Conditional
GANs were used with recurrent neural networks (RNNs) and convolutional neural networks
(CNNs) for generating meaningful images from a textual description. The dataset used consisted
of images of flowers and their relevant textual descriptions. For generating plausible images
from text using a GAN, preprocessing of textual data and image resizing was performed. We
took textual descriptions from the dataset, preprocessed these caption sentences, and created a
list of their vocabulary. Then, these captions were stored with their respective ids in the list. The
images were loaded and resized to a fixed dimension. These data were then given as input to our
proposed model. RNN was used for capturing the contextual information of text sequences by
defining the relationship between words at altered time stamps. Text-to-image mapping was
performed using an RNN and a CNN. The CNN recognized useful characteristics from the
images without the need for human intervention. An input sequence was given to the RNN,
which converted the textual descriptions into word embeddings with a size of 256. These word
embeddings were concatenated with a 512-dimensional noise vector. To train our model, we
took a batch size of 64 with gated-feedback 128 and fed the input noise and text input to a
generator. The architecture of the proposed model is presented in Figure 1. Eng. Proc. 2022, 20,
16 3 of 6 images were loaded for resizing to the same dimensions. All training images and
testing images were resized to a resolution of 128 × 128.

For training purposes, the images were converted into arrays, and both the vocabulary and
images were loaded onto the model. 4. Proposed Methodology This section describes the training
details of deep learning-based generative models. Conditional GANs were used with recurrent
neural networks (RNNs) and convolutional neural networks (CNNs) for generating meaningful
images from a textual description. The dataset used consisted of images of flowers and their
relevant textual descriptions. For generating plausible images from text using a GAN,
preprocessing of textual data and image resizing was performed. We took textual descriptions
from the dataset, preprocessed these caption sentences, and created a list of their vocabulary.
Then, these captions were stored with their respective ids in the list.

The images were loaded and resized to a fixed dimension. These data were then given as input
to our proposed model. RNN was used for capturing the contextual information of text sequences
by defining the relationship between words at altered time stamps. Text-to-image mapping was
performed using an RNN and a CNN. The CNN recognized useful characteristics from the
images without the need for human intervention. An input sequence was given to the RNN,
which converted the textual descriptions into word embeddings with a size of 256. These word
embeddings were concatenated with a 512-dimensional noise vector. To train our model, we
took a batch size of 64 with gated-feedback 128 and fed the input noise and text input to a
generator. The architecture of the proposed model is presented
Figure 1. Architecture of the proposed method, which can generate images from text
descriptions. Semantic information from the textual description was used as input in the
generator model, which converts characteristic information to pixels and generates the images.
This generated image was used as input in the discriminator along with real/wrong textual
descriptions and real sample images from the dataset. A sequence of distinct (picture and text)
pairings are then provided as input to the model to meet the goals of the discriminator: input
pairs of real images and real textual descriptions, wrong images and mismatched textual
descriptions, and generated images and real textual descriptions. The real photo and real text
combinations are provided so that the model can determine if a particular image and text
combination align. An incorrect picture and real text description indicates that the image does
not match the caption. The discriminator is trained to identify real and generated images. At the
start of training, the discriminator was good at classification of real/wrong images. Loss was
calculated to improve the weight and to provide training feedback to the generator and
discriminator model. As soon as the training proceeded, the generator produced more realistic
images and it fooled the discriminator when distinguishing between real and generated and
images

Video Recommendation System

Video recommendation systems are a fundamental component of many popular streaming


platforms, such as YouTube and Netflix. These systems are tasked with providing users with
engaging and personalized content recommendations. They come in various flavors, each offering
unique approaches to the task of suggesting personalized content to users. Three prominent types
of recommendation systems are content-based, collaborative filtering, and the innovative two-
tower architecture.

Content-Based Recommendation Systems

Content-based recommendation systems operate on the premise of suggesting items to users based
on the content attributes of those items and a user’s past preferences. These systems focus on
features and characteristics associated with items, such as text descriptions, genres, keywords, or
metadata.

The recommendations generated are aligned with the user’s historical interactions and
preferences. Content-based systems excel in providing recommendations that are closely related
to the user’s demonstrated interests. For example, a content-based movie recommendation system
might suggest films with similar genres or themes to those the user has previously enjoyed.

Collaborative Filtering Recommendation Systems

Collaborative filtering recommendation systems, on the other hand, rely on the collective
behavior and preferences of a large user base to make suggestions. This approach assumes that
users who have exhibited similar preferences in the past will continue to do so in the future.
Collaborative filtering can be further categorized into two subtypes: user-based and item-based.
User-based collaborative filtering recommends items to a user based on the preferences of users
who are similar to them. Item-based collaborative filtering suggests items similar to those the user
has shown interest in, based on the behavior of other users. These systems are effective at
suggesting items that are trending or popular among users with similar preferences.

Two-Tower Architecture

The Two-Tower architecture is a cutting-edge recommendation system design that leverages


neural networks to enhance recommendation quality. In this architecture, two separate “towers”
are used to encode user and item (content) information independently.

The user tower processes user data, such as profiles and historical interactions, while the item
tower encodes item features like metadata and content descriptors. By separately encoding user
and content information, the Two-Tower architecture excels in delivering highly personalized
recommendations. It is particularly adept at addressing challenges like the cold start problem,
where it must recommend to new users or new items with limited interaction data. This
architecture is highly efficient, scalable, and capable of fine-tuning recommendations based on
nuanced user preferences.

two tower architecture

Exploring Two-Tower Neural Networks for Enhanced Retrieval

In the realm of retrieval systems, Two-Tower Neural Networks (NNs) hold a special significance.
Our retrieval approach, grounded in machine learning, harnesses the power of the Word2Vec
algorithm to create embeddings for both users and media/authors based on their unique identifiers.

The Two Towers model expands upon the Word2Vec algorithm, permitting the incorporation of
diverse user or media/author characteristics. This adaptation also facilitates concurrent learning
across multiple objectives, enhancing its utility for multi-objective retrieval tasks. Notably, this
model retains the scalability and real-time capabilities inherent in Word2Vec, making it an
excellent choice for candidate sourcing algorithms.

Here’s a high-level overview of how Two-Tower retrieval operates in conjunction with a schema:
1. The Two Tower model comprises two distinct neural networks — one for users and one for
items.

2. Each neural network exclusively processes features pertinent to its respective entity and
generates an embedding.

3. The primary objective is to predict engagement events (e.g., user likes on a post) by measuring
the similarity between user and item embeddings.

4. Following training, user embeddings are optimized to closely match embeddings of relevant
items, enabling the use of nearby item embeddings for ranking purposes.

learning using four popular machine learning algorithms namely, Random Forest Classifier,
KNN, Decision Tree Classifier, and Naive Bayes classifier. We will directly jump into
implementation step-by-step.

classification using machine learning and machine learning image classification. However,
the work demonstrated here will help serve research purposes if one desires to compare their
CNN image classifier model with some machine learning algorithms.

Learning Objective:

• Provide a step-by-step guide to implementing image classification algorithms using


popular machine learning algorithms like Random Forest, KNN, Decision Tree, and
Naive Bayes.

• Demonstrate the limitations of traditional machine learning algorithms for image


classification algorithms tasks and highlight the need for deep learning approaches.

• Showcase how to test the trained models on custom input images and evaluate their
performance.

This article was published as a part of the Data Science Blogathon.

Table of contents

Dataset Acquisition
Source: cs.toronto

The dataset utilized in this blog is the CIFAR-10 dataset, which is a Keras dataset that can be
easily downloaded using the following code. The dataset includes ten classes: airplane,
automobile, bird, cat, deer, dog, frog, horse, ship, and truck, indicating that we will be
addressing a multi-class classification problem.

First, let’s import the required packages as follows:

from tensorflow import keras


import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
import numpy as np
import cv2
The dataset can be loaded using the code below:

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

You might also like