Video Captioning Using Neural Networks
Video Captioning Using Neural Networks
https://doi.org/10.22214/ijraset.2022.42506
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
Abstract: Researchers in the fields of computer vision and natural language processing have been concentrating their efforts in
recent years on automatically developing natural language descriptions for videos. Although video comprehension has a variety
of applications, such as video retrieval and indexing, video captioning is a difficult topic to master due to the complex and diverse
nature of video content. Understanding the relationship between video content and natural language sentences, on the other
hand, is still a work in progress, and several approaches for improved video analysis are being developed. Because of their supe-
rior performance and high-speed computing capabilities, deep learning approaches have shifted their focus to video processing.
This research aims at the end-to-end structure of a deep learning based encoder-decoder network for creating natural language
descriptions for video sequences. The use of a CNN-RNN model paired with beam search to generate captions for the MSVD
dataset is explored in this study. We have compared the results with beam search and greedy search approach.
The generated captions from this model is generally grammatically incorrect. Our paper focuses on improving those grammati-
cal errors using encoder-decoder model. Grammatical errors include spelling mistakes, incorrect use of articles, prepositions,
pronouns, nouns, etc or even poor sentence construction. Using beam search for k=3, the captions generated by our algorithm
get a BLEU score of 0.72. After passing the generated captions through a grammar error correction mechanism, the results im-
prove to a BLEU score of 0.76. The results increased by 5.55% after grammar correction. The blue score reduces as the value of
k decreases, but the time it takes to generate captions decreases as well.
Index Terms: Video captioning, end-to-end structure, MSVD dataset, encoder-decoder model, beam search, grammar correction
I. INTRODUCTION
Text, music, image, video, and other forms of multimedia are all part of today’s digital material. With the spread of sensor-rich mo-
bile devices, video has become a new means of communication between Internet users. Video data has been generated, published,
and distributed rapidly as a result of the significant rise in Internet speed and storage capacity, and it has become an integral aspect
of today’s big data. Improved methodologies for a wide range of video comprehension applications, including online advertising,
video retrieval, and video surveillance, have resulted as a result of this. Understanding video contents is a critical challenge that
supports the success of these technological advancements. In this paper, we are focusing on video captioning using deep neural net-
works. The goal of video captioning is to make a complete and natural sentence out of a single label in video classification, captur-
ing the most informative dynamics in videos. Template based models takes structured data as input and generates the natural lan-
guage sentences that describes the data in human manner. The main drawback of these models is that it predefines the rigid sentence
structure. For video captioning, there are two basic approaches: template-based language models and sequence learning models
(e.g., RNNs). The first divides the sentence into different sections and predefines the unique rule for language grammar (e.g., sub-
ject, verb, object). Many works match each portion with recognized words from visual content by object recognition and then create
a sentence with linguistic limitations using such sentence fragments. Sequence learning methods are used by the latter to learn a
direct mapping between video content and texts. Hindi-language captions are also being generated by converting the English cap-
tions. In our project we have used sequence learning end-to-end encoder-decoder model.
This model has 2 parts:-
1) Encoder: The video features are encoded into an array using this method. In our project, we capture 80 video frames and extract
4096 features for each frame 1 using the VGG16[10] CNN model, which has been pre-trained. As a result, we have an array
with dimensions of 80 X 4096.
2) Decoder: The hidden features generated by the encoder are fed into the decoder, along with the input captions, to generate vid-
eo captions. Because this is a sequential task, an LSTM[?] chain is used.
In our project, CNN-RNN[9] model is used with beam search to create video captions. The data is first pre-processed, and then the
features from the videos are extracted using a pre-trained VGG16 CNN model. Each video is cut into 80 frames, which are then run
through the VGG16 model. The training model is then created using LSTM.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1228
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1229
122
9
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The problem with the work is that using an existing dataset to train an action classifier limits the number of action classes. It could
be a better idea to unsupervised localise the actions. It’s hard to tell whether ’a man’ from two different frames refers to the same
person or not using the existing pipeline. As a result, the present system will have to be combined with the facial recognition system.
Lee et al. [7] Using Video Captioning to Capture Long-Range Dependencies. In this paper, the temporal capacity of a video caption-
ing network with a non-local block is examined.
It provides a non-local block video captioning method for capturing longrange temporal dependencies. Local and non-local features
are used separately in the suggested model. It makes use of both types of characteristics. It tests the method using a dataset from the
Microsoft Video Description Corpus (MSVD,YouTube2Text). The results suggest that in video captioning datasets, a non-local
block applied along the temporal axis helps alleviate the LSTM’s long-range dependency problem. The primary flaw in this paper is
that the model’s sentences are not optimised. During training, reinforcement learning can be utilised to directly optimise the phrases
generated by the model.
Xiao et al. [8] With a Temporal And Region Graph Convolution Network, video captioning is achieved. With Temporal Graph Net-
work (TGN) and Region Graph Network (RGN), this research proposes a revolutionary video captioning paradigm (RGN). The
temporal sequential information and the region object information of frames are encoded using a graph convolution network. TGN
primarily focuses on leveraging the sequential information of frames, which is often ignored by most existing approaches. RGN was
created to investigate the connections between important objects. The prominent boxes of each frame are extracted using an object
detection algorithm, and then a region graph is built based on their placements. The context features that are given into the decoder
are acquired using an attention method.
Hoxha et al. [9] In this paper, a new framework for picture captioning is proposed that combines creation and retrieval. For remote
sensing image captioning, a new CNN RNN framework has been developed. Multiple captions for a target image are generated us-
ing a CNN-RNN architecture paired with beam search. The best caption is then chosen based on linguistic similarity to reference
captions for the most similar photographs. The RSCID dataset is used to generate the results.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1230
123
0
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
B. Data Collection
The MSVD[2] dataset set by Microsoft is being used for this study. This data set includes 1970 short YouTube clips for training and
100 films for testing that have been hand labelled. Each video has a unique ID, and each ID includes approximately 15–20 captions.
There are two directories in the dataset: training_data and testing_data. Each folder has a video subfolder that contains the videos
that will be used for both training and testing. There is also a feat subfolder in these folders, which stands for features. The video’s
features are contained in the feat folders. There is also a training_label and testing_label json files. The captions for each ID are con-
tained in these json files.
The MSVD corpus was created from a compilation of Youtube video clips that every record a single activity by one or even more
people. Then every video segment is represented by a single sentence. We began by deleting any explanations that were not in Eng-
lish or had misspelt phrases. Then we converted all of the faulty descriptors to lower case tokens. For unedited video, we take imag-
es every 10 frames.
Each video was divided into multiple sections in MSVD, each one corresponded to one or even more captions. We separated the
data set into two portions by respective video names, producing Training and Testing sets with an 8:2 ratio, to avoid parts from the
very same video (despite when they may capture different behaviours) showing in both training and testing sets, and therefore bias-
sing up our assessment scores.
C. Feature Extraction
Each video in the dataset is treated as a series of frames of images to extract the video’s features. All images from the video stream
were recovered piece by piece. 1231 Depending on the duration of the video, the frame rate captured differs. As a consequence,
again for purpose of simplicity, only eighty frames from every movie are chosen. Each one of the eighty frames is analyzed by a
VGG16 that has been trained before, which yielded 4096 attributes. The VGG16 convolutional neural network design emphasises 3
× 3 filtered convolutional layer with step 1, while utilises the very similar padding and maxpool structure as a 2x2 filtered having
step 2. Both convolutional and maximum pooling tiers are placed in the very same manner all through the layout. The result is fin-
ished by 2 fully connected layers and a softmax. The (80, 4096) shaped array is formed by stacking these characteristics. The overall
number of frames generated from the video is 80, with each frame yielding 4096 unique features.
D. Preprocessing of Captions
The captions and video IDs are stored in two columns in the train_list. The train list’s beginning captions are labelled ’bos,’ while
the end captions are labelled ’eos.’ Following that, the data is divided into training set and validation sets as follows:- training list
contains 85% of the data, whereas the validation list contains the 15% of the data.
Because we simply tokenize the words in the training data, the vocab list only contains the captions from the training_list. The pad-
ding of the sentences is done after tokenizing to make all the sentences of the same word length. We have inflated all of them to be
ten words in our project.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1231
123
1
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The caption with the most words in the entire data set has 39 words, but the majority of captions have between 6 and 10 words. The
captions are filter out to make the dataset outlier free because then we need to pad the sentences with whitespaces to the length of 39
words. This can result in wrong prediction. If most sentences are 10 words long and we have to pad them to double their length,
there will be a lot of padding. These heavily padded sentences will be utilised for training, resulting in the model predicting padded
tokens almost exclusively. Because padding essentially entails the addition of white spaces, most of the sentences predicted by the
model will simply have more blank spaces and fewer words, resulting in incomplete phrases.
For the captions, we only used the top 1500 words as our vocabulary. Any captions that are generated must be included in the 1500-
word limit. Despite the fact that there are over 1500 distinct terms, most of them appear just once, twice, or three times, making the
vocabulary susceptible to outliers. As a result, we only chose the top 1500 most frequently occuring words to keep it secure.
Equation 3.1 and 3.2 is an element-wise non-linear function, such as RELU or hyperbolic tangent, where Wh, Uh denote the weight
matrices and b denotes the bias, and is an element-wise non-linear function. ot is the output at time t, ht is the hidden state with M
hidden units, and xt is the input.
Text generation and speech recognition have been demonstrated to be successful using traditional RNNs. Longrange temporal asso-
ciations movies, on the other hand, are difficult to manage due to the gradient problem of inflating and fading. Hochreiter and
Schmidhuber[17] devised the LSTM network, which has been shown to successfully prevent gradient vanishing and explosion prob-
lems during backpropagation through time (BPTT).
This one is due to the network’s inclusion of memory blocks, which allow it to acquire lengthy temporal relationships, forgetting
formerly concealed state, and modify them with fresh content. LSTM, as implemented in our framework, consists of many control-
ling gate and a storage cell. At every period interval, have xt, ct, and ht denote the inputs, cell memories, and hidden control values,
accordingly. The LSTM will calculate the hidden control sequence (h1,..., hT) and the cell storage series given a set of input (x1,...,
xT) (c1, ..., cT ).
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1232
123
2
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1233
123
3
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
a) Encoder Vector
This is the model’s final hidden state, generated by the encoder. The equation 3.5 is used to calculate it.
This vector seeks to incorporate all input element information in order to aid the decoder in making correct predictions.
It serves as the model’s initial hidden state for the decoder.
b) Decoder
A stack of recurrent units, each of which predicts an output y_t at a given time step t.
Each recurrent unit takes in a hidden state from the preceding unit and outputs as well as its own hidden state.
The output sequence are the captions generated for the frames.
The formula is used to calculate the output yt at time step t:
yt = softmax(Wsht) (3.6)
We compute the outputs by combining the hidden state at the current time step with the weight W(S). Softmax is used to generate a
probability vector that will aid in the prediction of the ultimate result.
Initially, first frame’s features are supplied into the encoder’s first LSTM cell. This is followed by the second frame’s features, and
so on until the 80th frame. Because we’re only interested in the encoder’s end state, all of the encoder’s other outputs are ignored.
Final state of the encoder is used as the decoder’s initial state. The first decoder uses LSTM as input to begin the sentence. Each
word of the caption from the training data is fed one at a time until is reached.
The encoder’s time steps are equal to the number of LSTM cells we chose for the encoder, which is 80. The amount of features from
video encoder tokens is 4096 in our situation. The decoder’s time steps are equal to the number of LSTM cells in the decoder, which
is ten, and the number of tokens is equal to the vocabulary length, which is 1500.
The final model looks as follows :
We tested our model by running it on different number of epochs i.e, 25, 50, 75, 150.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1234
123
4
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
H. Generation of Captions
The vocabulary of candidate sequence is scored on the basis of their likelihood. Two approaches are used to predict the possibility of
the next word that can be which is, greedy search and beam search.
1) Greedy Search: After analyzing and extracting the feature, we employed the use of decoder to perform the task of estimating
the probability of occurrence of consecutive terms of the video caption from the vocabulary. In decoder component we imple-
mented greedy search algorithm to effectively estimate probability of terms. The main goal of this approach is to compute the
term with maximum possible conditional probability term from the given list of known terms. The algorithm terminates when
the sequence of terms selected by the greedy search reaches a limit say ’L’. This satisfies the algorithm’s condition and helps
terminate, thereby resulting an output sequence and is assigned as the video caption for te given input encoded features and vo-
cabulary.
2) Beam Search Decoder: Beam search is an approach that explores all possible choices unlike the greedy search, which focuses
only on the most probable option. The amount of choices can be limited with the help of a parameter ’k’, when specified the al-
gorithm performs k number of parallel searches and then the algorithm comparatively analyses the various selections or choic-
es. We experimented k values with ’2’ and ’3’. We observed that as the value of ’k’ increases the time complexity varies expo-
nentially.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1235
123
5
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
VII. RESULTS
We trained our model using LSTM and we have noted the loss and accuracy for each epochs.
The training model loss and accuracy for the training and testing datasets are shown in Figures 7.1 and 7.2, respectively. The lines in
the preceding graph are approaching a constant value, which is why the training is ended after approximately 78 epochs. To generate
the best accurate captions for a video, we combined encoder decoder architecture with beam search. In comparison to greedy search,
beam search produces more accurate results; however, beam search takes longer to process. We can utilize greedy search to get fast-
er results in real-time prediction.
Table 7.1: Sample Results using beam search
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1236
123
6
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The captions created using beam search for k=2 and k=3 are shown in table 7.1. When k=2 is used instead of k=3, the captions cre-
ated are less accurate. In the first example, captions generated with k=2 are less grammatically correct than captions generated with
k=3. Example 3 in table 7.1 is an example of incorrect caption generating. The right title for that video would have been "Men are
running," but instead "Men are playing soccer" and "Men are playing a ball" were generated. This occurs because the accuracy of
our model suffers when the main object in the image is small. The size of the significant object in the first two photographs is large,
and it is generating correct captions.
In Table 7.2 we have compared the results of the captions generated by using grammar error correction network with beam search
and captions generated without grammar correction network using greedy search is less then the captions generated using beam
search with k=2 and k=3.
VIII. CONCLUSION
We concentrated on creating subtitles for the video dataset in this project. Initially, we concentrated on developing and training our
model using the LSTM encoder. The video datasets features were extracted, the captions were analysed, and the model was built. To
create the captions, we employed a variety of search algorithms and analysed their correctness. We discovered that beam search is
more accurate than greedy search, and that the accuracy grows as the node value in the beam search increases. The time it takes to
generate a caption increases as the number of nodes in a beam search increases. We built the Grammar Error Correction network to
correct the Grammatical Errors in the generated Captions. We used encoder-decoder architecture to create the model. BLEU Score
of the generated captions increases after the captions are passed through Grammar Error Correction Network. Our methodology
results in a 5.55% boost in the quality of generated captions. Furthermore, we’ve discovered that as the number of beams in the
beam search rises, the time it takes to generate captions climbs exponentially, and the more beams there are, the more accurate the
model’s results become.
REFERENCES
[1] Yang Yang, Jiangbo Ai, Yi Bin, Alan Hanjalic, "Video Captioning by Adversarial LSTM", IEEE Transactions on Image Processing 2019
[2] Bin Zhao, Xuelong Li ,Xiaoqiang Lu , "CAM-RNN: Co-Attention Model Based RNN for Video Captioning", IEEE TRANSACTIONS ON IMAGE PRO-
CESSING, VOL. 28, NO. 11, NOVEMBER 2019
[3] Chien-Yao Wang, Pei-Sin Liaw, Kai-Wen Liang, Jai-Ching Wang, Pao-Chi Chang, "Video Captioning Based On Jonit Image–Audio Deep Learning Tech-
niques", 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin)
[4] Soichiro Oura, Tetsu Matsukawa, Einoshin Suzuki, "Multimodal Deep Neural Network with Image Sequence Features for Video Captioning", 2018 Internation-
al Joint Conference on Neural Networks (IJCNN)
[5] Thang Nguyen, Shagan Sah, Raymond Ptucha, "Multistream Hierarchical Boundary Network For Video Captioning", 2017 IEEE Western New York Image and
Signal Processing Workshop (WNYISPW)
[6] Andrew Shin, Katsunori Ohnishi, Tatsuya Harada, "Beyond Caption To Narrative: Video Captioning With Multiple Sentences", 2016 IEEE International Con-
ference on Image Processing (ICIP)
[7] Jaeyoung Lee, Yekang Lee, Sihyeon Seong, Kyungsu Kim, Sungjin Kim, Junmo Kim, "Capturing Long-range Dependencies In Video Captioning", 2019 IEEE
International Conference on Image Processing (ICIP)
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1237
123
7
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC VALUE: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
[8] Xinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao, Weiguo Fan, "Video Captioning With Temporal And Region Graph Convolution Network",
2020 IEEE International Conference on Multimedia and Expo (ICME)
[9] Genc Hoxha, Farid Melgani, jacopo Slaghenauffi, "A New CNN RNN Framework For Remote Sensing Image Captioning", 2020 Mediterranean and Middle-
East Geoscience and Remote Sensing Symposium (M2GARSS)
[10] Jiahui Tao, Yuehan Gu, JiaZheng Sun, Yuxuan Bie, Hui Wang " Research on vgg16 convolutional neural network feature classification algorithm based on
Transfer Learning", 2021 2nd China International SAR Symposium (CISS)
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1238
123
8