AI Optics: Object Recognition and Caption Generation For Blinds Using Deep Learning Methodologies
AI Optics: Object Recognition and Caption Generation For Blinds Using Deep Learning Methodologies
Abstract—With the exponential development in the field of developed to implement range finder and camera for obstacle
artificial intelligence in recent years, many researchers have detection and navigation, the device also used solar panel for
focused their attention towards the topic of image caption charging.
generation. With this topic being that of arduous task and
interest people take it as a challenge to perform to excel in the Computer Vision can be used to implement purposeful
field of AI. Automatic generation of neutral language navigation and object detection for developing a technology
descriptions or ‘captions’ according to the composition detected for visual aid. Purposeful navigation refers to guided
in an image, i.e., scene understanding is the main part of image movement through free space to reach the desired location
caption generation which can be achieved by combining both while prevention from hitting obstacles. [5]. The major
natural language processing along with computer vision. In this challenge is to fast forward the results from the sensors to the
paper, we tackle the task of generating captions by using the processing algorithm and further to the accessible device.
concepts of Deep Learning.
In this paper we propose and end-to-end accessible
Keywords— Artificial Intelligence, Deep Learning, RNN, CNN, solution to provide purposeful acknowledgement and guidance
LSTM. using object recognition and caption generation for enabling
video to audio aid for the visually impaired community of the
I. INTRODUCTION society. The aim of purposeful acknowledgement and
Millions of people around the world face major disability guidance is to extract the range and direction of the obstacles
of visual impairment. Vision provides all the information within a finite and defined free space captured by the camera
needed for reading, body movement, mobility and its loss can of the device. [6]. The object detection and recognition
severely affect an individual‟s professional and social algorithm will be fed results to the caption generation
advancement. It was reported by the World Health algorithm for explaining the scene of the surrounding to the
Organization (WHO) that out of 1.3 billion people that suffer visually impaired in audio format. The same algorithm will
from one or another form of visual impairment, 36 million also be responsible for providing guidance support to the
suffer from complete blindness[1]. blind. The paper specifically contributes the following:
Problems are often faced by people with impaired vision or 1. A real time algorithm for mapping motion in free
complete blindness once they are out of their familiarized space using object detection.
environments. Corporeal development is one of the major 2. Modified and explored version of general caption
issues for the people suffering from impaired vision [2]. They generator for feedback about the surrounding.
also are unable to recognize an object without physically
feeling it and can‟tsavor the beauty of the nature. Many 3. Capable system for providing guidance support
assistive devices have been made commercially available for through free space along with prevention from
the visually impaired community of the society to help them harmful and specific objects like fire, heavy
read and recognize objects, enhancing their experience[3]. traffic road, pointed ends, etc.
Various research works are still being done regarding the Automatic generation of a caption of an image is itself a
visually impaired community. Thorough analysis of a few big hurdle in artificial intelligence that involves connecting
papers has been done to understand the ongoing work and computer vision with natural language processing. [7]. But
technology. A system that was to be worn in a shoe was solution to this problem could prove to be a better
proposed by K. Patil et al that contains ultrasonic sensors in all understanding of the outside world for the visually impaired
sides including vibration sensor, liquid detector and down step people. The task involves object detection and classification
sensor. [4]. Other works included proposing android with high accuracy and accessibility with large of flexibility in
application for navigation and whether forecasting, news inputs along with priorities to various situations. [8]. Previous
reading features using speech recognition and artificial studies have majorly focused on stitching together the
intelligence. Electronic intelligent eye- a device was solutions of the sub problems to form a larger solution and
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.
2021International Conference on Computing, Communication, and Intelligent Systems(ICCCIS)
have, therefore, failed in providing appropriate description to Kiros et al [21], proposed a multimodal log-bilinear model
the image. [9]. In this paper, we propose a neural network which was more inclined towards the features of the images.
based probabilistic model for the generation captions for Later method was improved in such a way that to allow a
images using a combination of recurrent neural network natural process of generation and ranking. Donahue et al
(RNN) and deep convolution neural network (DCNN) along applied LSTMs (Long Short-Term Memory) to Video to
with advanced statistical machine translation to obtain higher generated video captions for the Video. LSTM is an artificial
accuracy. [10]. recurrent neural network is used in deep learning. IT has
feedback connections. It is able to process single data as well
The results from the caption generator are fed as the input as entire sequences of data [22,23].
to the guidance support system to map the purposeful motion
of the user in free space using human body motion and
mapping algorithm developed using modifications in CNN
and statistical distance mapping done using python. The
collective results of the proposed models are rendered to text
to speech conversion algorithm in python to provide final
output accessibility to the visually impaired.
II. RELATED WORK
Computer vision technology has been used from a very
long period of time for making description of visual data in the
natural language [11,12]. Various types of systems have been
developed for this purpose one such type of system is where Fig. 1. Hybrid Model overview design
structured formal language is joined with compound System.
Type of such system are not very reliable as they are majorly Fang et al described a three - step pipeline for generation
hand created, have very few domains and are useful only on by inculcating object detection. First process of their model
specific realm. [13]. was to learn detectors for various visual notion. Then a
Recently, object detection or image detection has gained a language trained model was applied to detector results, along
lot of popularity and interest of a large number of people. with the image text embedding space.
There are various kind of advances in the domain which aid in In our work we combined image classification with deep
detecting natural language generation but are restricted in their convolution networks along with recurrent networks to
outcome. Li et al [15] initiates with detection and combines developa single model that produces description (caption) of
them all-together to form a final outcome which consists of the images. The model is motivated by sequence generation,
detected object and relationships. Similarly, Farhadi et al [14] where instead of sentence an image is provided by CNN. A
utilized observations to produce a triplet of the image and Latest work by Mao et al[24] used a NN for Same idea and
changed them into text phrases with the help of template. outcome. Our used method is somewhat similar to Mao‟s
Much greater model based on language parsing have been also approach with signifiable differences: - we have used more
utilized. The above methods have proved themself useful in impactful RNN Model and the image is directly provided to
various conditions but one issue that remains with them is that the model directly. As a result, our system obtains a better
they are highly hand designed and rigid when used for text result. Then we provided a multimodal embedding system
generations. space with RNN and LSTM that is used to remember text.
Various approaches have been also marked on the basis of Hence two separate pathways, i.e., one for image,while the
problem of ranking descriptions [16,17,18]. Inthese kinds of other one for text to construe a joint embedding to produce
method, the approach is to inserting text and images in the speech outcome.
same vector space. Hence, when an image query is passed,
III. BACKGROUND:TYPES OF ARCHITECTURE
those descriptions are fetched which lie near to image in the
vector space.This approach cannot be used to describe new In the first section we constructed a prominent
composition of objects, even if the separate object may have contradistinction amongst architectures that integrates
been noticed in training data.[19]. rhetorical and image attributes in multimodal layer, and even
those which inject the attributes of image straight onto the
Latest image description can also be considered as caption prefix encoding process.
dependent on language modelling which are created using
recurrent neural network (RNN)[25,26,27]. The basic RNN We are also able to differentiate four rhetorical
model is a language model, its basic functioning is to arrest probabilities emerging from these architectures, as also
the probability of developing a string from the words depicted in the figures and briefed as following: -
generated so far. [20]. The RNN here is not only used to
A. Init-Inject Technique
generated the next term of the string but also a set of image‟s
features.So, here the RNN is a hybrid Model that functions The initial state vector of the RNN is about to be an image
and relies on both linguistic as well as visual features. vector (it can be also a vector that is extracted from image
vector). It almost takes same size of the image vector as of the
size of vector of RNN‟s hidden state. This is a static binding
355
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.
2021International Conference on Computing, Communication, and Intelligent Systems(ICCCIS)
framework which also enables the descriptions of image to be Therefore, in prefix the image vector is managed as the
altered by RNN. primary word. The magnitude of both the inputs that is image
vector and word vectors must beequal. This is also a static
binding framework and enables the image description to be
altered by RNN.
C. Par-Inject Technique
The vector that is image vector or either extracted from
image vector andthe word vector of caption prefix both
simultaneously servers as inputs to the RNN in two ways or
terminologies that are:-
a) Both the inputs are combined to form a single input
(image vector is combined with word vector that is
being forwarded to RNN.
b) RNN can also handle two discrete inputs.
As our previous possibilities both image vector and word
vector of caption prefix needs to be the same size but, in this
case, it is not required that every word vector has a
corresponding image vector. Also, it is not required that each
and every image vector and word vector must be similar.
Therefor this is not a static binding architecture but rather is
mixed unlike our previous case. Little bit of modification in
also allowed while representation of the image. As it would be
quite a task for RNN if every image that is fed to RNN is
exactly the same as at its hidden state vector is refreshed with
the same image every single time.
D. Merge Technique
The vector that is either derived from image or image
vector is not exposed to the RNN at any instance. Rather than
that the image is set forth in the language model following by
the encoding of the prefix by the RNN. This is an example of
a late binding architecture. [28].
We also don‟t require to alter the image at every time step
during its representation. With these variations or possibilities,
we are required to consider about a selection process of these
above contributions.
IV. PROCEDURE
A. Prepare Photo And Text Data
For this following experiment we have used Flickr8K
dataset which consists of two parts: „Flickr8k_Dataset‟ which
contains 8092 different type of photographs in „.jpg‟ or
JPEG/JPG format and „Flickr8k_text‟ which contains a
number of text (.txt) files containing various sources of
rawdescriptions for the given photographs.
Flickr8k dataset is separated into three sections:
a) For training purposes, we are provided with 6,000
images,
Fig. 2. Multiple techniques of constraining a neural language model
alongside an image. b) For testing purposes, we are provided with 1,000
images,
B. Pre-Inject Technique
c) And for validating purposes we are provided with
RNN takes two inputs the first one isa vector that is
1,000 images.
extracted from image vector, i.e., image vector. The second
input that is word vector comes into play later. There are five different captions for each image. Using the
VGG (Visual Geometry Group) class we load the VGG model
356
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.
2021International Conference on Computing, Communication, and Intelligent Systems(ICCCIS)
in Keras. we are curious about the photo‟s internal layer, which is followed by a special kind of RNN layer called
representation prior to classification is produced and not in the as Long Short-Term Memory (LSTM) layer.
classification of images. From the pre-trained VGG-CNN we c) Decoding Layer: We receive a fixed-layer vector as
extract the 4096 element image feature vectors that are also an output from both sequence processor and feature extractor.
available in the distributed datasets. During pre-processing Finally, these both layers are integrated together and to make
these image vectors are normalized to unit length.
a concluding prediction, while these layers are processed by a
There is a unique identifier for each photograph which Dense Layer.
maps to a list of one or more textual description. these
description texts need to be cleaned. These descriptions are
easy to work with and already tokenized. Finally, we can
summarize the size of vocabulary once we have cleaned the
texts.
B. Develop Deep Learning Model
This section is divided into the following parts:
d) Loading the Datasets.
e) Defining the Caption Generation Model.
A. Loading the Datasets
All of the photograph along with their captions of the Fig. 4. Basic description of the model.
training dataset will be used to train the model. We can
extract the photo identifiers using these file names. These
identifiers are used to filter-out descriptions and photos for
each set. A caption will be generated for a photograph that
will be passed as an input for the model which will be
created after a sequence of previously generated words are
passed as an input, i.e., it will be generated one word at a
time. In the final step we remove the „startseq‟ and
„endseq‟ tokens and we have a base of our automatic
caption generation model. For example, for “black dog is
running in the water” as the input sequence we would have
8 input-output pairs for training of the model:
357
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.
2021International Conference on Computing, Communication, and Intelligent Systems(ICCCIS)
Architecture. While in the merge architecture, image The following are our results of the BLEU Score of our
vector is connected sequentially with the final LSTM state. model:
An input of photographic features of a vector of 4,066
elements is expected at the Photo Feature Extractor Model,
which are further processed by a dense layer. This layer
compresses 4,066 elements used to represent the
photograph to 256 elements.
While the Sequence Processor Model expects a
predefined length of 34 words as an input sequence. This
sequence is inputted into the „Embedding Layer‟ in which
the padded values are masked. After which a Long Short- Fig. 7. BLEU Score values of model created.
Term Memory of 256 memory unit is attached.
A 256-element vector is produced by both of the input We can see that the scores fit within expectancy range of
models. To reduce overfitting of the training dataset a 50% aappropriate model on the specifiedquery and that too close to
dropout regulation is applied on both of the input models, the top of the range.
which results in fast model configuration and learning. B. Generate Captions
The vectors that are output of both of the input models We load the photograph of which we want to generate the
are merged in the Decoder model using an addition caption and extract the features from it. We achieved this by
operation. The product of addition operation is fed to a implementing the VGG-16 model after redefining our existing
256-neuron dense layer followed by another dense layer model else we can predict features using the VGG model and
for the final output. This final layer makes arecursive provide them to the existing model as the input.
prediction over the entire set of vocabulary output by using
the next word in the sequence using the „Softmax‟.
Adam optimization algorithm was used to perform the
training of this model with default parameters and 50
captions as the mini batch size. While sum cross-entropy
was used as the cost function. An early stopping criterion
was applied during the training. After each training epoch
program measures the validation performance and once the
performance on the validation data began to deteriorate
training gets terminated.
V. RESULT
A. Evaluate the model
We have generated descriptions for all of the photographs
present in the Flickr8K‟s testing dataset and then evaluate our
model by evaluating these predictions with a standard cost
function. We evaluate the actual and the generated
descriptions by summarizing how resembling the predicted
text is to the expected text. We accomplish this task by using
the corpus BLEU score which are used during text translation,
i.e., for the evaluation of generated translation text against a
few reference translations. To evaluate the skills of any model
we calculate the BLEU Scores for the 1, 2, 3 & 4 cumulative
n-grams. We have some ball-park BLEU scores as a reference
for skillful models when evaluated on the test dataset used in
our experiment:
Fig. 6. BLEU Score range of a good model. Fig. 8. Input images provided to the model.
358
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.
2021International Conference on Computing, Communication, and Intelligent Systems(ICCCIS)
But how to generate a caption using our trained model? and Pattern Recognition (CVPR). In stitute of Electrical and Electronics
Initially we pass the „startseq‟, i.e., the starting description Engineers (IEEE), jun.
token, generate one word and then recursively call the model [9] Desmond Elliott and Frank Keller. 2013. Image De scription using
Visual Dependency Representations. In Proc. EMNLP‟13, pages 1292–
again and again and pass the generated words as an input until 1302, Seattle, WA. Association for Computational Linguistics.
maximum description length is reached or „endseq‟, i.e., end [10] Manchanda, C., Sharma, N., Rathi, R., Bhushan, B., & Grover, M.
sequence token is reached. In the final step we remove the (2020). Neoteric Security and Privacy Sanctuary Technologies in Smart
„startseq‟ and „endseq‟ tokens and we have our caption for the Cities. 2020 IEEE 9th International Conference on Communication
photograph we passed for caption generation. Systems and Network Technologies (CSNT).
doi:10.1109/csnt48778.2020.9115780
[11] R. Gerber and H.-H. Nagel. Knowledge representation forthe generation
of quantified natural language descriptions ofvehicle traffic in image
sequences. In ICIP. IEEE, 1996.
[12] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t:Image
parsing to text description. Proceedings of the IEEE,98(8), 2010.
[13] Rustagi, A., Manchanda, C., & Sharma, N. (2020). IoE: A Boon &
Threat to the Mankind. 2020 IEEE 9th International Conference on
Fig. 9. Output captions generated by the model Communication Systems and Network Technologies (CSNT).
doi:10.1109/csnt48778.2020.9115748
VI. CONCLUSION [14] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,C. Rashtchian, J.
Hockenmaier, and D. Forsyth. Every picturetells a story: Generating
Various kinds of models are available today for video sentences from images. InECCV,2010.
caption, image retrieval, image caption with their performance
[15] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi.
ability and the test results depicted that this system has greater Composingsimple image descriptions using web-scale n-grams. In
performance. The modal basically focuses on the three Conference on Computational Natural Language Learning, 2011.
important criteria:- first being on the generation of complete [16] M. Hodosh, P. Young, and J. Hockenmaier. Framing imagedescription
natural language sentences, second being on making the as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013.
generated sentence semantically and grammarly correct and [17] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, andS. Lazebnik.
third making the caption consistent with the image. Improving image-sentence embeddings using large weakly annotated
photo collections. In ECCV, 2014.
REFERENCES [18] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describingimages
using 1 million captioned photographs. In NIPS, 2011.
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar garet Mitchell,
Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual [19] Sharma, N., Kaushik, I., Rathi, R., & Kumar, S. (2020). Evaluation of
Question Answer ing. In Proc. ICCV‟15, pages 2425–2433, Santiago, Accidental Death Records Using Hybrid Genetic Algorithm. SSRN
Chile. IEEE. Electronic Journal. doi: 10.2139/ssrn.3563084
[2] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic [20] Rathi, R., Sharma, N., Manchanda, C., Bhushan, B., & Grover, M.
metric for MT evaluation with improved correlation with human (2020). Security Challenges & Controls in Cyber Physical System. 2020
judgments. In Proce. Work shop on Intrinsic and Extrinsic Evaluation IEEE 9th International Conference on Communication Systems and
Measures for Machine Translation and/or Summarization, vol ume 29, Network Technologies (CSNT). doi:10.1109/csnt48778.2020.9115778
pages 65–72. [21] R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural
[3] Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut languagemodels. In NIPS Deep Learning Workshop, 2013.
Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara [22] Rustagi, A., Manchanda, C., Sharma, N., & Kaushik, I. (2020).
Plank. 2016. Automatic Description Generation from Im ages: A Survey Depression Anatomy Using Combinational Deep Neural Network.
of Models, Datasets, and Evaluation Measures. Journal of Artificial Advances in Intelligent Systems and Computing International
Intelligence Research, 55:409–442. Conference on Innovative Computing and Communications, 19-33.
[4] Xinlei Chen and C. Lawrence Zitnick. 2015. Mind‟s eye: A recurrent doi:10.1007/978-981-15-5148-2_3.
visual representation for image cap tion generation. In Proc. CVPR‟15. [23] Grover, M., Sharma, N., Bhushan, B., Kaushik, I., & Khamparia, A.
Institute of Elec trical and Electronics Engineers (IEEE), jun. (2020). 6 Malware Threat Analysis of IoT Devices Using Deep Learning
[5] Manchanda, C., Rathi, R., & Sharma, N. (2019). Traffic Density Neural Network Methodologies. Security and Trust Issues in Internet of
Investigation & Road Accident Analysis in India using Deep Learning. Things: Blockchain to the Rescue, 123.
2019 International Conference on Computing, Communication, and [24] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Explainimages with
Intelligent Systems (ICCCIS). doi: 10.1109/icccis48478.2019.8974528 multimodal recurrent neural networks. In arXiv:1410.1090,2014.
[6] Grover, M., Verma, B., Sharma, N., & Kaushik, I. (2019). Traffic [25] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares,H. Schwenk, and
control using V-2-V Based Method using Reinforcement Learning. 2019 Y. Bengio. Learning phrase representations using RNN encoder-decoder
International Conference on Computing, Communication, and Intelligent for statistical machine translation. In EMNLP, 2014.
Systems (ICCCIS). doi: 10.1109/icccis48478.2019.8974540 [26] D. Bahdanau, K. Cho, and Y. Bengio. Neural machinetranslation by
[7] Harjani, M., Grover, M., Sharma, N., & Kaushik, I. (2019). Analysis of jointly learning to align and translate. arXiv:1409.0473, 2014.
Various Machine Learning Algorithm for Cardiac Pulse Prediction. 2019 [27] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequencelearning
International Conference on Computing, Communication, and with neural networks. In NIPS, 2014.
Intelligent Systems (ICCCIS). doi: 10.1109/icccis48478.2019.8974519
[28] Denkowski, Michael and Lavie, Alon. Meteor universal:
[8] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar rama, Marcus Languagespecific translation evaluation for any target language. In
Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Proceedings of the EACL 2014 Workshop on Statistical Machine
2015. Long-term Recurrent Convolutional Networks for Visual Recog Translation, 2014.
nition and Description. In 2015 IEEE Conference on Computer Vision
359
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 10:38:32 UTC from IEEE Xplore. Restrictions apply.