Smartphone-Based Image Captioning For Visually and Hearing Impaired
Smartphone-Based Image Captioning For Visually and Hearing Impaired
net/publication/339265429
CITATION READS
1 336
2 authors, including:
Volkan Kilic
University of Surrey
33 PUBLICATIONS 301 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Robust audio visual tracking of multiple moving sources for robot audition View project
All content following this page was uploaded by Volkan Kilic on 13 July 2020.
Electrical and Electronics Engineering Graduate Program, Izmir Katip Celebi University, Izmir, Turkey
[email protected], [email protected]
Abstract tributes extracted from images affect the overall captioning per-
formance which leads researchers to run sophisticated and com-
Visually and hearing impaired people face troubles due
plex methodologies on images. Deep learning is currently pop-
to inaccessible infrastructure and social challenges in daily
ular method used in state-of-the-art studies and there are various
life. To increase the life quality of those people, we report a
deep learning architectures reported in the literature like ZFNet
portable and user-friendly smartphone-based platform ca-
[7], AlexNet [8], GoogLeNet [9] and VGGNet [10]. In this
pable of generating captions and text descriptions, including
study, the VGG16, a member of VGGNet family, is employed
the option of a narrator, using image obtained from a smart-
due to the success of VGGNet over other architectures. On
phone camera. Image captioning is to generate a sentence to
the other hand, researchers from the NLP side focus on better
describe the visual content of an image in natural language
description of visual attributes with natural language and pro-
and has attracted an increasing amount of attention in the
posed models like “Nearest Neighbor” (NN) [11], “Recurrent
fields of computer vision and natural language processing
neural network (RNN)” [12], “Random” [13], “1NNfc7” [14],
due to its potential applications. Generating image captions
“Human” [15], “Stanford” [16] and Long Short-Term Memory
with proper linguistic properties is a challenging task as it
(LSTM) [17].
needs to combine advanced level of image understanding al-
gorithms with natural language processing methods. In this In this study, we propose to use the VGG16 deep learning
study, we propose to use Long Short-Term Memory (LSTM) architecture followed by the LSTM model to generate a caption.
model to generate a caption after images are trained using We show in our experiments that incorporating the VGG16 ar-
VGG16 deep learning architecture. The visual attributes of chitecture and LSTM model in this way improves the caption-
images are extracted with the VGG16, which conveys richer ing performance significantly. Moreover, we develop a custom-
content, and then they are fed into the LSTM model for cap- designed Android application, named as “Eye of Horus” capa-
tion generation. This system is integrated with our custom- ble of generating caption of an image taken by a smartphone
designed Android application, named as “Eye of Horus” camera. Eye of Horus transmits images via cloud system to the
which transfers the images from smartphone to the remote remoter server which runs our proposed image captioning ap-
server via a cloud system, and displays the captions after proach. After a caption is generated in the remoter server, it is
the images are processed with the proposed captioning ap- sent back to the cloud again and Eye of Horus receives the cap-
proach. The results show that the integrated platform has tion and displays on the screen. With narrator option, the user
great potential to be used for image captioning by visually can hear the caption.
and hearing impaired people with advantages such as porta- The rest of this paper is organized as follows: the next sec-
bility, simple operation and rapid response. tion introduces the proposed approach for caption generation.
Section 3 presents dataset, textitEye of Horus and discussion of
1. Introduction the results. Closing remarks are given in Section 4.
950
Figure 1: The architecture of VGG16 [10]
tecture. The VGGNet is simply a convolutional neural network Figure 2: Sample Image from MSCOCO
model and the VGG16 is found adequate for image classifica-
tion based on experimental studies. The VGG16 includes 16 • A lady sitting at an enormous dining table with lots of
weight layers including thirteen convolutional layers, two fully food.
connected layers and one output later with softmax activation.
The convolutional layers are categorized into 5 groups with a • A woman with eye glasses sitting at a table covered with
max-pooling layer at the end. Illustration of the VGG16 archi- food.
tecture is given in Figure 1. • Several plates of food on a dining table.
Outstanding performance of the VGG16 over the previous
• A guest looks over the plates of fruit on the table.
generation of models like AlexNet and GoogLeNet leads us
to employ in our captioning approach to extract the visual at- • A woman standing near a table with plates covered in
tributes. food.
LSTM networks are repetitive neural networks that have a Here, we demonstrate a portable smartphone-based plat-
specific transition mechanism that controls access to memory form for image captioning controlled by software, named as
cells [17]. Since the gates can prevent the rest of the network Eye of Horus, developed in Android Studio. A simple and user-
from changing the contents of the memory cells for many time friendly interface is designed to provide a simple operation for
periods, the LSTM protects the signals and propagates the er- visually and hearing impaired. Screenshots of the Eye of Horus
rors for a much longer period than ordinary repetitive neural app given in Figure 3 present the flow of running procedures.
networks. Independently, they can also learn to participate in
certain sections of the input signals and to ignore other sections
by reading, writing, and deleting content from memory cells.
These features allow the LSTM networks to process data with
complex and discrete dependencies. For example, it enables
speech recognition [18], offline handwriting recognition [19],
machine translation [20], and image captioning [13, 21]. Thus,
we follow the LSTM architecture to generate the captions from
the visual attributes with rich semantic content.
3. Experimental Results
In the previous section, the VGG16 and LSTM models are
introduced. In this section, dataset, Android application and
results will be discussed.
3.1. Dataset
Flickr [22] and MSCOCO [23] are commonly used datasets
used in image captioning. In this study, the MSCOCO dataset is
chosen as it contains approximately one hundred sixty thousand
pictures while thirty thousand pictures in Flickr. Additionally,
the MSCOCO also contains five reference captions per image.
The number of images in the dataset and the number of refer-
ence captions per image are important as they are used to train
overall system.
Figure 3: Steps of image captioning with Eye of Horus
Sample image from MSCOCO dataset is given in Figure 2.
The reference caption entered for this image is as follows.
951
Figure 4: Flowchart of overall system
When the user runs the Eye of Horus, “tap me to select” Eye of Horus. In Figure 5, sample captions that generated with
page is displayed after opening page. In this page, the user can proposed approach are given. The caption in Figure 5a is “a ze-
choose an image from the gallery or capture a new image us- bra standing in a field with tall grass” which is very close to the
ing the camera. After the image is selected, the user needs to visual content. The generated caption in Figure 5b is “a tennis
tap “upload” button to send the image to the remote server via player is swinging her racket during a serve” which is very sim-
Firebase cloud system. In the remoter server, a script coded in ilar to natural-looking text. The result shows that the proposed
python downloads the image from the Firebase to generate a system has potential to be used for image captioning by visually
caption. The generated caption is sent back to the Eye of Horus and hearing impaired people.
via the Firebase again. The Eye of Horus displays the caption
under the image with the narrator button. If the user taps this 4. Conclusion
button, the caption is read out loudly.
In this paper, we presented a smartphone-based image cap-
3.3. Results tioning for visually and hearing impaired using the VGG16
deep learning architecture and LSTM model. Our proposed ap-
In this study, proposed approach was tested on the proach was tested on the MSCOCO dataset and then integrated
MSCOCO dataset. First, pre-trained VGG16 model is used with our custom-designed Android application “Eye of Horus”
to extract visual attributes of an image. Then they are fed to to provide user-friendly interface that visually and hearing im-
the LSTM model to generate a caption. Flowchart of over- paired can use it with simple operation. The user either selects
all system is illustrated in Figure 4. The model trained with image from gallery or captures new image using the smartphone
the configurable value of parameters of the VGG16 which are camera. The selected image will be uploaded to the Firebase in
epoch and batch-size. The epoch represents one iteration over order to transfer to the remote server which runs our proposed
the entire training set to process each image and caption pair image captioning approach. The generated captions will send
only once.Batch-Size is defined as the number of total training back to the app again via Firebase to display the caption. The
examples available at one epoch. Obtained results might vary user has also an option to listen the caption. The app will be fur-
according to the parameters. The parameters values used for ther improved to include various capabilities such as translating
model training are set to 55 for Epoch and 1024 for Batch-Size. the English captions to Turkish or other languages and running
After our propose approach is trained, it is integrated with the on ios platform.
952
“Going deeper with convolutions,” in Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 1–9.
[10] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” arXiv
preprint arXiv:1409.1556, 2014.
[11] A. Karpathy and L. Fei-Fei, “Deep visual-semantic align-
ments for generating image descriptions,” in Proceedings
of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 3128–3137.
[12] X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recur-
rent visual representation for image caption generation,”
in Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, 2015, pp. 2422–2431.
[13] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show
and tell: A neural image caption generator,” in Proceed-
ings of the IEEE conference on computer vision and pat-
tern recognition, 2015, pp. 3156–3164.
[14] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
(a) (b) rell, “Long-term recurrent convolutional networks for vi-
sual recognition and description,” in Proceedings of the
Figure 5: Captioning results of the proposed approach IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 2625–2634.
[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li,
5. References Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Un-
[1] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. derstanding and generating simple image descriptions,”
Hendricks, M. Rohrbach, and K. Saenko, “Multimodal IEEE Transactions on Pattern Analysis and Machine In-
video description,” in Proceedings of the 24th ACM in- telligence, vol. 35, no. 12, pp. 2891–2903, 2013.
ternational conference on Multimedia. ACM, 2016, pp. [16] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard,
1092–1096. and D. McClosky, “The stanford corenlp natural language
[2] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Pointing novel processing toolkit,” in Proceedings of 52nd annual meet-
objects in image captioning,” in Proceedings of the IEEE ing of the association for computational linguistics: sys-
Conference on Computer Vision and Pattern Recognition, tem demonstrations, 2014, pp. 55–60.
2019, pp. 12 497–12 506. [17] S. Hochreiter and J. Schmidhuber, “Long short-term
[3] L. S. Batt, M. S. Batt, J. A. Baguley, and P. D. McGreevy, memory,” Neural computation, vol. 9, no. 8, pp. 1735–
“Factors associated with success in guide dog training,” 1780, 1997.
Journal of Veterinary Behavior, vol. 3, no. 4, pp. 143–151, [18] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recog-
2008. nition with deep recurrent neural networks,” in interna-
[4] J. Bai, S. Lian, Z. Liu, K. Wang, and D. Liu, “Smart guid- tional conference on acoustics, speech and signal process-
ing glasses for visually impaired people in indoor envi- ing. IEEE, 2013, pp. 6645–6649.
ronment,” IEEE Transactions on Consumer Electronics,
[19] A. Graves and J. Schmidhuber, “Offline handwriting
vol. 63, no. 3, pp. 258–266, 2017.
recognition with multidimensional recurrent neural net-
[5] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing works,” in Advances in neural information processing sys-
when to look: Adaptive attention via a visual sentinel for tems, 2009, pp. 545–552.
image captioning,” in Proceedings of the IEEE confer-
[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to se-
ence on computer vision and pattern recognition, 2017,
quence learning with neural networks,” in Advances in
pp. 375–383.
neural information processing systems, 2014, pp. 3104–
[6] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image 3112.
captioning with semantic attention,” in Proceedings of the
[21] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying
IEEE conference on computer vision and pattern recogni-
visual-semantic embeddings with multimodal neural lan-
tion, 2016, pp. 4651–4659.
guage models,” arXiv preprint arXiv:1411.2539, 2014.
[7] M. D. Zeiler and R. Fergus, “Visualizing and understand-
ing convolutional networks,” in European conference on [22] M. Hodosh, P. Young, and J. Hockenmaier, “Framing im-
computer vision. Springer, 2014, pp. 818–833. age description as a ranking task: Data, models and evalu-
ation metrics,” Journal of Artificial Intelligence Research,
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet vol. 47, pp. 853–899, 2013.
classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, [23] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
pp. 1097–1105. P. Dollár, and C. L. Zitnick, “Microsoft coco captions:
Data collection and evaluation server,” arXiv preprint
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
arXiv:1504.00325, 2015.
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, 953