0% found this document useful (0 votes)
78 views5 pages

Smartphone-Based Image Captioning For Visually and Hearing Impaired

The document describes a smartphone-based image captioning system for visually and hearing impaired people. The system uses a deep learning model (VGG16) to extract visual attributes from images. These attributes are then input to a LSTM model to generate natural language captions. An Android app called "Eye of Horus" transfers images from a smartphone camera to a remote server, where the captioning process occurs. Captions are then sent back to the app for users to view or hear via text-to-speech. The integrated system aims to improve accessibility and quality of life for visually and hearing impaired individuals.

Uploaded by

ABD BEST
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views5 pages

Smartphone-Based Image Captioning For Visually and Hearing Impaired

The document describes a smartphone-based image captioning system for visually and hearing impaired people. The system uses a deep learning model (VGG16) to extract visual attributes from images. These attributes are then input to a LSTM model to generate natural language captions. An Android app called "Eye of Horus" transfers images from a smartphone camera to a remote server, where the captioning process occurs. Captions are then sent back to the app for users to view or hear via text-to-speech. The integrated system aims to improve accessibility and quality of life for visually and hearing impaired individuals.

Uploaded by

ABD BEST
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339265429

Smartphone-based Image Captioning for Visually and Hearing Impaired

Conference Paper · November 2019


DOI: 10.23919/ELECO47770.2019.8990395

CITATION READS

1 336

2 authors, including:

Volkan Kilic
University of Surrey
33 PUBLICATIONS   301 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Robust audio visual tracking of multiple moving sources for robot audition View project

All content following this page was uploaded by Volkan Kilic on 13 July 2020.

The user has requested enhancement of the downloaded file.


Smartphone-based Image Captioning for Visually and Hearing Impaired
Burak Makav, Volkan Kılıç

Electrical and Electronics Engineering Graduate Program, Izmir Katip Celebi University, Izmir, Turkey
[email protected], [email protected]

Abstract tributes extracted from images affect the overall captioning per-
formance which leads researchers to run sophisticated and com-
Visually and hearing impaired people face troubles due
plex methodologies on images. Deep learning is currently pop-
to inaccessible infrastructure and social challenges in daily
ular method used in state-of-the-art studies and there are various
life. To increase the life quality of those people, we report a
deep learning architectures reported in the literature like ZFNet
portable and user-friendly smartphone-based platform ca-
[7], AlexNet [8], GoogLeNet [9] and VGGNet [10]. In this
pable of generating captions and text descriptions, including
study, the VGG16, a member of VGGNet family, is employed
the option of a narrator, using image obtained from a smart-
due to the success of VGGNet over other architectures. On
phone camera. Image captioning is to generate a sentence to
the other hand, researchers from the NLP side focus on better
describe the visual content of an image in natural language
description of visual attributes with natural language and pro-
and has attracted an increasing amount of attention in the
posed models like “Nearest Neighbor” (NN) [11], “Recurrent
fields of computer vision and natural language processing
neural network (RNN)” [12], “Random” [13], “1NNfc7” [14],
due to its potential applications. Generating image captions
“Human” [15], “Stanford” [16] and Long Short-Term Memory
with proper linguistic properties is a challenging task as it
(LSTM) [17].
needs to combine advanced level of image understanding al-
gorithms with natural language processing methods. In this In this study, we propose to use the VGG16 deep learning
study, we propose to use Long Short-Term Memory (LSTM) architecture followed by the LSTM model to generate a caption.
model to generate a caption after images are trained using We show in our experiments that incorporating the VGG16 ar-
VGG16 deep learning architecture. The visual attributes of chitecture and LSTM model in this way improves the caption-
images are extracted with the VGG16, which conveys richer ing performance significantly. Moreover, we develop a custom-
content, and then they are fed into the LSTM model for cap- designed Android application, named as “Eye of Horus” capa-
tion generation. This system is integrated with our custom- ble of generating caption of an image taken by a smartphone
designed Android application, named as “Eye of Horus” camera. Eye of Horus transmits images via cloud system to the
which transfers the images from smartphone to the remote remoter server which runs our proposed image captioning ap-
server via a cloud system, and displays the captions after proach. After a caption is generated in the remoter server, it is
the images are processed with the proposed captioning ap- sent back to the cloud again and Eye of Horus receives the cap-
proach. The results show that the integrated platform has tion and displays on the screen. With narrator option, the user
great potential to be used for image captioning by visually can hear the caption.
and hearing impaired people with advantages such as porta- The rest of this paper is organized as follows: the next sec-
bility, simple operation and rapid response. tion introduces the proposed approach for caption generation.
Section 3 presents dataset, textitEye of Horus and discussion of
1. Introduction the results. Closing remarks are given in Section 4.

Automated caption generation has received increasing in-


terest in the fields of computer vision, as a result of recent
2. Proposed Captioning Approach
progress in artificial intelligence (AI) and natural language pro- In this section, we describe our proposed approach using
cessing (NLP). Image captioning is the task of understanding the VGG16 deep learning architecture and the LSTM model.
a visual scene and expressing it in terms of natural language The image captioning has two steps. First step is that visual at-
descriptions [1, 2]. It has found many industrial and practi- tributes of images need to be extracted with richer content and
cal applications, such as visual intelligence in chatting robots, as a second step, these attributes need to be sent to the NLP
photo tagging and sharing on social media, image indexing or model to generate the most human-like captions. To accom-
retrieval, virtual assistants, image understanding and assistive plish both of these steps, we propose to incorporate the VGG16
activities for visually and hearing impaired people. architecture with LSTM model as presented in following sub-
In order to improve the life quality of those people, several sections.
studies such as “guide dog” [3], “smart glasses” [4] and “image
captioning” [5] have been reported. Generation of meaning- 2.1. VGG16 Deep Learning Architecture
ful and natural language description of an image from the vi-
sual content needs to employ advanced algorithms beyond im- After the VGGNet [10] became the first runner-up in im-
age classification and object detection which attracts the inter- age classification and winner in localization category in Ima-
est of two major areas of AI: computer vision and NLP [6]. Re- geNet Large Scale Visual Recognition Competition (ILSVRC)
searchers from computer vision fields focus on extracting visual 2014, improved versions of VGGNet such as VGG11, VGG13,
attributes with richer contents to feed the NLP which constructs VGG16 and VGG19 have been released with numbers at the end
a sentence in the form of natural-looking text. The visual at- which indicates the number of weight layers used in the archi-

950
Figure 1: The architecture of VGG16 [10]

tecture. The VGGNet is simply a convolutional neural network Figure 2: Sample Image from MSCOCO
model and the VGG16 is found adequate for image classifica-
tion based on experimental studies. The VGG16 includes 16 • A lady sitting at an enormous dining table with lots of
weight layers including thirteen convolutional layers, two fully food.
connected layers and one output later with softmax activation.
The convolutional layers are categorized into 5 groups with a • A woman with eye glasses sitting at a table covered with
max-pooling layer at the end. Illustration of the VGG16 archi- food.
tecture is given in Figure 1. • Several plates of food on a dining table.
Outstanding performance of the VGG16 over the previous
• A guest looks over the plates of fruit on the table.
generation of models like AlexNet and GoogLeNet leads us
to employ in our captioning approach to extract the visual at- • A woman standing near a table with plates covered in
tributes. food.

2.2. LSTM Architecture 3.2. Smartphone Application: Eye of Horus

LSTM networks are repetitive neural networks that have a Here, we demonstrate a portable smartphone-based plat-
specific transition mechanism that controls access to memory form for image captioning controlled by software, named as
cells [17]. Since the gates can prevent the rest of the network Eye of Horus, developed in Android Studio. A simple and user-
from changing the contents of the memory cells for many time friendly interface is designed to provide a simple operation for
periods, the LSTM protects the signals and propagates the er- visually and hearing impaired. Screenshots of the Eye of Horus
rors for a much longer period than ordinary repetitive neural app given in Figure 3 present the flow of running procedures.
networks. Independently, they can also learn to participate in
certain sections of the input signals and to ignore other sections
by reading, writing, and deleting content from memory cells.
These features allow the LSTM networks to process data with
complex and discrete dependencies. For example, it enables
speech recognition [18], offline handwriting recognition [19],
machine translation [20], and image captioning [13, 21]. Thus,
we follow the LSTM architecture to generate the captions from
the visual attributes with rich semantic content.

3. Experimental Results
In the previous section, the VGG16 and LSTM models are
introduced. In this section, dataset, Android application and
results will be discussed.

3.1. Dataset
Flickr [22] and MSCOCO [23] are commonly used datasets
used in image captioning. In this study, the MSCOCO dataset is
chosen as it contains approximately one hundred sixty thousand
pictures while thirty thousand pictures in Flickr. Additionally,
the MSCOCO also contains five reference captions per image.
The number of images in the dataset and the number of refer-
ence captions per image are important as they are used to train
overall system.
Figure 3: Steps of image captioning with Eye of Horus
Sample image from MSCOCO dataset is given in Figure 2.
The reference caption entered for this image is as follows.

951
Figure 4: Flowchart of overall system

When the user runs the Eye of Horus, “tap me to select” Eye of Horus. In Figure 5, sample captions that generated with
page is displayed after opening page. In this page, the user can proposed approach are given. The caption in Figure 5a is “a ze-
choose an image from the gallery or capture a new image us- bra standing in a field with tall grass” which is very close to the
ing the camera. After the image is selected, the user needs to visual content. The generated caption in Figure 5b is “a tennis
tap “upload” button to send the image to the remote server via player is swinging her racket during a serve” which is very sim-
Firebase cloud system. In the remoter server, a script coded in ilar to natural-looking text. The result shows that the proposed
python downloads the image from the Firebase to generate a system has potential to be used for image captioning by visually
caption. The generated caption is sent back to the Eye of Horus and hearing impaired people.
via the Firebase again. The Eye of Horus displays the caption
under the image with the narrator button. If the user taps this 4. Conclusion
button, the caption is read out loudly.
In this paper, we presented a smartphone-based image cap-
3.3. Results tioning for visually and hearing impaired using the VGG16
deep learning architecture and LSTM model. Our proposed ap-
In this study, proposed approach was tested on the proach was tested on the MSCOCO dataset and then integrated
MSCOCO dataset. First, pre-trained VGG16 model is used with our custom-designed Android application “Eye of Horus”
to extract visual attributes of an image. Then they are fed to to provide user-friendly interface that visually and hearing im-
the LSTM model to generate a caption. Flowchart of over- paired can use it with simple operation. The user either selects
all system is illustrated in Figure 4. The model trained with image from gallery or captures new image using the smartphone
the configurable value of parameters of the VGG16 which are camera. The selected image will be uploaded to the Firebase in
epoch and batch-size. The epoch represents one iteration over order to transfer to the remote server which runs our proposed
the entire training set to process each image and caption pair image captioning approach. The generated captions will send
only once.Batch-Size is defined as the number of total training back to the app again via Firebase to display the caption. The
examples available at one epoch. Obtained results might vary user has also an option to listen the caption. The app will be fur-
according to the parameters. The parameters values used for ther improved to include various capabilities such as translating
model training are set to 55 for Epoch and 1024 for Batch-Size. the English captions to Turkish or other languages and running
After our propose approach is trained, it is integrated with the on ios platform.

952
“Going deeper with convolutions,” in Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 1–9.
[10] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” arXiv
preprint arXiv:1409.1556, 2014.
[11] A. Karpathy and L. Fei-Fei, “Deep visual-semantic align-
ments for generating image descriptions,” in Proceedings
of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 3128–3137.
[12] X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recur-
rent visual representation for image caption generation,”
in Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, 2015, pp. 2422–2431.
[13] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show
and tell: A neural image caption generator,” in Proceed-
ings of the IEEE conference on computer vision and pat-
tern recognition, 2015, pp. 3156–3164.
[14] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
(a) (b) rell, “Long-term recurrent convolutional networks for vi-
sual recognition and description,” in Proceedings of the
Figure 5: Captioning results of the proposed approach IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 2625–2634.
[15] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li,
5. References Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Un-
[1] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. derstanding and generating simple image descriptions,”
Hendricks, M. Rohrbach, and K. Saenko, “Multimodal IEEE Transactions on Pattern Analysis and Machine In-
video description,” in Proceedings of the 24th ACM in- telligence, vol. 35, no. 12, pp. 2891–2903, 2013.
ternational conference on Multimedia. ACM, 2016, pp. [16] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard,
1092–1096. and D. McClosky, “The stanford corenlp natural language
[2] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Pointing novel processing toolkit,” in Proceedings of 52nd annual meet-
objects in image captioning,” in Proceedings of the IEEE ing of the association for computational linguistics: sys-
Conference on Computer Vision and Pattern Recognition, tem demonstrations, 2014, pp. 55–60.
2019, pp. 12 497–12 506. [17] S. Hochreiter and J. Schmidhuber, “Long short-term
[3] L. S. Batt, M. S. Batt, J. A. Baguley, and P. D. McGreevy, memory,” Neural computation, vol. 9, no. 8, pp. 1735–
“Factors associated with success in guide dog training,” 1780, 1997.
Journal of Veterinary Behavior, vol. 3, no. 4, pp. 143–151, [18] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recog-
2008. nition with deep recurrent neural networks,” in interna-
[4] J. Bai, S. Lian, Z. Liu, K. Wang, and D. Liu, “Smart guid- tional conference on acoustics, speech and signal process-
ing glasses for visually impaired people in indoor envi- ing. IEEE, 2013, pp. 6645–6649.
ronment,” IEEE Transactions on Consumer Electronics,
[19] A. Graves and J. Schmidhuber, “Offline handwriting
vol. 63, no. 3, pp. 258–266, 2017.
recognition with multidimensional recurrent neural net-
[5] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing works,” in Advances in neural information processing sys-
when to look: Adaptive attention via a visual sentinel for tems, 2009, pp. 545–552.
image captioning,” in Proceedings of the IEEE confer-
[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to se-
ence on computer vision and pattern recognition, 2017,
quence learning with neural networks,” in Advances in
pp. 375–383.
neural information processing systems, 2014, pp. 3104–
[6] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image 3112.
captioning with semantic attention,” in Proceedings of the
[21] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying
IEEE conference on computer vision and pattern recogni-
visual-semantic embeddings with multimodal neural lan-
tion, 2016, pp. 4651–4659.
guage models,” arXiv preprint arXiv:1411.2539, 2014.
[7] M. D. Zeiler and R. Fergus, “Visualizing and understand-
ing convolutional networks,” in European conference on [22] M. Hodosh, P. Young, and J. Hockenmaier, “Framing im-
computer vision. Springer, 2014, pp. 818–833. age description as a ranking task: Data, models and evalu-
ation metrics,” Journal of Artificial Intelligence Research,
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet vol. 47, pp. 853–899, 2013.
classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, [23] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
pp. 1097–1105. P. Dollár, and C. L. Zitnick, “Microsoft coco captions:
Data collection and evaluation server,” arXiv preprint
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
arXiv:1504.00325, 2015.
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, 953

View publication stats

You might also like