Ying 2018
Ying 2018
Conference on Smart City; IEEE 4th Intl. Conference on Data Science and Systems
Department of Computer Science and Information Engineering, Feng Chia University, Taiwan, ROC
Abstract—Sensory navigation device is an important trend in conditions, the blind who use the conventional guidance tools
the field of machine learning and data science. Nowadays, more might consider the road conditions is very safe. When the blind
and more sensory navigation devices are built for blind people. walk through the street and found the real road conditions had
The core of such sensory navigation devices for blind people become as shown in the right picture, he might be damaged by
usually is implemented by an Image Recognition Method. To passing around the obstacles. This phenomena is called time-
build an image recognition model, many tools and online machine variate dynamics. As the result, hug number of blind people
learning platforms are proposed. However, these tools or suffer terribly with these time-variate dynamics problem. To
platforms are not able to completely satisfy the requirements for protect blind people against potential losses caused by time-
sensory navigation device. To build a sensory navigation device
variate dynamics, many real-time road condition recognition
with satisfying requirements for blind people, an ability of
mechanisms have been proposed, whereas such mechanisms
reducing the cost of model training and a capability of user-
centric image recognition are the two main issues. Therefore, to for blind guidance are still in its infancy stage. Currently, all of
address the above issues, we propose a novel approach, namely, the real-time road condition recognition mechanisms for blind
DLSNF (Deep-Learning-based Sensory Navigation Framework). guidance more or less rely on the offline modeling manner. The
Our proposed DLSNF is built based on the YOLO architecture to system would collect street view via users’ wearable sensors,
deal with the reducing cost of model training and NVIDIA Jetson and the recognition model would be trained on a server.
TX2 to take the user-centric image recognition into account. However, the blind people still use the out-of-date model to
Based on our proposed DLSNF, the real-time image recognition detect obstacles before the model is updated. In other words,
can be trained well and conduct a sensory navigation to help such offline modeling manner still has the time-variate
blind people. At the same time, the train model is embedded in dynamics problem.
NVIDIA Jetson TX2 which is the fastest, most power-efficient
embedded AI computing device. For the experiments, we Sensory Navigation Device [7] is an important trend in the
evaluated our proposed DLSNF with a real-world dataset field of guidance tool for blind people. The core of the sensory
consisting of 4,570 images collected by part-time workers. The navigation device is a computer program that can lead blind
extensive experimental results show that our proposed DLSNF and visually impaired people around obstacles through voice
more effectively and efficiently beyond the existing baselines. with artificial intelligence, data mining, or customized rules.
Nowadays, there are three main groups of Sensory Navigation
Keywords—Sensory Navigation Device, Deep Learning, Devices, according to their working principle: radar, global
Residual Convolution Neural Network, Blind Guidance. positioning and stereovision. Meanwhile, the most widely
known are the Sensory Navigation Devices based on the radar
I. INTRODUCTION principle [18] [19] [20] . These devices emit laser or ultrasonic
In blind people’s daily life, guidance tools can be realized beams. When a beam strikes the object surface, it is reflected.
in many ways, such as guide dog, guidance tiles, and Tactile Then, the distance between the user and the object can be
sticks. However, the road conditions are always changed such calculated as the time difference between the emitted and
that the guidance tools might make some irreparable mistakes. received beam. A second type of Sensory Navigation Devices
Take figure 1 as an example. The two pictures show the street includes devices based on the Global Positioning System (GPS)
view in different dates. If the left picture shows the usual road [14] [15] [16] . These devices aim to guide the blind user
through a previously selected route; also, it provides user
location such as street number, street crossing, etc.
Unfortunately, although the Sensory Navigation Devices based
on the radar principle or the Global Positioning System have
widely been applied to guidance tool for blind people, these
two types of Sensory Navigation Devices are not able to deal
with the time-variate dynamics problem.
With the development of the webcam, many researchers [5]
[6] [7] proposed the application of stereovision to develop new
Figure 1. An Example of a Dialogue Robot.
1
https://developer.nvidia.com/embedded/buy/jetson-tx2
1196
B. Deep-Learning-based Image Recognition
CNN. Convolutional Neural Networks (CNN). In [13] ,
Krizhevsky et al. trained a large, deep convolutional neural
network to classify the 1.2 million high-resolution images in
the ImageNet LSVRC-2010 contest into the 1000 different
classes. On the test data, they achieved top-1 and top-5 error
rates of 37.5% and 17.0% which is considerably better than
the previous state-of-the-art. In [11] , Goodfellow et al.
propose an unified approach that use a deep convolutional
neural network to recognize arbitrary multi-character text in
unconstrained natural photographs. They evaluate this
approach on the publicly available SVHN dataset and achieve
over 96% accuracy in recognizing complete street numbers. Figure 2. An Illustration of Object Annotation of an Image.
R-CNN. In [10] , Girshick et al. propose an approach to improve YOLO’s performance. In [23] , Redmon et al.
combines two key insights: (1) one can apply high-capacity continue to improve the YOLO’s performance so that
convolutional neural networks (CNNs) to bottom-up region YOLOv3 has been proposed in 2018.
proposals in order to localize and segment objects and (2)
when labeled training data is scarce, supervised pre-training III. OUR PROPOSED METHOD
for an auxiliary task, followed by domain-specific fine-tuning, To build the sensory navigation device for blind guidance,
yields a significant performance boost. In [9] , Girshick we first collect images through a webcam and annotate the
proposes a fast R-CNN which can speed up the training objects shown in the images. Then we utilize the annotated
process of deep convolutional networks. Compared to R-CNN, images to train an image recognition model based on the
Fast R-CNN employs several innovations to improve training YOLOv3 architecture. Finally, we deploy the trained model on
and testing speed while also increasing detection accuracy the NVIDIA Jetson TX2.
VGG nets. In [24] , Simonyan et al. investigate the effect of A. Data Preprocess
the convolutional network depth on its accuracy in the large- As mentioned earlier, the idea of our framework is listed as
scale image recognition setting. Their main contribution is a follows: 1) YOLO architecture can be used to deal with the
thorough evaluation of networks of increasing depth using an ability of reducing the cost of model training, and 2) NVIDIA
architecture with very small ( 3 × 3) convolution filters, which Jetson TX2 can take the capability of user-centric image
shows that a significant improvement on the prior-art recognition into account. Therefore, transforming annotated
configurations can be achieved by pushing the depth to 16–19 images into the formulation which is compatible for the
weight layers. training data of model building plays crucial role for building
GoogLeNet. In [25] , Szegedy et al. propose a deep our sensory navigation device.
convolutional neural network architecture codenamed To do so, we first hire several staffs to collect images
Inception that achieves the new state of the art for through a webcam. All objects shown in the collected images
classification and detection. The main hallmark of this would be annotated by the staffs. Here, we adopt the an open
architecture is the improved utilization of the computing source image labeling tool, LabelImg [27] , to annotate the
resources inside the network. objects which is critical for blind guidance. Figure 2 shows an
Residual-Net. In [12] , He et al. present a residual learning example of LabelImg. The image shows several chairs and one
framework to ease the training of networks that are table. Thus, the staff annotate these object and LabelImg would
substantially deeper than those used previously. They output a text file which records the boundary of these object
explicitly reformulate the layers as learning residual functions and their labels.
with reference to the layer inputs, instead of learning
B. Sensory Navigation Device Building
unreferenced functions.
As mentioned earlier, we have already collect and annotate
YOLO. In [21] , Redmon et al. developed a fast single-shot images. In this subsection, we then detail how we build a model
detection method named you only look once (YOLO). YOLO for building an object detection model which can help blind to
is to predict multiclass bounding box candidates directly from pass by skirted the obstacles. In order to producing the object
the grids in the full input images. The combination of the class detection model, we utilize the YOLOv3 architecture, which is
probabilities and bounding box confidence provides the one of the popular types of Residual-CNNs. Figure 3 shows the
resulting detection. The input images are divided into 7 × 7 structure of a neuron of the YOLOv3. Meanwhile, it has 53
grids. Thus, each grid predicts classification probabilities for convolutional layers which uses successive 3 × 3 and 1 × 1
class and candidate bounding boxes with the confidence score. convolutional layers but now has some shortcut connections as
Each bounding box contains five position indicators, including well and is significantly larger.
the box coordinates (x, y, w, h) and the position confidence. In
[22] , Redmon et al. proposed YOLOv2, a faster and more Accordingly, we can realize that the model is too large to be
accurate detector has been proposed. Redmon et al. pool a trained within a reasonable time. Fortunately, the “Residual
variety of ideas from past work with our own novel concepts Neuron Network” inherently has some shortcut connections,
i.e., Residual layer in Figure 3. Such shortcut connections can
1197
(a) (b) (c)
Figure 5. Three type image samples in our database, (a) is a chair sample, (b
is a table sample and (c) is a image contains table and chair sample.
1198
(a) (b) (c)
examples and results.We also compared different iterations in and fluctuation is very unstable. The reason might be that too
terms of loss score. We do the normal training and fine-tuning little training data, it means training data are not rich. Therefore,
both using batchsize = 64 to see the result of image we can say that a small amount of data sets results in poor
preprocessing. performance.
Figure 9. Detection of the chair by our model. Left screen output is left
camera shot, and right screen output is right camera shot.
1199
the proposed YOLO offers a speedup. It sometimes generates 2005. IEEE Computer Society Conference, 2005
[5] L. Dunai, G. P. Fajarnes, V. S. Praderas, and B. D. Garcia, “Electronic
bounding boxes of different sizes because of different angles
Travel Aid systems for visually impaired people.” in Proceedings of
from left and right camera shot, but a slight gap of the false DRT4ALL 2011 Conference, IV Congreso Internacional de Diseño,
prediction of the bounding boxes is allowed to some extent. Redes de Investigación y Tecnología para Todos, Madrid, Spain, 2011.
[6] L. Dunai, G. P. Fajarnes, V. S. Praderas, B. D. Garcia, and I. Lengua,
“RealTime assistance prototype – a new navigation aid for blind
people.” In Proceedings of IEEE Industrial Electronics Society
Conference (IECON 2010), Phoenix, Arizona. 1173–1178, 2010.
[7] L. Dunai, G. P. Fajarnes, V. S. Praderas, B. D. Garcia, and I. Lengua,
“EYE2021-Acoustical cognitive system for navigation.” AEGIS 2nd
International Conference, Brussels, 2011.
[8] S. Emami, and V. P. Suciu, “Facial Recognition using OpenCV,”
Journal of Mobile,Embedded and Distributed Systems, 4(1), 38-43,
2012
[9] R. Girshick, “Fast r-cnn,” IEEE international conference on computer
vision, 2015
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
IEEE conference on computer vision and pattern recognition, 2014.
[11] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-
Figure 10. Output of the chair’s distance digit Number Recognition from Street View Imagery using Deep
Convolutional Neural Networks,” arXiv:1312.6082, 2014
Figure 10 shows the output of the chair’s distance on [12] K. He, X. Zhang, S. Ren, and Ji Sun, “Deep residual learning for
command, it determines the distance based on the different image recognition,” arXiv:1512.03385, 2015.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
angles of the two screens. According to the bounding box float Classification with Deep Convolutional Neural Networks,” In:
will output the distance, it can be seen from command that the Advances in neural information processing systems, pp. 1097-1105,
distance is constantly output. There may be some slight 2012.
distance errors, but a slight gap of the distance will not affect [14] R. Kuc, “Binaural Sonar Electronic Travel Aid Provides Vibrotactile
Cues for Landmark, Reflector Motion and Surface Texture
users. We can see that the output of the chair’s distance is about Classification.” IEEE Transactions on Biomedical Engineering, 49,
1 meter, and the output is the same as the actual distance. 1173–1180, 2002.
[15] J. M. Loomis, R. G. Golledge, and R. L. Klatzky, “GPS-Based
VI. CONCLUSIONS Navigation Systems for the Visually Impaired.” Fundamentals of
wearable computers and augmented reality, W. Barfield and T. Caudell,
In this paper, we propose a novel Deep-learning-based Eds., 429–446, Mahwah, NJ: Lawrence Erlbaum Associates, 2001.
Sensory Navigation Framework (DLCF) to build a Sensory [16] J. Loomis and R. Golledge, “Personal Guidance System using GPS,
GIS, and VR technologies.” In Proceedings, CSUN Conference on
Navigation Device for blind guidance. We also tackled the
Virtual Reality and Person with Disabilities, San Francisco, 2003.
problem of object detection, which is a crucial prerequisite for [17] D. G. Lowe, “Distinctive Image Features from Scale-Invariant
blind guidance tool. The core task of model learning is Keypoints,” IJCV, 60 (2), pp. 91-110, 2004
conveniently transformed to the problem of object detection [18] R. W. Mann, “Mobility aids for the blind – An argument for a
model learning. We develop the Residual-CNN architecture to computer-based, man-device environment, interactive, simulation
detect object shown in a snapshot catch by a webcam. Through system.” In Proceedings of Conference on Evaluation of Mobility Aids
for the Blind, Washington, DC: Com. On Interplay of Engineering
a series of experiments using a dataset crawled by staffs, we With Biology and Medicine, National Academy of Engineering, 101–
have validated the Residual-CNN for building an object 116, 1970.
detection model and shown that it has excellent performance [19] D. L. Morrissette, G. L. Goddrich, and J. J. Henesey, “A follow-up-
under various conditions. In future work, we plan to design study of the Mowat sensors applications, frequency of use and
more sophisticated methods and compare it with state-of-the-art maintenance reliability.” Journal of Visual Impairment and Blindness,
75, 244–247, 1981.
methods. [20] L. Russell, “Travel Path Sounder.” In Proceedings of Rotterdam
Mobility Research Conference, New York: American Foundation for
REFERENCES the Blind, 1965.
This research was partially supported by Ministry of [21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Science and Technology, Taiwan, R.O.C. under grant no. Once: Unified, Real-Time Object Detection,” arXiv:1506.02640, 2015.
[22] J. Redmon, and A. Farhadi, “YOLO9000: Better, Faster, Stronger,”
MOST 106-2218-E-126-001 and MOST 106-2221-E-035-094. Computer Vision and Pattern Recognition (CVPR), 2017.
[23] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
REFERENCES arXiv:1804.02767v1, 2018.
[1] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up Robust [24] K. Simonyan, and A. Zisserman. “Very deep convolutional networks
Features,” Computer Vision and Image Understanding (CVIU), for large-scale image recognition.” In ICLR, 2015.
110(3):346-359, 2008 [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
[2] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. “BRIEF: Binary Erhan, V. Vanhoucke, and A. Rabinovich. “Going deeper with
Robust Independent Elementary Features,” In Proceedings of the convolutions.” In CVPR, 2015.
European Conference on Computer Vision (ECCV), 2010 [26] G. Xie, and W. Lu, “Image Edge Detection Based on OpenCV.”
[3] I. Culjak, “A brief introduction to OpenCV,” Proceedings of the International Journal of Electronics and Electrical Engineering 1 (2):
35th International MIPRO Convention, IEEE (2013), pp. 2142-2147, 104-6, 2013
2012 [27] https://github.com/tzutalin/labelImg
[4] N. Dalal, and B. Triggs, “Histograms of Oriented Gradients for Human
Detection,” Computer Vision and Pattern Recognition, 2005. CVPR
1200