0% found this document useful (0 votes)
7 views37 pages

Paper 3

Uploaded by

Ansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views37 pages

Paper 3

Uploaded by

Ansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Multimed Tools Appl

DOI 10.1007/s11042-016-3617-6

A computer vision-based perception system for visually


impaired

Ruxandra Tapu 1,2 & Bogdan Mocanu 1,2 & Titus Zaharia 1

Received: 29 October 2015 / Revised: 25 March 2016 / Accepted: 12 May 2016


# Springer Science+Business Media New York 2016

Abstract In this paper, we introduce a novel computer vision-based perception system,


dedicated to the autonomous navigation of visually impaired people. A first feature concerns
the real-time detection and recognition of obstacles and moving objects present in potentially
cluttered urban scenes. To this purpose, a motion-based, real-time object detection and
classification method is proposed. The method requires no a priori information about the
obstacle type, size, position or location. In order to enhance the navigation/positioning
capabilities offered by traditional GPS-based approaches, which are often unreliably in urban
environments, a building/landmark recognition approach is also proposed. Finally, for the
specific case of indoor applications, the system has the possibility to learn a set of user-defined
objects of interest. Here, multi-object identification and tracking is applied in order to guide the
user to localize such objects of interest. The feedback is presented to user by audio warnings/
alerts/indications. Bone conduction headphones are employed in order to allow visually
impaired to hear the systems warnings without obstructing the sounds from the environment.
At the hardware level, the system is totally integrated on an android smartphone which makes
it easy to wear, non-invasive and low-cost.

Keywords Obstacle detection . BoVW / VLAD image representation . Relevant interest points .
A-HOG descriptor . Visually impaired people

* Ruxandra Tapu
[email protected]

1
ARTEMIS Department, Institut Mines-Télécom / Télécom SudParis, UMR CNRS MAP5 8145, 9 rue
Charles Fourier, 91000 Évry, France
2
Telecommunication Department, Faculty of ETTI, University BPolitehnica^ of Bucharest, Splaiul
Independentei 313, 060042 Bucharest, Romania
Multimed Tools Appl

1 Introduction

Recent statistics of the World Health Organization (WHO) [51] have shown that in 2012 about
0.5 % of the world population is visually impaired. Among this population, 10 % of the
concerned people are completely blind. The independent navigation in outdoor environments
is of extreme importance for visual impaired (VI) people. In order to perform daily activities
the VI attempt to memorize all the locations they have been through, so they can recognize
them afterwards.
In an unknown setting, VI people rely on white canes and walking dogs as a primary
assistive device. Although the white cane is the simplest and cheapest mobility tool, it has a
restricted searching range and cannot infer additional information such as: the speed and nature
of the obstacle a VI user is encountering or the distance and time to collision. On the other
hand, the walking dogs are highly expensive, require an extensive training phase and are
effectively operational for about 5 years only [7]. Both the white cane and guide dog provide
short range information and cannot detect overhanging obstructions.
The task of route planning in an unforeseen obstacle environment can severely impede the
independent travel of VI and thus reduce their willingness to travel [28].
In this context, in order to improve cognition and assist the navigation of VI users, it is
necessary to develop real-time systems, able to provide guidance and to recognize both static
and moving objects, in highly dynamic and potentially cluttered urban scenes. The goal of
such a technology is not to replace the white cane, but to complement it in an intelligent
manner, by alerting the user of obstacles in a few meters or by providing direction/localization
information. For acceptability reasons, a major constraint is imposed to the system: it should
not interfere with the other senses as acoustic or haptic.
Motivated by the above-mentioned considerations, in this paper, we introduce a novel VI-
dedicated navigational assistant, developed using the computer vision techniques that can be
used for improving or complementing the human senses (i.e., acoustic or haptic). The
approach is designed as a real-time, standalone application, running on a regular smartphone.
In this context, our system provides a low-cost, non-intrusive and simple device for VI
navigation.
The rest of the paper is organized as follows. Section 1 presents a state of the art review
(Section 1.1) and also introduces the general framework of our developments, which have
been made within the context of the AAL (Ambient Assisted Living) European project ALICE
(Section 1.2). The system includes a novel obstacle detection and classification module
(Section 2), a landmark recognition module (Section 3) and an indoor detection/localization
of interest objects (Section 4). Section 5 presents the experimental results conducted with real
VI users traveling in challenging environments, with various arbitrary moving objects and
obstacles. Finally, Section 6 concludes the paper and opens some perspectives of future work.

1.1 Related work

In the last couple of years, various assistive technologies for blind and visually impaired
people have been proposed.
Existing commercial approaches exploit the Global Positioning System (GPS), in order to
provide guidance and to localize a VI person. However, in the context of people with special
needs such systems prove to be sensitive to signal loss and have a reduced accuracy rate in
estimating the user position [56]. In urban areas with high density of buildings, the GPS
Multimed Tools Appl

sensors can offer an accuracy error of about 15 m [56]. Moreover, the GPS signal can be
frequently lost in certain areas. Such limitations strongly affect the reliability of the proposed
systems and severely penalize the GPS-based approaches within the context of VI navigation
applications.
Due mostly to the high limitations related to the required computational power and the lack
of robustness of the vision algorithms, until recently, computer vision techniques have been
relatively poorly used for developing VI-dedicated mobility assistants.
Nevertheless, in the past years, significant advances in computing and vision techniques
have been achieved. It is now possible to run in real-time reliable algorithms on embedded
computers and even smartphones that are equipped with powerful, multi-core processors. In
addition, the computer vision systems, unlike ultrasonic, infrared or laser technologies offer
superior level of reproduction and interpretation of real scenes, at the price of a higher
computational complexity. Let us now describe and analyze the existing state of the art
systems, with main features, advantages and limitations.

1.1.1 CCD camera systems

The tactile vision system (TVS) firstly introduced in [33] is designed as a compact, wearable
device, able to detect in real-time obstacles and provide directional information during indoor
navigation. The alerting messages are sent to the VI by using fourteen vibrating motors
attached to a flexible belt. In this way, the hands and ears free conditions are always satisfied.
However, the system is not able to differentiate between ground and overhead obstacles.
The NAVI navigational assistant dedicated to obstacle detection introduced in [61] is
composed of a processing unit, a regular video camera, stereo headphones and a supporting
vest. The camera captures a gray-scale, re-sampled video stream. Then, by using a fuzzy neural
network the system discriminates objects (foreground elements) from the background. The
framework is operational in real-time. However, no audio feedback is sent to the user and no
information about the distance to the object is provided.
In [71], authors introduce an obstacle detection system in order to determine a safe walking
area, in an outdoor environment, for a VI user. By applying the Canny-Edge detection
algorithm, vanishing points and sidewalk boundaries are extracted. Next, obstacles are iden-
tified by applying Gabor filters in order to determine quasi-vertical lines. The system is easy to
wear, light and compact. The major drawbacks are related to its sensitivity to the user
movement and the violation of the ear-free constraint. In addition, the solution has never been
tested on in real life conditions.
The SmartVision system [34] is designed to offer global indoor and outdoor navigation
guidance (using GIS with GPS) with the avoidance of static and dynamic obstacles. Diago-
nally distributed sidewalk boundaries are extracted to determine the path area. Then, the
objects inside the path area are detected by quasi-vertical edges or changes in texture patterns.
The system is sensitive to GPS signal lost, initial path positioning, when the VI user is leaving
the path or at intersections or crossings.
In [42], by using a regular camera mounted on the user’s waist, the authors introduce an
obstacle detection system able to differentiate between background and foreground objects as
an image re-sampling process. So, as an inhomogeneous re-sampling process, the background
edges are sub-sampled while the obstacle edges are oversampled in the top-view domain.
To our very best knowledge, the only system designed to incorporate a navigational
assistant on a regular smartphone is proposed in [52]. By using computer vision techniques
Multimed Tools Appl

(color histograms and object edge detection), the prototype is able to detect with high
confidence objects situated at arbitrary height levels. However, the evaluation was performed
solely in indoor spaces and with no VI users. In addition, the hand-free condition [46] imposed
by the VI user is violated because the smartphone needs to be hand-hold.
A regular camera is more compact and easier to maintain than stereo cameras. However, it
is much difficult to estimate the distance or to distinguish between background and foreground
objects. Despite the efforts made to detect obstacles from a single image without depth cues,
the appearance and geometry models used in these systems are valid only in limited scenarios.

1.1.2 Stereo camera systems

Stereo cameras are more popularly used for building mobility aid systems, because depth can
be computed directly from pairs of stereo images.
By using electro-tactile stimulation, GPS localization and visual sensors, the electronic
neural vision system (ENVS) introduced in [48] is designed as a real-time application that
facilitates the navigation of VI and also alerts user of potential hazards in his way. The warning
messages are transmitted to VI by electrical nerve stimulation gloves. In this case, the user
hands are always occupied. Moreover, the ground and overhead objects are not detected, while
the walking path needs to be flat, which limits the domain of applicability of the method.
The navigation assistant Tyflos, firstly introduced in [14] and extended in [15] is designed
to detect surrounding obstacles. The system is composed of two video cameras for depth
image estimation, a microphone, ear headphones, a processing unit and a 2D vibration vest.
The architecture satisfies the hand-free constraints and the VI can be alerted about obstacles
situated at various levels of height. However, the necessity of wearing a vibration vest situated
near the skin makes the entire framework invasive.
A wearable stereo device for indoor VI user navigation is proposed in [60]. The system is
composed on a processing unit, a stereo camera and a chest mounted harness. The algorithm
yields solely a metric map, which is difficult to exploit by blind people. Furthermore, the
system is not able to perform in real-time.
A stereo based navigational assistant device for VI is introduced in [54]. The system offers
obstacles detection and feature localization capabilities using stereo reconstruction. The video
cameras are head mounted and the warnings are sent through vibro-tactile stimulation. The
micro vibration motors are put on a vest situated near the skin which makes the entire
framework invasive.
A stereo vision aerial obstacle detection system is introduced in [59]. The method develops
a 3D map for outdoor environment of the user vicinity, predicts the VI motion using 6DOF
egomotion algorithms and evaluates possible aerial obstacles in the next pose. Nothing is said
about the acoustic feedback sent to the user and no information about the distance to the object
is provided.
The stereo vision system introduced in [12] detects obstacles by threshold segmentation of
the scene saliency map. The 3D information of the obstacle is also computed. Finally, voice
messages are transmitted to the VI user.
Although many stereo-vision-based systems were introduced as [12, 14, 15, 48, 54, 59, 60],
some inherent problems still need to be solved. First, because of the incorrect estimation of
large depth cues the stereo-matching algorithm fails especially for less textured regions.
Second, the quality and accuracy of the estimated depth map is sensitive to artifacts in the
scene and abrupt changes in the illumination.
Multimed Tools Appl

1.1.3 RGB-D camera systems

More recently, the emergence of RGB-D cameras enabled the apparition of a new family of VI
guidance systems exploiting such technologies. An RGB-D camera provides a depth map of
the whole scene in real-time as well as a RGB color map. Therefore, it may be used
conveniently for both object detection and scene understanding purposes.
By using a Kinect combined with a depth sensor and acoustic feedback, the KinDetect
system [35] aims at detecting obstacles and humans in real time. Obstacles situated at the level
of the head or feet can be identified by processing the depth information on a backpack
computer. However, by using regular headphones to transmit acoustic warnings the user ears
are always occupied.
The system introduced in [64] can recognize a 3D object from depth data generated by the
Kinect sensor. The VI users are informed not only about the existence of the detected
obstacles, but also about their semantic category (chairs and upward stairs are here supported).
In a similar manner, the framework proposed in [8] identifies nearby structures from the depth
map and uses audio cues to convey obstacle information to the user.
In [56], 3D scene points recovered from the depth map (in indoor and outdoor scenarios)
are classified as either in the ground or on the object by estimating the ground plane with the
help of a RANSAC filtering technique. A polar accumulative grid is then built to represent the
scene. The design is completed with acoustic feedback to assist visually impaired users. The
system was tested on real VI users and satisfies the hands-free and ear-free constraints.
The RGB-D camera depends on emitted infrared rays to generate a depth map. In outdoor
environments, the infrared rays can be easily affected by sunlight. Therefore, guidance systems
developed using the RGB-D camera can only be used in indoor environments, which limits the
range of its use in a mobility aid system. Moreover, due to the limited computational resources,
the development of accurate and dense depth maps is still expensive.
The analysis of the state of the art shows that the existing systems have their own
advantages and limitations but do not meet all the features and functionalities needed by VI
users. Existing systems focus on automatic obstacle detection, without proposing a joint
detection/recognition that can provide a valuable feedback to the users with respect to the
surrounding environment. In addition, motion information, which is essential for the compre-
hension of the scene dynamics, is not available. Concerning the localization capabilities, most
of techniques still rely on GPS approaches, in the case of outdoor positioning, or include a
priori knowledge (e.g., detailed maps) for more specific indoor scenarios.
In recent years, the emerging deep learning strategies showed promising results in various
computer vision/image classification areas, including object detection and recognition and 2D/
3D scene reconstruction.

1.1.4 Emerging deep-learning strategies

The deep learning approach can be interpreted as a method of hierarchical learning that uses a
set of multiple layers of representation to transform data to high level concepts [23]. At each
individual layer of transformation, higher level features are derived from the lower level
features, leading to a hierarchical representation of information. Let us analyze how the
emerging deep learning approaches have been considered for VI-dedicated applications.
In [69], a system is introduced for automatic understanding and representation of image
content. The system relays on a generative model that updates the system memory based on a
Multimed Tools Appl

non-linear function and on an image vocabulary determined using Convolution Neural


Networks (CNN). Even though the systems returns good results on traditional image dataset
as PASCAL, Flicker30k or SBU no experimental results were conducted on video datasets.
Furthermore, the authors provide no information regarding the computational complexity
which is of crucial importance when considering embedded VI applications.
In [70], the authors introduce a navigation assistant for an unknown environment based on
SLAM estimation and text recognition in natural scenes. The system improves the traditional
SLAM camera pose and scene estimation by integrating real-time text extraction and repre-
sentation for quick rejection of candidate regions. The text extraction method is based on deep
learning network, trained with a 107 low-resolution input examples.
A mobility assistant for VI users, completely integrated on a mobile device, with face
detection, gender classification and sound representation of images is introduced in [10]. The
system uses the video camera integrated on the smartphone to capture pictures of the
environment in order to detect faces by using CNN. Then, for gender estimation the authors
use Logistic Regression as the classification method on feature vectors from the CNN.
However, no validation of the system with real VI users has been considered. In addition,
and no information regarding the transmission of acoustic signals to users is presented.
Concerning the object detection techniques, various deep learning-based approaches have
been introduced [21, 26, 63]. In [63] the authors propose replacing the last layer of a deep
convolution network with a regression layer in order to estimate the object’s bounding box,
while in [26] a bottom-up object detection system based on deep model is introduced.
Similarly, in [21] a saliency-inspired deep neural network is proposed.
The deep learning approaches offer highly interesting perspectives of development. The
main issue that still needs to be solved relates to the computational complexity that is
conditioning the relevance of such approaches within the context of real-time, embedded
VI-dedicated systems. The emergence of new smartphones equipped with powerful graphical
boards (e.g. NVIDIA TX1) may offer a solution to this problem and enable in the future the
deployment of such deep learning techniques for real-time object detection and classification
purposes, within various contexts of application.
As a general conclusion of the above state of the art methods we can say that the difficulty
is not developing a system that has all the Bbells and whistles^ but to conceive the technology
that can last in time and be useful. For the moment, the VI users cannot be completely
confident about the robustness, reliability or overall performance of the existing prototypes.
Any new technology should be designed not to replace the cane or the walking dog, but
complement them by alerting the user of obstacles in a few meters, and provide guidance.
Our work has notably been carried out within the framework of the European project
ALICE (www.alice-project.eu), supported by the AAL (Ambient Assisted Living) program.
The ALICE project had as ambitious objective the development of a VI-dedicated navigational
assistant. The main features of the ALICE navigational assistant are briefly described in the
following section.

1.2 The ALICE framework

ALICE aims at offering to visually impaired users a cognitive description of the scenes they are
evolving in, based on a fusion of perceptions gathered from a range of sensors, including image/
video, GPS, audio and mobile-related. The ALICE system, illustrated in Fig. 1, is composed of
a regular smartphone attached to a chest mounted harness and bone conduction headphones.
Multimed Tools Appl

Fig. 1 The hardware components of ALICE device

The harness has two major roles: it makes it possible to satisfy the hands-free requirement
imposed by the VI and improves the video acquisition process, by reducing the instabilities
related to cyclic pan and tilt oscillation. The system can be described as a wearable and
friendly device, ready to use by the VI without any training.
The proposed solution is low-cost, since it does not require any expensive, dedicated
hardware architecture, but solely general public components available at affordable prices on
the market.
In addition, the system is also non-intrusive, satisfying the hands-free and ears-free
requirements imposed by VI users.
The main functionalities offered by ALICE are the following:

– Real-time detection of obstacles and moving objects (cars, pedestrians, bicycles),


– Automatic identification of crossings, traffic lights…
– Landmark recognition and specification of annotated itineraries,
– Precise localization through enhanced GPS navigation techniques,
– Adapted human-machine interfaces: non-invasive feedback with minimum verbalization,
– and enactive, earconic/haptic signals.

Within this framework, our developments notably concern the computer vision-related
capabilities integrated in the ALICE device (Fig. 2). The following section summarizes the
proposed contributions.

1.3 Contributions

The main contributions presented in this paper concern:

– an obstacle detection method (Section 2.1), based on apparent motion analysis. The
originality of the approach comes from the whole chain of apparent motion analysis
proposed. Semi-dense interest point extraction, motion-based agglomerative clustering,
and motion description are the key ingredients involved at this stage. The method makes it
possible to reliably detect both static and dynamic obstacles, without any a priori knowl-
edge of the type of object considered. Moreover, the motion analysis makes it possible to
acquire useful information that is exploited for prioritizing the alerts sent to the user.
– an object recognition/classification approach (Section 2.2), which introduces the concept
of relevant interest points extraction, adaptive HOG descriptors and shows how it can be
Multimed Tools Appl

Fig. 2 Computer vision capabilities of the ALICE system

exploited in a dedicated BoVW / VLAD image representation. The strong point of the
method relates to its ability of dealing with multiple categories, without need of using
different templates and/or sliding windows. The object detection and recognition method
is able to run in real-time.
– a landmark recognition module (Section 3), dedicated to the improvement of the GPS
accuracy localization through computer vision methods. The main originality of the
approach concerns the two-step matching procedure, which makes it possible to benefit
from both FLANN matcher’s speed search and the BruteForce matcher’s consistency.
Remarkably, the system can work entirely on a smartphone in off-line mode, with times of
response to the queries of about 2 s (and for 10 to 15 landmarks).
– an indoor object localization module (Section 4). The user has the possibility to pre-learn a
set of objects of interest. At the run-time stage, a detection and tracking technique makes it
possible to detect and identify such objects in cluttered in-door scenes. The main
contribution concerns the spatial layout verification scheme which makes it possible to
achieve robustness without impacting the computational burden.

All the methods involved were specifically designed and tuned under the constraint of
achieving real-time processing on regular smartphones. To our very best knowledge, no other
state of the art approach can offer such a complete set of computer vision functionalities
dedicated to VI assistance and adapted to real-time processing on light devices.
The validation of the proposed methodology is presented in Section 5. We have objectively
evaluated each of the involved methods on ground truth data sets and with recognized
performance measures (Sections 5.1, 5.2 and 5.3).
Let us first detail the obstacle detection/recognition approach, which is the core of the
proposed methodology.

2 Obstacle detection and classification

When navigating in urban environments, the user can encounter a variety of obstacles, which
can be either static (e.g., objects in the street that can cause injuries and should be avoided) or
Multimed Tools Appl

dynamic (e.g., other pedestrians, vehicles, bicycles…). In order to ensure a safe navigation, it
is important to be able to detect such obstacles in real-time and alert the user. Our approach
makes it possible to both detect such elements and to semantically interpret them.
Let us first describe the obstacle detection method proposed.

2.1 Static and dynamic object detection

We start by extracting interest points regularly sampled over the video frame. Let us mention
that we have also considered more powerful, content-related interest points extractors as SIFT
[44] and SURF [5]. However, we have empirically observed that generally in outdoor
environments the background has a significantly higher number of interest points than the
one corresponding to the obstacles/objects.
Gauglitz et al. presents in [25] a complete evaluation of interest point detectors (i.e. corner,
blob and affine invariant detector) and feature descriptors (i.e. SIFT, PCA-SIFT, SURF) in the
context of visual tracking. The evaluation is performed on a video dataset of planar textures,
which include inconsistent movement, different levels of motion blur, geometric changes
(panning, rotation, perspective distortion, zoom), and lighting variation (static and dynamic).
In the context of obstacle detection the following conclusions can be highlighted:

– The execution time when tracking interest points between consecutive frames and random
frame pairs is important (100 ms for SIFT).
– For increased temporal distance between images, the repeatability of all detectors de-
creases significantly, which makes a problem for object tracking objectives.
– Large camera motion leads to strong variations in the background. Consequently, the
neighborhood interest points between adjacent images can significantly be different.
– None of the detectors cope well with the increased noise level of the darker static lighting
conditions.
– Furthermore, in the case of low resolution videos or for less textured regions SIFT or
SURF detectors extract a reduced number of interest points.

Based on this analysis, we have privileged a semi-dense sampling approach, which fits
the computational complexity requirements without degrading the detection performances.
A uniform grid is constructed. The grid step is defined as: Γ = (W H)/ Npoints, where W and
H are the dimensions of the image and Npoints is the maximum number of points to be
considered.
The value of parameter Npoints determines a trade-off between detection accuracy and
computational speed. In our case, a good compromise has been achieved for a value of Npoints
set to 1000 interest points, for videos acquired at a resolution of (320 × 240 pixels).
In order to identify static or dynamic obstacles, we consider a motion-based analysis
approach. Thus, the objective is to determine all objects that exhibit an apparent motion
different from the background. This makes it possible to identify moving objects, but also
static objects (e.g., obstacles) that appear in the foreground while the user is moving within the
scene.
First, we need to determine the interest point displacements (e.g., 2D apparent motion
vectors) between successive frames. To this purpose, we retained the multiscale Lucas-Kanade
algorithm (LKA) [45]. The main limitations of the LKA come from the brightness and spatial
inconsistency.
Multimed Tools Appl

Let us note that more recent methods such as [6] are able to increase the estimation
accuracy and become robust to abrupt changes in the illumination. However, in our case
where the computational burden is an important constraint, we cannot adopt this strategy.
Thus, we prefer to exploit a Brelatively good^ estimation of the motion vectors, rather than a
highly accurate one that can collapse the real-time capability. As we will see in the following,
when combined with a motion clustering approach, this is sufficient to ensure high detection
performances.
The LKA tracking process is initialized with the set of interest points of the uniform
grid considered. Then, these points are tracked between successive images. However, in
practice the LKA cannot determine a trajectory for all interest points (e.g., when the video
camera is moving, obstacles disappear or other/new objects appear). So, when the density
of points in a certain area of the image is inferior to the grid resolution, we locally
reinitialize the tracker with points from the grid. The new points are then assigned to
existing objects.
Let us denote by p1i (x1i,y1i) the ith keypoint in the reference image and p2i (x2i, y2i) the
correspondent one, determined with the LKA in the successive frame. The associated motion
vector vi = (vix, viy) is also expressed in polar coordinates, with angular value θi and magnitude
Di.
The availability of the motion vectors makes it possible to determine first the global motion
of the scene, modeled as a homographic transform between successive frames.

2.1.1 Global motion estimation

We robustly determine the global homographic transform (H) between adjacent frames with
the help of the RANSAC (Random Sample Consensus) [38] algorithm. If we consider a
reference point expressed in homogenous coordinates p1i = [x1i, y1i, 1]T, and a homographic
matrix H, we can estimate the novel position of the point pest2i = [xest2i, yest2i, 1]T in the
successive frame.
For each interest point, we compute the L2 distance between the estimated position pest2i
and the tracked position p2i of that interest point (determined using the LKA):
 
E ðp1i ; H Þ ¼ pest
2i −p2i
 ð1Þ

In order to determine the background interest points, we compare E (p1i, H) to a predefined


threshold ThBG. The interest points satisfying this condition are marked as inliers (i.e.,
belonging to background) while the outliers represent keypoints associated to different moving
objects existent in the scene (i.e. foreground objects). In our experiments we fixed the
background/ foreground separation threshold ThBG to 2 pixels.
In outdoor scenes, multiple moving objects can be encountered. For this reason, we focused
next on the detection of foreground objects.

2.1.2 Foreground object identification

Let us note that, due to the foreground apparent motion, even static obstacles situated in the
foreground can act like moving objects relatively to the background. So, we further cluster the
set of outlier points in different classes of motion. To this purpose, we exploit an agglomerative
clustering technique described in the following.
Multimed Tools Appl

The principle consists of considering first each interest point as an individual cluster. Then,
adjacent clusters are successively merged together based on a motion similarity criterion. The
operation stops when no interest point left satisfies the similarity constraint. The sensitivity of
the method is notably determined by the considered similarity measure between interest point
motion vectors assigned to different clusters. In this paper, we propose the following strategy:

– Phase I – Construct the frequency of apparition of the motion vectors angular coordinates.
To this purpose, the angular coordinates are represented as integer degrees (from 0° to
360°). An arbitrary chosen interest point in the set of points with the highest represented
angular value will determine a new motion cluster MCl. Let θ(MCl) denote its angular
value.
– Phase II – For all the keypoints not yet assigned to any cluster, compute the angular
deviation by taking as reference the centroid value:
δ ðθi ; θ ðM C l ÞÞ ¼ jθi −θ ðM C l Þj ð2Þ

If the angular deviation δ (θi, θ (MCl)) is inferior to a predefined threshold ThAD and if the
corresponding motion magnitudes are equal, than the ith point is assigned to the (MCl) cluster.
Let us note that the motion magnitude values are here quantized to their nearest integer value,
for comparison purposes. For the remaining outlier interest points, the process is repeated
recursively until all the points are assigned to a motion class. In our experiments, we set the
grouping threshold ThAD to 15°.
To each motion cluster, a centroid point is assigned to, defined as the point in the considered
cluster with the median value (over the set of all points assigned to the given cluster) of the
corresponding motion vector angular coordinate.
A final stage, the k-NN clustering algorithm [73] is applied, in order to verify the spatial
consistency of the determined motion classes. Thus, we determine for each point its associated
kNN neighbors using the Euclidian distance (between their corresponding spatial positions). If
at least half of the detected points do not belong to the same motion class, we consider that the
point’s assignment to the present cluster is due to an error into the grouping process.
Consequently, the point is removed from the motion class and assigned to the background.
In cluttered outdoor environments objects can disappear, stop or be occluded for a period of
time. In such situations, incomplete trajectories can be obtained or even worse, the same object
can be identified as a new entity in the scene. In order to deal with such situations, the object
detection process is reinforced with a multi-frame, long term fusion scheme. By saving the
object location and its average velocity within a temporal sliding window of size Twindow, we
can predict its novel global motion compensated as described in Eq. (3):

  1X
T
     
pi t j ¼ pi t j−1 þ vi t j−k −pest
i tj ð3Þ
T k¼1

where pi(tj) is the ith interest point at frame tj, vi is the motion vector velocity and piest(ti) the
point estimated location given by the camera motion obtained by applying eq.1 to pi(tj-1) with
the current homographic matrix H.
So, when a previously detected object is occluded, a discontinuity in its trajectory will be
determined. By estimating the object’s position, in frames where it is hidden, we can
Multimed Tools Appl

determine, at the moment of time when it reenters into the scene that it corresponds to an
obstacle already detected.
Once the obstacles are identified, we determine their degree of danger and classify them
accordingly, as described in the following section. Let us underline that no a priori knowledge
about the size, shape or position of an obstacle is required.

2.1.3 Motion-based object description

This stage is performed by using the object position and motion direction, relative to the VI
user. For each motion cluster, the analysis is performed on the corresponding centroid point
previously determined. A global motion compensation procedure is first applied, in order to
characterize the centroid movement independently of the camera movement.
Let P1 denote the centroid position at frame N and P2 its motion-compensated position at
frame N + 1. We considered a reference point P0 as the camera focus of attention defined by
convention in the middle of the bottom row of each frame (Fig. 3).
Then, we compute the object’s angular displacement α = P0 P1 P2 (Fig. 3). The object is
labeled as approaching (AP) if the angle α is inferior to a specified threshold (ThAP/DE).
Otherwise, the subject is considered as moving away from the obstacle or that the object is
departing (DE). The ThAP/DE parameter helps to perform a first, preliminary classification
based on the degree of dangerousness of various objects existent in the scene. A higher value
of ThAP/DE threshold will signify that a larger set of obstacles will be considered as ap-
proaching the user and vice-versa.
Because the human body might shake or slightly rotate in time we included a reinforcement
strategy based on motion consistency in time. So, by saving the object directions α within the
temporal sliding window of size Twindow (cf. Section 2.1.2), we can predict and verify its novel
directions relative to general object movement. The AP/DE decision is then taken as the
majoritary label detected in the considered sliding window. In our experiments, we have tested
with values of the ThAP/DE parameter in the interval [32, 40] and obtained equivalent
performances (±5 % of objects detected as approaching/departing). We finally set ThAP/DE to
45°, which leads to reasonable performances in a majority of situations.
We propose to use a trapezium region projected onto the image in order to define the user’s
proximity area.
We used for video acquisition the camera embedded on a regular smartphone with an angle
of view α = 69°. The smartphone is attached to the user at an average elevation (E) of 1.3 m
(meters).

Fig. 3 Obstacle direction estimation


Multimed Tools Appl

For the trapezium, the height is set to a third of the total image height (i.e., ST segment in
Fig. 4). We can establish the distance from the user and the bottom down pixel in the trapezium
as the RT segment (Fig. 4):
RT ¼ E=tg ðα=2Þ ¼ 1:85m ð4Þ

Nevertheless, the size of the trapezium can be adjusted in a pre-calibration step by the user.
A warning message will be generated only for obstacles situated at maximum distance relative
to the user of about five meters:
RN ¼ RT þ T N ¼ 1:85 þ 2⋅E=tg ðα=2Þ≅5:5m ð5Þ

An obstacle is marked as urgent (U) if it is situated in the proximity of the blind/visual


impaired person. Otherwise, if located outside the trapezium, the obstacle is categorized as
non-urgent or normal (N). However, by employing the proximity area we can prevent the
system to continuously warn the subject about any object existent in the scene. A warning can
be launched just for objects situated in the urgent region.
The downside of this assumption is given by the rejection of warnings for dynamic objects
(e.g., vehicles) approaching the user very fast or for obstacles situated high at the head level,
such as tree branches, arcades or banners. To avoid such situations it is necessary to distinguish
and recognize the various types of objects. Using this information we can then generate
warnings for objects situated outside the proximity trapezium, whenever such action is
required. To achieve this purpose, we propose an obstacle recognition/classification method,
further described in Section 2.2.
Let us underline that the above-described approach depends poorly on the technical
characteristics of the considered smartphone and can be easily implemented on various mobile
devices. In our case, we have successfully tested the approach on multiple devices ranging
from LG G3, HTC one and Samsung Galaxy S2, S3 and S4. Actually, ALICE can be
optimally integrated on any mobile device running Android as operation system, with a
processor superior to 1.3GHz and 2GB of RAM. Regarding the angle of view of a video
camera embedded on a regular smartphone this is superior to 600 in most of the cases.
Let us also note that the VI user height is a parameter that has little effect on the overall
system performance. Our only constraint is to attach the smartphone at an average elevation of
about 1.3 m (meters) which can be adjusted with the help of the chest mounted harness. If this
constrains cannot be satisfied then the static obstacles will be detected at a maximum distance

Fig. 4 Real distance estimation


Multimed Tools Appl

relative to user smaller than five meters. For dynamic objects the smartphone elevation has no
effect over the detection efficiency.

2.2 Obstacle classification

Each frame of the video stream can be considered as a hierarchical structure with increasingly
higher levels of abstraction. The objective is to capture the semantic meaning of the objects in
the scene. In this framework, we have considered the following four major categories:
vehicles, bicycles, pedestrians and static obstacles. A training dataset has been constituted
for learning purposes, extracted from the PASCAL repository [22]. The training set includes
4500 images as follows: 1700 vehicles, 500 bicycles, 1100 pedestrians and 1200 static
obstacles. Some samples are illustrated in Fig. 5.
The considered categories were selected according to the most often encountered obstacles
in outdoor navigation. When creating a vocabulary of visual words an important concern is the
size and the choice of data used to construct it. As indicated in the state of the art [29], the most
accurate results are obtained when developing the vocabulary with the same data source that is
going to appear in the classification task. However, in the ALICE framework, the categoriza-
tion phase can be considered as a more focused task of recognition of a relatively reduced set
of specific objects, rather than a generic classification task. Because the number of categories is
known in advance, we have sampled the descriptors from the training data to have a good
coverage over all the 4 considered classes. To this purpose, we used 3300 image patches
selected from the PASCAL corpus, enriched with 1200 image patches representing: fences,
pylons, trees, garbage cans, traffic signs, overhanging branches, edge of pavements, ramps,
bumps, steps… selected from our own dataset. We can naturally expect that an increase of the
training dataset (e.g. Imagenet database) can: (1) enhance the performances of the classifica-
tion stage, and (2) be useful for a finer categorization into an extended number of categories
(notably by refining the obstacle class into sub-categories). However, for the time being we
focused solely on the 4 categories retained and thus considered a reduced training set. In order
to deal with the situation when the training images do not perfectly match the images captured
in real life, we introduce an extra Outlier category, that gathers the image patches that cannot
be reliably assigned to one of the 4 categories.
The proposed obstacle classification framework is illustrated in Fig. 6.

(1) Firstly, for each image in the dataset low level image descriptors are extracted (i.e.,
relevant interest points or A-HOG).
(2) Then, by using these descriptors an unsupervised learning step is performed in order to
create a vocabulary. Each descriptor is mapped to the nearest word in order to develop a
global image representation. Two different approaches are here retained. The first one

Fig. 5 Samples from the considered training set


Multimed Tools Appl

Fig. 6 Obstacle recognition/classification

concerns the Bag of Visual Words (BoVW) representation built upon low level descrip-
tors, while the second adopts the Vector of Aggregated Local Descriptors (VLAD)
methodology.
(3) In the final stage the image patch classification is performed. Here we adopted a strategy
based on Support Vector Machines (SVM) with Radial Basis Functions (RBF) kernels.

2.2.1 Feature extraction and description

Relevant interest points All images in the dataset are mapped onto a common format by
resizing them to a maximum of 16 k pixels, while preserving the original aspect ratio. For each
image, we extract interest points using pyramidal FAST [67] algorithm. Then, we have
privileged a simple, semi-dense sampling approach, which fits the computational complexity
requirements without degrading the retrieval performances.
After the interest points are extracted we overlap a regular grid onto the image. Then, we
propose to characterize each rectangle of the grid using only one key-point. Instead of applying
a traditional interest points selection strategy (most often, the center of each grid cell), we
propose to determine which points are within the rectangle, filter and rank them based on the
Harris algorithm [31] and then select the most representative one (the one with the highest
value of the Laplacian operator). The process is controlled by the grid step parameter, defined
as: Γ = (W H) / Npoints, where W and H are the dimensions of the image and Npoints is the
maximum number of points.
Figure 7 illustrates the results obtained on a given image patch with the proposed interest
point extraction method (Fig. 7c), the traditional FAST algorithm (Fig. 7a), and the FAST
points filtered with the Harris-Laplacian procedure (Fig. 7b).
Let us underline that the same number of interest points is retained for the examples in
Fig. 7b and c. As it can be observed, the set of interest points obtained after applying the
proposed method is more uniformly distributed within the image patch and thus potentially
better suited to capture the underlying informational content. The retained interest points are
further described using the SIFT descriptor [44].
Let us note that the technological advances of mobile devices, have led in recent years to a
variety of dedicated light-weight, binary interest point descriptors, adapted to the computa-
tional capacities of such devices. Among the most representative approaches, let us mention
the BRISK (Binary Robust Invariant Scalable Keypoints) [40], BRIEF (Binary Robust
Independent Elementary Features) [39], ORB (Oriented FAST and Rotated BRIEF) [58],
and FREAK (Fast REtinA Keypoint) [1] descriptors. Apart from their fast extraction proce-
dures, such descriptors offer very quick matching time with the help of distance measure
adapted for binary values, such as the Hamming distance. This actually represents the biggest
Multimed Tools Appl

Fig. 7 Interest point extraction


using: a. traditional FAST, b.
FAST filtered using Harris-
Laplacian, c. proposed technique

advantage of binary descriptors, as it replaces the more costly Euclidean distance. The major
drawback of such approaches concerns the matching performances that are significantly lower
than those of the SIFT descriptor, in particular when the scale variations of objects become
significant (less than 0,5 or greater than 2) [1].
For this reason, in our work we have retained the standard SIFT descriptors which ensure
high matching performances, reproducibility and scale invariance at a computational cost that
can still fulfill the real-time constraint required by our application.

Adaptive HOG descriptor extraction In the traditional HOG [16] approach, an image
I(x,y) is divided into a set of non-overlapping cells. For each cell, a 1D histogram of gradient
directions is computed. Next, a normalization step is performed on every cell block in order to
form a histogram that is invariant to shadows or changes in illumination.
The traditional HOG descriptor was initially developed for human detection. In [16],
authors propose using an analysis window of (64 × 128 pixels) for an accurate localization
and recognition of pedestrians.
Some improvement are proposed in Rosa et al. [57] where the HOG descriptor is used for
people detection by employing detection windows of fixed resolution (64 × 128 pixels or
64 × 64 pixels). In our case such approach is obsolete because our system is designed to detect
objects with a high variability of instances and shapes.
Directly applying the traditional HOG method to our case would require constraining the
size of the image patch (extracted using the obstacle detection method described in Section 2.1
and representing the object’s bounding box) to a fixed resolution. However, a fixed resolution
of the analysis window would alter significantly the aspect ratio of the patch and thus the
corresponding HOG descriptors, with an impact on their discriminative power (Fig. 8b).
Consequently, the system will return high recall rates only for the pedestrian’ classification.
Different research works [17] propose solutions for overcoming this limitation. The
principle consists of modifying the size of the patch to a pre-established value, appropriate
for each category (e.g., for bicycles 120 × 80 pixels, for cars 104 × 56 pixels). Even so, the high
Multimed Tools Appl

Fig. 8 Low-level descriptor extraction: a image patch at the original resolution; b traditional HOG, c adaptive
HOG (A-HOG)

variability of instances considered in our case, makes it impossible to select a specific


resolution adequate for each element (e.g., garbage cans, traffic signs). Moreover, because of
the real-time constraint of our application it is intractable to use a multiple window size
decision approach.
In order to avoid such limitations, we introduce a novel version of HOG descriptor denoted
adapted HOG (A-HOG). The A-HOG approach dynamically modifies the patch resolution but
conserves its original aspect ratio (Fig. 8c). We also limit the maximum number of cells for
which we extract the descriptor. So, when the algorithm receives as input a new image patch it
starts by computing its associated aspect ratio. Then, it modifies the patch height and width
with:
hpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffii
w¼ ar⋅ncell⋅csize ; h ¼ ½w=ar ð6Þ

where ar is the patch aspect ratio, ncell is the maximum number of cells for which we extract
the HOG descriptor and csize is the dimension of a cell.
The size of the patch is adapted in such a way to meet both requirements: conserving the
initial aspect ratio and matching the fixed number of cells imposed.
Whatever the low-level descriptors considered, either FAST or A-HOG, an aggregation
procedure is required in order to obtain a global image representation.

2.2.2 Global image representation

In order to represent an image as a high dimensional descriptor we evaluated two methods as


described below: BoVW and VLAD.

BoVW image representation In the classical BoVW (Bag of Visual Words) [13] frame-
work, each image in the dataset can be described as a set of SIFT (associated to the detected
interest points) or A-HOG descriptors Di = {di1, di2, …, din}, where dij represents the
jth descriptor of ith image (i.e., descriptor associated to the jh interest point/ cell of image Ii)
and n is the total number of interest points /cells in an image.
The obtained visual words are clustered with the help of the k-means algorithm [19], which
makes it possible to obtain a codebook W.
Multimed Tools Appl

An arbitrary visual word descriptor dij can be then mapped onto its nearest prototype w(dij)
in the vocabulary W:
   
w d i j ¼ arg min  w−d i j  1 ð7Þ
w∈W

where ‖. ‖ 1 denotes the L1 norm in the descriptor space.


This procedure makes it possible to represent each image in the dataset as a histogram of
visual words. The total number of bins that compose the histogram is equal with the number of
words K included in the vocabulary. Each bin bi represents the number of occurrences of a
visual word wi in W:bk = Card (Dk):
n    o

Dk ¼ d i j ; j∈f1; …; ng  w d i j ¼ wk ð8Þ

where Dk is the set of descriptors associated to a specific visual word wk in the considered
image and Card(Dk) is the cardinality of the set Dk.

VLAD image representation The vector of locally aggregated descriptors (VLAD) [32] is
designed to characterize an image Ii in the dataset by the difference between its local features
Di = {di1, di2, …, din} and the codebook words (ck) learned off-line. Then, the residuals of all
the descriptors corresponding to the same visual word in the codebook are accumulated. So,
for an image the original VLAD representation encodes the interest point /A-HOG descriptors
as follows:
X
vk ¼ d i j −ck ð9Þ
d i j ∈Di ;d i j ≅ck

Each centroid ck in the codebook determines a vector of aggregated residuals. The final
VLAD signature v is determined by the concatenation of all residual vectors vk. The
VLAD dimension is given by the product between the codebook size and the descriptor
dimension.
As proposed by Delhumeau et al. [18], in order to reduce the influence of bursty features
(caused by repetitive structures in the image) that might dominate other descriptors we use the
power-law normalization on VLAD descriptor as down-weight factor:
 
~vl ¼ signðvl Þ⋅vαl  ð10Þ

where α is the normalization parameter. It was suggested in [18] that α = 0.2 is a good choice.
To further improve the performance of power law normalization, the descriptor is transformed
to a different coordinate system using PCA, without any dimensionality reduction.

2.2.3 Image classification

The final step of the obstacle categorization framework can be divided in two stages: an offline
process (SVM training) and an online process (SVM prediction).
The offline process is represented by a supervised learning strategy based on SVM (Support
Vector Machine) training. The BoVW / VLAD image representation is fed into a SVM that
uses a statistical decision procedure in order to differentiate between categories.
Multimed Tools Appl

We adopted the strategy firstly introduced in [66] designed to find a separation hyperplane,
between two classes with respect to maximizing the margin:
!
X
ϕðxÞ ¼ sign yi αi K ðx; xi Þ þ b ð11Þ
i

where K is the SVM kernel, xi are the training features from the data set, yi the label of xi, b is
the hyperplane free term, while αi is a parameter dependent on the kernel type. In our case, we
have adopted the RBF kernel and used the implementation available in [36].
The SVM training completes the offline process of our object classification framework.
In the online phase, for each image patch extracted using our obstacle detection method
(Section 2.1), we construct the BoVW histogram / VLAD descriptor using the extended A-
HOG or relevant interest points with SIFT representation. The SVM classification is then
performed to establish the object’s class.
The proposed technique requires a reduced computational power because we are not
performing an exhaustive sliding window search within the current frame in order to determine
objects and their associated positions.
In our case, the obstacle classification receives as input the location and size of the object
that we want to label, determined with the help of the object detection approach. The various
elements detected and recognized are finally transmitted and presented to the VI user, with the
help of an acoustic feedback.
The object detection and classification framework proposed strongly relies on a motion
analysis process. However, it cannot respond to the detection and classification of objects of
interest that cannot be identified from the background solely on a motion analysis basis. Such
objects are however interesting to consider and exploit, notably in two scenarios. The first one
concerns the recognition of landmarks/ buildings in outdoor environments that can improve/
complement the GPS localization abilities. The second one concerns the identification of user-
defined objects of interest in indoor environments.
Let us note that in such less critical scenarios, the real-time condition can be slightly
relaxed, as long as the system ensures interactive response rates. Thus, a latency of several
seconds can be admitted for both landmark recognition and indoor object detection. For this
reason, we have adopted a different methodological framework, based on interest point
representations, which makes it possible to enrich the capabilities of the system.
The proposed solutions are presented in the following sections.

3 Landmark/building recognition

Although geo-localization solutions have been largely improved in recent years, orientation in
dense urban environments is still a complicated task. Since building façades provide key
information in this context, the visual approach has become a suitable alternative. Our prime
goal is to design a mobile application for automatically identifying buildings in a city.
Numerous attempts were made in solving the building recognition problem. Some of the
recent approaches [4] use textured 3D city-models. However, such 3D information is not
always available for all cities. Moreover, memory requirements exclude the possibility of a
mobile phone implementation. Gronàt et al. [30] suggest treating the place recognition
problem as a classification task. They use the available GPS tags to train a classifier for each
Multimed Tools Appl

location in the database in a similar manner to per-example SVMs in object recognition. The
approach is efficient only when a big dataset and a large-sized vocabulary are used. On the
contrary, in our case, we propose a minimum number of images per class along with a very
small-sized vocabulary. Most of the approaches presented in the literature are based on feature
representations. In the work of Chen et al. [11], a large-sized vocabulary-tree was used and a tf-
idf indexing algorithm scores a reduced number of GPS-tagged images. Many similar ap-
proaches use GPS-tag-based pruning method along with custom feature representations. In [2],
a rapid window detection and localization method was proposed. The building detection is
considered here as a pattern recognition task. In our work, we address this task as a
classification problem.
Our contribution is related to the way that a small-sized vocabulary tree is used, combined
with the spatial verification step. Instead of matching a query descriptor to its nearest visual
word, we perform the matching with respect to a reduced list of interest points extracted from
an image in the dataset. This approach makes it possible to overcome the dimensionality
problem, which is critical when considering mobile implementations.

3.1 The training phase

First, all the images in the training dataset are used to train a visual word vocabulary,
determined with the help of a k-means clustering algorithm [3] performed on the whole set
of SIFT descriptors [44] extracted from the training images. Each element in the vocabulary is
considered as a visual word consisting of a vector of length 128 (i.e., the length of a SIFT
descriptor). The size of the vocabulary was set to 4000 clusters. Then, for each labeled image
in the dataset, we extract Difference of Gaussians (DoG) [43] interest points and save their
corresponding SIFT descriptors. Then, we divide these vectors into relatively small groups.
Each group consists of the list of interest points corresponding to a certain visual word from
the vocabulary. The FLANN matcher [49] is used to determine the nearest neighbour (from the
considered vocabulary) for each interest point. In order to reduce quantization errors, we
propose to include the same interest point in the groups of the 3 nearest neighbours visual
words. Next, we save klandmark spatial neighbours for each interest point.
After tests with different values of klandmark we set klandmark to a value of 20, which offers a
good trade-off between computational cost and robustness. In fact, we found that with higher
values of klandmark the performance is consistent while the time of execution increases
exponentially. However, for values of klandmark lower than 20, the number of incorrect results
becomes penalizing. In addition, for each interest point we select the spatially nearest 20 points
and store their corresponding visual words.
At the end of the training phase, the following 5 components are recorded: the vocabulary
file including the visual words vectors, the descriptors of all the interest points in the training
set, the descriptors clusters by visual word, the spatial neighbors of each interest point and
finally the class name of each interest point.

3.2 The test phase

In the test phase (Fig. 9), we train a kd-tree from the list of visual words in the vocabulary
using the FLANN approach as an initialization step.
For each query image, we extract DoG interest points and their corresponding SIFT
descriptors. The trained kd-tree is used as a first step to find the corresponding visual word
Multimed Tools Appl

Fig. 9 Descriptor matching and


spatial consistency scheme. Stars
represent interest points and
squares are visual words from the
vocabulary

for each interest point. Then, a BruteForce matcher is applied in order to determine the nearest
similar descriptor from the cluster of descriptors corresponding to the same visual word. The
BruteForce matcher computes the Euclidean distance between query keypoint and each
descriptor in the cluster.
This two phase matching method allows us to benefit from both FLANN matcher’s speed
search and the BruteForce matcher’s consistency. In fact, the BruteForce search becomes
computationally expensive only if the number of comparisons is too high.
In our case, the average number of descriptors in a given cluster is around 250 which ensure
a relatively low computational cost.

3.3 Spatial consistency check

To avoid mismatches, a spatial consistency algorithm is further proposed. The principle


consists of evaluating the visual words associated to the kNN-SC nearest spatial neighbors for
each query interest point. Since we already computed the correspondent visual word for each
interest point in an earlier step, we check if among these visual words, there exist a minimum
number of similarities within the spatial neighbors of the matched interest point. The same
number of spatial neighbors as in the training phase is used here. A candidate point is
considered as spatially consistent if it has more than kNN-SC/4 similarities. Each interest point
that fails the spatial consistency test is considered as irrelevant. Otherwise, we vote to its
corresponding class label.
As a final step, a class histogram collecting the different scores obtained is constructed.
Assuming that an image can only be labeled to one class or none, we require that the
confidence measure of the top-ranked class to be greater than a predefined threshold (Thtop-
ranked). This measure is defined as the ratio between the best score and the number of keypoints
in the image. If the top ranked class has a confidence measure less than the fixed threshold, we
assume that none of the known classes exists in the query image. A negative label is then
returned. Otherwise, the label of the class with the best score is returned. In our experiments,
we have considered threshold values between 5 and 15 % which yield relatively stable results.
Therefore, we have selected the value of 10 % for the results retained in this paper.
The proposed approach has been totally implemented on an Android smartphone without
the need of any server-client communication. It is an offline application that the user can run
even without an internet connection. Let us underline that the training phase should be done
Multimed Tools Appl

off-line, on a regular server. Once the training phase achieved, the resulting vocabulary and
corresponding descriptors are stored on the smartphone. In this case, we have to limit the
number of possible building categories to about 10–15 landmarks. This is however sufficient
to deal with a given itinerary and the time of response to the query is of about 2 s.
Let us now describe how such an interest point-based method can be extended/adapted for
the detection, localization and tracking of objects of interest in indoor environments.

4 Indoor object of interest detection/localization

Multi-object detection and tracking requires efficient both object recognition and localization
approaches. Sliding window techniques seem to be the well-suited for such a task, specifically
for classification and localization purposes. Recently, branch-and-bound [37] approaches have
been introduced to speed up object recognition and localization. Such techniques rely on a
linear scan of all the models, which can be computationally costly when the number of models
is large. Other recent methods abandon the sliding window paradigm in favor of segmentation-
based approaches. The principle consists of using multiple types of over-segmentations to
generate object candidates [27, 68]. Despite the promising results reported, the related
limitations come from the relatively high computational burden involved, which makes them
inappropriate for mobile applications.
Unlike classification and categorization problems, which need to learn for each category
considered a set of objects with a certain amount of variability, our recognition and localization
framework is designed to recognize single object models, specified by the user. For this
purpose, another alternative to sliding window and segmentation consists of extracting local
interest points and group them into pertinent regions. Such methods critically rely on the
matching algorithm involved, which establishes correspondences between feature points in the
given test image and those present in the training data set. The main advantage is related to the
low computational requirements involved.
In our work, we have adopted a feature point matching approach in order to design a
simultaneous multi-object recognition and localization framework. The main contribution
proposed concerns the reliable matching algorithm proposed. Based on an efficient spatial
layout consistency testing, the method achieves significant savings in computational time.
The recognition approach is preceded by an offline stage designed to learn a set of pre-
defined objects specified by the user. Local sparse feature points are extracted from a set of
training images covering all objects from different views and at various scales. The detection
of local interest points is computed using the Difference of Gaussian (DoG) [43]. Each local
region is described with the help of SIFT descriptors [44]. For further processing and less
expensive computing, we assign to each descriptor a visual word. Therefore, we build a
vocabulary of visual words using the k-means clustering algorithm [3], as described in
Section 3.
The same procedure is applied to extract local interest points from the given video frame.
The interest points are put into correspondence with feature points obtained from the set of
training images. A local interest point is classified as belonging to an object instance if it is
matched with a feature point of the object from the training set. In order to boost the
performances of the method, we introduce a novel matching algorithm. The proposed proce-
dure is based on the verification of the spatial consistency of the matched interest points pair
(cf. Section 3.3).
Multimed Tools Appl

After classifying interest points, the next stage consists of grouping feature points belonging
to a same object class into spatial clusters, with the help of hierarchical clustering. Each time a
spatial cluster exceeds a number of pre-defined points, it is detected as a new object. A
bounding box covering the spatial cluster defines the localization of the object in the test
image. The main steps of the object recognition and localization process in the test stage are
illustrated in Fig. 10. Let us now detail the various stages involved.

4.1 Interest point matching

4.1.1 Nearest neighbor searching

Our matching method aims at determining for each interest point in the test image its
correspondents in the set of training images. To enable fast matching, we search for the k
nearest visual words of each local interest point. The feature points in the training set
associated with the determined k nearest visual words are then identified and stored. This
preliminary, rough matching based on visual words makes it possible to decrease the number
of candidate matches and thus to significantly reduce the computational complexity. The
corresponding descriptors are then matched with the one of the considered interest point.
The closest m matches are retained in this case. Furthermore, in order to improve the reliability
of the matching procedure, we investigate on the spatial layout similarity between the regions
around the matched feature points pair.

4.1.2 Spatial layout consistency

The spatial layout consistency procedure proposed extends the method previously introduced
in Section 3.3. The availability of interest points and of associated images in the training set
makes it possible to perform a finer analysis, by taking into consideration the angles between
matched interest points. For each interest point, we define a search area by the r spatial
nearest neighbor interest points. The interest point to be classified is called central point,
while spatial nearest neighbors are called secondary interest points. Let us consider a central
pair match (A, B). The set of secondary feature points that are assigned to the same visual
word are considered as matched. We accept a secondary match pair by investigating strictly
on their relative position with respect to the SIFT orientation of the central interest point. Let
us recall that each SIFT keypoint patch has assigned an orientation reflecting the dominant
gradient orientation. A correct secondary match pair (A1 , B1 ) should satisfy the following
condition:

Fig. 10 Overview of the object recognition/localization approach


Multimed Tools Appl

  ! ! 
 ! ! 
 OA ; AA1 − OB ; BB1  < θth ð12Þ

Here, the central match pair is (A, B) and the secondary match pair (A1 ; B1 ), while OA and

OB respectively represent the SIFT orientation of the central interest points A and B. The
parameter θth is a pre-defined threshold (set in our experiments to 30°). Figure 11 illustrates a
correct and a rejected secondary match pair.
In the case where a correct match satisfying the spatial layout consistency condition was
identified, we label the interest point in the test image with the same class of the matched
interest point of the training dataset. In the opposite case, the interest point in the test image
remains unlabeled.

4.2 Spatial clustering

The spatial clustering process is performed separately for each pre-defined object model. The
purpose of this stage is to examine the locations of the same labeled interest points in order to
detect the presence of an object and identify its localization. This operation includes two main
stages.

4.2.1 Building spatial clusters

We begin by grouping same labeled feature points into spatial clusters with the help of
hierarchical clustering. The proposed algorithm initializes each point as a spatial cluster.
Two spatial clusters with a distance below a pre-defined threshold are then merged. The
distance between two spatial clusters is defined in Eq. (13). C 1 and C 2 are two spatial clusters
and P1 and P2 are two interest points, while d is the Euclidian distance between two points:
d ðC 1 ; C 2 Þ ¼ minðd ðP 1 ∈C1; P 2 ∈C 2 ÞÞ ð13Þ

This process is repeated iteratively and stopped when all distances between clusters are
superior to a pre-defined threshold. Spatial clusters with more than a predefined number of
points are finally retained.

4.2.2 Object localization

In some cases, more than one spatial cluster associated with an object model can be identified.
Under the assumption that only one instance of an object model can be detected in the test image,

Fig. 11 Illustration of spatial


consistency condition: (A1 , B1 ) is a
correct secondary
⟶ match pair since
 ⟶
the condition  OA ; AA1 Þ 
⟶ ⟶ 

OB ; BB1 Þ < θth is verified. The
same condition is not satisfied for
the secondary match pair (A2 , B2 )
which is rejected
Multimed Tools Appl

two situations can be occur. First, there is the case when the spatial clusters represent different
parts of an object. In this situation, they should be merged to define its accurate localization.
However, an off target spatial cluster should be discarded. To tackle this issue, we test all
possible combinations generated from merging spatial clusters of an object-class. Each combi-
nation represents a candidate window for the object’s localization. The purpose of the following
process is to identify the candidate window that covers best the localization of an object.
Let us define as measure of similarity S between two images I 1 and I 2 the cosine measure
between the vectors V 1 and V 2 of their histograms of Bag of Words.

VT1 V2
SðI1 ; I2 Þ ¼ ð14Þ
kV1 kkV2 k

We compute the measure of similarity S between a candidate window W covering a


combination of spatial clusters in the test image and each of the set of training images T(C)
associated with the target object class C. A score assigned to a candidate window is defined as
follows:
ScoreðW Þ ¼ maxðSðW ; I∈T ðC ÞÞ ð15Þ

The candidate window with the highest score defines the bounding box localization of the
detected object.
The final stage of the method concerns a tracking procedure, which makes it possible to
consistently follow the detected objects along multiple frames. The same method as the one
described in Section 2.1 was employed here.
The object localization information is sent to the VI user by acoustic feedback. The warning
messages are sent in the same order as objects are detected, with no priorities. When multiple
interest objects are identified in the scene in order to not confuse the user, only one warning at
a time is generated. After 2 s if another unannounced object still exist in the scene a warning is
launched. Let us note that a speech recognition module can be easily integrated in the system
in order to allow the VI user to specify the desired object of interest.

5 Experimental results

Let us first present the experimental evaluation of the object detection / classification frame-
work presented in Section 2.1.

5.1 Object/obstacle detection and classification

We tested our system in multiple complex outdoor urban environments with the help of visual
impaired users. The videos were also recorded and used to develop a testing database with 30
items. The average duration of each video is around 10 min acquired at 30 fps, at an image
resolution of 320 × 240 pixels.
The image sequences are highly challenging because they contain in the same scene
multiple static and dynamic obstacles including vehicles, pedestrians or bicycles.
Also, because the recording process is done by VI users, the videos are trembled, noisy,
include dark, cluttered and dynamic scenes. In addition, different types of camera/background
motions are present.
Multimed Tools Appl

The annotation of each video was performed frame by frame by a set of human observers.
Once the ground truth image data set was developed we objectively evaluated the proposed
methodology with the help of two error parameters, denoted by FD and MD, and respectively
representing the number of false detected (FD) and missed detected (MD) obstacles. Let us
denote by D the total number of correctly detected obstacles. Based on these entities, the most
relevant evaluation metrics are the so-called recall (R) and precision (P) rates [53].
The recall and precision rates can be combined within in order to define a unique evaluation
measure, so-called F1 norm [53] Table 1 summarizes the results obtained by the obstacle
detection module.
On the considered database the resulting F1 scores are around and superior to 84 % for all
the considered categories. Particular high detection rates are obtained for the cars and static
obstacles class.
Let us note that the image resolution is an important parameter, directly affecting the
detection performances and conditioning the real-time processing constraint. In our case, the
image resolution retained was selected based on a trade-off between the detection rates and the
processing speed. With the increase of the video resolution (1280 × 780 pixels) even low
contrast or small object can be identified, but the computation burden becomes prohibitively
high. On the contrary, with the reduction in image size (176 × 144 pixels) various artifacts
appear, caused mostly by camera or object motion. In this case, the motion vectors associated
to low resolution objects are reduced or have low amplitude values. Thus, we have finally
considered an image resolution of (320 × 240) pixels which, on the considered devices,
represents the maximal resolution that still enables real-time performance.
We compared our obstacle detection method with one of the most relevant algorithms in the
state of the art, introduced in [50]. On the considered data set, the method in [50] yields an
average F1-score of 91 %. However, even though the detection performance increases, the
computational time becomes prohibitive (more than 60 s/frame), which makes it unsuited for a
real-time scenario.
In Fig. 12 we give a graphical representation of the performance of the obstacle detection
module. For each considered video we present four representative images. Various static or
dynamic object presented in the scene are represented with different colors (associated to each
individual motion class). Due to our temporal consistency step, the motion class associated to
an object remains constant between successive frames.
From the videos presented in Fig. 12 we can observe that the proposed framework is
able to detect and classify, with high accuracy, both dynamic obstacles (e.g. vehicles,
pedestrians and bikes) as well as static obstacles (e.g. pillar, road signs and fences). In the
case of obstruction our system is able to identify if the objects are situated at the head level
or down at the foot area.

Table 1 Experimental evaluation of the obstacle detection module

No. Obj. D MD FD R (%) P (%) F1


(%)

Cars 851 778 73 58 91.4 93.1 92.2


People 678 584 94 87 86.1 87.0 86.5
Bikes 315 262 53 43 83.2 85.9 84.5
Static Obs. 587 511 76 62 87.1 89.2 88.1
Multimed Tools Appl

Fig. 12 Experimental results of the obstacle detection and classification framework using ALICE device

Regarding the sensitivity of the detection method, as it can be observed from Fig. 12 a high
density of obstacles situated in the near surrounding of the user does not influence the system
performance. In the same time, the system shows robustness with abrupt changes in the
illumination conditions.
In the second part, we have evaluated the performances of the obstacle classification
module. We conducted multiple tests on a set of 2432 image patches that were extracted from
the video database using our obstacle detection method.
There are five parameters involved in our object recognition framework: the maximum
number of extracted relevant interest points (Npoints), the number of cell (ncell) for A-HOG
descriptor, the size of the codebook (C) used in BoVW or VLAD image representation and the
gamma parameter (γ) of the SVM-RBF kernel.
In Fig. 13a a comparative diagram of the results is presented, with respect to the variation of
the total number of retained interest points after applying our filtering strategy (Section 2.2.1).

Fig. 13 System performance evaluation with respect to the parameters involved: a. maximum number of interest
points, b. maximum number of cell for A-HOG; c. codebook size used in BoVW; d. codebook size used in
VLAD; e. γ parameter of the SVM-RBF kernel
Multimed Tools Appl

Comparable results, in terms of F1 score, are obtained for values ranging between 300 and
2000 interest points. In Fig. 13b we illustrate the F1 variation with respect to the total number
of cells used to extract the adaptive HOG descriptor. The increase in the number of cells can be
translated as a zoom effect over the image patch. So, the discriminative power of the descriptor
is in this case reduced.
For further experiments we set Npoints to 300 interest points and ncell to 128 which offer the
best compromise between classification accuracy and computational speed.
We have studied next the impact of the vocabulary size on the overall system performance.
In Fig. 13c and d we present the F1 score variation with different values of the BoVW and
VLAD vocabulary.
As it can be observed, in the case of BOVW a vocabulary with 4000 words returns the best
results, while for VLAD the best results are obtained for a codebook with 512 words.
However, we have to recall that in the context of VI applications, our objective is to achieve
real-time processing.
The classification speed is in this respect a crucial parameter. With the increase of the
vocabulary size, the computational complexity significantly increases. So, due to this con-
straint we adopted a size for BoVW vocabulary of 1000 words, while for VLAD we set WK to
128 words.
In Fig. 13e we show the system performance when varying γ in the SVM-RBF kernel. As
expected, different optimal values are obtained for BoVW and VLAD representations. We
fixed γ to 50 when adopting the BoVW representation and set γ to 1 in the case of VLAD.
Based on these results, our next goal is to determine the best mix of methods that return
high classification rates, in the context of VI application, without extensively increasing the
computational time.
The obtained results are summarized in Table 2. Regarding low-level image descriptors it is
better to use relevant interest points rather than A-HOG because it retunes high classification
rates without extensively increasing the computational time.
For the global image representation, even though BoVW approach can return good results
for large vocabularies it cannot scale to more than a few thousand images, on a regular
smartphone. VLAD representation is more discriminative and describes images using typically
smaller and denser vectors. Beyond the optimized vector representation, high retrieval accu-
racy and significantly better results are obtained when compared to a classical BoVW, even for
small codes consisting of few hundred of bits.
However, regarding the computational burden, the VLAD image representation significant-
ly increases the processing time. A solution to this problem is to reduce the size of the
vocabulary from 128 to 64 words.

Table 2 Experimental evaluation of the different systems considered

Selected framework F1 score Processing Time (ms)


(100 %)

A-HOG + BoVW + SVM 78.9 41


A-HOG + VLAD + SVM 82.9 157
Relevant interest points + BoVW + SVM 83.1 55
Relevant interest points + VLAD + SVM 86.6 140

The best results for the F1 score and the processing time are highlighted in bold
Multimed Tools Appl

From the perspective of an application dedicated to VI people, we believe that the optimal
solution is the selection of relevant interest points, VLAD and SVM because it offers the best
compromise between the performance and processing speed.
In Table 3 we illustrate the performance of the classification system (using relevant interest
points, with VLAD image representation and SVM-RBF classification) for each category (i.e.
vehicles, bicycles, pedestrian and static obstacles) along with the confusion matrix. As it can be
noticed we introduced an extra-category called Outlier in order to make sure that our system assigns
an image patch to a category due to its high resemblance to class and not just because it has to.
The obtained results show high F1 scores, which validates our approach. The recognition
scores are particularly high, superior to 83 % in all the cases. Slightly lower performances are
obtained in the case of the bicycle class. This is mainly due to the fact that the part of the image
corresponding to a bike is quite reduced when compared to the entire detected object (e.g.,
person riding the bike plus the bike). Thus, confusion appears between the bike and the
pedestrian categories (as also shown in Table 3).
Let us note that the classification performances can be enhanced by considering a late
fusion approach [24]. In our case, multiple fusion strategies can be considered by combining
both multiple cues of information (i.e., relevant interest points descriptors and adapted HOG)
and results from various classifiers (i.e., BoVW and VLAD). As indicated in [24], we can
expect that the late fusion classification will outperform the results of the best individual
descriptors and classifiers. The late fusion process itself is of low complexity. However, it still
requires computing the results of different approaches. This additional computational burden
remains the main limitation of such an approach in a real-time framework.
In terms of computational complexity, the average processing time of the entire framework
(obstacle detection and classification) when run on a regular Android smartphone is around
240 ms per frame, which leads to a processing speed around 5 frames per second.
Regarding the battery consumption our system can be continuously run for 1–1.5 h. We
estimate that by equipping the ALICE system with general public external batteries the
system’s autonomy can be easily extended to up to 4–5 h of continuous running.

Discussion One limitation of the obstacle detection framework is given by the failure to
identify large, flat structures (e.g., walls, doors). When the user is progressively approaching a
large obstacle and its size becomes superior to half of the video scene the system will not be
able to correctly distinguish the background information from the foreground objects. In this
case, the obstacle will be considered as a part of the background. Moreover, because the
algorithm exploits a LKA point tracking procedure, we expect that aliasing problems can occur
in such cases.

Table 3 Obstacle classification module performance evaluation: confusion matrix and MC/FC per category

Static People Bikes Cars Outliers GT MC FC Precision Recall F1 score


obstacles

Static obstacles 108 8 4 2 5 127 7 19 0.939 0.851 0.89


People 4 355 8 3 7 377 17 22 0.954 0.941 0.94
Bikes 10 8 88 3 6 115 9 27 0.907 0.765 0.83
Cars 3 0 3 297 2 305 13 8 0.958 0.973 0.96

GT Ground Truth, MC Missed Classified, FC False Classified (False Alarms)


Multimed Tools Appl

The only solution for dealing with such structures would be to consider specific, dedicated
detection algorithms. Extraction of vertical, flat and/or repetitive elements such as the method
presented in [65] can provide hints for building up such a solution.
The evaluation of the standalone building/ landmark recognition approach is presented in
the following section.

5.2 Landmark/building recognition

To objectively evaluate our approach, we used two publicly available datasets which are the
Zürich Building dataset [62] and Sheffield Building dataset [41]. The Zurich Building
Database (ZuBuD) includes of 1005 database images with a resolution of 640 × 480 pixels,
representing 5 views of 201 different building facades in Zurich (Fig. 14). For testing purposes,
we have considered the 115 query images indicated in [20, 41, 72]. Our method achieves
99.13 % accuracy rate on this dataset which outperforms the state of the art approaches
designed for mobile applications. In addition, we can observe that despite significant occlu-
sions, the method is able to correctly identify the considered building.
On the Sheffield dataset (Fig. 15), we have no labeled test images, so, we run a 5-fold cross
validation scheme where for each fold, we randomly select one fifth of the dataset images as
test samples and the rest as training samples (Table 4).
This dataset consists of 3192 JPEG images with a resolution of 160 × 120 of 40 buildings in
Sheffield. We applied the same evaluation scheme on the Zürich dataset as well. The average
accuracy rate obtained in this case is of 98 %.
We compared our results on the 115 query images from ZuBuD dataset with those reported
in other papers. The recognition rates range from 80 % in [41] to 96.5 % in [72] and 99.1 % in
[20].
The recognition rate of our method is of 99.13 %. The only method that exceeds this rate is
the one presented in [47], where a 100 % accuracy rate is reported. Their algorithm consists of
detecting Maximally Stable Extremal Regions in the images and then describing them by
affine co-variant local coordinate systems (called Local Affine Frames, LAFs). A keypoint
matching scheme is here performed between the query image and the whole dataset. It is
obvious that such method requires significant computational and memory loads. Therefore, it
is not well-suited for mobile phone-dedicated implementations. Moreover, the recognition rate
in the experiments reported in [47] represents the percentage of finding a correct image match
in a top-five ranked list of candidates. In our case, we state the percentage of finding exactly
the correct class of buildings.

Fig. 14 An example from Zurich Building dataset. a and c represents the five labeled images used for the
training phase. b and d represents two test samples that were successfully matched to the correct class of
buildings. The blue interest points were labeled to the correct class while the red interest points were labeled to
the wrong class
Multimed Tools Appl

Fig. 15 An example from Sheffield Building dataset. a and c five labeled images per building used for the
training phase. b and d two test samples that were successfully matched to the correct class of buildings. The blue
interest points were labeled to the correct class while the red interest points were labeled to the wrong class

To our very best knowledge, only the paper in [74] reports significant results on the
Sheffield building dataset. Zhao et al. [74] extract the multi-scale GIST features that represent
the structural information of the building images with an enhanced fuzzy local maximal
marginal embedding (EFLMME) algorithm to project MS-GIST feature manifold onto a low
dimensional subspace. The method achieves a maximum recognition rate of 96.90 % on a
subset of the dataset with images that were selected manually.
Finally, let us present the results obtained for the indoor object localization/identification
and tracking approach introduced in Section 4.

5.3 Indoor object identification and tracking

In order to constitute a ground truth data set, we have retained 40 book covers downloaded
from the Stanford mobile visual search data set [9]. For each book cover, we have the acquired
20 training images (at resolution 720 × 480 pixels) representing the object at different poses
and scales.
In addition, we have constituted our own data set, composed of a set of 10 target objects
commonly used in daily life (books, bottles, remote control, keyboard). Figure 16 shows some
examples of the images used in our experiments. Here again, a number of 20 training images
has been acquired, representing each object from different views and at various scales in order
to learn the object model.
Starting from these objects, we have constructed a test data set composed of 50 images with
a resolution 720 × 480. Unlike the Stanford mobile visual search dataset, images captured for
evaluation hold multiple objects of interest, placed in a significantly cluttered background. In
addition target objects can cover less than 10 % of the full image.
The test data set counts for a total number of 142 target objects. A correct recognition and
localization of an object of interest is obtained when the ground truth bounding box covers
more than the half of the detected bounding box. The recall and the precision rates obtained are

Table 4 Fold cross validation


results ZuBuD Sheffield

1st accuracy rate 99.00 % 99.63 %


2nd round accuracy rate 99.00 % 99.87 %
3rd round accuracy rate 99.00 % 99.51 %
4th round accuracy rate 99.50 % 99.87 %
5th round accuracy rate 94.02 % 99.87 %
Average accuracy rate 98.00 % 99.75 %
Multimed Tools Appl

Fig. 16 Examples of training images used in experiments

of 92 and 93 %, respectively. This demonstrates the pertinence of the proposed approach for
efficient object detection and recognition.
Figure 17 shows some examples of the objects detected and tracked in cluttered
environments.
We have run our experiments on a mobile device. The running time of our recognition and
localization approach took about 2 s. Achieved in 1.5 s, a CPU-only SIFT feature detector and
descriptor is by far the most time-consuming stage in the process. A GPU implementation [55]
can be a future solution for reducing significantly the recognition execution time. We execute
the recognition algorithm on each regular interval of frames. For the tracking process, the
running time is estimated to 60 ms per frame. That determines a frame rate of 16.

6 Conclusions and perspectives

In this paper, we introduce a novel computer vision-based perception system, dedicated to the
autonomous navigation of visually impaired people. The core of the proposed framework is
the obstacle detection and classification methodology designed to identify, in real time, both
static obstacles and dynamic object without any knowledge about the object type, position or
location. At the end, through bone conduction headphones the acoustic signals are sent to a VI
user. At the hardware level the entire framework is embedded on a regular smartphone attached
to the VI using a chest mounted harness.
In addition, a landmark/building recognition approach has been proposed, which aims at
improving the GPS-based navigation capabilities with the help of visual analysis.
Finally, a localization/tracking of objects of interest defined by the user has been proposed.
We tested our method on different outdoor scenarios with visually impaired participants.
The system shows robustness and consistency even for important camera and background
movement or for crowded scenes with multiple obstacles.
From the experimental evaluation we determined that our system works in real time
returning warring messages fast enough so that the user could walk normally. The algorithms
were carefully designed and optimized in order to work efficiently on a low power processing
unit. In our opinion this characteristic is one of the major differences between our system and
most of the state the art algorithms. Is completely integrated on a regular smartphone thus it

Fig. 17 Examples of multi-object localization results


Multimed Tools Appl

can be described as a wearable and friendly device, ready to be used by the VI. The system is
low-cost, since it does not require any expensive, dedicated hardware architecture, but solely
general public components available at affordable prices on the market. By using a chest
mounted harness, regular smartphone and bone conduction headphones our system is wearable
and portable. In addition, it is also non-intrusive, satisfying the hands-free and ears-free
requirements imposed by VI users. Because the entire processing is performed on the
smartphone, no connection to a processing unit is required.
For further work we propose integrating our system in a much more developed assistant
which includes: navigation information, stairs and crossings detectors and people recognizer
(that helps identify the familiar persons). The use of portable, stereoscopic acquisition devices
that can serve to real-time 2D/3D reconstruction purposes and notably provide more precise
information related to distances to the detected objects/obstacles is also an interesting axis of
future development.
With the emergence of graphical boards integrated on regular smartphones (e.g. NVIDIA
TX1) we envisage the integration of deep learning strategies within the object detection and
classification processes.

Acknowledgments This work has been partially supported by the AAL (Ambient Assisted Living) ALICE
project (AAL-2011-4-099), co-financed by ANR (Agence Nationale de la Recherche) and CNSA (Conseil
National pour la Solidarité et l’Autonomie).
This work was supported by a grant of the Romanian National Authority for Scientific Research and
Innovation, CNCS - UEFISCDI, project number PN-II-RU-TE-2014-4-0202.

References

1. Alahi, Ortiz R, Vandergheynst P (2012) FREAK: fast retina keypoint. In: IEEE Conference on Computer
Vision and Pattern Recognition, 2012. CVPR
2. Ali H, Paar G, Paletta L (2007) Semantic indexing for visual recognition of buildings, 5th Int Symp Mob
Mapp Technol. 6–9
3. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the
Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, pp. 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494
4. Baatz G, Köser K, Chen D, Grzeszczuk R, Pollefeys M (2010) Handling urban location recognition as a 2D
homothetic problem. In: Daniilidis K, Maragos P, Paragios N (eds) Computer Vision – ECCV 2010 SE - 20,
Springer Berlin Heidelberg, pp. 266–279. doi: 10.1007/978-3-642-15567-3_20
5. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image
Underst 110:346–359. doi:10.1016/j.cviu.2007.09.014
6. Black M, Anandan P (1993) A framework for robust estimation of optical flow. In: International Conference
on Computer Vision CVPR, 231–236
7. Blasch BB, Wiener WR, Welsh RL (1997) Foundations of orientation and mobility. In: American
Foundation for the Blind, 2nd ed., Press: New York
8. Brock M, Kristensson PO (2013) Supporting blind navigation using depth sensing and sonification. In
Proceedings of the ACM Conference on Pervasive and Ubiquitous Computing, Switzerland
9. Chandrasekhar VR, Chen DM, Tsai SS, Cheung NM, Chen H, Takacs G et al (2011) The Stanford Mobile
Visual Search Data Set, in: Proceedings of the Second Annual ACM Conference on Multimedia Systems,
ACM, New York, NY, USA, pp. 117–122. doi: 10.1145/1943552.1943568
10. Chaudhry, Chandra R (2015) Design of a mobile face recognition system for visually impaired persons.
CoRR, vol. abs/1502.00756
11. Chen DM, Baatz G, Koser K, Tsai SS, Vedantham R, Pylvanainen T et al (2011) City-scale landmark
identification on mobile devices, Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference
on. 737–744. doi: 10.1109/CVPR.2011.5995610
12. Chen L, Guo B, Sun W (2010) Obstacle detection system for visually impaired people based on stereo
vision. In Proceedings of the 4th International Conference on Genetic and Evolutionary Computing,
Shenzhen, China, 13–15
Multimed Tools Appl

13. Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints, Workshop on
Statistical Learning in Computer Vision, ECCV. 1–22
14. Dakopoulos D, Boddhu SK, Bourbakis N (2007) A 2D vibration array as an assistive device for visually
impaired, bioinformatics and bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International
Conference on. 930–937. doi: 10.1109/BIBE.2007.4375670
15. Dakopoulos D, Bourbakis N (2008) Preserving visual information in low resolution images during naviga-
tion of visually impaired. In: Proceedings of the 1st International Conference on PErvasive Technologies
Related to Assistive Environments, ACM, New York, NY, USA, pp. 27:1–27:6. doi: 10.1145/1389586.
1389619
16. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection, Computer Vision and
Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 1 886–893 vol. 1. doi: 10.
1109/CVPR.2005.177
17. Dalal N, Triggs B (2006) Object detection using histograms of oriented gradients. In: European Conference
on Computer Vision
18. Delhumeau J, Gosselin P-H, Jégou H, Pérez P (2013) Revisiting the VLAD image representation. In: ACM
Multimedia, 653–656
19. Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the Twenty-
First International Conference on Machine Learning, ACM, New York, NY, USA, pp. 29–. doi: 10.1145/
1015330.1015408
20. El Mobacher A, Mitri N, Awad M (2013) Entropy-based and weighted selective SIFT clustering as an energy
aware framework for supervised visual recognition of man-made structures. Math Probl Eng
21. Erhan D, Szegedy C, Toshev A, Anguelov D (2014) Scalable object detection using deep neural networks.
In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2155–2162
22. Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC)
challenge. Int J Comput Vis 88:303–338. doi:10.1007/s11263-009-0275-4
23. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. Pattern
Analysis and MachineIntelligence, IEEE Transactions on, pp. 1–15
24. Fernando B, Fromont E, Muselet D, Sebban M (2012) Discriminative feature fusion for image classification.
In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3434–3441
25. Gauglitz S, Hollerer T, Turk M (2011) Evaluation of interest point detectors and feature descriptors for visual
tracking. Int J Comput Vis, pages 1–26
26. Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and
semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587
27. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and
semantic segmentation, Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on. 580–
587. doi: 10.1109/CVPR.2014.81
28. Golledge RG, Marston JR, Costanzo CM (1997) Attitudes of visually impaired persons towards the use of
public transportation. J Vis Impair Blindness 90:446–459
29. Grauman K, Bastian L (2011) Visual object recognition. Morgan & Claypool, San Francisco
30. Gronat P, Obozinski G, Sivic J, Pajdla T (2013) Learning and calibrating per-location classifiers for visual
place recognition, Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on. 907–914.
doi: 10.1109/CVPR.2013.122
31. Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey Vision Conference, 147–151
32. Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. In PAMI 33(1):117–128
33. Johnson LA, Higgins CM (2006) A navigation aid for the blind using tactile-visual sensory substitution. Eng
Med Biol Soc 2006. EMBS’06. 28th Annual International Conference of the IEEE. 6289–6292. doi: 10.
1109/IEMBS.2006.259473.
34. José J, Farrajota M, Rodrigues João MF, Hans du Buf JM (2011) The smart vision local navigation aid for
blind and visually impaired persons. Int J Digit Content Technol Appl 5:362–375
35. Khan A, Moideen F, Lopez J, Khoo WL, Zhu Z (2012) KinDetect: kinect detection objects. In: Computer
Helping People with Special Needs, LNCS7382, 588–595
36. Kuo BC, Ho HH, Li CH, Hung CC, Taur JS (2014) A kernel-based feature selection method for SVM with
RBF kernel for hyperspectral image classification, selected topics in applied earth observations and remote
sensing. IEEE J 7:317–326. doi:10.1109/JSTARS.2013.2262926
37. Lampert CH, Blaschko MB, Hofmann T (2008) Beyond sliding windows: object localization by efficient
subwindow search, Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8.
doi: 10.1109/CVPR.2008.4587586
38. Lee JJ, Kim G (2007) Robust estimation of camera homography using fuzzy RANSAC. In: Proceedings of
the 2007 International Conference on Computational Science and Its Applications - Volume Part I, Springer-
Verlag, Berlin, Heidelberg, pp. 992–1002. http://dl.acm.org/citation.cfm?id=1802834.1802930
Multimed Tools Appl

39. Lepetit CV, Strecha C, Fua P () BRIEF: binary robust independent elementary features. 11th European
Conference on Computer Vision (ECCV), Heraklion, Crete. LNCS Springer, September 2010
40. Leutenegger S, Chli M, Siegwart R (2011) Brisk: binary robust invariant scalable keypoints. IEEE
International Conference on Computer Vision (ICCV)
41. Li J, Allinson NM (2009) Dimensionality reduction-based building recognition. In: 9th IASTED
International Conference on Visualization
42. Lin Q, Hahn HS, Han YJ (2013) Top-view based guidance for blind people using directional ellipse model.
Int J Adv Robot Syst 1:1–10
43. Lowe DG (1999) Object recognition from local scale-invariant features, Computer Vision, 1999. The
Proceedings of the Seventh IEEE International Conference on. 2, 1150–1157 vol.2. doi: 10.1109/ICCV.
1999.790410
44. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110. doi:
10.1023/B:VISI.0000029664.99615.94
45. Lucas B, Kanade T (1981) An iterative technique of image registration and its application to stereo. In:
IJCAI’81 Proceedings of the 7th international joint conference on Artificial intelligence, 2, 674–679
46. Manduchi R (2012) Mobile vision as assistive technology for the blind: an experimental study. In:
Proceedings of the 13th International Conference on Computers Helping People with Special Needs -
Volume Part II, Springer-Verlag, Berlin, Heidelberg, pp. 9–16. doi: 10.1007/978-3-642-31534-3_2
47. Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide-baseline stereo from maximally stable external
regions. Image Vis Comput 22:761–767. doi:10.1016/j.imavis.2004.02.006
48. Meers S, Ward K (2005) A substitute vision system for providing 3D perception and GPS navigation via
electro-tactile stimulation. In: 1st International Conference on Sensing Technology, 21–23
49. Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration. In:
VISAPP International Conference on Computer Vision Theory and Applications, pp. 331–340
50. Oneata D, Revaud J, Verbeek J, Schmid C (2014) Spatio-temporal object detection proposals. Europeean
Conference on Computer Vision, ECCV 2014 - European Conference on Computer Vision, volume 8691,
pages 737–752, Zurich, Switzerland, Springer
51. Pascolini D, Mariotti SP (2012) Global data on visual impairments 2010, in: World Health Organization,
Geneva
52. Peng E, Peursum P, Li L, Venkatesh S (2010) A smartphone-based obstacle sensor for the visually impaired.
In: Yu Z, Liscano R, Chen G, Zhang D, Zhou X (eds) Ubiquitous Intelligence and Computing SE - 45,
Springer Berlin Heidelberg, pp. 590–604. doi: 10.1007/978-3-642-16355-5_45
53. Powers DMW (2011) Evaluation: from precision, recall and F measure to roc, informedness, markedness and
correlation. J Mach Learn Technol 2(1):37–63
54. Pradeep V, Medioni G, Weiland J (2010) Robot vision for the visually impaired, Computer Vision and
Pattern Recognition Workshops (CVPRW), 2010 I.E. Computer Society Conference on. 15–22. doi: 10.
1109/CVPRW.2010.5543579
55. Rister B, Wang G, Wu M, Cavallaro JR (2013) A fast and efficient sift detector using the mobile GPU,
Acoustics, Speech and Signal Processing (ICASSP), 2013 I.E. International Conference on. 2674–2678. doi:
10.1109/ICASSP.2013.6638141
56. Rodríguez A, Yebes JJ, Alcantarilla PF, Bergasa LM, Almazán J, Cela A (2012) Assisting the visually
impaired: obstacle detection and warning system by acoustic feedback. Sensors 12:17476–17496. doi:10.
3390/s121217476
57. Rosa S, Paleari M, Ariano P, Bona B (2012) Object tracking with adaptive HOG detector and adaptive Rao-
Blackwellised particle filter. Proceedings of SPIE 8301, Intelligent Robots and Computer Vision XXIX:
Algorithms and Techniques, 83010 W. doi:10.1117/12.911991
58. Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: an efficient alternative to SIFT or SURF.
Computer Vision (ICCV), 2011 I.E. International Conference on, vol., no., pp. 2564–2571, 6–13
59. Saez JM, Escolano F (2008) Stereo-based aerial obstacle detection for the visually impaired. In: Workshop
on Computer Vision Applications for the Visually Impaired, Marselle, France
60. Saez JM, Escolano F, Penalver A (2005) First steps towards stereo-based 6DOF SLAM for the visually
impaired, computer vision and pattern recognition - workshops, 2005. CVPR Workshops. IEEE Computer
Society Conference on. 23. doi: 10.1109/CVPR.2005.461
61. Sainarayanan G, Nagarajan R, Yaacob S (2007) Fuzzy image processing scheme for autonomous navigation
of human blind. Appl Soft Comput 7:257–264
62. Shao H, Svoboda1 T, Tuytelaars T, Van Gool L (2003) HPAT indexing for fast object/scene recognition
based on local appearance. In: E. Bakker, M. Lew, T. Huang, N. Sebe, X. Zhou (Eds.), Image and Video
Retrieval SE - 8, Springer Berlin Heidelberg, pp. 71–80. doi: 10.1007/3-540-45113-7_8
63. Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Annual Conference on
Neural Information Processing Systems, pp. 2553–2561
Multimed Tools Appl

64. Takizawa H, Yamaguchi S, Aoyagi M, Ezaki N, Mizuno S (2012) Kinect cane: an assistive system for the
visually impaired based on three-dimensional object recognition. In Proceedings of IEEE International
Symposium on System Integration, Japan
65. Tian Y, Yang X, Arditi A (2010) Computer vision-based door detection for accessibility of unfamiliar
environments to blind persons. In: Proceedings of the 12th International Conference on Computers Helping
People with Special Needs, Springer LNCS, vol. 6180, pp. 263–270
66. Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the
Ninth ACM International Conference on Multimedia, ACM, New York, NY, USA, pp. 107–118. doi: 10.
1145/500141.500159
67. Tuzel O, Porikli F, Meer P (2006) Region covariance: Ba fast descriptor for detection and classification^. In
ECCV 3952:589–600
68. van de Sande KEA, Uijlings JRR, Gevers T, Smeulders AWM (2011) Segmentation As Selective Search for
Object Recognition, in: Proceedings of the 2011 International Conference on Computer Vision, IEEE
Computer Society, Washington, DC, USA, pp. 1879–1886. doi: 10.1109/ICCV.2011.6126456
69. Vinyals A, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In:
International Conference on Computer Vision and Pattern Recognition (CVPR)
70. Wang HC et al (2015) Bridging text spotting and SLAM with junction features. Intelligent Robots and
Systems (IROS), 2015 IEEE/RSJ International Conference on, Hamburg, pp. 3701–3708
71. Yu JH, Chung HI, Hahn HS (2009) Walking assistance system for sight impaired people based on a multimodal
transformation technique. In Proceedings of the ICROS-SICE International Joint Conference, Japan
72. Zhang W (2005) Localization based on building recognition. In: IEEE Workshop on Applications for
Visually Impaired, pp. 21–28
73. Zhang M, Zhou Z (2005) A k-nearest neighbor based algorithm for multilabel classification. In: IEEE
International Conference on Granular Computing 2, 718–721
74. Zhao C, Liu C, Lai Z (2011) Multi-scale gist feature manifold for building recognition. Neurocomput 74:
2929–2940. doi:10.1016/j.neucom.2011.03.035

Ruxandra TAPU (M’09) received as valedictorian her B.S. degree in Electronics, Telecommunications and
Information Technology from University BPolitehnica^ of Bucharest, Romania in 2008. In 2011 she received her
Ph.D., in Electronics and Telecommunication from University BPolitehnica^ of Bucharest and 2012 her second
Ph.D. in Informatics from University Paris VI - Pierre et Marie Currie Paris, France with the distinction Btrès
honorables^. From 2012 she is a researcher within ARTEMIS department, Institut Mines-Telecom/Telecom
Sudparis having as major research interest content-based video indexing and retrieval, pattern recognition and
machine learning techniques.
Multimed Tools Appl

Bogdan MOCANU (M’09) received his B.S. degree in Electronics, Telecommunications and Information
Technology from University BPolitehnica^ of Bucharest, Romania in 2008. In 2011 he received his Ph.D., in
Electronics and Telecommunication from University BPolitehnica^ of Bucharest and in 2012 his second Ph.D. in
Informatics from University Paris VI - Pierre et Marie Currie, Paris, France. From 2012 he is a researcher within
ARTEMIS department, Institut Mines-Telecom/Telecom Sudparis. His major research interest is the computer
application technology: such as 3D model compression and algorithm analysis in image processing.

Titus ZAHARIA (M’97) received an Engineer degree in Electronics and Telecommunications, and a MSc.
degree from University POLITEHNICA (Bucharest, Romania) in 1995 and 1996, respectively. In 2001, he
obtained a Ph.D. in Mathematics and Computer Science from University Paris V – Rene Descartes (Paris,
France). He joined the ARTEMIS Department at Institut Télécom, Télécom Sudparis as an Associate Professor in
2002 and has become a full Professor, in 2011. His research interests include visual content representation
methods, with 2D/3D compression, reconstruction, recognition and indexing applications. Since 1998, Titus
Zaharia actively contributes to the ISO/MPEG-4 and MPEG-7 standards.

You might also like