0% found this document useful (0 votes)
69 views152 pages

Head-Controlled Perception Via Electro-Neural Stimulation

This document summarizes a 2012 PhD thesis from the University of Wollongong titled "Head-Controlled Perception via Electro-Neural Stimulation" by Simon Meers. The thesis explores using head-mounted sensors and haptic feedback to provide intuitive environmental perception for the visually impaired. It presents systems for using electro-neural stimulation of the skin to convey depth, color, and navigation information from sensors. Experimental results demonstrate using these systems for obstacle avoidance, localization, navigating corridors and a laboratory, and interpreting landmarks for navigation. The thesis helped pioneer research into sensory substitution and non-visual human-computer interaction.

Uploaded by

Ratan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views152 pages

Head-Controlled Perception Via Electro-Neural Stimulation

This document summarizes a 2012 PhD thesis from the University of Wollongong titled "Head-Controlled Perception via Electro-Neural Stimulation" by Simon Meers. The thesis explores using head-mounted sensors and haptic feedback to provide intuitive environmental perception for the visually impaired. It presents systems for using electro-neural stimulation of the skin to convey depth, color, and navigation information from sensors. Experimental results demonstrate using these systems for obstacle avoidance, localization, navigating corridors and a laboratory, and interpreting landmarks for navigation. The thesis helped pioneer research into sensory substitution and non-visual human-computer interaction.

Uploaded by

Ratan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

University of Wollongong

Research Online
University of Wollongong Thesis Collection University of Wollongong Thesis Collections

2012

Head-controlled perception via electro-neural


stimulation
Simon Meers
University of Wollongong

Recommended Citation
Meers, Simon, Head-controlled perception via electro-neural stimulation, Doctor of Philosophy thesis, School of Computer Science
and Software Engineering, University of Wollongong, 2012. http://ro.uow.edu.au/theses/3769

Research Online is the open access institutional repository for the


University of Wollongong. For further information contact the UOW
Library: [email protected]
Head-Controlled Perception
via Electro-Neural Stimulation

A thesis submitted in fulfilment of the


requirements for the award of the degree

DOCTOR OF PHILOSOPHY

from

UNIVERSITY OF WOLLONGONG

by

SIMON MEERS, BCompSci(HonI)

School of Computer Science and Software Engineering

2012
Certification
I, Simon Meers, declare that this thesis, submitted in fulfilment of the re-
quirements for the award of Doctor of Philosophy, in the School of Computer
Science and Software Engineering, University of Wollongong, is wholly my
own work unless otherwise referenced or acknowledged. This document has
not been submitted for qualifications at any other academic institution.

Simon Meers
8th November 2012

ii
Publications
• Meers, S., and Ward, K., “Head-tracking haptic computer interface
for the blind”, Advances in Haptics, pp. 143-154, INTECH, ISBN:
9789533070933, January 2010
• Meers, S., and Ward, K., “Face recognition using a time-of-flight cam-
era”, Sixth International Conference on Computer Graphics, Imaging
and Visualisation, Tianjin, China, August 2009
• Meers, S., and Ward, K., “Head-Pose Tracking with a Time-of-Flight
Camera”, Australasian Conference on Robotics and Automation, Can-
berra, Australia, December 2008
• Meers, S., and Ward, K., “Substitute Three-Dimensional Perception
using Depth and Colour Sensors”, Australasian Conference on Robotics
and Automation, Brisbane, Australia, December 2007
• Meers, S., and Ward, K., “Haptic Gaze-Tracking Based Perception of
Graphical User Interfaces”, 11th International Conference Information
Visualisation, Zurich, Switzerland, July 2007
• Meers, S., Ward, K. and Piper, I., “Simple, Robust and Accurate Head-
Pose Tracking Using a Single Camera”, The Thirteenth IEEE Confer-
ence on Mechatronics and Machine Vision in Practice, Toowoomba,
Australia, December 2006 – ?Best paper award
• Meers, S. and Ward, K., “A Vision System for Providing the Blind with
3D Colour Perception of the Environment”, Asia-Pacific Workshop on
Visual Information Processing, Hong Kong, December 2005
• Meers, S. and Ward, K., “A Substitute Vision System for Providing
3D Perception and GPS Navigation via Electro-Tactile Stimulation”,
International Conference on Sensing Technology, Palmerston North,
New Zealand, November 2005 – ?Best paper award
• Meers, S. and Ward, K., “A Vision System for Providing 3D Perception
of the Environment via Transcutaneous Electro-Neural Stimulation”,
IV04 IEEE 8th International Conference on Information Visualisa-
tion, London, UK, July 2004

iii
Abstract
This thesis explores the use of head-mounted sensors combined with haptic
feedback for providing effective and intuitive perception of the surround-
ing environment for the visually impaired. Additionally, this interaction
paradigm is extended to providing haptic perception of graphical computer
interfaces. To achieve this, accurate sensing of the head itself is required for
tracking the user’s “gaze” position instead of sensing the environment. Trans-
cutaneous electro-neural feedback is utilised as a substitute for the retina’s
neural input, and is shown to provide a rich and versatile communication
interface without encumbering the user’s auditory perception.
Systems are presented for:
• facilitating obstacle avoidance and localisation via electro-neural stimu-
lation of intensity proportional to distance (obtained via head-mounted
stereo cameras or infrared range sensors);
• encoding of colour information (obtained via a head-mounted video
camera or dedicated colour sensor) for landmark identification using
electro-neural frequency;
• navigation using GPS data by encoding landmark identifiers into short
pulse patterns and mapping fingers/sensors to bearing regions (aided
by a head-mounted digital compass);
• tracking human head-pose using a single video camera with accuracy
within 0.5◦ ;
• utilising time-of-flight sensing technology for head-pose tracking and
facial recognition;
• non-visual manipulation of a typical software “desktop” graphical user
interface using point-and-click and drag-and-drop interactions; and
• haptic perception of the spatial layout of pages on the World Wide Web,
contrasting output via electro-neural stimulation and Braille displays.

Preliminary experimental results are presented for each system. References


to many new research endeavours building upon the concepts pioneered in
this project are also provided.

iv
Acknowledgements
This document only exists thanks to:

• Dr. Koren Ward’s direction and inspiration;

• the gracious guidance and motivation of Prof. John Fulcher and Dr.
Ian Piper in compiling, refining and editing over many iterations;

• the University of Wollongong and its research infrastructure and envi-


ronment;

• the Australian Research Council grants which provided access to equip-


ment such as time-of-flight cameras, and also aided travel to present
this research around the globe;

• the Trailblazer innovation competition, a prize from which also helped


fund research equipment;

• the helpful guidance of A/Prof. Phillip McKerrow and Prof. Yi Mu;

• assistance in equation wrangling provided by Dr. Ian Piper and the


Mathematica software package;

• LATEX making document management bearable;

• the support and encouragement of my wife Gillian, and our three chil-
dren Maya, Cadan and Saxon, who were born during this research;

• family and friends’ encouragement to persist;

• the giants of the research community on whose shoulders we stand; and

• The Engineer of the human body – for provision of the inspirational


designs on which these systems are based; and the remarkable recovery
of my own body from the debilitating illness which greatly delayed, and
very nearly prevented, the completion of this PhD.

v
Contents

1 Introduction 1

2 Literature Review 6

3 Perceiving Depth 13
3.1 Electro-Neural Vision System . . . . . . . . . . . . . . . . . . 13
3.1.1 Extracting depth data from stereo video . . . . . . . . 19
3.1.2 Range sampling . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Using infrared sensors . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Wireless ENVS . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Obstacle avoidance . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Localisation . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Perceiving Colour 35
4.1 Representing colours with electro-neural signal frequencies . . 36
4.1.1 Mapping the colour spectrum to frequency . . . . . . . 36
4.1.2 Using a lookup-table for familiar colours . . . . . . . . 37
4.2 Colour perception experimentation . . . . . . . . . . . . . . . 38
4.2.1 Navigating corridors . . . . . . . . . . . . . . . . . . . 38
4.2.2 Navigating the laboratory . . . . . . . . . . . . . . . . 40
4.3 Colour sensing technology . . . . . . . . . . . . . . . . . . . . 42
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vi
5 High-Level Navigation 45
5.1 Interpreting landmarks via TENS . . . . . . . . . . . . . . . . 47
5.2 Landmark bearing protocols . . . . . . . . . . . . . . . . . . . 49
5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Navigating the car park . . . . . . . . . . . . . . . . . 52
5.3.2 Navigating the University campus . . . . . . . . . . . . 54
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Haptic Perception of Graphical User Interfaces 57


6.1 The virtual screen . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Gridded desktop interface . . . . . . . . . . . . . . . . . . . . 59
6.2.1 Zoomable web browser interface . . . . . . . . . . . . . 62
6.3 Haptic output . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.1 Electro-tactile stimulation . . . . . . . . . . . . . . . . 64
6.3.2 Vibro-tactile interface . . . . . . . . . . . . . . . . . . 65
6.3.3 Refreshable Braille display . . . . . . . . . . . . . . . . 66
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Head-Pose Tracking 72
7.1 Head-pose tracking using a single camera . . . . . . . . . . . . 74
7.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.3 Experimental results . . . . . . . . . . . . . . . . . . . 89
7.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Time-of-flight camera technology . . . . . . . . . . . . . . . . 93
7.2.1 The SwissRanger . . . . . . . . . . . . . . . . . . . . . 93
7.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.4 Nose tracking . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.5 Finding orientation . . . . . . . . . . . . . . . . . . . . 107
7.2.6 Building a mesh . . . . . . . . . . . . . . . . . . . . . . 112
7.2.7 Facial recognition . . . . . . . . . . . . . . . . . . . . . 113
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 Conclusions 124
8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2 Suggestions for further research . . . . . . . . . . . . . . . . . 128
8.2.1 Research already commenced . . . . . . . . . . . . . . 128
8.2.2 Other areas . . . . . . . . . . . . . . . . . . . . . . . . 129
8.3 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 130

vii
List of Figures

3.1 Schematic of the first ENVS prototype . . . . . . . . . . . . . 14


3.2 Photograph of the first ENVS prototype . . . . . . . . . . . . 15
3.3 Internal ENVS TENS hardware . . . . . . . . . . . . . . . . . 16
3.4 TENS output waveform . . . . . . . . . . . . . . . . . . . . . 17
3.5 The ENVS control panel . . . . . . . . . . . . . . . . . . . . . 18
3.6 Videre Design DCAM . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Stereo disparity geometry . . . . . . . . . . . . . . . . . . . . 21
3.8 Disparity map of featureless surface . . . . . . . . . . . . . . . 24
3.9 The prototype Infrared-ENVS . . . . . . . . . . . . . . . . . . 26
3.10 Wireless TENS patch . . . . . . . . . . . . . . . . . . . . . . . 27
3.11 ENVS user surveying a doorway . . . . . . . . . . . . . . . . . 30
3.12 ENVS user negotiating an obstacle . . . . . . . . . . . . . . . 32

4.1 Comparison of colour-to-frequency mapping strategies . . . . . 37


4.2 ENVS user negotiating a corridor using colour information . . 39
4.3 ENVS user negotiating a laboratory using colour information . 41

5.1 Electronic compass mounted on headset . . . . . . . . . . . . 46


5.2 GPS landmarks in visual field only . . . . . . . . . . . . . . . 49
5.3 225◦ GPS perception protocol . . . . . . . . . . . . . . . . . . 50
5.4 360◦ GPS perception protocol . . . . . . . . . . . . . . . . . . 51
5.5 ENVS user negotiating a car park . . . . . . . . . . . . . . . . 53
5.6 ENVS user surveying a paved path in the campus environment 55

6.1 Experimental desktop grid interface . . . . . . . . . . . . . . . 59


6.2 Mapping of fingers to grid cells . . . . . . . . . . . . . . . . . 60

viii
6.3 Grid cell transit stabilisation . . . . . . . . . . . . . . . . . . . 61
6.4 Example of collapsing a web page for faster perception . . . . 63
6.5 Wired TENS system . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Vibro-tactile keyboard design . . . . . . . . . . . . . . . . . . 66
6.7 Papenmeier BRAILLEX EL 40s refreshable Braille display . . 67
6.8 Example glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.9 Braille text displaying details of central element . . . . . . . . 68

7.1 Infrared LED hardware . . . . . . . . . . . . . . . . . . . . . . 76


7.2 Infrared blob-tracking . . . . . . . . . . . . . . . . . . . . . . 79
7.3 Virtual screen geometry . . . . . . . . . . . . . . . . . . . . . 81
7.4 Relationship between parameters t and u . . . . . . . . . . . . 84
7.5 Triangle baseline distance . . . . . . . . . . . . . . . . . . . . 85
7.6 Triangle height . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.7 Triangle height/baseline ratio . . . . . . . . . . . . . . . . . . 86
7.8 z-coordinates of F and M . . . . . . . . . . . . . . . . . . . . 87
7.9 Triangle ratio graph with limits displayed . . . . . . . . . . . . 89
7.10 ‘Gaze’ angle resolution graphs . . . . . . . . . . . . . . . . . . 91
7.11 SwissRanger SR-3000 . . . . . . . . . . . . . . . . . . . . . . . 94
7.12 Sample amplitude image and corresponding depth map . . . . 95
7.13 SwissRanger point clouds . . . . . . . . . . . . . . . . . . . . . 97
7.14 Tracing a spherical intersection profile . . . . . . . . . . . . . 101
7.15 Example facial region-of-interest . . . . . . . . . . . . . . . . . 105
7.16 Amplitude image with curvature data . . . . . . . . . . . . . . 106
7.17 Sample frame with nose-tracking data . . . . . . . . . . . . . . 107
7.18 Example spherical intersection profiles . . . . . . . . . . . . . 111
7.19 Example faceprint . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.20 Running-average faceprint . . . . . . . . . . . . . . . . . . . . 119
7.21 Example faceprints . . . . . . . . . . . . . . . . . . . . . . . . 120
7.22 Example ‘gaze’ projected onto a ‘virtual screen’ . . . . . . . . 122

ix
x
1
Introduction

In 2011 the World Health Organisation estimated the number of visually


impaired people worldwide to be 285 million [88]. Sight is considered to be
the predominant human sense, and is often taken for granted by people with
full vision. Whilst visually impaired people are generally able to compensate
for their lack of visual perception remarkably well by enhancing their other
sensory skills, loss of sight remains a severe handicap.
Visually impaired people are further handicapped by the recent prolif-
eration of desktop and hand-held communications and computing devices
which generally have vision-based graphical user interfaces (GUIs). Although

1
“screen readers” (e.g. [19, 25]) are able to provide limited assistance, they do
not enable perception of the layout of the screen content or effective inter-
action with graphical user interfaces. The range of applications that blind
people are able to use effectively is therefore quite limited.
Rapid advances in the fields of sensor technology, data processing and
haptic feedback are continually creating new possibilities for utilising non-
visual senses to compensate for having little or no visual perception. How-
ever, determining the best methods for providing substitute visual perception
to the visually impaired remains a challenge. Chapter 2 provides a review of
existing systems and evaluates their benefits and limitations.
This research aims to utilise these technological advances to:

1. Devise a wearable substitute vision system for effective and intuitive


perception of the immediate environment to facilitate obstacle avoid-
ance, localisation and navigation.

2. Devise a perception system that can enable a blind computer user to


perceive the spatial screen layout and interact with the graphical user
interface via point-and-click and drag-and-drop interactions.

To achieve perception of the environment, as stated in Item 1 above,


head-mounted stereo video cameras are used to construct a depth map of
the immediate environment. This range data is then delivered to the fingers
via electro-neural stimulation. To interpret this information, each finger is
designated to represent a region within the camera’s field of view. The in-
tensity felt by each finger indicates the distance to physical objects in the
corresponding region of the environment, as explained in Chapter 3. It is

2
also possible to represent the colour of perceived objects by modulating the
frequency of the stimulation, as described in Chapter 4. Furthermore, Chap-
ter 5 shows how locations, based on Global Positioning System (GPS) data,
can also be encoded into the electrical stimulation to facilitate navigation.
To enable a blind computer user to perceive the computer screen, as stated
in Item 2 above, the user’s “gaze” point on a virtual screen is estimated by
tracking the head position and orientation. Electro-neural stimulation of the
fingers is then used to indicate to the user what is located at the gaze po-
sition on the screen. In this case, the frequency of the pulses delivered to
the fingers indicates the type of screen object at the gaze position and the
intensity of the pulses indicates the screen object’s “depth” (e.g. if an object
is in a foreground or background window). Chapter 6 provides details of the
interface and experimental results. Chapter 7 outlines the novel head-pose
sensor research undertaken to provide tracking suitable for this application.
Chapter 6 also provides details of how a Braille display or vibro-tactile inter-
face can be used to replace the electro-neural stimulation and provide more
convenience to the user.
Although the above proposed perception systems use different sensory
devices for performing different perception tasks, both are based on the same
principle. Namely, utilising the head to control perception and the skin to
interpret what is being perceived. As most blind people retain the use of
their head for directing their sense of hearing they can be expected to also
be capable of using their head for directing substitute visual perception. Use
of haptic feedback provides rich and versatile communication while leaving
the user’s auditory senses unencumbered.

3
The main discoveries and developments of the research are summarised here:

• Mapping range data to the fingers (or other areas of the body) via
electro-neural pulses where the intensity is proportional to the distance
to objects.

• Perception of colour via electro-neural frequency mapping.

• Identification of known landmarks via electro-neural pulse coding.

• 360◦ bearing perception of GPS landmarks via electro-neural pulses.

• Development of a head-mounted infrared sensor array for eliminating


the vision processing overheads and limitations of stereo cameras.

• Electro-neural perception of a computer desktop environment, includ-


ing manipulation of screen objects via the graphical user interface.

• Development of a wireless electro-neural stimulation patch for commu-


nicating perception data to the user.

• Head-pose tracking system with a single-camera and LEDs.

• Head-pose tracking system with a time-of-flight camera.

• Facial recognition with a time-of-flight camera.

• Head-pose-based haptic perception of the spatial layout of web pages.

• Head-pose-based spatial interface perception using refreshable Braille.

It should be noted that the experimental results outlined throughout


this thesis are of a preliminary nature, i.e. proof of concept, and involve

4
only sighted (blindfolded) laboratory staff (author and supervisor). Testing
on blind subjects is outlined in Section 8.2 as future research. Based on
the assumption that blind individuals have more neural activity allocated
to tactile sensing, it is anticipated that they may in fact acquire greater
perception skills using the proposed devices than blindfolded sighted users.
Since the first Electro-Neural Vision System (ENVS) research was pub-
lished in 2004 [58] considerable research interest has been shown in the work
(e.g. [51, 12, 49, 81, 32, 4, 22, 65, 10, 68, 11, 79, 16, 35, 35, 42, 47, 1, 27,
14, 46, 66, 3, 85, 71]). Section 8.2.1 highlights the pioneering nature of this
research and provides examples of similar work built upon it.

5
6
2
Literature Review

Significant research effort has been invested in developing substitute vision


systems for the blind. See [51, 52, 12, 84, 5] for reviews. The most relevant
and significant work is outlined here.
Bionic vision in the form of artificial silicon retinas or external cameras
that stimulate the retina, optic nerve or visual cortex via tiny implanted
electrodes have been developed [86, 89, 13]. Implanted devices (generally
with input from external cameras) can provide visual perception in the form
of a number of points of light. The resultant information is of low resolution,
but has been found to enable subjects to identify simple objects and detect

7
motion [86]. Some researchers have suggested that the limited resolution of
such devices would be better utilised in delivering a depth map [48]. Whilst
the effectiveness of prosthetic devices will likely increase in the near future,
the cost and surgery involved render them inaccessible to most. Also, some
forms of blindness, such as brain or optic nerve damage, may be unsuitable
for implants.
A number of wearable devices have been developed for providing the
blind with some means of sensing or visualising the environment (see [84]
for a survey). For example, Meijer’s vOICe [60] compresses a camera image
into a coarse 2D array of greyscale values and delivers this information to the
ears as a sequence of sounds with varying frequency. However it is difficult to
mentally reconstruct the sounds into a three-dimensional (3D) representation
of the environment, which is needed for obstacle avoidance and navigation.
Sonar mobility aids for the blind have been developed by Kay [41]. Kay’s
system delivers frequency modulated sounds, using pitch to represent dis-
tance and timbre to indicate surface features. However, to an inexperienced
user, these combined sounds can be confusing and difficult to interpret. Also,
the sonar beam from these systems is specular and can be reflected off many
surfaces or absorbed, resulting in uncertain perception. Nonetheless, Kay’s
sonar blind aids can help to identify landmarks by resolving some object
features, and can facilitate a degree of object classification for experienced
users.
A major disadvantage of auditory substitute vision systems is that they
can diminish a blind person’s capacity to hear sounds in the environment
(e.g. voices, traffic, footsteps, etc.). Consequently, these devices are not

8
widely used in public places because they can reduce a blind person’s auditory
perception of the environment and could potentially cause harm or injury if
impending danger is not detected by the ears.
Computer vision technology such as object recognition and optical char-
acter recognition has also been utilised in recent navigational aid research [83,
21, 76].
The use of infrared distance sensors for detecting objects has been pro-
posed [56] but little explored in practice. Laser range scanners can provide
a high level of accuracy, but have limited portability [22].
Recently released consumer electronics are equipped with complex sensing
technology, creating new opportunities for the development of devices for
perception assistance [9, 53].
Electro-tactile displays for interpreting the shape of images on a com-
puter screen with the fingers, tongue or abdomen have been developed by
Kaczmarek et al. [36]. These displays work by mapping black and white pix-
els to a matrix of closely-spaced pulsated electrodes that can be felt by the
fingers. These electro-tactile displays can give a blind user the capacity to
recognise the shape of certain objects, such as black alphabetic characters on
a white background. More recently, experiments have been conducted using
electro-tactile stimulation of the roof of the mouth to provide directional cues
[82].
Researchers continue to investigate novel methods of haptic feedback.
Recently the use of “skin-stretch tactors” to provide directional cues for
mobile navigation was proposed by Provancher et al. [31, 44]. Mann et al.
use a head-mounted array of vibro-tactile actuators [53]. Samra et al. have

9
experimented with varying the speed and direction of rotating brushes for
texture synthesis within a virtual environment [73].
In addition to sensing the surrounding environment, it is of considerable
benefit if a perception aid can provide the position of the user or nearby land-
marks. Currently, a number of GPS devices are available for the blind that
provide the position of the user or specific locations using voice or Braille
interfaces (e.g. [28, 39]). However, as with audio substitute vision systems,
voice interfaces occupy the sense of hearing which can diminish a blind per-
son’s capacity to hear important environmental sounds. Braille interfaces
avoid this problem, but interacting via typing and reading in Braille is slower
and requires higher cognitive effort.
Others have explored the use of alternative location tracking technology
such as infrared [15] or Wi-Fi beacons [72] or RFID tags [8, 77, 87], which
can prove advantageous for navigating indoor environments. Yelamarthi et
al. have combined both GPS and RFID information in producing robots
designed to act as replacements for guide dogs [90].
In addition to inhibiting the navigation of physical environments, visual
impairment also restricts interaction with virtual environments such as soft-
ware interfaces. A number of “assistive technology” avenues have been pur-
sued to enable non-visual interaction with computers [78].
Use of “screen-reading” software [19, 25] is the predominant way in which
visually impaired users currently interact with computers. This software typ-
ically requires the user to navigate the computer interface in a linearised fash-

10
ion using memorised keystrokes (numbering in the hundreds 1 ). The currently
“focused” piece of information is conveyed to the user via speech synthesis
or a refreshable Braille display. A study of 100 users of screen-reading soft-
ware determined that the primary frustration encountered when using this
technology was “page layout causing confusing screen reader feedback” [45].
Tactile devices for enabling blind users to perceive graphics or images on
the computer have been under development for some time. For example the
haptic mouse (e.g. [33, 75, 6, 24]) can produce characteristic vibrations when
the mouse cursor is moved over screen icons, window controls and application
windows. This can indicate to a blind user what is currently beneath the
mouse pointer, but is of limited value when the location of the objects and
pointer on the screen are not easily ascertained.
Force feedback devices, like the PHANToM [20], and tactile (or electro-
tactile) displays, e.g. [34, 36, 40, 55], can enable three-dimensional graphical
models or two-dimensional black-and-white images to be visualised by using
the sense of touch (see [7] for a survey). However, little success has been
demonstrated with these devices toward enabling blind users to interact with
typical GUIs beyond simple memory and visualisation experiments like “the
memory house” [80], which involves discovering buttons on different planes
via force feedback and remembering the buttons that play the same sounds
when found.
Upon reviewing the progress in the research field to date, it was deter-

1
http://www.freedomscientific.com/doccenter/archives/training/
JAWSKeystrokes.htm

11
mined that the following areas have been explored little (if at all) despite
holding significant potential for assisting the visually impaired using cur-
rently available technology:

• Electro-neural stimulation as a navigational aid

• Haptic perception of colour information

• GPS navigation using haptic signals

• Haptic interaction with the spatial layout of graphical user interfaces

It was further noted that all four of these areas could utilise the head
as a natural and intuitive pan-and-tilt controller for scanning the region of
interest, and electro-neural stimulation as a versatile haptic feedback channel.
The following chapters outline new research and experimentation in these
areas.

12
3
Perceiving Depth

3.1 Electro-Neural Vision System

The structure of the first ENVS prototype is illustrated in Figure 3.1. The
prototype is comprised of:

• a stereo video camera headset for scanning the environment,

• a laptop computer for processing the video data,

• a Transcutaneous Electro-Neural Stimulation (TENS) unit for convert-

13
ing the output from the computer into appropriate electrical pulses that
can be felt via the skin, and

• special gloves fitted with electrodes for delivering the electrical pulses
to the fingers.

The ENVS works by using the laptop computer to obtain a disparity


depth map of the immediate environment from the head-mounted stereo
cameras. This is then converted into electrical pulses by the TENS unit that
stimulates nerves in the skin via electrodes located in the TENS data gloves.
To achieve electrical conductivity between the electrodes and skin, a small
amount of conductive gel is applied to the fingers prior to fitting the gloves.
For testing purposes, the stereo camera headset is designed to completely
block out the users eyes to simulate blindness. A photograph of the first
ENVS prototype is shown in Figure 3.2, while Figure 3.3 reveals the internal
hardware of a subsequent ENVS prototype.

Figure 3.1: Schematic of the first ENVS prototype

14
Figure 3.2: Photograph of the first ENVS prototype

15
Figure 3.3: Internal ENVS TENS hardware

The key to obtaining useful environmental information from the electro-


neural data gloves lies in representing the range data delivered to the fingers
in an intuitive manner. To interpret this information the user imagines their
hands are positioned in front of the abdomen with fingers extended. The
amount of stimulation felt by each finger is directly proportional to the dis-
tance of objects in the direction pointed by each finger.
Figure 3.4 shows an oscilloscope snapshot of a typical TENS pulse. For
conducting experiments the TENS pulse frequency was set to 20Hz and the
amplitude to between 40–80V depending on individual user comfort. To
control the intensity felt by each finger the ENVS adjusts the pulse width to
between 10–100µs.

16
Figure 3.4: TENS output waveform

Adjusting the signal intensity by varying the pulse width was found
preferable to varying the pulse amplitude for two reasons. Firstly, it en-
abled the overall intensity of the electro-neural simulation to be easily set to
a comfortable level by presetting the pulse amplitude. It also simplified the
TENS hardware considerably by not requiring digital-to-analogue converters
or analogue output drivers on the output circuits.
To enable the stereo disparity algorithm parameters and the TENS output
waveform to be altered for experimental purposes, the ENVS is equipped with
the control panel shown in Figure 3.5. This was also designed to monitor the
image data coming from the cameras and the signals being delivered to the
fingers via the TENS unit. The ENVS software was built upon the “SVS”

17
Figure 3.5: The ENVS control panel

software [43] provided with the stereo camera hardware (see Section 3.1.1).
Figure 3.5 shows a typical screenshot of the ENVS control panel while in
operation. The top-left image shows a typical environment image obtained
from one of the cameras in the stereo camera headset. The corresponding
disparity depth map, derived from both cameras, can be seen in the top-right
image (in which lighter pixels represent a closer range than darker pixels).
Also, the ten disparity map sample regions, used to obtain the ten range

18
readings delivered to the fingers, can be seen spread horizontally across the
centre of the disparity map image. These regions are also adjustable via the
control panel for experimentation.
To calculate the amount of stimulation delivered to each finger, the min-
imum depth of each of the ten sample regions is taken. The bar graph, at
the bottom-left of Figure 3.5, shows the actual amount of stimulation de-
livered to each finger. Using a 450MHz Pentium 3 computer, a frame rate
of 15 frames per second was achieved, which proved more than adequate for
experimental purposes.

3.1.1 Extracting depth data from stereo video

The ENVS works by using the principle of stereo disparity. Just as human
eyes capture two slightly different images and the brain combines them to
provide a sense of depth, the stereo cameras in the ENVS captures two images
and the laptop computer computes a depth map by estimating the disparity
between the two images. However, unlike binocular vision on humans and
animals, which have independently moveable eye balls, typical stereo vision
systems use parallel mounted video cameras positioned at a set distance from
each other.

19
The stereo camera head

Experimentation utilised a pair of parallel mounted DCAM video cameras


manufactured by Videre Design 1 , as shown in Figure 3.6b. The stereo
DCAMs interface with the computer via an IEEE 1394 (FireWire) port.

(a) Single DCAM board

(b) Stereo DCAM Unit

Figure 3.6: Videre Design DCAM

1
http://users.rcn.com/mclaughl.dnai/

20
Calculating disparity

The process of calculating a depth map from a pair of images using parallel
mounted stereo cameras is well known [54]. Given the baseline distance
between the two cameras and their focal lengths (shown in Figure 3.7), the
coordinates of corresponding pixels in the two images can be used to derive
the distance to the object from the cameras at that point in the images.

Figure 3.7: Stereo disparity geometry

Calculating the disparity between two images involves finding correspond-


ing features in both images and measuring their displacement on the pro-
jected image planes. For example, given the camera setup shown in Fig-
ure 3.7, the distance from the cameras to the subject can be calculated quite
simply. Let the horizontal offsets of the pixel in question from the centre of
the image planes be xl and xr for the left and right images respectively and
the focal length be f with the baseline b. By using the properties of the sim-
ilar triangles denoted in Figure 3.7, then z = f (b/d), where z is the distance

21
to the subject and d is the disparity (xl − xr). To compute a complete depth
map of the observed image in real time is computationally expensive because
the detection of corresponding features and calculating their disparity has to
be done at the frame rate for every pixel on each frame.

3.1.2 Range sampling

A number of methods for reducing the dense stereo map to ten scalar values
were trialled in the experimentation. These included:

• sampling regions spanning the full height of the camera viewport for
maximising the perception window;

• sampling a narrow band of regions for maximal focus and acuity when
building a mental map using head/sensor movement;

• simulation of the human eye’s foveal perception by providing greater


acuity at the centre of the viewport (with smaller regions) and broader
peripheral vision with larger regions toward the extremities;

• sampling the average distance within each region (high stability, but
small obstacles can be easily missed);

• sampling the predominant distance within each region (generally an


improvement on taking the average, however small obstacles were still
problematic);

• sampling the minimum distance within each region (proved to be safest


and most effective, providing appropriate thresholds were used); and

22
• maintaining exponential moving averages for mitigating data space lim-
itations, and allowing longer and more accurate windowing.

3.1.3 Limitations

Calibration

A tedious amount of fine-tuning of the stereo camera focus and alignment


was required to maximise the visual field and achieve an accurate depth
map. The tiniest deviation from these settings could greatly diminish the
quality of the resultant depth profile, necessitating recalibration. Even with
the camera hardware aligned as precisely as possible, unreliable software
calibration procedures could still lead to a poor quality depth map.

Featureless surfaces

The stereo disparity algorithm requires automated detection of corresponding


pixels in the two images, using feature recognition techniques, in order to
calculate the disparity between the pixels. Consequently, featureless surfaces
can pose a problem for the disparity algorithm due to a lack of identifiable
features. For example, Figure 3.8 illustrates this problem with a disparity
map of a whiteboard. As the whiteboard surface has no identifiable features
on it, the disparity of this surface and its range cannot be calculated. To
make the user aware of this, the ENVS maintains a slight signal if a region
contains only distant features and no signal at all if the disparity cannot be
calculated due to a lack of features in a region.

23
Figure 3.8: Disparity map of a featureless surface, displayed via the custom-
built ENVS software interface

As noted by others [68], intensity-based stereo matching algorithms can


overcome this problem, but incur substantially increased processing over-
heads.

Power consumption and processing overheads

The amount of power required to operate the stereo camera unit and vision
processing software was found to be less than optimal for an application
designed for extended usage during which the user to would need to carry a
battery of considerable size.

24
3.2 Using infrared sensors

As stereo cameras are bulky, computationally expensive, power hungry and


2
need regular calibration, the use of eight Sharp GP2D120 infrared sensors
was assessed as an alternative means of measuring range data. The use of
infrared sensors in place of stereo cameras also overcomes previous limita-
tions perceiving featureless surfaces (see Section 3.1.3). The infrared sensor
produces an analogue voltage output proportional to the range of detected
objects and is easily interfaced to low power microprocessors. Although the
GP2D120 infrared sensor performs poorly in sunlight, it was found to be
capable of measuring 3–4 metres indoors under most artificial lighting con-
ditions and is accurate to within 5% of the distance being measured. Fur-
thermore, when configured as shown in Figure 3.9, very little interference or
cross-talk occurred between the sensors due the narrow width of the infrared
detector array within each sensor. The range data and object perception
achieved with the infrared-headset and ENVS data gloves, was found to be
comparable with that achieved using stereo cameras. The outdoor limitations
of the infrared sensors could be overcome by developing custom infrared sen-
sors with more powerful infrared LEDs or laser diodes.

2
http://www.sharp-world.com/products/device

25
Figure 3.9: The prototype Infrared-ENVS

3.3 Wireless ENVS

Whilst the crude gloves built for the initial prototype proved effective for ex-
perimentation, they are less than ideal for real-world usage. Having to wear
gloves limits the use of the hands whilst hooked up to the system. Further-
more there is no reason for the haptic feedback to be limited to the fingers –
many other nerves in the body (e.g. arms or torso) may prove more conve-
nient or effective, perhaps varying from user to user. Some users may prefer
a less conspicous location for the electrodes. The wires on the prototype sys-
tem also present potential risk of snagging and can restrict movement. These
issues prompted some initial research into alternative TENS hardware.

26
Figure 3.10: Wireless TENS patch

Figure 3.10 shows a prototype wireless TENS patch developed to over-


come the above issues. The wireless patches communicate with the system
via radio transmission. This not only allows the user to walk away from the
system without having to detach electrodes, but also enables the electrodes
to be placed anywhere on the body. They can also be placed inconspicuously
beneath clothing. When not limited to placement on fingers, the system
can be used with more than ten patches, providing higher communication
bandwidth.

27
3.4 Experimental results

A number of experiments were conducted to determine the extent to which


users could navigate the laboratory environment and recognise their location
within this environment without any use of the eyes. To simulate blindness
with sighted users, the stereo camera headset was designed to be fitted over
the user’s eyes so that no light whatsoever could enter the eyes.

3.4.1 Obstacle avoidance

The objective of the first was to determine whether the user could identify
and negotiate obstacles while moving around in the cluttered laboratory en-
vironment. It was found that after five minutes of use within the unknown
environment, users could estimate the direction and range of obstacles lo-
cated in the environment, with sufficient accuracy to enable approaching
objects and then walking around them by interpreting the range data de-
livered to the fingers via the ENVS. As the environment contained many
different sized obstacles, it was also necessary for users to regularly scan the
immediate environment, (mostly with up and down head movements), to
ensure all objects were detected regardless of their size. After ten minutes
moving around in the environment, while avoiding obstacles, users were also
able identify features like the open doorway, shown in Figure 3.11a and even
walk through the doorway by observing this region of the environment with
the stereo cameras. Figure 3.11 shows a photo of a user and a screenshot of
the ENVS control panel at one instant while the user was performing this
task. The 3D profile of the doorway can be plainly seen in the depth map

28
shown at the top-right of Figure 3.11b. Also, the corresponding intensity of
the TENS pulses felt by each finger can be seen on the bar graphs shown at
the bottom-left corner of Figure 3.11b.
Ten range readings delivered to the fingers in this manner may not seem
like much environmental information. The real power of the ENVS lies in
the user’s ability to easily interpret the ten range readings, and, by fusing
this information over time, produce a mental 3D model of the environment.
Remembering the locations of obstacles was found to increase with contin-
ued use of the ENVS, eliminating much of the need to regularly scan the
environment comprehensively. With practice, users could also interpret the
range data without any need to hold the hands in front of the abdomen.

29
(a) Photograph

(b) Interface screenshot

Figure 3.11: ENVS user surveying a doorway

30
3.4.2 Localisation

Localisation experiments were also conducted to determine if the user could


recognise their location within the laboratory environment after becoming
disoriented. This was performed by rotating the user a number of times on
a swivel chair, in different directions, while moving the chair. Care was also
taken to eliminate all noises in the environment that might enable the user
to recognise the locations of familiar sounds. As long as the environment
had significant identifiable objects that were left unaltered and the user had
previously acquired a mental 3D map of the environment, the user could
recognise significant objects, recall their mental map of the environment and
describe approximately where they were located in the environment after
surveying the environment for a few seconds. However, this task becomes
more difficult if the environment lacks significant perceivable features or is
symmetrical in shape.
Figure 3.12 shows a photo of a user and a screenshot of the ENVS control
panel at one instant while a user was surveying the environment to determine
his location. The approximated height, width and range of the table in the
foreground of Figure 3.12a can be plainly seen in the depth map, shown at the
top-right of Figure 3.12b. The corresponding intensity of the TENS pulses
felt by each finger can be seen on the bar graphs shown at the bottom-left
corner of Figure 3.12b.

31
(a) Photograph

(b) Screenshot

Figure 3.12: ENVS user negotiating an obstacle

32
The inability of stereo cameras to resolve the depth of featureless surfaces
was not a problem within the cluttered laboratory environment because there
were sufficient edges and features on the lab’s objects and walls for the dis-
parity to be resolved from the stereo video images. In fact, not resolving
the depth of the floor benefited the experiments to some extent by enabling
objects located on the floor to be more clearly identifiable, as can be seen in
Figure 3.12b.

3.5 Conclusions

Initial experimentation with the use of electro-neural stimulation as a means


of communicating the profile of the surrounding environment has proved the
potential of the concept. Stereo cameras worked well as range sensors, how-
ever required substantial calibration effort and could not resolve featureless
surfaces. Use of infrared sensors overcame these problems, however they
performed less well in outdoor environments in daylight.
Mapping the range data to electro-neural output with one finger per sam-
ple region worked well. Users were able to intuitively build a mental map of
the corresponding environmental profile, which was enhanced over time by
scanning the environment with the head-mounted range sensing devices. It
is anticipated that results with blind users might prove even more successful
given their heightened non-visual sensory abilities, though this might vary
between users who were born blind and those who have previously experi-
enced sight, due to differences in how the environment is visualised.
Use of electro-neural intensity to represent proportional distance was

33
found to be effective and intuitively interpreted.
Wireless hardware was developed to overcome the limitations and incon-
veniences of the crude wired prototypes, and expanded the future potential
of the system in regards to sensor placement and resolution.
The prototype system was successfully tested by blindfolded (sighted)
users, proving its potential for facilitating spatial awareness, detection of
stationary and moving objects, obstacle avoidance and localisation.

34
4
Perceiving Colour

Although environmental range readings can enable blind users to avoid ob-
stacles and recognise their relative position by perceiving the profile of the
surrounding environment, a considerable improvement in localisation, navi-
gation and object recognition can be achieved by incorporating colour percep-
tion into the ENVS. Colour perception is important because it can facilitate
the recognition of significant objects which can also serve as landmarks when
navigating the environment.

35
4.1 Representing colours with electro-neural
signal frequencies

To encode colour perception into the ENVS, the frequency of the electrical
signals delivered to the fingers was adjusted according to the corresponding
colour sample. Two methods of achieving colour perception were investi-
gated. The first was to represent the continuous colour spectrum with the
entire available frequency bandwidth of the ENVS signals. Thus, red objects
detected by the ENVS would be represented with low frequency signals, vio-
let colours would be represented with high frequency signals and any colours
between these limits would be represented with a corresponding proportion-
ate frequency. The second method was to only represent significant colours
with specific allocated frequencies via a lookup table.

4.1.1 Mapping the colour spectrum to frequency

It was found that the most useful frequencies for delivering depth and colour
information to the user via transcutaneous electro-neural stimulation were
frequencies between 10–120Hz. Frequencies above this range tended to result
in nerves becoming insensitive by adapting to the stimulation. Frequencies
below this range tended to be too slow to respond to changed input. Con-
sequently, mapping the entire colour spectrum to the frequency bandwidth
available to the ENVS signals proved infeasible due to the limited band-
width available. Furthermore, ENVS experiments involving detection and
delivery of all colours within a specific environment via frequencies proved
ineffective for accurate interpretation of the range and colour information.

36
Rapid changes in frequency often made the intensity difficult to determine
accurately, and vice versa.

4.1.2 Using a lookup-table for familiar colours

Due to the infeasibility of mapping the entire colour spectrum to frequencies,


an eight-entry lookup table was implemented in the ENVS for mapping sig-
nificant colours, selectable from the environment, to selectable frequencies.
Figure 4.1 illustrates the differences between the two mapping strategies.
“Significant” colours represent specific colours possessed by objects in the
user’s familiar environment that can aid the user in locating their position
in the environment or identifying regularly used items – for example, doors,
kitchen table, refrigerator, pets, people (i.e. skin colour), home, etc. Al-
though these colours may be taken from regularly encountered environments,
such as the home or workplace, they are also likely to be often encountered
on objects in unfamiliar environments which can be used as landmarks to
aid in navigation.

Figure 4.1: Comparison of colour-to-frequency mapping strategies

37
4.2 Colour perception experimentation

A number of experiments were conducted to determine if users could navigate


indoor environments without using the eyes, by perceiving the 3D structure of
the environment and by recognising the location of landmarks by their colour
using the ENVS. As for previous experiments, the headset covered the user’s
eyes to simulate blindness with sighted users. To avoid the possibility of
users remembering any sighted positions of objects in the environment prior
to conducting trials, users were led blindfolded some considerable distance
to the starting point of each experiment.

4.2.1 Navigating corridors

The first experiment was to determine if the ENVS users could navigate
the corridor and find the correct door leading into the laboratory from a
location in another wing of the building. Prior to conducting the trials
users were familiarised with the entrance to the laboratory and had practiced
negotiating the corridor using the ENVS.
The lab entrance was characterised by having a blue door with a red fire
extinguisher mounted on the wall to the right of the door. To the left of the
door, was a grey cabinet. The colour of the door, fire extinguisher and cabinet
were stored in the ENVS as significant colours and given distinguishable
frequencies. Figure 4.2 shows a photo of a user observing the entrance of
the laboratory with the ENVS and a corresponding screenshot of the ENVS
control panel. The level of stimulation delivered to the fingers can be seen on
the bars at the bottom-left of Figure 4.2b. As close range readings stimulate

38
(a) Photograph

(b) Screenshot

Figure 4.2: ENVS user negotiating a corridor using colour information

39
the fingers more than distant range readings, the finger stimulation levels felt
by the user at this instant indicate that a wall is being observed on the right.
Furthermore, the three significant colours of the door, fire extinguisher
and cabinet can also be seen in the range bars in Figure 4.2b. This indicates
that the ENVS has detected these familiar colours and is indicating this to the
user by stimulating the left middle finger, left pointer finger, both thumbs and
the right ring finger with frequencies corresponding to the detected familiar
colours.
After being familiarised with the laboratory entrance, the ENVS users
were led to a location in another wing of the building and asked to find their
way back to the lab by using only the ENVS. Users could competently ne-
gotiate the corridor, locate the laboratory entrance and enter the laboratory
unassisted and without difficulty, demonstrating the potential of this form of
colour and depth perception as an aid for the visually disabled.

4.2.2 Navigating the laboratory

Experiments were also performed within the laboratory (see Figure 4.3) to
determine if the ENVS users could recognise their location and navigate the
laboratory to the doorway without using the eyes or other blind aids. The
colours of a number of objects were encoded into the ENVS as significant
colours. These included the blue door and a red barrier stand located near
the door, as can be seen in Figure 4.3.
Before being given any navigational tasks in the laboratory, each user was
given approximately three minutes with the ENVS to become familiarised

40
(a) Photograph

(b) Screenshot

Figure 4.3: ENVS user negotiating a laboratory using colour information

41
with the doorway vicinity and other objects that possessed familiar colours
that were stored in the ENVS. To ensure that the starting location and
direction was unknown to the ENVS users, each user was rotated a number of
times on a swivel chair and moved to an undisclosed location in the laboratory
immediately after being fitted with the ENVS. Furthermore, the doorway
happens to be concealed from view from most positions in the laboratory
by partitions. Consequently, users had the added task of first locating their
position using other familiar coloured landmarks and the perceived profile of
the laboratory before deciding on which direction to head toward the door.
It was found that users were generally able to quickly determine their
location within the laboratory based on the perceived profile of the environ-
ment and the location of familiar objects. Subsequently, users were able to
approach the door, identify it by its familiar colour (as well as the barrier
stand near the door) and proceed to the door. Figure 4.3 shows a photo of
a user observing the laboratory door with the ENVS and a corresponding
screenshot of the ENVS control panel. The level of stimulation delivered to
the fingers and the detected familiar colours can be seen on the display at
the bottom-left of Figure 4.3b.

4.3 Colour sensing technology

When experimenting with infrared sensors rather than stereo cameras, colour
information was obtained using a miniature CMOS camera, like the one
shown in centre of the infrared headset in Figure 3.9, and colour sensors like

42
1
the TAOS TCS230 and the Hamamatsu S9706 2 . These technologies pro-
vided equivalent perception of colour information to the stereo video camera
input, but with a far smaller power consumption and processing footprint.
The TAOS TCS230 sensor was found to perform poorly for this appli-
cation under fluorescent lighting conditions. This sensor samples the three
colour channels in sequence, and receives inconsistent exposure to each chan-
nel due to motion or the strobe effect of fluorescent lights.
Experiments with the Hamamatsu S9706 sensor proved it to be suitable
for the ENVS if fitted with an appropriate lens to maximise the amount of
light captured. This sensor samples red, green and blue light levels simulta-
neously, so does not suffer from the fluorescent strobing or fast motion issues
discovered with the TAOS sensor.
The hardware prototype uses custom colour matching algorithms encoded
3
on a PIC microprocessor, and experimental results have shown that this
sensor arrangement is capable of performing as well as stereo cameras for
this application.

1
http://www.taosinc.com/product_detail.asp?cateid=11&proid=12
2
http://www.sales.hamamatsu.com/en/products/solid-state-division/color_
sensors/part-s9706.php
3
http://www.microchip.com

43
4.4 Conclusions

ENVS depth and colour detection experiments were conducted with stereo
cameras and infrared range sensors combined with various colour sensors.
Colour sensors such as the TAOS TCS230 and the Hamamatsu S9706, com-
bined with a suitable focusing lens, were found to be comparable in perfor-
mance to stereo cameras at detecting colours at distance, with less power
and processing overheads. The TAOS TCS230 sensor was found to be less
effective under fluorescent light and with moving objects.
TENS frequencies in the range of 10–120Hz were found to be effective
in resolving and communicating both the colour and range of objects to the
user to a limited extent. Attempts to resolve the colour of all objects in the
environment proved ineffective due to bandwidth limitations of the TENS
signal and confusion that can occur when both the frequency and intensity
of the TENS signal varies too often.
Encoding a limited number of colours into the ENVS with a frequency
lookup table for detecting significant coloured objects was found to be effec-
tive for both helping to identify familiar objects and facilitating navigation
by establishing the location of known landmarks in the environment.

44
5
High-Level Navigation

Whilst perception of range and colour data within the immediate environ-
ment can help users to avoid collisions and identify landmarks, it does not
address the entirety of a blind person’s navigational difficulties. The visible
environment beyond the short distance detectable by range sensing hardware
should also be considered, as well as utilisation of data available from tech-
nology that human senses cannot naturally detect. For example, additional
information from GPS and compass technology could be used to facilitate
navigation in the broader environment by determining the user’s location in
relation to known landmarks and the destination.

45
To enable landmarks to be perceived by blind users, the ENVS is equipped
with a GPS unit, a digital compass and a database of landmarks. The
digital compass (Geosensory: RDCM-802 1 ) is mounted on the stereo camera
headset, as shown in Figure 5.1, and is used to determine if the user is looking
in the direction of any landmarks.

Figure 5.1: Electronic compass mounted on headset

Landmarks can be loaded from a file or entered by the user by pressing a


button on the ENVS and are associated with their GPS location and an ID
number. Landmarks are considered to be any significant object or feature
in the environment from which the user can approximate their position. A
landmark can also be a location that the user wishes to remember, for ex-

1
http://www.geosensory.com/rdcm-802.htm

46
ample a bus stop or the location of a parked vehicle. By using the GPS unit
to obtain the user’s location, the ENVS can maintain a list of direction vec-
tors to landmarks that are within a set radius from the user. The landmark
radius can be set to short or long range (e.g. 200–600m) by the user via a
switch on the ENVS unit.

5.1 Interpreting landmarks via TENS

When a landmark is calculated to be within the user’s vicinity (as determined


by the GPS coordinates, headset compass and the set landmark radius), the
ID of the perceived landmark is encoded into a sequence of pulses and deliv-
ered to the user via the finger which represents the direction of the landmark.
To encode the ID of a landmark, a five-bit sequence of dots and dashes carried
by a 200Hz signal is used to represent binary numbers. To avoid interfering
with the range readings of objects, which are also delivered to the fingers via
the ENVS, locations are delivered to the fingers in five-second intervals. For
example, if a landmark is detected, the user will receive range readings via
the fingers for four seconds followed by approximately one second of land-
mark ID information. If more than one landmark is present within the set
landmark radius and the field of view of landmarks, the landmark nearest
to the centre of the field of view will be output to the user. If there are
additional landmarks in the same region as the most central one, these are
transmitted sequentially in order of proximity.
By using five bits to represent landmark IDs, the ENVS is able to store
up to 32 locations, which proved more than adequate for experimentation.

47
The distance to the landmark is indicated by the intensity of the pulses.
Weak sensations indicate that the landmark is near to the set maximum
landmark radius. Strong sensations indicate that the landmark is only metres
away from the user. If the user has difficulty recognising landmarks by their
pulse sequence, a button is available on the ENVS unit to output the name,
distance and direction of the perceived landmark as speech.
The RDCM-802 digital compass mounted on the ENVS headset has three-
bit output, providing an accuracy of 22.5◦ . This low resolution was not found
to be problematic as the user could obtain higher accuracy by observing the
point at which the bearing changed whilst moving their head. Also, the eight
compass point output mapped well to the fingers for providing the user with
a wide field of view for landmarks and proved effective for maintaining spatial
awareness of landmarks.
Communicating the landmark ID to the user via the frequency of the
landmark pulse (rather than the ‘dots and dashes’ protocol) was also tested
and found to allow faster interpretation of the ID. However, this was only
effective for a small number of landmarks due to the difficulty involved in
differentiating frequencies. It was also found that the delay between provid-
ing landmark information could be extended from five seconds to ten seconds
because landmarks tend to be more distant than objects which allows spatial
awareness of landmarks to be maintained for longer periods than for nearby
objects.

48
5.2 Landmark bearing protocols

A number of protocols for mapping the relative direction of landmarks to


fingers were tested. The most obvious was to simply create a direct spa-
tial relationship between the ‘visual field’ of each of the ten fingers and the
landmark information, as shown in Figure 5.2.

Figure 5.2: GPS landmarks in visual field only

However, transmitting the landmark locations to the user periodically,


and separately to the range readings, made it unnecessary to map landmark
directions to the fingers in the same manner as used for objects. This pro-
vided the user with a greater field of view for landmarks than for objects
which proved more effective for maintaining spatial awareness of landmarks.

49
Figure 5.3 illustrates one landmark perception protocol tested which al-
lows the user to perceive the direction of landmarks within 112.5◦ of the
direction they are facing. For example, landmarks within 22.5◦ of the centre
of vision are communicated via the two index fingers simultaneously. Land-
marks within the ‘peripheral vision of the user are communicated via the
appropriate pinkie finger.

Figure 5.3: 225◦ GPS perception protocol

It was found also that this perception field could effectively be extended
to 360◦ (see Figure 5.4), allowing the user to be constantly aware of land-
marks in their vicinity regardless of the direction they were facing. Both
these protocols were found effective, and users demonstrated no problems
interpreting the different fields of perception.

50
Figure 5.4: 360◦ GPS perception protocol

5.3 Experimental results

To test the ENVS a number of experiments were conducted within the Uni-
versity campus grounds, to determine the ability of users to navigate the
campus environment without any use of their eyes. The ENVS users were
familiar with the campus grounds and the landmarks stored in the ENVS and

51
had no visual impairments. The headset acted as a blindfold as in previous
experiments.

5.3.1 Navigating the car park

The first experiment was performed to determine if the ENVS users could
navigate a car park and arrive at a target vehicle location that was encoded
into the ENVS as a landmark. Each user was fitted with the ENVS and
led blindfolded to a location in the car park that was unknown to them and
asked to navigate to the target vehicle using only the ENVS electro-tactile
signals. The car park was occupied to approximately 75% of its full capacity
and also contained some obstacles such as grass strips and lampposts.
Users were able to perceive and describe their surroundings and the loca-
tion of the target vehicle in sufficient detail for them to be able to navigate
to the target vehicle without bumping into cars or lampposts. With practice,
users could also interpret the ENVS output without needing to extend their
hands to assist with visualisation, and could walk between closely spaced
vehicles without colliding with the vehicles.
Figure 5.5 shows a user observing two closely spaced vehicles in the car
park with the ENVS. The profile of the space between the vehicles can be
seen in the disparity map, shown in the top-right of Figure 5.5b, and in
the finger pulse bars shown at the lower-left. The highlighted bar at the
left forefinger position of the intensity display indicates that target vehicle is
located slightly to the left of where the user is looking and at a distance of
approximately 120m.

52
(a) Photograph

(b) Screenshot showing the perceived vehicles and the target landmark’s
direction and distance.

Figure 5.5: ENVS user negotiating a car park

53
5.3.2 Navigating the University campus

Experiments were also performed within the University campus to determine


the potential of the ENVS to enable blindfolded users to navigate the campus
without using other aids. The main test was to see if users could navigate
between two locations some distance apart (approximately 500m) and avoid
any obstacles happening to be in the way. The path was flat and contained
no stairs between the two locations. A number of familiar landmarks were
stored in the ENVS in the vicinity of the two locations.
It was found that users were able to avoid obstacles, report their ap-
proximate location and orientation, and arrive at the destination without
difficulty. Unlike the car park, the paved path was highly textured making
it clearly visible to the stereo cameras. Consequently, this delivered clearly
defined range readings of the paved path to the user via the ENVS unit as
shown in Figure 5.6.

54
(a) Photograph

(b) Screenshot

Figure 5.6: ENVS user surveying a paved path in the campus environment

55
In some areas, users were unable to determine where the path ended and
the grass began, using the ENVS range stimulus alone. However, this did
not cause any collisions and the users became quickly aware of the edge of
the path whenever their feet made contact with the grass. This problem
could be overcome by encoding colour into the range signals delivered to the
fingers by varying the frequency of the tactile signals.

5.4 Conclusions

To enable the user to perceive and navigate the broader environment, the
ENVS was fitted with a compass and GPS unit.
By encoding unique landmark identifiers with five-bit morse-code-style
binary signals and delivering this information to the fingers, it was shown
that the user was able to perceive the approximate bearing and distance of
up to 32 known landmarks.
It was also found that the region-to-finger mapping need not necessarily
correspond exactly with the ENVS range and colour data regions, and could
in fact be expanded to make full 360◦ perception of landmarks possible.
These experimental results demonstrate that by incorporating GPS and
compass information into the ENVS output, it may be possible for blind
users to negotiate the immediate environment and navigate to more distant
destinations without additional assistance.

56
6
Haptic Perception of
Graphical User Interfaces

Given the successful real-world navigation experiments using the ENVS, it


was anticipated that the extension of the concept to the navigation of “vir-
tual” environments (such as computer interfaces) would prove similarly ef-
fective. Similar haptic feedback principles, based on perception directed by
the head, could be applied. The same TENS glove equipment could be used
for experimentation. However, the means of sensory input and control (for
example head-mounted range sensors) would differ significantly.

57
A substantial advantage of virtual environments is the elimination of sen-
sor inaccuracies and limitations involved in determining the state of the envi-
ronment. This is because the software can have a complete map available at
all times. More difficult, is accurately determining the portion of the environ-
ment toward which the user’s perception is directed during each moment in
time. Details of the custom systems developed to deliver head-pose tracking
suitable for this application can be found in Chapter 7. The interface itself
is described in the following sections.
The primary goal of the gaze-tracking 1 haptic interface is to maintain the
spatial layout of the interface so that the user can perceive and interact with
it in two-dimensions as it was intended, rather than enforcing linearisation
with the loss of spatial and format data, as is the case with screen readers.
In order to maintain spatial awareness, the user must be able to control
the “region of interest” and understand its location within the interface as
a whole. Given the advantages of keeping the hands free for typing and
perception, the use of the head as a pointing device was an obvious choice
– a natural and intuitive pan/tilt input device which is easy to control and
track for the user (unlike mouse devices).
The graphical user interface experimentation was performed using the
infrared-LED-based head-pose tracking system described in Section 7.1.

1
In this context, gaze-tracking is analogous to head-pose tracking

58
6.1 The virtual screen

Once the user’s head-pose is determined, a vector is projected through space


to determine the gaze position on the virtual screen. The main problems
lie in deciding what comprises a screen element, how screen elements can be
interpreted quickly, and the manner by which the user’s gaze passes from one
screen element to another. Two approaches to solving these problems were
tested as explained in the following sections.

6.2 Gridded desktop interface

The initial experiments involved the simulation of a typical “desktop” inter-


face, comprising a grid of file/directory/application icons at the “desktop”
level, with cascading resizable windows able to “float” over the desktop (see

Figure 6.1: Experimental desktop grid interface

59
Figure 6.1). The level of the window being perceived (from frontmost window
to desktop-level) was mapped to the intensity of haptic feedback provided to
the corresponding finger, so that “depth” could be conveyed in a similar fash-
ion to the ENVS. The frequency of haptic feedback was used to convey the
type of element being perceived (file/folder/application/control/empty cell).
Figure 6.2 illustrates the mapping between adjacent grid cells and the user’s
fingers. The index fingers were used to perceive the element at the gaze point,
while adjacent fingers were optionally mapped to neighbouring elements to
provide a form of peripheral perception. This was found to enable the user
to quickly acquire a mental map of the desktop layout and content. By gaz-
ing momentarily at an individual element, the user could acquire additional
details such as the file name, control type, etc. via synthetic speech output
or Braille text on a refreshable display.
A problem discovered early in experimentation with this interface was
the confusion caused when the user’s gaze meandered back and forth across

Figure 6.2: Mapping of fingers to grid cells

60
cell boundaries, as shown in Figure 6.3. To overcome this problem, a subtle
auditory cue was provided when the gaze crossed boundaries to make the user
aware of the grid positioning, which also helped to distinguish contiguous
sections of homogeneous elements. In addition, a stabilisation algorithm was
implemented to minimise the number of incidental cell changes as shown in
Figure 6.3.

Figure 6.3: Gaze travel cell-visiting sequence unstabilised (left) and with
stabilisation applied (right)

61
6.2.1 Zoomable web browser interface

With the ever-increasing popularity and use of the World Wide Web, a web-
browser interface is arguably more important to a blind user than a desktop
or file management system. Attempting to map web pages into grids similar
to the desktop interface proved difficult due to the more free-form nature
of interface layouts used. Small items such as radio buttons were forced to
occupy an entire cell, thus beginning to lose the spatial information critical
to the system’s purpose. The grid was therefore discarded altogether, and
the native borders of the HyperText Markup Language (HTML) elements
used instead.
Web pages can contain such a wealth of tightly-packed elements, however,
that it can take a long time to scan them all and find what you are looking
for. To alleviate this problem, the system takes advantage of the natural
Document Object Model (DOM) element hierarchy inherent in HTML and
“collapses” appropriate container elements to reduce the complexity of the
page. For example, a page containing three bulleted lists, each containing
text and links, and two tables of data might easily contain hundreds of el-
ements. If, instead of rendering all of these individually, they are simply
collapsed into the three tables and two lists, the user can much more quickly
perceive the layout, and then opt to “zoom” into whichever list or table in-
terests them to perceive the contained elements (see Figures 6.4a and 6.4b
for another example).

62
(a) Raw page

(b) Collapsed page

Figure 6.4: Example of collapsing a web page for faster perception

63
The experimental interface has been developed as an extension for the
Mozilla Firefox web browser 2 , and uses the BRLTTY 3
for Braille commu-
4
nication and Orca for speech synthesis. It uses JavaScript to analyse the
page structure and coordinate gaze-interaction in real-time. Communication
with the Braille display (including input polling) is performed via a separate
Java application.

6.3 Haptic output

A number of modes of haptic output were trialled during experimentation, in-


cluding glove-based electro-tactile stimulation, vibro-tactile actuators, wire-
less TENS patches and refreshable Braille displays. The following sections
discuss the merits and shortcomings of each system.

6.3.1 Electro-tactile stimulation

The wired and wireless TENS interfaces developed for mobile ENVS usage
were also able to be used for stationary perception of virtual environments.
This interface proved effective in experimentation and allowed the user’s
fingers to be free to use the keyboard (see Figure 6.5). However, being
physically connected to the TENS unit proved inconvenient for general use.

2
http://www.mozilla.com/firefox/
3
http://mielke.cc/brltty/
4
http://live.gnome.org/Orca

64
Figure 6.5: Wired TENS system

6.3.2 Vibro-tactile interface

Although the TENS interface is completely painless, it still requires wire-


less TENS electrodes to be placed on the skin in a number of places which
can be inconvenient. To overcome this problem, and to trial another mode
of haptic communication, a vibro-tactile keyboard interface was proposed,
as illustrated in Figure 6.6. This device integrates vibro-tactile actuators,
constructed from speakers capable of producing vibration output of the fre-
quency and amplitude specified by the system, analogous to the TENS pulse-
train output.
This system has clear advantages over the TENS interface. Firstly, the
user is not “attached” to the interface and can move around as they please.
Furthermore, no TENS electrodes need to be worn, and users are generally

65
more comfortable with the idea of vibration feedback than electrical stimu-
lation.
Whilst this interface was found to be capable of delivering a wide range
of sensations, the range and differentiability of TENS output was superior.
Furthermore, the TENS interface allows users to simultaneously perceive and
use the keyboard, whilst the vibro-tactile keyboard would require movement
of the fingers between the actuators and the keys.

Figure 6.6: Vibro-tactile keyboard design

6.3.3 Refreshable Braille display

Experimentation was also performed to assess the potential of refreshable


Braille displays for haptic perception. This revolved mainly around a Pa-
penmeier BRAILLEX EL 40s [67] as seen in Figure 6.7. It consists of 40
eight-dot Braille cells, each with an input button above, a scroll button at
either end of the cell array, and an “easy access bar” (joystick-style bar)
across the front of the device. This device was found to be quite versatile,

66
and capable of varying the “refresh-rate” up to 25Hz.

Figure 6.7: Papenmeier BRAILLEX EL 40s refreshable Braille display

A refreshable Braille display can be used in a similar fashion to the TENS


and electro-tactile output arrays for providing perception of adjacent ele-
ments. Each Braille cell has a theoretical output resolution of 256 differen-
tiable pin combinations. Given that the average user’s finger width occupies
two to three Braille cells, multiple adjacent cells can be combined to further
increase the per-finger resolution.
Whilst a blind user’s highly tuned haptic senses may be able to differen-
tiate so many different dot-combinations, sighted researchers have significant
difficulty doing so without extensive training. The preliminary experimen-
tation adopted simple “glyphs” for fast and intuitive perception. Figure 6.8
shows some example glyphs representing HTML elements for web page per-
ception.

67
Figure 6.8: Example glyphs – link, text, text

A further advantage of using a Braille display is the ability to display


element details using traditional Braille text. Suitably trained users are
able to quickly read Braille text rather than listening to synthetic speech
output. Experimentation has shown that using half the display for element-
type perception using glyphs and the other half for instantaneous reading of
further details of the central element using Braille text is an effective method
of quickly scanning web pages and other interfaces (see Figure 6.9).

Figure 6.9: Braille text displaying details of central element

The Papenmeier “easy access bar” has also proven to be a valuable asset
for interface navigation. In the prototype browser, vertical motions allow
the user to quickly “zoom” in or out of element groups (as described in
Section 6.2.1), and horizontal motions allow the display to toggle between
“perception mode” and “reading” mode once a element of significance has
been discovered.

68
6.4 Results

This research involved devising and testing a number of human–computer


interaction paradigms capable of enabling the two-dimensional screen inter-
face to be perceived without use of the eyes. These systems involve head-pose
tracking for obtaining the gaze position on a virtual screen, and various meth-
ods of receiving haptic feedback for interpreting screen content at the gaze
position.
Preliminary experimental results have shown that using the head as an
interface pointing device is an effective means of selecting screen regions
for interpretation, and for manipulating screen objects without use of the
eyes. When combined with haptic feedback, a user is able to perceive the
location and approximate dimensions of the virtual screen, as well as the
approximate locations of objects located on the screen after briefly browsing
over the screen area.
The use of haptic signal intensity to perceive window edges and their layer
is also possible to a limited extent with the TENS interface. After continued
use, users were able to perceive objects on the screen without any use of the
eyes, differentiate between files, folders and controls based on their signal
frequency, locate specific items and drag and drop items into open windows.
With practice, users were also able to operate pull-down menus and move
and resize windows without sight.
The interpretation of screen objects involves devising varying haptic feed-
back signals for identifying different screen objects. Learning to identify var-
ious screen elements based on their haptic feedback proved time consuming

69
on all haptic feedback devices. However, this learning procedure can be facil-
itated by providing speech or Braille output to identify elements when they
are ‘gazed’ at for a brief period.
As far as screen element interpretation was concerned, haptic feedback
via the Braille display surpassed the TENS and vibro-tactile interfaces. This
was mainly because the pictorial nature of glyphs used is more intuitive to
inexperienced users. It is also possible to encode more differentiable elements
by using two Braille cells per finger. The advantages of the Braille interface
would be presumably even more pronounced for users with prior experience
in Braille usage.
Preliminary experiments with the haptic web browser also demonstrated
promising results. For example, users were given the task of using a search
engine to find the answer to a question without sight. They showed that they
were able to locate the input form element with ease and enter the search
keywords. They were also able to locate the search results, browse over them
and navigate to web pages by clicking on links at the gaze position. Further-
more, users could describe the layout of unfamiliar web pages according to
where images, text, links, etc. were located.

70
6.5 Conclusions

This work presents a novel, haptic head-pose tracking computer interface that
enables the two-dimensional screen interface to be perceived and accessed
without any use of the eyes.
Three haptic output paradigms were tested, namely: TENS, vibro-tactile
and a refreshable Braille display. All three haptic feedback methods proved
effective to varying degrees. The Braille interface provided greater versatility
in terms of rapid identification of screen objects. The TENS system provided
improved perception of depth (for determining window layers). The vibro-
tactile output proved convenient but with limited resolution.
Preliminary experimental results have demonstrated that considerable
screen-based interactivity is able to be achieved with haptic gaze-tracking
systems including point-and-click and drag-and-drop manipulation of screen
objects. The use of varying haptic feedback can also allow screen objects at
the gaze position to be identified and interpreted. Furthermore, the prelim-
inary experimental results using the haptic web browser demonstrate that
this means of interactivity holds potential for improved human–computer
interactivity for the blind.

71
72
7
Head-Pose Tracking

Providing consistent, effective and intuitive perception of virtual environ-


ments, using the interface described in the previous chapter, requires accurate
detection and tracking of the user’s “head pose”. Since none of the available
head-pose tracking systems were found to be suitable for this application,
customised systems were built, as explained below.

73
7.1 Simple, robust and accurate head-pose
tracking using a single camera

Tracking the position and orientation of the head in real time is finding
increasing application in avionics, virtual reality, augmented reality, cine-
matography, computer games, driver monitoring and user interfaces for the
disabled. Although many head-pose tracking systems and techniques have
been developed, existing systems either added considerable complexity and
cost, or were not accurate enough for the application. For example, systems
described in [30], [38] and [64] use feature detection and tracking to mon-
itor the position of the eyes, nose and/or other facial features in order to
determine the orientation of the head. Unfortunately these systems require
considerable processing power, additional hardware or multiple cameras to
detect and track the facial features in 3D space. Although monocular sys-
tems (like [30], [38] and [92]) can reduce the cost of the system, they generally
performed poorly in terms of accuracy when compared with stereo or multi-
camera tracking systems [64]. Furthermore, facial feature tracking methods
introduce inaccuracies and the need for calibration or training into the sys-
tem due to the inherent image processing error margins and diverse range of
possible facial characteristics of different users.
To avoid the cost and complexity of facial feature tracking methods a
number of head-pose tracking systems have been developed that track LEDs
or infrared reflectors mounted on the user’s helmet, cap or spectacles (see
[63], [17], [18], and [29] ). However the pointing accuracy of systems utilising
reflected infrared light [63] was found to be insufficient for this research.

74
The other LED-based systems, like [17], [18], and [29], still require multiple
cameras for tracking the position of the LEDs in 3D space which adds cost
and complexity to the system as well as the need for calibration.
In order to track head-pose with high accuracy whilst minimising cost
and complexity, methods were researched for pinpointing the position of in-
frared LEDs using an inexpensive USB camera and efficient algorithms for
estimating the 3D coordinates of the LEDs based on known geometry. The
system is comprised of a single low-cost USB camera and a pair of specta-
cles fitted with three battery-powered LEDs concealed within the spectacle
frame. Judging by the experimental results, the system appears to be the
most accurate low-cost head-pose tracking system developed to date. Fur-
thermore, it is robust and requires no calibration. Experimental results are
provided, demonstrating head-pose tracking accurate to within 0.5◦ when the
user is within one meter of the camera.

7.1.1 Hardware

The prototype infrared LED-based head-pose tracking spectacles is shown in


Figure 7.1a. Figure 7.1b shows the experimental rig, which incorporates a
laser pointer (mounted below the central LED) for testing the ‘gaze’ accuracy.
The baseline distance between the outer LEDs is 147mm; the perpendicular
distance of the front LED from the baseline is 42mm.

75
(a) Prototype LED spectacles

(b) LED testing rig

Figure 7.1: Infrared LED hardware

76
Although the infrared light cannot be seen with the naked eye, the LEDs
appear quite bright to a digital camera. The experiments were carried out
using a low-cost, standard ‘Logitech QuickCam Express V-UH9’ USB cam-
era 1 , providing a maximum resolution of 640x480 pixels with a horizontal
lens angle of approximately 35◦ . The video data captured by this camera
is quite noisy, compared with more expensive cameras, though this proved
useful for testing the robustness of the system. Most visible light was filtered
out by fitting the lens with a filter comprising several layers of fully-exposed
colour photographic negative. Removal of the camera’s internal infrared fil-
ter was found to be unnecessary. This filtering, combined with appropriate
adjustments of the brightness, contrast and exposure settings of the camera,
allowed the raw video image to be completely black, with the infrared LEDs
appearing as bright white points of light. Consequently the image processing
task is simplified considerably.
The requirement of the user to wear a special pair of spectacles may
appear undesirable when compared to systems which use traditional image
processing to detect facial features. However, the advantage of being a ro-
bust, accurate and low-cost system which is independent of individual facial
variations, plus the elimination of any training or calibration procedures can
outweigh any inconvenience caused by wearing special spectacles. Further-
more, the LEDs and batteries could be mounted on any pair of spectacles,
headset, helmet, cap or other head-mounted accessory, provided that the
geometry of the LEDs is entered into the system.

1
http://www.logitech.com/en-us/support/webcams/legacy-devices/3403

77
7.1.2 Processing

The data processing involved in the system comprises two stages:

1. determining the two-dimensional LED image blob coordinates, and

2. the projection of the two-dimensional points into three-dimensional


space to derive the real-world locations of the LEDs in relation to the
camera.

Blob tracking

Figure 7.2a shows an example raw video image of the infrared LEDs which
appear as three white blobs on a black background.
The individual blobs are detected by scanning the image for contigu-
ous regions of pixels over an adjustable brightness threshold. Initially, the
blobs were converted to coordinates simply by calculating the centre of the
bounding-box; however the sensitivity of the three-dimensional transforma-
tions to even single-pixel changes proved this method to be unstable and
inaccurate. Consequently a more accurate method was adopted — calculat-
ing the centroid of the area using the intensity-based weighted average of
the pixel coordinates, as illustrated in Figure 7.2b. This method provides a
surprisingly high level of accuracy even with low-resolution input and distant
LEDs.

78
(a) Raw video input (showing the infrared LEDs at close range – 200mm)

(b) Example LED blob (with centroid marked) and corresponding intensity data

Figure 7.2: Infrared blob-tracking

79
Head-pose calculation

Once the two-dimensional blob coordinates have been calculated, the points
must be projected back into three-dimensional space in order to recover the
original LED positions. Solving this problem is not straightforward. Fig-
ure 7.3 illustrates the configuration of the problem. The camera centre (C)
is the origin of the coordinate system, and it is assumed to be facing directly
down the z-axis. The ‘gaze’ of the user is projected onto a ‘virtual screen’
which is also centred on the z-axis and perpendicular to it. The dimensions
and z-translation of the virtual screen are controllable parameters and do not
necessarily have to correspond with a physical computer screen, particularly
for blind users and virtual reality applications. In fact, the virtual screen
can be easily transformed to any size, shape, position or orientation relative
to the camera. Figure 7.3 also displays the two-dimensional image plane,
scaled for greater visibility. The focal length (z) of the camera is required
to perform the three-dimensional calculations. The LED points are labelled
L, R and F (left, right and front respectively, ordered from the camera’s
point of view). Their two-dimensional projections onto the image plane are
labelled l, r and f . L, R and F must lie on vectors from the origin through
their two-dimensional counterparts.

80
Figure 7.3: Perspective illustration of the virtual screen (located at the cam-
era centre), the 2D image plane, the 3D LED model and its projected ‘gaze’

Given knowledge of this model, the exact location of the LEDs along the
projection rays can be determined. The front LED is equidistant to the outer
LEDs, thus providing Equation 7.1.

d(L, F ) = d(R, F ) (7.1)

The ratio r between these distances and the baseline distance are also
known.

d(L, F ) = rd(L, R) (7.2)

81
These constraints are sufficient for determining a single solution orien-
tation for the model. Once the orientation has been calculated, the exact
physical coordinates of the points can be derived, including the depth from
the camera, by utilising the model measurements (provided in Section 7.1.1).
The distance of the model from the camera is irrelevant for determining
the model’s orientation, since it can simply be scaled in perspective along the
projection vectors. Thus it is feasible to fix one of the points at an arbitrary
location along its projection vector, calculate the corresponding coordinates
of the other two points, and then scale the solution to its actual size and
distance from the camera.
Parametric equations can be used to solve the problem. Thus the position
of point L is expressed as:
Lx = tlx (7.3a)

Ly = tly (7.3b)

Lz = tz (7.3c)

Since z is the focal length, a value of 1 for the parameter t will position L
on the image plane.
Thus there are only three unknowns — the three parameters of the LED
points on their projection vectors. In fact one of these unknowns is elimi-
nated, since the location of one of the points can be fixed — in this solution,
the location of R was fixed at depth Rz = z, thus making its x- and y-

82
coordinates equal to rx and ry respectively.
The position of the point F is expressed as:

Fx = ufx (7.4a)

Fy = ufy (7.4b)

Fz = uz (7.4c)

Substituting these six parametric coordinate equations for L and F into


Equation 7.1 yields:

q q
(tlx − vi)2 + (tly − vj)2 + (tz − vz)2 = (rx − ufx )2 + (ry − ufy )2 + (z − uz)2

(7.5)

which can be rewritten as:

z 2 (t2 − 1) + lx 2 t2 + ly 2 t2 − rx 2 − ry 2
u(t) = (7.6)
2(z 2 (t − 1) + lx fx t + ly fy t − rx fx − ry fy )

83
2

1.5
Front-point
1
Parameter (u)
0.5

0
0 0.5 1 1.5 2
Left-point Parameter (t)

Figure 7.4: Relationship between parameters t and u

Figure 7.4 shows a plot of Equation 7.6. It should be noted that the asymp-
tote is at:
rx fx + ry fy + z 2
t= (7.7)
lx fx + ly fy + z 2

and that the function has a root after the asymptote.


Now the point on the front-point projection vector which is equidistant
to L and R can be calculated, given a value for t. Of course, not all of these
points are valid — the ratio constraint specified in Equation 7.2 must be
satisfied. Thus it is necessary to also calculate the dimensions of the triangle
formed by the three points and find the parameter values for which the ratio
matches the model.
The baseline distance of the triangle is given by Equation 7.8 and plotted
in Figure 7.5.

q
b(t) = (rx − tlx )2 + (ry − tly )2 + (z − tz)2 (7.8)

84
750

Baseline 500
Distance

250
0 0.5 1 1.5 2
Left-point Parameter (t)

Figure 7.5: Triangle baseline distance

The height of the triangle is given by:

q
h(t) = ((u(t)fx − tlx )2 + (u(t)fy − tly )2 + (u(t)z − tz)2 ) − (b(t)/2)2 (7.9)

500
400
Height from 300
baseline to
front-point 200
100
0
0 0.5 1 1.5 2
Left-point Parameter (t)

Figure 7.6: Triangle height

Figure 7.6 shows a plot of Equation 7.9. It should be noted that this
function, since it is dependent on u(t), shares the asymptote defined in Equa-
tion 7.7.

85
1.6
1.4
1.2
Triangle 1
Height/Baseline 0.8
Ratio 0.6
0.4
0.2
0
0 0.5 1 1.5 2
Left-point Parameter (t)

Figure 7.7: Triangle height/baseline ratio

At this stage the actual baseline distance or height of the triangle is not
relevant — only their relationship. Figure 7.7 shows a plot of h(t)/b(t). This
function has a near-invisible ‘hump’ just after it reaches its minimum value
after the asymptote (around t = 1.4 in this case). This graph holds the key
to the solution, and can reveal the value of t for which the triangle has a ratio
which matches the model. Unfortunately, it is too complex to be analytically
inverted, thus requiring root-approximation techniques to find the solution
values. Thankfully, the solution range can be further reduced by noting two
more constraints inherent in the problem.

86
Firstly, only solutions in which the head is facing toward the camera are
relevant. Rearward facing solutions are considered to be invalid as the user’s
head would obscure the LEDs. Thus the addition of the constraint that:

Fz < Mz (7.10)

where M is the midpoint of line LR. This can be restated as:

u(t)fz < (tlz + z)/2 (7.11)

1000
900 Front-point z
Midpoint z
800
700
z-coordinates
600
500
400
300
0 0.5 1 1.5 2
Left-point Parameter (t)

Figure 7.8: z-coordinates of F and M

Figure 7.8 shows the behaviour of the z-coordinates of F and M as t varies.


It can be seen that Equation 7.10 holds true only between the asymptote and
the intersection of the two functions. Thus these points form the limits of
the values for t which are of interest. The lower-limit allows disregarding
all values of t less than the asymptote, while the upper-limit crops the ratio
function nicely to avoid problems with its ‘hump’. What remains is a nicely
behaved, continuous piece of curve on which to perform root approximation.
The domain could be further restricted by noting that not only rearward-

87
facing solutions are invalid, but also solutions beyond the rotational range
of the LED configuration; that is, the point at which the front LED would
occlude one of the outer LEDs. The prototype LED configuration allows
rotation (panning) of approximately 58◦ to either side before this occurs.
The upper-limit (intersection between the Fz and Mz functions) can be
expressed as:
q
−S − −4(−lx 2 − ly 2 + lx fx + ly fy ) (rx 2 + ry 2 − rx fx − ry fy ) + S 2
t≤
2(−lx 2 − ly 2 + lx fx + ly fy )
(7.12)

where S = fx (lx − rx ) + fy (ly − ry ).


Note that this value is undefined if lx and ly are both zero (l is at the
origin) or one of them is zero and the other is equal to the corresponding
f coordinate. This follows from the degeneracy of the parametric equations
which occurs when the projection of one of the control points lies on one or
both of the x- and y-axes. Rather than explicitly detecting this problem and
solving a simpler equation for the specific case, all two-dimensional coordi-
nates were instead jittered by a very small amount so that they will never
lie on the axes.
The root approximation domain can be further reduced by noting that all
parameters should be positive so that the points cannot appear behind the
camera. Note that the positive root of Equation 7.6 (illustrated in Figure 7.4)
is after the asymptote. Since u must be positive, this root can be used as
the new lower-limit for t. Thus the lower-limit is now:
s
rx 2 + ry 2 + z 2
t≥ (7.13)
lx 2 + ly 2 + z 2

88
2
Lower limit
1.5 Upper limit
Triangle
Height/Baseline 1
Ratio
0.5

0
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
Left-point Parameter (t)

Figure 7.9: Triangle ratio graph with limits displayed

Figure 7.9 illustrates the upper and lower limits for root-approximation in
finding the value of t for which the triangle ratio matches the model geometry.
Once t has been approximated, u can be easily derived using Equation 7.6,
and these parameter values substituted into the parametric coordinate equa-
tions for L and F . Thus the orientation has been derived. The solution can
now be simply scaled to the appropriate size using the dimensions of the
model. This provides accurate three-dimensional coordinates for the model
in relation to the camera. Thus the user’s ‘gaze’ (based on head-orientation)
can be projected onto a ‘virtual screen’ positioned relative to the camera.

7.1.3 Experimental results

Even using as crude a method of root-approximation as the bisection method,


the prototype system implemented in C++ on a 1.3GHz Pentium processor
took less than a microsecond to perform the entire three-dimensional trans-
formation, from two-dimensional coordinates to three-dimensional head-pose

89
coordinates. The t parameter was calculated to 10-decimal-place precision,
in approximately 30 bisection iterations.
To test the accuracy of the system, the camera was mounted in the centre
of a piece of board measuring 800x600mm. A laser-pointer was mounted just
below the centre LED position to indicate the ‘gaze’ position on the board.
The system was tested over a number of different distances, orientations and
video resolutions. The accuracy was monitored over many frames in order to
measure the system’s response to noise introduced by the dynamic camera
image. Table 7.1 and Figure 7.10 report the variation in calculated ‘gaze’
x- and y-coordinates when the position of the spectacles remained static.
Note that this variation increases as the LEDs are moved further from the
camera, because the resolution effectively drops as the blobs become smaller
(see Table 7.2). This problem could be avoided by using a camera with optical
zoom capability providing the varying focal length could be determined.

Resolution 320x240 pixels 640x480 pixels


Distance (mm) 500 1000 1500 2000 500 1000 1500 2000
Avg. x-error 0.09◦ 0.29◦ 0.36◦ 1.33◦ 0.08◦ 0.23◦ 0.31◦ 0.98◦
Max. x-error 0.13◦ 0.40◦ 0.57◦ 2.15◦ 0.12◦ 0.34◦ 0.46◦ 1.43◦
Avg. y-error 0.14◦ 0.32◦ 0.46◦ 2.01◦ 0.10◦ 0.20◦ 0.38◦ 1.46◦
Max. y-error 0.22◦ 0.46◦ 0.69◦ 2.86◦ 0.15◦ 0.29◦ 0.54◦ 2.15◦

Table 7.1: Horizontal and vertical ‘gaze’ angle (degrees) resolution

90
3 3
320x240 Pixels b 320x240 Pixels b
2.5 640x480 Pixels 4 2.5 640x480 Pixels 4
2 2 b

x◦ 1.5 b y 1.5 4
1 4 1
b
4b
0.5 0.5 b
4b
4
b b 4
04 04
500 1000 1500 2000 500 1000 1500 2000
Distance from camera (mm) Distance from camera (mm)

Figure 7.10: ‘Gaze’ angle resolution graphs

500mm 1000mm 1500mm 2000mm


640x480 pixels 20 13 10 8
320x240 pixels 7 5 4 3

Table 7.2: LED ‘blob’ diameters (pixels) at different resolutions and camera
distances

To ascertain the overall accuracy of the system’s ‘gaze’ calculation, the


LEDs were aimed at fixed points around the test board using the laser
pointer, and the calculated gaze coordinates were compared over a number of
repetitions. The test unit’s base position, roll, pitch and yaw were modified
slightly between readings to ensure that whilst the laser gaze position was
the same between readings, the positions of the LEDs were not. The aver-
ages and standard deviations of the coordinate differences were calculated,
and found to be no greater than the variations caused by noise reported in
Table 7.1 and Figure 7.10 at the same distances and resolutions. Conse-
quently it can be deduced that the repeatability accuracy of the system is
approximately equal to, and limited by, the noise introduced by the sensing
device.

91
As an additional accuracy measure, the system’s depth resolution was
measured at a range of distances from the camera. As with the ‘gaze’ reso-
lution, the depth resolution was limited by the video noise. In each case, the
spectacles faced directly toward the camera. These results are tabulated in
Table 7.3.

Distance (mm) 500 1000 1500 2000


Accuracy at 320x240 pixels ±0.3mm ±2mm ±5mm ±15mm
Accuracy at 640x480 pixels ±0.15mm ±1.5mm ±3mm ±10mm

Table 7.3: Distance from Camera Calculation Resolution

7.1.4 Summary

The experimental results demonstrate that the proposed LED-based head-


pose tracking system is very accurate considering the quality of the camera
used for the experiments. At typical computer operating distances the ac-
curacy is within 0.5◦ using an inexpensive USB camera. If longer range
or higher accuracy is required a higher quality camera could be employed.
The computational cost is also extremely low, at less than one microsecond
processing time per frame on an average personal computer for the entire
three-dimensional calculation. The system can therefore easily keep up with
whatever frame rate the video camera is able to deliver. The system is inde-
pendent of the varying facial features of different users, needs no calibration
and is immune to changes in illumination. It even works in complete dark-
ness. This is particularly useful for human–computer interface applications
involving blind users as they have little need to turn on the room lights.

92
Other applications include scroll control of head-mounted virtual reality dis-
plays, or any application where the head position and orientation is to be
monitored.

7.2 Time-of-flight camera technology

The infrared LED tracking solution indeed proved highly accurate and ex-
tremely low-cost, however requiring the user to wear tracking spectacles was
far from ideal. A more recent development in sensor technology presented
an opportunity to maintain the high level of accurancy without requiring a
tracking target with known geometry: time-of-flight cameras.

7.2.1 The SwissRanger

In 2006, Swiss company MESA Imaging announced the release of the SR-
3000 “SwissRanger” time-of-flight camera [61]. The camera (pictured in
Figure 7.11) is surrounded by infrared LEDs which illuminate the scene,
and allows the depth of each pixel to be measured based on the time of
arrival of the frequency modulated infrared light in real-time. Thus, for each
frame it is able to provide a depth map in addition to a standard greyscale
amplitude image (see examples in Figure 7.12). The amplitude image is
based on reflected infrared light, and therefore is not affected by external
lighting conditions.

93
Figure 7.11: SwissRanger SR-3000

94
(a) SR-3000 amplitude image

(b) Depth map

Figure 7.12: Sample amplitude image and corresponding depth map

95
Despite the technological breakthrough that the SwissRanger has pro-
vided, it has a number of limitations. The sensor is QCIF (Quarter Common
Intermediate Format, 176x144 pixels), so the resolution of the data is low.
The sensor also has a limited ‘non-ambiguity range’ before the signals get
out of phase. At the standard 20MHz frequency, this range is 7.5 metres.
However, given the comparatively short-range nature of the application, this
limitation does not pose a problem for the head-pose tracking system. The
main limitation of concern is noise associated with rapid movement. The
SR-3000 sensor is controlled as a so-called one-tap sensor. This means that
in order to obtain distance information, four consecutive exposures have to
be performed. Fast moving targets in the scene may therefore cause errors
in the distance calculations; see [62]. Whilst the accuracy of a depth map
of a relatively stationary target is quite impressive (see Figure 7.13a), the
depth map of a target in rapid motion is almost unusable by itself (see Fig-
ure 7.13b). This problem may be overcome to a considerable extent by using
a combination of median filtering, time fusion and by combining the intensity
image data with the depth map.

96
(a) Stationary subject

(b) Subject in motion

Figure 7.13: SwissRanger point clouds

97
7.2.2 Overview

The system is required to be able to track an arbitrary human face. The


diverse range of faces, hairstyles and accessories makes this task consider-
ably difficult. A feature-based approach would require the availability of at
least three identifiable features in order to obtain an unambiguous three-
dimensional orientation. For example, locating the eyes and nose would be
sufficient to calculate the orientation. However the user’s eyes may not al-
ways be visible to the camera, especially if the user is wearing sunglasses or
spectacles. The ears could be used, however they may not be visible if the
user has long hair. Both the eyes and ears could also be partially occluded
from the camera when the user is looking away from the camera. The edges of
the mouth could be used, though they can change with the facial expression,
and could be obscured by facial hair. Likewise the chin or jaw-line (which are
very easily detected in a depth map) cannot be guaranteed to be available on
all users. Furthermore, a user may have a combination of feature-obscuring
characteristics (e.g. long hair, facial hair, glasses, etc.).
Consequently, the most universally available and identifiable feature on
the human face is the nose, for a number of reasons. Firstly, it is rarely
obscured. Secondly, if it is occluded the user can be assumed to be facing
away from the camera. Thirdly, it is advantageously positioned near the
centre of the face on an approximate axis of symmetry. Other researchers
also consider the nose to be important in facial tracking, e.g. [23, 91], and
have devised systems to reliably detect the nose in both amplitude and depth
images [23, 26]. The ‘tip’ of the nose (furthest protrusion from face) is

98
considered to be the most important point in this system.
Although the nose can be considered to be the best facial feature to
track due to its availability and universality, more than this single feature
is needed to obtain the orientation of the face in three-dimensional space.
One approach would be to use an algorithm such as Iterative Closest Point
(ICP) [2] to match the facial model obtained in the previous frame with the
current frame. This method may work but is expensive to do in real time. It
may also fail if the head moves too quickly or if some frames are noisy and
the initial fit is a considerable distance from optimal.
Alternatively, an adaptive feature selection algorithm could be formu-
lated which automatically detects identifiable features within the depth map
or amplitude image (or ideally a combination of the two) in order to de-
tect additional features which can be used to perform matching. Here, a
redundant set of features could theoretically provide the ability to match the
orientation between two models with high accuracy. In practice however,
the low resolution of the SwissRanger camera combined with the noisy na-
ture of the depth map have caused this approach to prove unsuccessful. The
features obtained in such an adaptive feature selection algorithm would also
need to be coerced to conform to the target spacial configuration. To over-
come these difficulties, a novel approach was developed that simplified the
feature selection process whilst simultaneously removing the need for spatial
coercion.
The premise is relatively simple: with only one feature so far (i.e. the
nose tip) more are required, preferably of a known spatial configuration. In-
tersecting the model with a sphere of radius r centred on the nose feature

99
results in an intersection profile containing all points on the model which
are r units away from the central feature, as shown in Figure 7.14. Because
of the spherical nature of the intersection, the resulting intersection pro-
file is completely unaffected by the orientation of the model, and thus ideal
for tracking purposes. It could simply be analysed for symmetry, if it were
safe to assume that the face is sufficiently symmetrical and that the central
feature lies on the axis of symmetry, and an approximate head-pose could
be calculated based on symmetry alone. However, given that many human
noses are far from symmetrical, and up to 50% of the face may be occluded
due to rotation, this approach will not always succeed. But if the model is
saved from the first frame, spherical intersections can be used to match it
against subsequent frames and thus obtain the relative positional and rota-
tional transformation. Multiple spherical intersections can be performed to
increase the accuracy of the system.

100
Figure 7.14: Illustration of tracing a spherical intersection profile starting
from the nose tip

Subsequently, the orientation matching problem is now reduced to that


of aligning paths on spherical surfaces. This can be performed using ICP
or a similar alignment optimisation algorithm. A good initial fit can be
performed regardless of the orientation by simply optimising the alignment
of the latitudinal extrema of the profiles (i.e. the top-most and bottom-most
points). These should always be present, because at least 50% of the face is
visible. If not, their absence can be easily detected by the fact that they lie on
the end-points of the path, thus indicating that the latitudinal traversal was
cut short. The latitudinal extrema are reliable features of the intersection

101
profile due to the ‘roll’ rotational limits of the human head.
After having matched a subsequent 3D model to the position and orienta-
tion of the previous one, additional data becomes available. If the orientation
of the face has changed, some regions that were previously occluded (e.g. the
far-side of the nose) may now be visible. Thus, merging the new 3D data
into the existing 3D model improves the accuracy of subsequent tracking.
Consequently, with every new matched frame, more data is added to the
target 3D model making it into a continuously evolving mesh.
To prevent the data comprising the 3D model from becoming too large,
the use of polygonal simplification algorithms is proposed to adaptively re-
duce the complexity of the model. If a sufficient number of frames indicate
that some regions of the target model are inaccurate, points can be adap-
tively moulded to match the majority of the data, thus filtering out noise,
occlusion, expression changes, etc. In fact, regions of the model can be iden-
tified as being rigid (reliably robust) or noisy / fluid (such as hair, regions
subject to facial expression variation, etc.) and appropriately labelled. Thus,
matching can be performed more accurately by appropriately weighting the
robust areas of the 3D model.
This approach to head pose tracking depends heavily on the accuracy of
the estimation of the initial central feature point (i.e. the nose tip). If this
is offset, the entire intersection profile changes. Fortunately, the spherical
intersection profiles themselves can be used to improve the initial central
point position. By using a hill climbing approach in three-dimensional space,
the central point can be adjusted slightly in each direction to check for a more
accurate profile match. This will converge upon the best approximation of

102
the target centre point provided the initial estimate is relatively close to the
optimal position.
Furthermore, the system can be used to differentiate or identify users.
Each evolving mesh can be stored in a database, and a new model can be
generated for a face which does not sufficiently match any existing models.
Due to the simplified nature of spherical intersection profile comparisons,
a database of faces can be searched with considerable efficiency. Spherical
intersection profiles for each model could also be cached for fast searching.

7.2.3 Preprocessing

Several steps are used to prepare the acquired data for processing.

Median Filtering

As discussed in Section 7.2.1, the SwissRanger depth map is subject to con-


siderable noise, particularly if the subject is in motion. Median filtering is
applied to reduce the effects of noise in the depth map. The amount of noise
in the depth map is also measured to identify frames which are likely to
produce inaccurate results due to excessive noise.

Distance and amplitude thresholds

Some cross-correlation with the amplitude image can help eliminate erro-
neous data in the depth map. For example, pixels in the depth map which
correspond to zero values in the amplitude image (most frequently found
around the edges of objects) are likely to have been affected by object mo-

103
tion, and can be filtered out. Minimum and maximum distance thresholds
can also be applied to the depth map to eliminate objects in the foreground
or background.

Region of interest

Whilst the user’s torso can be of some value to a head-pose tracking ap-
plication, it is generally desirable to perform calculations based solely upon
the facial region of the images. Identifying the facial region robustly is non-
trivial. The initial prototype system performed accurate jaw-line detection
based on the greatest discontinuity in each column of the depth map in order
to separate the head from the body. This approach sometimes failed due
to extreme head rotation, excessive noise, presence of facial hair, occlusion,
etc. Consequently, a more robust approach was developed using a simple
bounding-box. Given that a depth map is available, it is straightforward to
determine the approximate distance of the user from the camera. This is
achieved by sampling the first n non-empty rows in the depth map (the top
of the user’s head) and then calculating the average depth. By taking the
approximate distance of the user’s head, and anthropometric statistics [69],
the maximum number of rows the head is likely to occupy within the images
can be determined. The centroid of the depth map pixels within these rows
is then calculated, and a region of interest of the appropriate dimensions
is centred on this point (see Figure 7.15). This method has proved 100%
reliable in all sample sequences recorded to date.

104
Figure 7.15: Example facial region-of-interest

7.2.4 Nose tracking

Once the acquired data has been preprocessed, the central feature must be
found in order to localise the models in preparation for orientation calcu-
lations. The rationale for choosing the nose tip as the central feature was
discussed in Section 7.2.2. As Gorodnichy points out [23], the nose tip can
be robustly detected in an amplitude image by assuming it is surrounded
by a spherical Lambertian surface of constant albedo. However, using a
SwissRanger sensor provides an added advantage. Since the amplitude im-

105
age is illuminated solely by the integrated infrared light source, calculating
complex reflectance maps to handle differing angles of illumination is unnec-
essary. Additional data from the depth map such as proximity to camera and
curvature (see Figure 7.16) can also be used to improve the search and assist
with confirming the location of the nose tip. Figure 7.17 shows a typical
frame with nose localisation data overlaid. Preliminary results have shown
that this approach is fast and robust enough to locate the nose within typical
frame sequences with sufficient accuracy for this application.

Figure 7.16: Amplitude image with curvature data overlaid. Greener pixels
indicate higher curvature (calculated from depth map).

106
Figure 7.17: Sample frame with nose-tracking data displayed. Green pixels
are candidate nose pixels; red cross indicates primary candidate.

7.2.5 Finding orientation

Once the central feature (nose tip) has been located, the dimensionality of
the problem has been reduced considerably by removing the translational
component of the transformation. Now a single rotation about a specific
axis in three-dimensional space (passing through the central feature point)
will be sufficient to match the orientation of the current models to the saved
target. Small errors in the initial location of the nose point can be iteratively
improved using three-dimensional hill-climbing to optimise the matching of
the spherical intersection profiles, as discussed in Section 7.2.2.

107
Spherical intersection algorithm

The intersection of a three-dimensional mesh with an arbitrary sphere might


sound like an expensive operation to perform repeatedly, however the algo-
rithm achieves this very efficiently (see Algorithm 1). In essence, it traverses
the depth map from the centre point in the direction of the facial centroid
until it finds a pair of projected pixels spanning the sphere boundary. It
then adds the interpolated intersection point to a vector (SIP ) and con-
tinues traversing the depth map along the intersection boundary until the
end-points are found or the loop is closed. The average execution time of the
algorithm on the sample sequence shown in the figures and Video 1 [57] on
a dual-core 1.8GHz processor was 140µs. The interpolation allows the pro-
files to be calculated with sub-pixel accuracy. Super-sampling of four-pixel
groups was used to reduce the influence of noise.

108
Algorithm 1 Find intersection profile of projected depth map with sphere of
radius radius centred on pixel projected from depthM ap[noseRow][noseCol]
r ⇐ noseRow, c ⇐ noseCol
if noseCol > f aceCentroidCol then
direction ⇐ LEF T
else
direction ⇐ RIGHT
end if
f ound ⇐ false {No intersection found yet}
centre ⇐ projectInto3d(depthM ap, r, c)
innerP oint ⇐ centreP oint
while r and c are within region of interest, and distance(inner, centre) <
radius do
(r, c) ⇐ translate(r, c, direction)
outerP oint ⇐ projectInto3d(depthM ap, r, c)
if distance(outer, centre) > radius then
f ound ⇐ true
end if
end while
if Not f ound then
return No intersection
end if

109
Algorithm 1 (continued)
for startDirection = U P to DOW N do
direction ⇐ startDirection
while Loop not closed do
(r2, r2) ⇐ translate(inner.r, inner.c, direction)
inner2 ⇐ projectInto3d(depthM ap, r2, c2)
(r2, c2) ⇐ translate(outer.r, outer.c, direction)
outer2 ⇐ projectInto3d(depthM ap, r2, c2)
if inner2 or outer2 are invalid then
break
else if distance(inner2, centre) > radius then
outer ⇐ inner2
else if distance(outer2, centre) < radius then
inner ⇐ outer2
else
inner ⇐ inner2
outer ⇐ outer2
end if
id ⇐ distance(inner, centre)
t ⇐ (radius − id)/((distance(outer, centre) − id)
if startDirection = U P then
Append (inner + (outer − inner) × t) to SIP
else
Prepend (inner + (outer − inner) × t) to SIP
end if
Update direction
end while
end for
return SIP

110
Figure 7.18: Example spherical intersection profiles overlaid on depth (top-
left), amplitude (bottom-left) and 3D point cloud (right).

Profile matching algorithm

Implementation of a full profile matching algorithm should be considered for


future research. The gaze calculation in Video 1 [57] is a crude approximation
based on a least-squares fit of a line through the central feature and the
point cloud formed by taking the midpoint of the latitudinal extrema of each
intersection profile. Yet it can be seen that even this provides a relatively
accurate gaze estimate.
It is envisaged that the latitudinal extrema of each profile would be used

111
only to provide an initial alignment for each profile pair, after which a new
algorithm will measure the fit and optimise it in a manner similar to ICP [2].
The fit metric provided by this algorithm could be used in the hill-climbing-
based optimisation of the central point (nose tip).
It is advantageous that each profile pair should lead to the same three-
dimensional axis and magnitude of rotation required to align the model. Thus
the resultant collection of rotational axes provides a redundant approxima-
tion of the rotation required to align the entire model. This can be analysed
to remove outliers, etc., and then averaged to produce the best approximation
of the overall transformation.

7.2.6 Building a mesh

Once the current frame has been aligned with the target 3D model, any addi-
tional information can be used to evolve the 3D model. For example, regions
which were originally occluded in the target model may now be visible, and
can be used to extend the model. Thus a near-360◦ model of the user’s head
can be obtained over time. Adjacent polygons with similar normals can be
combined to simplify the mesh. Areas with higher curvature can be sub-
divided to provide greater detail. For each point in the model, a running
average of the closest points in the frames is maintained. This can be used
to push or pull points which stray from the average and allow the mesh to
‘evolve’ and become more accurate with each additional frame. The contri-
bution of a given frame to the running averages is weighted by the quality
of that frame, which is simply a measure of depth map noise combined with

112
the mean standard deviation of the intersection profile rotational transforms.
Furthermore, a running standard deviation can be maintained for each point
to allow the detection of noisy regions of the model, such as hair, glasses
which might reflect the infrared light, facial regions subject to expression
variation, etc. These measures of rigidity can then be used to weight regions
of the intersection profiles to make the algorithm tolerant of fluid regions,
and able to focus on the rigid regions. Non-rigid regions on the extremities
of the model can be ignored completely. For example, the neck will show up
as a non-rigid region due to the fact that its relation to the face changes as
the user’s head pose varies.

7.2.7 Facial recognition

Having obtained a three-dimensional model of the user’s face in the process


of eliminating the requirement for infrared-LED tracking spectacles, the op-
portunity to extend the spherical-intersection-profile research to facial recog-
nition could not be ignored. Facial recognition could also be utilised in the
virtual-environment perception system for automatic loading of user prefer-
ences or authentication purposes in a multi-user environment.

Initial alignment

The most obvious method of calculating the orientation is to find the centroid
of each spherical intersection profile, and then project a line through the cen-
troids from the centre of the spheres using a least-squares fit. This provides
a reasonable approximation in most cases but performs poorly when the face

113
orientation occludes considerable regions of the spherical intersection profiles
from the camera. A spherical intersection profile which is 40% occluded will
produce a poor approximation of the true centroid and orientation vector
using this method.
Instead, the system finds the average of the latitudinal extrema of each
spherical intersection profile (i.e. the topmost and bottommost points). This
proved effective over varying face orientations for two reasons. Firstly, these
points are unlikely to be occluded due to head rotation. Secondly, in most
applications the subject’s head is unlikely to ‘roll’ much (as opposed to ‘pan’
and ‘tilt’), so these points are likely to be the true vertical extremities of
the face. If the head is rotated so far that these points are not visible on
the spherical intersection profile, the system detects that the spherical inter-
section profile is still rising/falling at the upper/lower terminal points and
therefore dismisses it as insufficient. A least-squares fit of a vector from the
nose tip passing through the latitudinal extrema midpoints provides a good
initial estimate of the face orientation. Several further optimisations are sub-
sequently performed by utilising additional information, as discussed in the
following sections.

Symmetry optimisation

Given that human faces tend to be highly symmetric, the orientation of a


“faceprint” can be optimised by detecting the plane of bilateral symmetry.
Note that this does not require the subject’s face to be perfectly symmet-
ric in order to be recognised. This is performed by first transforming the

114
faceprint using the technique described above, to produce good orientation
estimation. Given the observation of limited roll discussed in Section 7.2.7, it
is reasonable to assume that the orientation plane will intersect the faceprint
approximately vertically. The symmetry of the faceprint is then measured
using Algorithm 2. This algorithm returns a set of midpoints for each spher-
ical intersection profile, which can be used to measure the symmetry of the
current faceprint orientation. Algorithm 2 also provides an appropriate trans-
formation which can be used to optimise the symmetry by performing a least-
squares fit to align the plane of symmetry approximated by the midpoints.
Figure 7.19 illustrates an example faceprint with symmetry midpoints visible.

115
Algorithm 2 Calculate the symmetry of a set of SIP s
midpoints ⇐ new Vector[length(SIP s)]
for s ⇐ 0 to length(SIP s) − 1 do
for p ⇐ 0 to length(SIP s[s]) − 1 do
other ⇐ ∅
Find other point on SIP s[s] at same ‘height’ as p:
for q ⇐ 0 to length(SIP s[s]) − 1 do
next ⇐ q + 1
if next >= length(SIP s[s] then
next ⇐ 0 {Wrap}
end if
if q = p ∨ next = p then
continue
end if
if (q.y < p.y < next.y) ∨ (q.y > p.y > next.y) then
other ⇐ interpolate(q, next)
break
end if
end for
if other 6= ∅ then
midpoints[s].append((p + other)/2)
end if
end for
end for
return midpoints

116
Figure 7.19: Example faceprint with symmetry midpoints (in yellow), and
per-sphere midpoint averages (cyan)

Temporal optimisation

Utilising data from more than one frame provides opportunities to increase
the quality and accuracy of the faceprint tracking and recognition. This is
achieved by maintaining a faceprint model over time, where each new frame
contributes to the model by an amount weighted by the quality of the current
frame compared to its predecessors. The quality of a frame is assessed using
two parameters. Firstly, the noise in the data is measured during the median
filtering process mentioned in Section 7.2.3. This is an important parameter,

117
as frames captured during fast motions will be of substantially lower quality
due to the sensor issues discussed in Section 7.2.1. In addition, the measure
of symmetry of the faceprint (see Section 7.2.7) provides a good assessment
of the quality of the frame. These parameters are combined to create an
estimate of the overall quality of the frame using Equation 7.14.

p
quality = noiseF actor × symmetryF actor (7.14)

Thus, for each new frame captured, the accuracy of the stored faceprint
model can be improved. The system maintains a collection of the n high-
est quality faceprints captured up to the current frame, and calculates the
optimal average faceprint based on the quality-weighted average of those n
frames. This average faceprint quickly evolves as the camera captures the
subject in real-time, and forms a robust and symmetric representation of the
subject’s 3D facial profile (see Figure 7.20).

118
Figure 7.20: Faceprint from noisy sample frame (white) with running-average
faceprint (red) superimposed

Temporal information can also be gained by analysing the motion of the


subject’s head-pose over a sequence of frames. The direction, speed and
acceleration of the motion can be analysed, and used to predict the nose
position, head-pose and the resultant ‘gaze’ position for the next frame. This
is particularly useful when the system is used as a ‘gaze’-tracker as the system
can utilise this data to smooth the ‘gaze’ path and further eliminate noise.

119
Comparing faceprints

Using the averaging technique discussed in the previous section, a single


faceprint can be maintained for each subject. These are stored by the system
and accessed when identifying a subject, which requires efficient and accurate
comparison of faceprints. In order to standardise the comparison of faceprints
and increase the efficiency, quantisation is performed prior to storage and
comparison. Each spherical intersection profile is sampled at a set number of
angular divisions, and the resultant interpolated points form the basis of the
quantised faceprint. Thus each faceprint point has a direct relationship to the
corresponding point in every other faceprint. Comparison of faceprints can
then be simplified to the average Euclidean distance between corresponding
point pairs.

Figure 7.21: Example faceprints

Facial recognition results

Figure 7.21 shows some actual 2D transforms of faceprints taken from differ-
ent subjects. Although the system is yet to be tested with a large sample of

120
subjects, the preliminary results suggest that faceprints derived from spher-
ical intersection profiles vary considerably between individuals.
The number of spherical intersection profiles comprising faceprints, their
respective radii and the number of quantised points in each spherical inter-
section profile are important parameters that define how a faceprint is stored
and compared. Experimentation has shown that increasing the number of
spherical intersection profiles beyond six and the number of quantised points
in each spherical intersection profile beyond twenty did not significantly im-
prove the system’s ability to differentiate between faceprints. It seems this
is due mainly to the continuous nature of the facial landscape and residue
noise in the processed data. Consequently, the experiments were conducted
with faceprints comprised of six spherical intersection profiles. Each spher-
ical intersection profile was divided into twenty points. All faceprints were
taken from ten human subjects positioned approximately one metre from the
camera and at various angles.
Experiments showed that 120-point faceprints provided accurate recog-
nition of the test subjects from most view points. It was found to take an
average of 6.7µs to process and compare two faceprints with an Intel dual-
core 1.8GHz Centrino processor. This equates to faceprint comparison rate
of almost 150,000 per second which clearly demonstrates the potential search
speed of the system. The speed of the each procedure (running on the same
processor) involved in capturing faceprints from the SR3000 camera is out-
lined in Table 7.4.

121
Processing Stage Execution Time
Distance Thresholding 210µs
Brightness Thresholding 148µs
Noise Filtering 6,506µs
Region of Interest 122µs
Locate Nose 17,385µs
Trace Profiles 1,201µs
Quantise Faceprint 6,330µs
Facial Recognition 91µs
Update Average Faceprint 694µs
Total per frame 32,687µs

Table 7.4: Average execution times for each processing stage

Figure 7.22: Example ‘gaze’ projected onto a ‘virtual screen’

122
7.3 Conclusions

To enable the users’ head-pose to be tracked with sufficient accuracy for


controlling the perception region, several head-pose tracking systems were
developed. The simple single-camera based system for tracking three infrared
LEDs demonstrated unparalled accuracy (within 0.5◦ ) compared to existing
systems.
The research involving time-of-flight camera technology prompted the
development of novel spherical-intersection-profile tracking. It eliminates
the requirements for the user to wear tracking LEDs. It also opened up
promising avenues for further research into facial recognition, with impressive
preliminary results.

123
124
8
Conclusions

This research sought to determine methods of utilising new technologies to


provide enhanced perception for the visually-impaired. It was based upon
the hypothesis that designing, building and testing devices for perception of
physical and virtual environments using novel applications of technology will
determine whether or not such technologies are likely to be of use in assisting
visually-impaired people. The work was focussed around four areas identified
as possessing significant potential:

• The use of electro-neural stimulation as a navigational aid

125
• Perception of colour information via haptic feedback

• GPS navigation using haptic signals

• Haptic interaction with the spatial layout of software graphical user


interfaces

and specifically the use of the head for controlling the region of interest and
electro-neural stimulation for haptic communication.

8.1 Achievements

A number of devices were designed, constructed and tested. The preliminary


experimental results were generally positive, both validating the hypothesised
potential of the research avenues and paving the way for future research and
development.
The primary discoveries of the research are enumerated below:

1. Mapping of range data to electro-neural pulses of proportional intensity,


delivered to fingers or other areas (see Section 3.1)

2. Use of an infrared sensor array to obtain one-to-one range-electrode


mappings and eliminate vision processing overheads and inaccuracies
(see Section 3.2)

3. Electro-neural presentation of GPS data (see Section 5)

4. Morse-code style communication of landmark identifiers via electro-


neural pulses (see Section 5.1)

126
5. 360◦ perception protocol for communication of landmark bearings (see
Section 5.2)

6. Communication of colour information via electro-neural frequency map-


ping (see Section 4.1)

7. Wireless electro-neural patches for convenience, flexibility, and broad-


ened configuration possibilities for number of regions and arrangement
of electrodes (see Section 3.3)

8. Electro-neural perception of a computer desktop environment, includ-


ing basic manipulation of graphical user interfaces (see Section 6.2)

9. Comparison of vibro-tactile perception to electro-neural stimulation for


perceiving virtual environments (see Section 6.3.2)

10. Enabling head-pose-based interaction with virtual environments (see


Section 7)

11. Head-pose tracking with a single-camera (LED-based, see Section 7.1)

12. Head-pose tracking with a time-of-flight camera (see Section 7.2.2)

13. Facial recognition with a time-of-flight camera (see Section 7.2.7)

14. Head-pose-based haptic perception of the spatial layout of web pages


(see Section 6.2.1)

15. Head-pose based spatial interface perception using a refreshable Braille


display (see Section 6.3.3)

127
8.2 Suggestions for further research

As can be seen by the breadth of the achievement list above, most avenues
of research and experimentation in these areas open up several more exciting
possibilities.

8.2.1 Research already commenced

Many others have followed this research and already built upon it. In 2004,
the ENVS pioneered research in this field and was the only device of its kind
for a number of years. In 2010, Pradeep et al. developed an ENVS-like
device using stereo cameras and vibration motors mounted on the shoulders
and waist [70]. In 2011, Mann et al. built a device [53] very similar to the
1
original ENVS prototype, but utilising a Microsoft Kinect and an array of
vibro-tactile sensors mounted within a helmet. Several others (for example
[16], [68] and [50]) have also continued research into stereo-matching based
navigation aids. Reviews such as [51] and [12] also note the significance of
the ENVS.
Very little research has been conducted into the communication of colour
information via haptic feedback. As noted in [37], at the time of publication
of [59] there were only a couple of references to the mapping of colours to
textures in the literature. Since then researchers such as [74] have explored
the concept a little further based on this work, but there is great scope for
further research.

1
http://www.xbox.com/en-AU/Kinect

128
8.2.2 Other areas

It would be of great interest to determine the differences in experimental


results achieved by visually impaired users rather than blindfolded sighted
users. It has been hypothesised that heightened non-visual sensory skills
would allow the system to be utilised to greater effect, but this has yet to be
proven. The differences between results for users who were born blind and
those who have previously experienced sight would also be of great value.
Theoretically, the latter would be more akin to blindfolded sighted subjects,
whereas comprehension of the visual mapping paradigms employed may be
more difficult for those who have never experienced vision.
The development of custom infrared sensors which perform better for
the application in both indoor and outdoor environments was proposed in
Section 3.2.
Whilst this research found the vibro-tactile interface proposed in Sec-
tion 6.3.2 to be inferior to the TENS and Braille communication, the devel-
opment of the full keyboard device may be worthy of exploration.
Section 7.2.5 highlighted the potential for developing a full faceprint
matching algorithm which ought to greatly outperform the crude approx-
imations employed in the preliminary experimentation.

129
8.3 Closing remarks

All four of the target research areas were explored extensively, and each re-
sulted in the successful design, development and testing of devices to assist
the visually impaired. Use of the head for controlling the environmental sen-
sors (or the region of interest within the virtual environment) indeed proved
natural and intuitive, harnessing the existing design within the human body
for controlling visual receptors. Electro-neural stimulation was shown to be
highly effective in providing rich haptic feedback using intensity, frequency,
placement and pulse-encoding. Areas of difficulty and limitations were dis-
covered, as well as areas of great potential for future development. Thus, in
a way, this research has acted as a navigational aid for research in this area –
exposing obstacles, and illuminating navigable paths in the direction of the
destination: use of technology to not only replace the missing visual sense,
but provide capabilities above and beyond the standard human apparatus.

130
Bibliography

[1] S. Bauer, J. Wasza, K. Muller, and J. Hornegger. 4d photogeometric


face recognition with time-of-flight sensors. In Applications of Computer
Vision (WACV), 2011 IEEE Workshop on, pages 196–203. IEEE, 2011.

[2] P. J. Besl and H. D. McKay. A method for registration of 3-D


shapes. IEEE Transactions on pattern analysis and machine intelli-
gence, 14(2):239–256, 1992.

[3] A. Bleiweiss and M. Werman. Robust head pose estimation by fus-


ing time-of-flight depth and color. In Multimedia Signal Processing
(MMSP), 2010 IEEE International Workshop on, pages 116–121. IEEE,
2010.

[4] G. Bologna, B. Deville, J. Diego Gomez, and T. Pun. Toward local


and global perception modules for vision substitution. Neurocomputing,
74(8):1182–1190, 2011.

[5] N. Bourbakis. Sensing surrounding 3-d space for navigation of the


blind. Engineering in Medicine and Biology Magazine, IEEE, 27(1):49–
55, 2008.

131
[6] D. C. Chang, L. B. Rosenberg, and J. R. Mallett. Design of force
sensations for haptic feedback computer interfaces, 2010. US Patent
7,701,438.

[7] V. G. Chouvardas, A. N. Miliou, and M. K. Hatalis. Tactile displays:


Overview and recent advances. Displays, 29(3):185–194, 2008.

[8] S. Chumkamon, P. Tuvaphanthaphiphat, and P. Keeratiwintakorn. A


blind navigation system using rfid for indoor environments. In Electrical
Engineering/Electronics, Computer, Telecommunications and Informa-
tion Technology, 2008. ECTI-CON 2008. 5th International Conference
on, volume 2, pages 765 –768, May 2008.

[9] Johnny Chung. Nintendo wii projects, 2012. http://johnnylee.net/


projects/wii/.

[10] P. Costa, H. Fernandes, V. Vasconcelos, P. Coelho, J. Barroso, and


L. Hadjileontiadis. Fiducials marks detection to assist visually impaired
people navigation. International Journal of Digital Content Technology
and its Applications, 5(5), 2011.

[11] P. Costa, H. Fernandes, V. Vasconcelos, P. Coelho, J. Barroso, and


L. Hadjileontiadis. Landmarks detection to assist the navigation of vi-
sually impaired people. Human-Computer Interaction. Towards Mobile
and Intelligent Interaction Environments, pages 293–300, 2011.

[12] D. Dakopoulos and N. G. Bourbakis. Wearable obstacle avoidance elec-


tronic travel aids for blind: a survey. Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, 40(1):25–35,
2010.

[13] W. H. Dobelle. Artificial vision for the blind by connecting a televi-


sion camera to the visual cortex. ASAIO journal (American Society for
Artificial Internal Organs : 1992), 46(1):3–9, 2000.

[14] O. Ebers, T. Ebers, M. Plaue, T. Radüntz, G. Bärwolff, and


H. Schwandt. Study on three-dimensional face recognition with
continuous-wave time-of-flight range cameras. Optical Engineering,
50(6), 2011.

[15] S. Ertan, C. Lee, A. Willets, H. Tan, and A. Pentland. A wearable haptic


navigation guidance system. In Wearable Computers, 1998. Digest of
Papers. Second International Symposium on, pages 164 –165, 1998.

132
[16] S. Fazli and H. Mohammadi. Collision-free navigation for blind persons
using stereo matching. International Journal of Scientific and Engi-
neering Research, Volume 2, December 2011. http://www.ijser.org/
viewPaperDetail.aspx?DEC1171.

[17] M. Foursa. Real-time infrared tracking system for virtual environments.


In VRCAI ’04: Proceedings of the 2004 ACM SIGGRAPH international
conference on Virtual Reality continuum and its applications in industry,
pages 427–430, New York, NY, USA, 2004. ACM Press.

[18] Eric Foxlin, Yury Altshuler, Leonid Naimark, and Mike Harrington. A
3d motion and structure estimation algorithm for optical head tracker
system. In ISMAR ’04: Proceedings of the Third IEEE and ACM In-
ternational Symposium on Mixed and Augmented Reality (ISMAR’04),
pages 212–221, Washington, DC, USA, 2005. IEEE Computer Society.

[19] Freedom Scientific. Job Access With Speech (JAWS), 2012. http:
//www.freedomscientific.com/products/fs/jaws-product-page.
asp.

[20] J. P. Fritz and K. E. Barner. Design of a haptic data visualization system


for people with visual impairments. Rehabilitation Engineering, IEEE
Transactions on, 7(3):372 –384, 1999.

[21] R. M. Gibson, S. G. McMeekin, A. Ahmadinia, L. M. Watson, N. C.


Strang, and V. Manahilov. Augmented reading utilizing edge detection
for the visually impaired. PGBIOMED 2011, page 3, 2011.

[22] J. V. Gomeza and F. E. Sandnesb. Roboguidedog: Guiding blind users


through physical environments with laser range scanners. DSAI 2012.
Elsevier B.V., 2012.

[23] Dmitry O. Gorodnichy. On importance of nose for face tracking. In


Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE
International Conference on, pages 181–186, May 2002.

[24] R. Gouzman, I. Karasin, and A. Braunstein. The virtual touch system


by virtouch ltd: Opening new computer windows graphically for the
blind. In Proceedings of” Technology and Persons with Disabilities”
Conference, Los Angeles March, pages 20–25, 2000.

[25] GW Micro. Window-Eyes, 2012. http://www.gwmicro.com/


Window-Eyes/.

133
[26] M. Haker, M. Bohme, T. Martinetz, and E. Barth. Geometric invariants
for facial feature tracking with 3D TOF cameras. Signals, Circuits and
Systems, 2007. ISSCS 2007. International Symposium on, 1:1–4, July
2007.

[27] N. Haubner, U. Schwanecke, R. Dörner, S. Lehmann, and J. Luder-


schmidt. Recognition of dynamic hand gestures with time-of-flight cam-
eras. In Proceedings of ITG/GI Workshop on Self-Integrating Systems
for Better Living Environments, volume 2010, pages 33–39, 2010.

[28] A. Helal, S. E. Moore, and B. Ramachandran. Drishti: an integrated


navigation system for visually impaired and disabled. In Wearable Com-
puters, 2001. Proceedings. Fifth International Symposium on, pages 149
–156, 2001.

[29] S. K. Hong and C. G. Park. A 3D Motion and Structure Estimation


Algorithm for Optical Head Tracker System. American Institute of Aero-
nautics and Astronautics: Guidance, Navigation, and Control Confer-
ence and Exhibit, 2005.

[30] T. Horprasert, Y. Yacoob, and L. S. Davis. Computing 3-d head orien-


tation from a monocular image sequence. In Proceedings of the Second
International Conference on Automatic Face and Gesture Recognition,
pages 242–247, 1996.

[31] S. K. Horschel, B. T. Gleeson, and W. R. Provancher. A fingertip shear


tactile display for communicating direction cues. In EuroHaptics con-
ference, 2009 and Symposium on Haptic Interfaces for Virtual Environ-
ment and Teleoperator Systems. World Haptics 2009. Third Joint, pages
611 –612, March 2009.

[32] E. Hossain, M. R. Khan, R. Muhida, and A. Ali. State of the art


review on walking support system for visually impaired people. Inter-
national Journal of Biomechatronics and Biomedical Robotics, 1(4):232–
251, 2011.

[33] R. G. Hughes and A. R. Forrest. Perceptualisation using a tactile mouse.


In Visualization ’96. Proceedings., pages 181–188, 1996.

[34] Y. Ikei, K. Wakamatsu, and S. Fukuda. Texture display for tactile


sensation. Advances in human factors/ergonomics, pages 961–964, 1997.

[35] D. J. Jacques, R. Rodrigo, K. A. McIsaac, and J. Samarabandu. An


application framework for measuring the performance of a visual servo

134
control of a reaching task for the visually impaired. In Systems, Man
and Cybernetics, 2007. ISIC. IEEE International Conference on, pages
894–901. IEEE, 2007.

[36] K. A. Kaczmarek, J. G. Webster, P. Bach-y Rita, and W. J. Tompkins.


Electrotactile and vibrotactile displays for sensory substitution systems.
Biomedical Engineering, IEEE Transactions on, 38(1):1–16, 1991.

[37] Kanav Kahol, Jamieson French, Laura Bratton, and Sethuraman Pan-
chanathan. Learning and perceiving colors haptically. In Proceedings
of the 8th international ACM SIGACCESS conference on Computers
and accessibility, Assets ’06, pages 173–180, New York, NY, USA, 2006.
ACM. http://doi.acm.org/10.1145/1168987.1169017.

[38] J. Y. Kaminski, M. Teicher, D. Knaan, and A. Shavit. Head orientation


and gaze detection from a single image. In Proceedings of International
Conference Of Computer Vision Theory And Applications, 2006.

[39] L. Kaminski, R. Kowalik, Z. Lubniewski, and A. Stepnowski. v̈oice


mapsp̈ortable, dedicated gis for supporting the street navigation and
self-dependent movement of the blind. In Information Technology
(ICIT), 2010 2nd International Conference on, pages 153 –156, June
2010.

[40] Yoshihiro Kawai and Fumiaki Tomita. Interactive tactile display sys-
tem: a support system for the visually disabled to recognize 3d objects.
In Assets ’96: Proceedings of the second annual ACM conference on
Assistive technologies, pages 45–50, New York, NY, USA, 1996. ACM.

[41] L Kay. Auditory perception of objects by blind persons, using a bioa-


coustic high resolution air sonar. Journal of the Acoustical Society of
America, 107(6):3266–3275, 2000.

[42] A. Khan, F. Moideen, J. Lopez, W. Khoo, and Z. Zhu. Kindectect:


kinect detecting objects. Computers Helping People with Special Needs,
pages 588–595, 2012.

[43] K. Konolige. Small vision systems: Hardware and implementation. In


Robotics Research – International Symposium, volume 8, pages 203–212.
MIT PRESS, 1998.

[44] Rebecca L. Koslover, Brian T. Gleeson, Joshua T. de Bever, and


William R. Provancher. Mobile navigation using haptic, audio, and

135
visual direction cues with a handheld test platform. Haptics, IEEE
Transactions on, 5(1):33 –38, 2012.

[45] Jonathan Lazar, Aaron Allen, Jason Kleinman, and Chris Malarkey.
What frustrates screen reader users on the web: A study of 100
blind users. International Journal of Human-Computer Interac-
tion, 22(3):247–269, 2007. http://www.tandfonline.com/doi/abs/
10.1080/10447310709336964.

[46] C. H. Lee, Y. C. Su, and L. G. Chen. An intelligent depth-based ob-


stacle detection for mobile applications. In Consumer Electronics-Berlin
(ICCE-Berlin), 2012 IEEE International Conference on, pages 223–225.
IEEE, 2012.

[47] J. H. Lee, Y. J. Lee, E. S. Lee, J. S. Park, and B. S. Shin. Adaptive


power control of obstacle avoidance system using via motion context for
visually impaired person. In Cloud Computing and Social Networking
(ICCCSN), 2012 International Conference on, pages 1–4. IEEE, 2012.

[48] P. Lieby, N. Barnes, C. McCarthy, Nianjun Liu, H. Dennett, J. G.


Walker, V. Botea, and A. F. Scott. Substituting depth for intensity
and real-time phosphene rendering: Visual navigation under low vision
conditions. In Engineering in Medicine and Biology Society,EMBC, 2011
Annual International Conference of the IEEE, pages 8017 –8020, 2011.

[49] T. Limna, P. Tandayya, and N. Suvanvorn. Low-cost stereo vision sys-


tem for supporting the visually impaired’s walk. In Proceedings of the
3rd International Convention on Rehabilitation Engineering & Assistive
Technology, page 4. ACM, 2009.

[50] Kai Wun Lin, Tak Kit Lau, Chi Ming Cheuk, and Yunhui Liu. A wear-
able stereo vision system for visually impaired. In Mechatronics and
Automation (ICMA), 2012 International Conference on, pages 1423 –
1428, 2012.

[51] Jihong Liu and Xiaoye Sun. A survey of vision aids for the blind. In In-
telligent Control and Automation, 2006. WCICA 2006. The Sixth World
Congress on, volume 1, pages 4312 –4316, 2006.

[52] Jin Liu, Jingbo Liu, Luqiang Xu, and Weidong Jin. Electronic travel
aids for the blind based on sensory substitution. In Computer Science
and Education (ICCSE), 2010 5th International Conference on, pages
1328 –1331, 2010.

136
[53] Steve Mann, Jason Huang, Ryan Janzen, Raymond Lo, Valmiki Ram-
persad, Alexander Chen, and Taqveer Doha. Blind navigation with a
wearable range camera and vibrotactile helmet. In Proceedings of the
19th ACM international conference on Multimedia, MM ’11, pages 1325–
1328, New York, NY, USA, 2011. ACM.

[54] D. Marr and T. Poggio. Cooperative computation of stereo disparity.


Science, 194(4262):283–287, 1976.

[55] T. Maucher, K. Meier, and J. Schemmel. An interactive tactile graphics


display. In Signal Processing and its Applications, Sixth International,
Symposium on. 2001, volume 1, pages 190–193 vol.1, 2001.

[56] Bernhard Mayerhofer, Bettina Pressl, and Manfred Wieser. Odilia - a


mobility concept for the visually impaired. In Proceedings of the 11th in-
ternational conference on Computers Helping People with Special Needs,
ICCHP ’08, pages 1109–1116, Berlin, Heidelberg, 2008. Springer-Verlag.
http://dx.doi.org/10.1007/978-3-540-70540-6_166.

[57] Simon Meers. Head-pose tracking with a time-of-flight camera: Demon-


stration video, 2008. http://acra08.simonmeers.com.

[58] Simon Meers and Koren Ward. A vision system for providing 3D percep-
tion of the environment via transcutaneous electro-neural stimulation.
In Information Visualisation, 2004. IV 2004. Proceedings. Eighth Inter-
national Conference on, pages 546–552, 2004.

[59] Simon Meers and Koren Ward. A vision system for providing the blind
with 3D colour perception of the environment. In Proceedings of the
Asia-Pacific Workshop on Visual Information Processing, December
2005.

[60] P. B. L. Meijer. An experimental system for auditory image representa-


tions. Biomedical Engineering, IEEE Transactions on, 39(2):112 –121,
1992.

[61] MESA Imaging. SwissRanger SR3000 - miniature 3D time-of-flight


range camera, 2006. http://www.mesa-imaging.ch/prodview3k.php.

[62] MESA Imaging. SwissRanger SR3000 manual, version 1.02, 2006.

[63] NaturalPoint Inc. TrackIR, 2006. http://www.naturalpoint.com/


trackir.

137
[64] R. Newman, Y. Matsumoto, S. Rougeaux, and A. Zelinsky. Real-time
stereo tracking for head pose and gaze estimation. In Proceedings. Fourth
IEEE International Conference on Automatic Face and Gesture Recog-
nition, pages 122–128, 2000.

[65] M. Niitsuma and H. Hashimoto. Comparison of visual and vibration dis-


plays for finding spatial memory in intelligent space. In Robot and Hu-
man Interactive Communication, 2009. RO-MAN 2009. The 18th IEEE
International Symposium on, pages 587–592. IEEE, 2009.

[66] Y. Ohmoto, H. Ohashi, D. Lala, S. Mori, K. Sakamoto, K. Kinoshita,


and T. Nishida. Icie: immersive environment for social interaction based
on socio-spacial information. In Technologies and Applications of Ar-
tificial Intelligence (TAAI), 2011 International Conference on, pages
119–125. IEEE, 2011.

[67] Papenmeier. BRAILLEX EL 40s, 2009. http://www.papenmeier.de/


rehatechnik/en/produkte/braillex_el40s.html.

[68] Thanathip Limna Pichaya Tandayya and Nikom Suvonvorn. Stereo


Vision Utilizing Parallel Computing for the Visually Impaired,
chapter 5. InTech, 2009. http://intechopen.com/books/
rehabilitation-engineering.

[69] Alan Poston. Department of defense human factors engineering technical


advisory group (dod hfe tag), 2000. Human Engineering Design Data
Digest – http://hfetag.dtic.mil/hfs_docs.html, Accessed June 24,
2008.

[70] V. Pradeep, G. Medioni, and J. Weiland. In Computer Vision and Pat-


tern Recognition Workshops (CVPRW), 2010 IEEE Computer Society
Conference on.

[71] J. Richardson, B. Zimmer, and E. W. Davison. Systems and methods


for using a movable object to control a computer, 2012. US Patent
8,179,366.

[72] David A. Ross and Alexander Lightman. Talking braille: a wireless


ubiquitous computing network for orientation and wayfinding. In Pro-
ceedings of the 7th international ACM SIGACCESS conference on Com-
puters and accessibility, Assets ’05, pages 98–105, New York, NY, USA,
2005. ACM. http://doi.acm.org/10.1145/1090785.1090805.

138
[73] R. Samra, D. Wang, and M. H. Zadeh. On texture perception in a haptic-
enabled virtual environment. In Haptic Audio-Visual Environments and
Games (HAVE), 2010 IEEE International Symposium on, pages 1 –4,
2010.

[74] H. N. Schwerdt, J. Tapson, and R. Etienne-Cummings. A color detection


glove with haptic feedback for the visually disabled. In Information
Sciences and Systems, 2009. CISS 2009. 43rd Annual Conference on,
pages 681 –686, March 2009.

[75] E. J. Shahoian and L. B. Rosenberg. Low-cost haptic mouse implemen-


tations, 2009. US Patent RE40,808.

[76] Huiying Shen and James M. Coughlan. Towards a real-time system


for finding and reading signs for visually impaired users. In Proceed-
ings of the 13th international conference on Computers Helping People
with Special Needs - Volume Part II, ICCHP’12, pages 41–47, Berlin,
Heidelberg, 2012. Springer-Verlag.

[77] Y. Shiizu, Y. Hirahara, K. Yanashima, and K. Magatani. The develop-


ment of a white cane which navigates the visually impaired. In Engineer-
ing in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual
International Conference of the IEEE, pages 5005 –5008, 2007.

[78] Kristen Shinohara and Josh Tenenberg. A blind person’s interactions


with technology. Commun. ACM, 52(8):58–66, 2009. http://doi.acm.
org/10.1145/1536616.1536636.

[79] P. M. Silva, T. N. Pappas, J. Atkins, J. E. West, and W. M. Hartmann.


Acoustic-tactile rendering of visual information. In Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Series, volume
8291, page 16, 2012.

[80] Calle Sjöström. Designing haptic computer interfaces for blind people.
In Proceedings of ISSPA 2001, pages 1–4, 2001.

[81] Mu-Chun Su, Yi-Zeng Hsieh, and Yu-Xiang Zhao. A simple approach
to stereo matching and its application in developing a travel aid for the
blind. In JCIS. Atlantis Press, 2006. http://dblp.uni-trier.de/db/
conf/jcis/jcis2006.html#SuHZ06.

[82] Hui Tang and D. J. Beebe. An oral tactile interface for blind navigation.
Neural Systems and Rehabilitation Engineering, IEEE Transactions on,
14(1):116 –123, 2006.

139
[83] YingLi Tian, Xiaodong Yang, Chucai Yi, and Aries Arditi. Toward a
computer vision-based wayfinding aid for blind persons to access unfa-
miliar indoor environments. Machine Vision and Applications, pages
1–15, 2012. 10.1007/s00138-012-0431-7 http://dx.doi.org/10.1007/
s00138-012-0431-7.

[84] R. Velázquez. Wearable assistive devices for the blind. Wearable and
Autonomous Biomedical Devices and Systems for Smart Environment,
pages 331–349, 2010.

[85] S. Verstockt, S. Van Hoecke, T. Beji, B. Merci, B. Gouverneur, A. E.


Cetin, P. De Potter, and R. Van de Walle. A multi-modal video analysis
approach for car park fire detection. Fire Safety Journal, 2012.

[86] J. D. Weiland and M. S. Humayun. Visual prosthesis. Proceedings of


the IEEE, 96(7):1076 –1084, 2008.

[87] S. Willis and S. Helal. Rfid information grid for blind navigation and
wayfinding. In Wearable Computers, 2005. Proceedings. Ninth IEEE
International Symposium on, pages 34 – 37, 2005.

[88] World Heath Organization. Visual impairment and blindness, October


2011. http://www.who.int/mediacentre/factsheets/fs282/en/.

[89] J. Wyatt and J. Rizzo. Ocular implants for the blind. Spectrum, IEEE,
33(5):47 –53, May 1996.

[90] K. Yelamarthi, D. Haas, D. Nielsen, and S. Mothersell. Rfid and gps


integrated navigation system for the visually impaired. In Circuits and
Systems (MWSCAS), 2010 53rd IEEE International Midwest Sympo-
sium on, pages 1149 –1152, 2010.

[91] L. Yin and A. Basu. Nose shape estimation and tracking for model-
based coding. Acoustics, Speech, and Signal Processing, 2001. Proceed-
ings.(ICASSP’01). 2001 IEEE International Conference on, 3, 2001.

[92] Z. Zhu and Q. Ji. Real time 3d face pose tracking from an uncalibrated
camera. In First IEEE Workshop on Face Processing in Video, in con-
junction with IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR’04), Washington DC, June 2004.

140

You might also like