Head-Controlled Perception Via Electro-Neural Stimulation
Head-Controlled Perception Via Electro-Neural Stimulation
Research Online
University of Wollongong Thesis Collection University of Wollongong Thesis Collections
2012
Recommended Citation
Meers, Simon, Head-controlled perception via electro-neural stimulation, Doctor of Philosophy thesis, School of Computer Science
and Software Engineering, University of Wollongong, 2012. http://ro.uow.edu.au/theses/3769
DOCTOR OF PHILOSOPHY
from
UNIVERSITY OF WOLLONGONG
by
2012
Certification
I, Simon Meers, declare that this thesis, submitted in fulfilment of the re-
quirements for the award of Doctor of Philosophy, in the School of Computer
Science and Software Engineering, University of Wollongong, is wholly my
own work unless otherwise referenced or acknowledged. This document has
not been submitted for qualifications at any other academic institution.
Simon Meers
8th November 2012
ii
Publications
• Meers, S., and Ward, K., “Head-tracking haptic computer interface
for the blind”, Advances in Haptics, pp. 143-154, INTECH, ISBN:
9789533070933, January 2010
• Meers, S., and Ward, K., “Face recognition using a time-of-flight cam-
era”, Sixth International Conference on Computer Graphics, Imaging
and Visualisation, Tianjin, China, August 2009
• Meers, S., and Ward, K., “Head-Pose Tracking with a Time-of-Flight
Camera”, Australasian Conference on Robotics and Automation, Can-
berra, Australia, December 2008
• Meers, S., and Ward, K., “Substitute Three-Dimensional Perception
using Depth and Colour Sensors”, Australasian Conference on Robotics
and Automation, Brisbane, Australia, December 2007
• Meers, S., and Ward, K., “Haptic Gaze-Tracking Based Perception of
Graphical User Interfaces”, 11th International Conference Information
Visualisation, Zurich, Switzerland, July 2007
• Meers, S., Ward, K. and Piper, I., “Simple, Robust and Accurate Head-
Pose Tracking Using a Single Camera”, The Thirteenth IEEE Confer-
ence on Mechatronics and Machine Vision in Practice, Toowoomba,
Australia, December 2006 – ?Best paper award
• Meers, S. and Ward, K., “A Vision System for Providing the Blind with
3D Colour Perception of the Environment”, Asia-Pacific Workshop on
Visual Information Processing, Hong Kong, December 2005
• Meers, S. and Ward, K., “A Substitute Vision System for Providing
3D Perception and GPS Navigation via Electro-Tactile Stimulation”,
International Conference on Sensing Technology, Palmerston North,
New Zealand, November 2005 – ?Best paper award
• Meers, S. and Ward, K., “A Vision System for Providing 3D Perception
of the Environment via Transcutaneous Electro-Neural Stimulation”,
IV04 IEEE 8th International Conference on Information Visualisa-
tion, London, UK, July 2004
iii
Abstract
This thesis explores the use of head-mounted sensors combined with haptic
feedback for providing effective and intuitive perception of the surround-
ing environment for the visually impaired. Additionally, this interaction
paradigm is extended to providing haptic perception of graphical computer
interfaces. To achieve this, accurate sensing of the head itself is required for
tracking the user’s “gaze” position instead of sensing the environment. Trans-
cutaneous electro-neural feedback is utilised as a substitute for the retina’s
neural input, and is shown to provide a rich and versatile communication
interface without encumbering the user’s auditory perception.
Systems are presented for:
• facilitating obstacle avoidance and localisation via electro-neural stimu-
lation of intensity proportional to distance (obtained via head-mounted
stereo cameras or infrared range sensors);
• encoding of colour information (obtained via a head-mounted video
camera or dedicated colour sensor) for landmark identification using
electro-neural frequency;
• navigation using GPS data by encoding landmark identifiers into short
pulse patterns and mapping fingers/sensors to bearing regions (aided
by a head-mounted digital compass);
• tracking human head-pose using a single video camera with accuracy
within 0.5◦ ;
• utilising time-of-flight sensing technology for head-pose tracking and
facial recognition;
• non-visual manipulation of a typical software “desktop” graphical user
interface using point-and-click and drag-and-drop interactions; and
• haptic perception of the spatial layout of pages on the World Wide Web,
contrasting output via electro-neural stimulation and Braille displays.
iv
Acknowledgements
This document only exists thanks to:
• the gracious guidance and motivation of Prof. John Fulcher and Dr.
Ian Piper in compiling, refining and editing over many iterations;
• the support and encouragement of my wife Gillian, and our three chil-
dren Maya, Cadan and Saxon, who were born during this research;
v
Contents
1 Introduction 1
2 Literature Review 6
3 Perceiving Depth 13
3.1 Electro-Neural Vision System . . . . . . . . . . . . . . . . . . 13
3.1.1 Extracting depth data from stereo video . . . . . . . . 19
3.1.2 Range sampling . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Using infrared sensors . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Wireless ENVS . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Obstacle avoidance . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Localisation . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Perceiving Colour 35
4.1 Representing colours with electro-neural signal frequencies . . 36
4.1.1 Mapping the colour spectrum to frequency . . . . . . . 36
4.1.2 Using a lookup-table for familiar colours . . . . . . . . 37
4.2 Colour perception experimentation . . . . . . . . . . . . . . . 38
4.2.1 Navigating corridors . . . . . . . . . . . . . . . . . . . 38
4.2.2 Navigating the laboratory . . . . . . . . . . . . . . . . 40
4.3 Colour sensing technology . . . . . . . . . . . . . . . . . . . . 42
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
5 High-Level Navigation 45
5.1 Interpreting landmarks via TENS . . . . . . . . . . . . . . . . 47
5.2 Landmark bearing protocols . . . . . . . . . . . . . . . . . . . 49
5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Navigating the car park . . . . . . . . . . . . . . . . . 52
5.3.2 Navigating the University campus . . . . . . . . . . . . 54
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7 Head-Pose Tracking 72
7.1 Head-pose tracking using a single camera . . . . . . . . . . . . 74
7.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1.3 Experimental results . . . . . . . . . . . . . . . . . . . 89
7.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Time-of-flight camera technology . . . . . . . . . . . . . . . . 93
7.2.1 The SwissRanger . . . . . . . . . . . . . . . . . . . . . 93
7.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.4 Nose tracking . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.5 Finding orientation . . . . . . . . . . . . . . . . . . . . 107
7.2.6 Building a mesh . . . . . . . . . . . . . . . . . . . . . . 112
7.2.7 Facial recognition . . . . . . . . . . . . . . . . . . . . . 113
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 Conclusions 124
8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2 Suggestions for further research . . . . . . . . . . . . . . . . . 128
8.2.1 Research already commenced . . . . . . . . . . . . . . 128
8.2.2 Other areas . . . . . . . . . . . . . . . . . . . . . . . . 129
8.3 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 130
vii
List of Figures
viii
6.3 Grid cell transit stabilisation . . . . . . . . . . . . . . . . . . . 61
6.4 Example of collapsing a web page for faster perception . . . . 63
6.5 Wired TENS system . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Vibro-tactile keyboard design . . . . . . . . . . . . . . . . . . 66
6.7 Papenmeier BRAILLEX EL 40s refreshable Braille display . . 67
6.8 Example glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.9 Braille text displaying details of central element . . . . . . . . 68
ix
x
1
Introduction
1
“screen readers” (e.g. [19, 25]) are able to provide limited assistance, they do
not enable perception of the layout of the screen content or effective inter-
action with graphical user interfaces. The range of applications that blind
people are able to use effectively is therefore quite limited.
Rapid advances in the fields of sensor technology, data processing and
haptic feedback are continually creating new possibilities for utilising non-
visual senses to compensate for having little or no visual perception. How-
ever, determining the best methods for providing substitute visual perception
to the visually impaired remains a challenge. Chapter 2 provides a review of
existing systems and evaluates their benefits and limitations.
This research aims to utilise these technological advances to:
2
also possible to represent the colour of perceived objects by modulating the
frequency of the stimulation, as described in Chapter 4. Furthermore, Chap-
ter 5 shows how locations, based on Global Positioning System (GPS) data,
can also be encoded into the electrical stimulation to facilitate navigation.
To enable a blind computer user to perceive the computer screen, as stated
in Item 2 above, the user’s “gaze” point on a virtual screen is estimated by
tracking the head position and orientation. Electro-neural stimulation of the
fingers is then used to indicate to the user what is located at the gaze po-
sition on the screen. In this case, the frequency of the pulses delivered to
the fingers indicates the type of screen object at the gaze position and the
intensity of the pulses indicates the screen object’s “depth” (e.g. if an object
is in a foreground or background window). Chapter 6 provides details of the
interface and experimental results. Chapter 7 outlines the novel head-pose
sensor research undertaken to provide tracking suitable for this application.
Chapter 6 also provides details of how a Braille display or vibro-tactile inter-
face can be used to replace the electro-neural stimulation and provide more
convenience to the user.
Although the above proposed perception systems use different sensory
devices for performing different perception tasks, both are based on the same
principle. Namely, utilising the head to control perception and the skin to
interpret what is being perceived. As most blind people retain the use of
their head for directing their sense of hearing they can be expected to also
be capable of using their head for directing substitute visual perception. Use
of haptic feedback provides rich and versatile communication while leaving
the user’s auditory senses unencumbered.
3
The main discoveries and developments of the research are summarised here:
• Mapping range data to the fingers (or other areas of the body) via
electro-neural pulses where the intensity is proportional to the distance
to objects.
4
only sighted (blindfolded) laboratory staff (author and supervisor). Testing
on blind subjects is outlined in Section 8.2 as future research. Based on
the assumption that blind individuals have more neural activity allocated
to tactile sensing, it is anticipated that they may in fact acquire greater
perception skills using the proposed devices than blindfolded sighted users.
Since the first Electro-Neural Vision System (ENVS) research was pub-
lished in 2004 [58] considerable research interest has been shown in the work
(e.g. [51, 12, 49, 81, 32, 4, 22, 65, 10, 68, 11, 79, 16, 35, 35, 42, 47, 1, 27,
14, 46, 66, 3, 85, 71]). Section 8.2.1 highlights the pioneering nature of this
research and provides examples of similar work built upon it.
5
6
2
Literature Review
7
motion [86]. Some researchers have suggested that the limited resolution of
such devices would be better utilised in delivering a depth map [48]. Whilst
the effectiveness of prosthetic devices will likely increase in the near future,
the cost and surgery involved render them inaccessible to most. Also, some
forms of blindness, such as brain or optic nerve damage, may be unsuitable
for implants.
A number of wearable devices have been developed for providing the
blind with some means of sensing or visualising the environment (see [84]
for a survey). For example, Meijer’s vOICe [60] compresses a camera image
into a coarse 2D array of greyscale values and delivers this information to the
ears as a sequence of sounds with varying frequency. However it is difficult to
mentally reconstruct the sounds into a three-dimensional (3D) representation
of the environment, which is needed for obstacle avoidance and navigation.
Sonar mobility aids for the blind have been developed by Kay [41]. Kay’s
system delivers frequency modulated sounds, using pitch to represent dis-
tance and timbre to indicate surface features. However, to an inexperienced
user, these combined sounds can be confusing and difficult to interpret. Also,
the sonar beam from these systems is specular and can be reflected off many
surfaces or absorbed, resulting in uncertain perception. Nonetheless, Kay’s
sonar blind aids can help to identify landmarks by resolving some object
features, and can facilitate a degree of object classification for experienced
users.
A major disadvantage of auditory substitute vision systems is that they
can diminish a blind person’s capacity to hear sounds in the environment
(e.g. voices, traffic, footsteps, etc.). Consequently, these devices are not
8
widely used in public places because they can reduce a blind person’s auditory
perception of the environment and could potentially cause harm or injury if
impending danger is not detected by the ears.
Computer vision technology such as object recognition and optical char-
acter recognition has also been utilised in recent navigational aid research [83,
21, 76].
The use of infrared distance sensors for detecting objects has been pro-
posed [56] but little explored in practice. Laser range scanners can provide
a high level of accuracy, but have limited portability [22].
Recently released consumer electronics are equipped with complex sensing
technology, creating new opportunities for the development of devices for
perception assistance [9, 53].
Electro-tactile displays for interpreting the shape of images on a com-
puter screen with the fingers, tongue or abdomen have been developed by
Kaczmarek et al. [36]. These displays work by mapping black and white pix-
els to a matrix of closely-spaced pulsated electrodes that can be felt by the
fingers. These electro-tactile displays can give a blind user the capacity to
recognise the shape of certain objects, such as black alphabetic characters on
a white background. More recently, experiments have been conducted using
electro-tactile stimulation of the roof of the mouth to provide directional cues
[82].
Researchers continue to investigate novel methods of haptic feedback.
Recently the use of “skin-stretch tactors” to provide directional cues for
mobile navigation was proposed by Provancher et al. [31, 44]. Mann et al.
use a head-mounted array of vibro-tactile actuators [53]. Samra et al. have
9
experimented with varying the speed and direction of rotating brushes for
texture synthesis within a virtual environment [73].
In addition to sensing the surrounding environment, it is of considerable
benefit if a perception aid can provide the position of the user or nearby land-
marks. Currently, a number of GPS devices are available for the blind that
provide the position of the user or specific locations using voice or Braille
interfaces (e.g. [28, 39]). However, as with audio substitute vision systems,
voice interfaces occupy the sense of hearing which can diminish a blind per-
son’s capacity to hear important environmental sounds. Braille interfaces
avoid this problem, but interacting via typing and reading in Braille is slower
and requires higher cognitive effort.
Others have explored the use of alternative location tracking technology
such as infrared [15] or Wi-Fi beacons [72] or RFID tags [8, 77, 87], which
can prove advantageous for navigating indoor environments. Yelamarthi et
al. have combined both GPS and RFID information in producing robots
designed to act as replacements for guide dogs [90].
In addition to inhibiting the navigation of physical environments, visual
impairment also restricts interaction with virtual environments such as soft-
ware interfaces. A number of “assistive technology” avenues have been pur-
sued to enable non-visual interaction with computers [78].
Use of “screen-reading” software [19, 25] is the predominant way in which
visually impaired users currently interact with computers. This software typ-
ically requires the user to navigate the computer interface in a linearised fash-
10
ion using memorised keystrokes (numbering in the hundreds 1 ). The currently
“focused” piece of information is conveyed to the user via speech synthesis
or a refreshable Braille display. A study of 100 users of screen-reading soft-
ware determined that the primary frustration encountered when using this
technology was “page layout causing confusing screen reader feedback” [45].
Tactile devices for enabling blind users to perceive graphics or images on
the computer have been under development for some time. For example the
haptic mouse (e.g. [33, 75, 6, 24]) can produce characteristic vibrations when
the mouse cursor is moved over screen icons, window controls and application
windows. This can indicate to a blind user what is currently beneath the
mouse pointer, but is of limited value when the location of the objects and
pointer on the screen are not easily ascertained.
Force feedback devices, like the PHANToM [20], and tactile (or electro-
tactile) displays, e.g. [34, 36, 40, 55], can enable three-dimensional graphical
models or two-dimensional black-and-white images to be visualised by using
the sense of touch (see [7] for a survey). However, little success has been
demonstrated with these devices toward enabling blind users to interact with
typical GUIs beyond simple memory and visualisation experiments like “the
memory house” [80], which involves discovering buttons on different planes
via force feedback and remembering the buttons that play the same sounds
when found.
Upon reviewing the progress in the research field to date, it was deter-
1
http://www.freedomscientific.com/doccenter/archives/training/
JAWSKeystrokes.htm
11
mined that the following areas have been explored little (if at all) despite
holding significant potential for assisting the visually impaired using cur-
rently available technology:
It was further noted that all four of these areas could utilise the head
as a natural and intuitive pan-and-tilt controller for scanning the region of
interest, and electro-neural stimulation as a versatile haptic feedback channel.
The following chapters outline new research and experimentation in these
areas.
12
3
Perceiving Depth
The structure of the first ENVS prototype is illustrated in Figure 3.1. The
prototype is comprised of:
13
ing the output from the computer into appropriate electrical pulses that
can be felt via the skin, and
• special gloves fitted with electrodes for delivering the electrical pulses
to the fingers.
14
Figure 3.2: Photograph of the first ENVS prototype
15
Figure 3.3: Internal ENVS TENS hardware
16
Figure 3.4: TENS output waveform
Adjusting the signal intensity by varying the pulse width was found
preferable to varying the pulse amplitude for two reasons. Firstly, it en-
abled the overall intensity of the electro-neural simulation to be easily set to
a comfortable level by presetting the pulse amplitude. It also simplified the
TENS hardware considerably by not requiring digital-to-analogue converters
or analogue output drivers on the output circuits.
To enable the stereo disparity algorithm parameters and the TENS output
waveform to be altered for experimental purposes, the ENVS is equipped with
the control panel shown in Figure 3.5. This was also designed to monitor the
image data coming from the cameras and the signals being delivered to the
fingers via the TENS unit. The ENVS software was built upon the “SVS”
17
Figure 3.5: The ENVS control panel
software [43] provided with the stereo camera hardware (see Section 3.1.1).
Figure 3.5 shows a typical screenshot of the ENVS control panel while in
operation. The top-left image shows a typical environment image obtained
from one of the cameras in the stereo camera headset. The corresponding
disparity depth map, derived from both cameras, can be seen in the top-right
image (in which lighter pixels represent a closer range than darker pixels).
Also, the ten disparity map sample regions, used to obtain the ten range
18
readings delivered to the fingers, can be seen spread horizontally across the
centre of the disparity map image. These regions are also adjustable via the
control panel for experimentation.
To calculate the amount of stimulation delivered to each finger, the min-
imum depth of each of the ten sample regions is taken. The bar graph, at
the bottom-left of Figure 3.5, shows the actual amount of stimulation de-
livered to each finger. Using a 450MHz Pentium 3 computer, a frame rate
of 15 frames per second was achieved, which proved more than adequate for
experimental purposes.
The ENVS works by using the principle of stereo disparity. Just as human
eyes capture two slightly different images and the brain combines them to
provide a sense of depth, the stereo cameras in the ENVS captures two images
and the laptop computer computes a depth map by estimating the disparity
between the two images. However, unlike binocular vision on humans and
animals, which have independently moveable eye balls, typical stereo vision
systems use parallel mounted video cameras positioned at a set distance from
each other.
19
The stereo camera head
1
http://users.rcn.com/mclaughl.dnai/
20
Calculating disparity
The process of calculating a depth map from a pair of images using parallel
mounted stereo cameras is well known [54]. Given the baseline distance
between the two cameras and their focal lengths (shown in Figure 3.7), the
coordinates of corresponding pixels in the two images can be used to derive
the distance to the object from the cameras at that point in the images.
21
to the subject and d is the disparity (xl − xr). To compute a complete depth
map of the observed image in real time is computationally expensive because
the detection of corresponding features and calculating their disparity has to
be done at the frame rate for every pixel on each frame.
A number of methods for reducing the dense stereo map to ten scalar values
were trialled in the experimentation. These included:
• sampling regions spanning the full height of the camera viewport for
maximising the perception window;
• sampling a narrow band of regions for maximal focus and acuity when
building a mental map using head/sensor movement;
• sampling the average distance within each region (high stability, but
small obstacles can be easily missed);
22
• maintaining exponential moving averages for mitigating data space lim-
itations, and allowing longer and more accurate windowing.
3.1.3 Limitations
Calibration
Featureless surfaces
23
Figure 3.8: Disparity map of a featureless surface, displayed via the custom-
built ENVS software interface
The amount of power required to operate the stereo camera unit and vision
processing software was found to be less than optimal for an application
designed for extended usage during which the user to would need to carry a
battery of considerable size.
24
3.2 Using infrared sensors
2
http://www.sharp-world.com/products/device
25
Figure 3.9: The prototype Infrared-ENVS
Whilst the crude gloves built for the initial prototype proved effective for ex-
perimentation, they are less than ideal for real-world usage. Having to wear
gloves limits the use of the hands whilst hooked up to the system. Further-
more there is no reason for the haptic feedback to be limited to the fingers –
many other nerves in the body (e.g. arms or torso) may prove more conve-
nient or effective, perhaps varying from user to user. Some users may prefer
a less conspicous location for the electrodes. The wires on the prototype sys-
tem also present potential risk of snagging and can restrict movement. These
issues prompted some initial research into alternative TENS hardware.
26
Figure 3.10: Wireless TENS patch
27
3.4 Experimental results
The objective of the first was to determine whether the user could identify
and negotiate obstacles while moving around in the cluttered laboratory en-
vironment. It was found that after five minutes of use within the unknown
environment, users could estimate the direction and range of obstacles lo-
cated in the environment, with sufficient accuracy to enable approaching
objects and then walking around them by interpreting the range data de-
livered to the fingers via the ENVS. As the environment contained many
different sized obstacles, it was also necessary for users to regularly scan the
immediate environment, (mostly with up and down head movements), to
ensure all objects were detected regardless of their size. After ten minutes
moving around in the environment, while avoiding obstacles, users were also
able identify features like the open doorway, shown in Figure 3.11a and even
walk through the doorway by observing this region of the environment with
the stereo cameras. Figure 3.11 shows a photo of a user and a screenshot of
the ENVS control panel at one instant while the user was performing this
task. The 3D profile of the doorway can be plainly seen in the depth map
28
shown at the top-right of Figure 3.11b. Also, the corresponding intensity of
the TENS pulses felt by each finger can be seen on the bar graphs shown at
the bottom-left corner of Figure 3.11b.
Ten range readings delivered to the fingers in this manner may not seem
like much environmental information. The real power of the ENVS lies in
the user’s ability to easily interpret the ten range readings, and, by fusing
this information over time, produce a mental 3D model of the environment.
Remembering the locations of obstacles was found to increase with contin-
ued use of the ENVS, eliminating much of the need to regularly scan the
environment comprehensively. With practice, users could also interpret the
range data without any need to hold the hands in front of the abdomen.
29
(a) Photograph
30
3.4.2 Localisation
31
(a) Photograph
(b) Screenshot
32
The inability of stereo cameras to resolve the depth of featureless surfaces
was not a problem within the cluttered laboratory environment because there
were sufficient edges and features on the lab’s objects and walls for the dis-
parity to be resolved from the stereo video images. In fact, not resolving
the depth of the floor benefited the experiments to some extent by enabling
objects located on the floor to be more clearly identifiable, as can be seen in
Figure 3.12b.
3.5 Conclusions
33
found to be effective and intuitively interpreted.
Wireless hardware was developed to overcome the limitations and incon-
veniences of the crude wired prototypes, and expanded the future potential
of the system in regards to sensor placement and resolution.
The prototype system was successfully tested by blindfolded (sighted)
users, proving its potential for facilitating spatial awareness, detection of
stationary and moving objects, obstacle avoidance and localisation.
34
4
Perceiving Colour
Although environmental range readings can enable blind users to avoid ob-
stacles and recognise their relative position by perceiving the profile of the
surrounding environment, a considerable improvement in localisation, navi-
gation and object recognition can be achieved by incorporating colour percep-
tion into the ENVS. Colour perception is important because it can facilitate
the recognition of significant objects which can also serve as landmarks when
navigating the environment.
35
4.1 Representing colours with electro-neural
signal frequencies
To encode colour perception into the ENVS, the frequency of the electrical
signals delivered to the fingers was adjusted according to the corresponding
colour sample. Two methods of achieving colour perception were investi-
gated. The first was to represent the continuous colour spectrum with the
entire available frequency bandwidth of the ENVS signals. Thus, red objects
detected by the ENVS would be represented with low frequency signals, vio-
let colours would be represented with high frequency signals and any colours
between these limits would be represented with a corresponding proportion-
ate frequency. The second method was to only represent significant colours
with specific allocated frequencies via a lookup table.
It was found that the most useful frequencies for delivering depth and colour
information to the user via transcutaneous electro-neural stimulation were
frequencies between 10–120Hz. Frequencies above this range tended to result
in nerves becoming insensitive by adapting to the stimulation. Frequencies
below this range tended to be too slow to respond to changed input. Con-
sequently, mapping the entire colour spectrum to the frequency bandwidth
available to the ENVS signals proved infeasible due to the limited band-
width available. Furthermore, ENVS experiments involving detection and
delivery of all colours within a specific environment via frequencies proved
ineffective for accurate interpretation of the range and colour information.
36
Rapid changes in frequency often made the intensity difficult to determine
accurately, and vice versa.
37
4.2 Colour perception experimentation
The first experiment was to determine if the ENVS users could navigate
the corridor and find the correct door leading into the laboratory from a
location in another wing of the building. Prior to conducting the trials
users were familiarised with the entrance to the laboratory and had practiced
negotiating the corridor using the ENVS.
The lab entrance was characterised by having a blue door with a red fire
extinguisher mounted on the wall to the right of the door. To the left of the
door, was a grey cabinet. The colour of the door, fire extinguisher and cabinet
were stored in the ENVS as significant colours and given distinguishable
frequencies. Figure 4.2 shows a photo of a user observing the entrance of
the laboratory with the ENVS and a corresponding screenshot of the ENVS
control panel. The level of stimulation delivered to the fingers can be seen on
the bars at the bottom-left of Figure 4.2b. As close range readings stimulate
38
(a) Photograph
(b) Screenshot
39
the fingers more than distant range readings, the finger stimulation levels felt
by the user at this instant indicate that a wall is being observed on the right.
Furthermore, the three significant colours of the door, fire extinguisher
and cabinet can also be seen in the range bars in Figure 4.2b. This indicates
that the ENVS has detected these familiar colours and is indicating this to the
user by stimulating the left middle finger, left pointer finger, both thumbs and
the right ring finger with frequencies corresponding to the detected familiar
colours.
After being familiarised with the laboratory entrance, the ENVS users
were led to a location in another wing of the building and asked to find their
way back to the lab by using only the ENVS. Users could competently ne-
gotiate the corridor, locate the laboratory entrance and enter the laboratory
unassisted and without difficulty, demonstrating the potential of this form of
colour and depth perception as an aid for the visually disabled.
Experiments were also performed within the laboratory (see Figure 4.3) to
determine if the ENVS users could recognise their location and navigate the
laboratory to the doorway without using the eyes or other blind aids. The
colours of a number of objects were encoded into the ENVS as significant
colours. These included the blue door and a red barrier stand located near
the door, as can be seen in Figure 4.3.
Before being given any navigational tasks in the laboratory, each user was
given approximately three minutes with the ENVS to become familiarised
40
(a) Photograph
(b) Screenshot
41
with the doorway vicinity and other objects that possessed familiar colours
that were stored in the ENVS. To ensure that the starting location and
direction was unknown to the ENVS users, each user was rotated a number of
times on a swivel chair and moved to an undisclosed location in the laboratory
immediately after being fitted with the ENVS. Furthermore, the doorway
happens to be concealed from view from most positions in the laboratory
by partitions. Consequently, users had the added task of first locating their
position using other familiar coloured landmarks and the perceived profile of
the laboratory before deciding on which direction to head toward the door.
It was found that users were generally able to quickly determine their
location within the laboratory based on the perceived profile of the environ-
ment and the location of familiar objects. Subsequently, users were able to
approach the door, identify it by its familiar colour (as well as the barrier
stand near the door) and proceed to the door. Figure 4.3 shows a photo of
a user observing the laboratory door with the ENVS and a corresponding
screenshot of the ENVS control panel. The level of stimulation delivered to
the fingers and the detected familiar colours can be seen on the display at
the bottom-left of Figure 4.3b.
When experimenting with infrared sensors rather than stereo cameras, colour
information was obtained using a miniature CMOS camera, like the one
shown in centre of the infrared headset in Figure 3.9, and colour sensors like
42
1
the TAOS TCS230 and the Hamamatsu S9706 2 . These technologies pro-
vided equivalent perception of colour information to the stereo video camera
input, but with a far smaller power consumption and processing footprint.
The TAOS TCS230 sensor was found to perform poorly for this appli-
cation under fluorescent lighting conditions. This sensor samples the three
colour channels in sequence, and receives inconsistent exposure to each chan-
nel due to motion or the strobe effect of fluorescent lights.
Experiments with the Hamamatsu S9706 sensor proved it to be suitable
for the ENVS if fitted with an appropriate lens to maximise the amount of
light captured. This sensor samples red, green and blue light levels simulta-
neously, so does not suffer from the fluorescent strobing or fast motion issues
discovered with the TAOS sensor.
The hardware prototype uses custom colour matching algorithms encoded
3
on a PIC microprocessor, and experimental results have shown that this
sensor arrangement is capable of performing as well as stereo cameras for
this application.
1
http://www.taosinc.com/product_detail.asp?cateid=11&proid=12
2
http://www.sales.hamamatsu.com/en/products/solid-state-division/color_
sensors/part-s9706.php
3
http://www.microchip.com
43
4.4 Conclusions
ENVS depth and colour detection experiments were conducted with stereo
cameras and infrared range sensors combined with various colour sensors.
Colour sensors such as the TAOS TCS230 and the Hamamatsu S9706, com-
bined with a suitable focusing lens, were found to be comparable in perfor-
mance to stereo cameras at detecting colours at distance, with less power
and processing overheads. The TAOS TCS230 sensor was found to be less
effective under fluorescent light and with moving objects.
TENS frequencies in the range of 10–120Hz were found to be effective
in resolving and communicating both the colour and range of objects to the
user to a limited extent. Attempts to resolve the colour of all objects in the
environment proved ineffective due to bandwidth limitations of the TENS
signal and confusion that can occur when both the frequency and intensity
of the TENS signal varies too often.
Encoding a limited number of colours into the ENVS with a frequency
lookup table for detecting significant coloured objects was found to be effec-
tive for both helping to identify familiar objects and facilitating navigation
by establishing the location of known landmarks in the environment.
44
5
High-Level Navigation
Whilst perception of range and colour data within the immediate environ-
ment can help users to avoid collisions and identify landmarks, it does not
address the entirety of a blind person’s navigational difficulties. The visible
environment beyond the short distance detectable by range sensing hardware
should also be considered, as well as utilisation of data available from tech-
nology that human senses cannot naturally detect. For example, additional
information from GPS and compass technology could be used to facilitate
navigation in the broader environment by determining the user’s location in
relation to known landmarks and the destination.
45
To enable landmarks to be perceived by blind users, the ENVS is equipped
with a GPS unit, a digital compass and a database of landmarks. The
digital compass (Geosensory: RDCM-802 1 ) is mounted on the stereo camera
headset, as shown in Figure 5.1, and is used to determine if the user is looking
in the direction of any landmarks.
1
http://www.geosensory.com/rdcm-802.htm
46
ample a bus stop or the location of a parked vehicle. By using the GPS unit
to obtain the user’s location, the ENVS can maintain a list of direction vec-
tors to landmarks that are within a set radius from the user. The landmark
radius can be set to short or long range (e.g. 200–600m) by the user via a
switch on the ENVS unit.
47
The distance to the landmark is indicated by the intensity of the pulses.
Weak sensations indicate that the landmark is near to the set maximum
landmark radius. Strong sensations indicate that the landmark is only metres
away from the user. If the user has difficulty recognising landmarks by their
pulse sequence, a button is available on the ENVS unit to output the name,
distance and direction of the perceived landmark as speech.
The RDCM-802 digital compass mounted on the ENVS headset has three-
bit output, providing an accuracy of 22.5◦ . This low resolution was not found
to be problematic as the user could obtain higher accuracy by observing the
point at which the bearing changed whilst moving their head. Also, the eight
compass point output mapped well to the fingers for providing the user with
a wide field of view for landmarks and proved effective for maintaining spatial
awareness of landmarks.
Communicating the landmark ID to the user via the frequency of the
landmark pulse (rather than the ‘dots and dashes’ protocol) was also tested
and found to allow faster interpretation of the ID. However, this was only
effective for a small number of landmarks due to the difficulty involved in
differentiating frequencies. It was also found that the delay between provid-
ing landmark information could be extended from five seconds to ten seconds
because landmarks tend to be more distant than objects which allows spatial
awareness of landmarks to be maintained for longer periods than for nearby
objects.
48
5.2 Landmark bearing protocols
49
Figure 5.3 illustrates one landmark perception protocol tested which al-
lows the user to perceive the direction of landmarks within 112.5◦ of the
direction they are facing. For example, landmarks within 22.5◦ of the centre
of vision are communicated via the two index fingers simultaneously. Land-
marks within the ‘peripheral vision of the user are communicated via the
appropriate pinkie finger.
It was found also that this perception field could effectively be extended
to 360◦ (see Figure 5.4), allowing the user to be constantly aware of land-
marks in their vicinity regardless of the direction they were facing. Both
these protocols were found effective, and users demonstrated no problems
interpreting the different fields of perception.
50
Figure 5.4: 360◦ GPS perception protocol
To test the ENVS a number of experiments were conducted within the Uni-
versity campus grounds, to determine the ability of users to navigate the
campus environment without any use of their eyes. The ENVS users were
familiar with the campus grounds and the landmarks stored in the ENVS and
51
had no visual impairments. The headset acted as a blindfold as in previous
experiments.
The first experiment was performed to determine if the ENVS users could
navigate a car park and arrive at a target vehicle location that was encoded
into the ENVS as a landmark. Each user was fitted with the ENVS and
led blindfolded to a location in the car park that was unknown to them and
asked to navigate to the target vehicle using only the ENVS electro-tactile
signals. The car park was occupied to approximately 75% of its full capacity
and also contained some obstacles such as grass strips and lampposts.
Users were able to perceive and describe their surroundings and the loca-
tion of the target vehicle in sufficient detail for them to be able to navigate
to the target vehicle without bumping into cars or lampposts. With practice,
users could also interpret the ENVS output without needing to extend their
hands to assist with visualisation, and could walk between closely spaced
vehicles without colliding with the vehicles.
Figure 5.5 shows a user observing two closely spaced vehicles in the car
park with the ENVS. The profile of the space between the vehicles can be
seen in the disparity map, shown in the top-right of Figure 5.5b, and in
the finger pulse bars shown at the lower-left. The highlighted bar at the
left forefinger position of the intensity display indicates that target vehicle is
located slightly to the left of where the user is looking and at a distance of
approximately 120m.
52
(a) Photograph
(b) Screenshot showing the perceived vehicles and the target landmark’s
direction and distance.
53
5.3.2 Navigating the University campus
54
(a) Photograph
(b) Screenshot
Figure 5.6: ENVS user surveying a paved path in the campus environment
55
In some areas, users were unable to determine where the path ended and
the grass began, using the ENVS range stimulus alone. However, this did
not cause any collisions and the users became quickly aware of the edge of
the path whenever their feet made contact with the grass. This problem
could be overcome by encoding colour into the range signals delivered to the
fingers by varying the frequency of the tactile signals.
5.4 Conclusions
To enable the user to perceive and navigate the broader environment, the
ENVS was fitted with a compass and GPS unit.
By encoding unique landmark identifiers with five-bit morse-code-style
binary signals and delivering this information to the fingers, it was shown
that the user was able to perceive the approximate bearing and distance of
up to 32 known landmarks.
It was also found that the region-to-finger mapping need not necessarily
correspond exactly with the ENVS range and colour data regions, and could
in fact be expanded to make full 360◦ perception of landmarks possible.
These experimental results demonstrate that by incorporating GPS and
compass information into the ENVS output, it may be possible for blind
users to negotiate the immediate environment and navigate to more distant
destinations without additional assistance.
56
6
Haptic Perception of
Graphical User Interfaces
57
A substantial advantage of virtual environments is the elimination of sen-
sor inaccuracies and limitations involved in determining the state of the envi-
ronment. This is because the software can have a complete map available at
all times. More difficult, is accurately determining the portion of the environ-
ment toward which the user’s perception is directed during each moment in
time. Details of the custom systems developed to deliver head-pose tracking
suitable for this application can be found in Chapter 7. The interface itself
is described in the following sections.
The primary goal of the gaze-tracking 1 haptic interface is to maintain the
spatial layout of the interface so that the user can perceive and interact with
it in two-dimensions as it was intended, rather than enforcing linearisation
with the loss of spatial and format data, as is the case with screen readers.
In order to maintain spatial awareness, the user must be able to control
the “region of interest” and understand its location within the interface as
a whole. Given the advantages of keeping the hands free for typing and
perception, the use of the head as a pointing device was an obvious choice
– a natural and intuitive pan/tilt input device which is easy to control and
track for the user (unlike mouse devices).
The graphical user interface experimentation was performed using the
infrared-LED-based head-pose tracking system described in Section 7.1.
1
In this context, gaze-tracking is analogous to head-pose tracking
58
6.1 The virtual screen
59
Figure 6.1). The level of the window being perceived (from frontmost window
to desktop-level) was mapped to the intensity of haptic feedback provided to
the corresponding finger, so that “depth” could be conveyed in a similar fash-
ion to the ENVS. The frequency of haptic feedback was used to convey the
type of element being perceived (file/folder/application/control/empty cell).
Figure 6.2 illustrates the mapping between adjacent grid cells and the user’s
fingers. The index fingers were used to perceive the element at the gaze point,
while adjacent fingers were optionally mapped to neighbouring elements to
provide a form of peripheral perception. This was found to enable the user
to quickly acquire a mental map of the desktop layout and content. By gaz-
ing momentarily at an individual element, the user could acquire additional
details such as the file name, control type, etc. via synthetic speech output
or Braille text on a refreshable display.
A problem discovered early in experimentation with this interface was
the confusion caused when the user’s gaze meandered back and forth across
60
cell boundaries, as shown in Figure 6.3. To overcome this problem, a subtle
auditory cue was provided when the gaze crossed boundaries to make the user
aware of the grid positioning, which also helped to distinguish contiguous
sections of homogeneous elements. In addition, a stabilisation algorithm was
implemented to minimise the number of incidental cell changes as shown in
Figure 6.3.
Figure 6.3: Gaze travel cell-visiting sequence unstabilised (left) and with
stabilisation applied (right)
61
6.2.1 Zoomable web browser interface
With the ever-increasing popularity and use of the World Wide Web, a web-
browser interface is arguably more important to a blind user than a desktop
or file management system. Attempting to map web pages into grids similar
to the desktop interface proved difficult due to the more free-form nature
of interface layouts used. Small items such as radio buttons were forced to
occupy an entire cell, thus beginning to lose the spatial information critical
to the system’s purpose. The grid was therefore discarded altogether, and
the native borders of the HyperText Markup Language (HTML) elements
used instead.
Web pages can contain such a wealth of tightly-packed elements, however,
that it can take a long time to scan them all and find what you are looking
for. To alleviate this problem, the system takes advantage of the natural
Document Object Model (DOM) element hierarchy inherent in HTML and
“collapses” appropriate container elements to reduce the complexity of the
page. For example, a page containing three bulleted lists, each containing
text and links, and two tables of data might easily contain hundreds of el-
ements. If, instead of rendering all of these individually, they are simply
collapsed into the three tables and two lists, the user can much more quickly
perceive the layout, and then opt to “zoom” into whichever list or table in-
terests them to perceive the contained elements (see Figures 6.4a and 6.4b
for another example).
62
(a) Raw page
63
The experimental interface has been developed as an extension for the
Mozilla Firefox web browser 2 , and uses the BRLTTY 3
for Braille commu-
4
nication and Orca for speech synthesis. It uses JavaScript to analyse the
page structure and coordinate gaze-interaction in real-time. Communication
with the Braille display (including input polling) is performed via a separate
Java application.
The wired and wireless TENS interfaces developed for mobile ENVS usage
were also able to be used for stationary perception of virtual environments.
This interface proved effective in experimentation and allowed the user’s
fingers to be free to use the keyboard (see Figure 6.5). However, being
physically connected to the TENS unit proved inconvenient for general use.
2
http://www.mozilla.com/firefox/
3
http://mielke.cc/brltty/
4
http://live.gnome.org/Orca
64
Figure 6.5: Wired TENS system
65
more comfortable with the idea of vibration feedback than electrical stimu-
lation.
Whilst this interface was found to be capable of delivering a wide range
of sensations, the range and differentiability of TENS output was superior.
Furthermore, the TENS interface allows users to simultaneously perceive and
use the keyboard, whilst the vibro-tactile keyboard would require movement
of the fingers between the actuators and the keys.
66
and capable of varying the “refresh-rate” up to 25Hz.
67
Figure 6.8: Example glyphs – link, text, text
The Papenmeier “easy access bar” has also proven to be a valuable asset
for interface navigation. In the prototype browser, vertical motions allow
the user to quickly “zoom” in or out of element groups (as described in
Section 6.2.1), and horizontal motions allow the display to toggle between
“perception mode” and “reading” mode once a element of significance has
been discovered.
68
6.4 Results
69
on all haptic feedback devices. However, this learning procedure can be facil-
itated by providing speech or Braille output to identify elements when they
are ‘gazed’ at for a brief period.
As far as screen element interpretation was concerned, haptic feedback
via the Braille display surpassed the TENS and vibro-tactile interfaces. This
was mainly because the pictorial nature of glyphs used is more intuitive to
inexperienced users. It is also possible to encode more differentiable elements
by using two Braille cells per finger. The advantages of the Braille interface
would be presumably even more pronounced for users with prior experience
in Braille usage.
Preliminary experiments with the haptic web browser also demonstrated
promising results. For example, users were given the task of using a search
engine to find the answer to a question without sight. They showed that they
were able to locate the input form element with ease and enter the search
keywords. They were also able to locate the search results, browse over them
and navigate to web pages by clicking on links at the gaze position. Further-
more, users could describe the layout of unfamiliar web pages according to
where images, text, links, etc. were located.
70
6.5 Conclusions
This work presents a novel, haptic head-pose tracking computer interface that
enables the two-dimensional screen interface to be perceived and accessed
without any use of the eyes.
Three haptic output paradigms were tested, namely: TENS, vibro-tactile
and a refreshable Braille display. All three haptic feedback methods proved
effective to varying degrees. The Braille interface provided greater versatility
in terms of rapid identification of screen objects. The TENS system provided
improved perception of depth (for determining window layers). The vibro-
tactile output proved convenient but with limited resolution.
Preliminary experimental results have demonstrated that considerable
screen-based interactivity is able to be achieved with haptic gaze-tracking
systems including point-and-click and drag-and-drop manipulation of screen
objects. The use of varying haptic feedback can also allow screen objects at
the gaze position to be identified and interpreted. Furthermore, the prelim-
inary experimental results using the haptic web browser demonstrate that
this means of interactivity holds potential for improved human–computer
interactivity for the blind.
71
72
7
Head-Pose Tracking
73
7.1 Simple, robust and accurate head-pose
tracking using a single camera
Tracking the position and orientation of the head in real time is finding
increasing application in avionics, virtual reality, augmented reality, cine-
matography, computer games, driver monitoring and user interfaces for the
disabled. Although many head-pose tracking systems and techniques have
been developed, existing systems either added considerable complexity and
cost, or were not accurate enough for the application. For example, systems
described in [30], [38] and [64] use feature detection and tracking to mon-
itor the position of the eyes, nose and/or other facial features in order to
determine the orientation of the head. Unfortunately these systems require
considerable processing power, additional hardware or multiple cameras to
detect and track the facial features in 3D space. Although monocular sys-
tems (like [30], [38] and [92]) can reduce the cost of the system, they generally
performed poorly in terms of accuracy when compared with stereo or multi-
camera tracking systems [64]. Furthermore, facial feature tracking methods
introduce inaccuracies and the need for calibration or training into the sys-
tem due to the inherent image processing error margins and diverse range of
possible facial characteristics of different users.
To avoid the cost and complexity of facial feature tracking methods a
number of head-pose tracking systems have been developed that track LEDs
or infrared reflectors mounted on the user’s helmet, cap or spectacles (see
[63], [17], [18], and [29] ). However the pointing accuracy of systems utilising
reflected infrared light [63] was found to be insufficient for this research.
74
The other LED-based systems, like [17], [18], and [29], still require multiple
cameras for tracking the position of the LEDs in 3D space which adds cost
and complexity to the system as well as the need for calibration.
In order to track head-pose with high accuracy whilst minimising cost
and complexity, methods were researched for pinpointing the position of in-
frared LEDs using an inexpensive USB camera and efficient algorithms for
estimating the 3D coordinates of the LEDs based on known geometry. The
system is comprised of a single low-cost USB camera and a pair of specta-
cles fitted with three battery-powered LEDs concealed within the spectacle
frame. Judging by the experimental results, the system appears to be the
most accurate low-cost head-pose tracking system developed to date. Fur-
thermore, it is robust and requires no calibration. Experimental results are
provided, demonstrating head-pose tracking accurate to within 0.5◦ when the
user is within one meter of the camera.
7.1.1 Hardware
75
(a) Prototype LED spectacles
76
Although the infrared light cannot be seen with the naked eye, the LEDs
appear quite bright to a digital camera. The experiments were carried out
using a low-cost, standard ‘Logitech QuickCam Express V-UH9’ USB cam-
era 1 , providing a maximum resolution of 640x480 pixels with a horizontal
lens angle of approximately 35◦ . The video data captured by this camera
is quite noisy, compared with more expensive cameras, though this proved
useful for testing the robustness of the system. Most visible light was filtered
out by fitting the lens with a filter comprising several layers of fully-exposed
colour photographic negative. Removal of the camera’s internal infrared fil-
ter was found to be unnecessary. This filtering, combined with appropriate
adjustments of the brightness, contrast and exposure settings of the camera,
allowed the raw video image to be completely black, with the infrared LEDs
appearing as bright white points of light. Consequently the image processing
task is simplified considerably.
The requirement of the user to wear a special pair of spectacles may
appear undesirable when compared to systems which use traditional image
processing to detect facial features. However, the advantage of being a ro-
bust, accurate and low-cost system which is independent of individual facial
variations, plus the elimination of any training or calibration procedures can
outweigh any inconvenience caused by wearing special spectacles. Further-
more, the LEDs and batteries could be mounted on any pair of spectacles,
headset, helmet, cap or other head-mounted accessory, provided that the
geometry of the LEDs is entered into the system.
1
http://www.logitech.com/en-us/support/webcams/legacy-devices/3403
77
7.1.2 Processing
Blob tracking
Figure 7.2a shows an example raw video image of the infrared LEDs which
appear as three white blobs on a black background.
The individual blobs are detected by scanning the image for contigu-
ous regions of pixels over an adjustable brightness threshold. Initially, the
blobs were converted to coordinates simply by calculating the centre of the
bounding-box; however the sensitivity of the three-dimensional transforma-
tions to even single-pixel changes proved this method to be unstable and
inaccurate. Consequently a more accurate method was adopted — calculat-
ing the centroid of the area using the intensity-based weighted average of
the pixel coordinates, as illustrated in Figure 7.2b. This method provides a
surprisingly high level of accuracy even with low-resolution input and distant
LEDs.
78
(a) Raw video input (showing the infrared LEDs at close range – 200mm)
(b) Example LED blob (with centroid marked) and corresponding intensity data
79
Head-pose calculation
Once the two-dimensional blob coordinates have been calculated, the points
must be projected back into three-dimensional space in order to recover the
original LED positions. Solving this problem is not straightforward. Fig-
ure 7.3 illustrates the configuration of the problem. The camera centre (C)
is the origin of the coordinate system, and it is assumed to be facing directly
down the z-axis. The ‘gaze’ of the user is projected onto a ‘virtual screen’
which is also centred on the z-axis and perpendicular to it. The dimensions
and z-translation of the virtual screen are controllable parameters and do not
necessarily have to correspond with a physical computer screen, particularly
for blind users and virtual reality applications. In fact, the virtual screen
can be easily transformed to any size, shape, position or orientation relative
to the camera. Figure 7.3 also displays the two-dimensional image plane,
scaled for greater visibility. The focal length (z) of the camera is required
to perform the three-dimensional calculations. The LED points are labelled
L, R and F (left, right and front respectively, ordered from the camera’s
point of view). Their two-dimensional projections onto the image plane are
labelled l, r and f . L, R and F must lie on vectors from the origin through
their two-dimensional counterparts.
80
Figure 7.3: Perspective illustration of the virtual screen (located at the cam-
era centre), the 2D image plane, the 3D LED model and its projected ‘gaze’
Given knowledge of this model, the exact location of the LEDs along the
projection rays can be determined. The front LED is equidistant to the outer
LEDs, thus providing Equation 7.1.
The ratio r between these distances and the baseline distance are also
known.
81
These constraints are sufficient for determining a single solution orien-
tation for the model. Once the orientation has been calculated, the exact
physical coordinates of the points can be derived, including the depth from
the camera, by utilising the model measurements (provided in Section 7.1.1).
The distance of the model from the camera is irrelevant for determining
the model’s orientation, since it can simply be scaled in perspective along the
projection vectors. Thus it is feasible to fix one of the points at an arbitrary
location along its projection vector, calculate the corresponding coordinates
of the other two points, and then scale the solution to its actual size and
distance from the camera.
Parametric equations can be used to solve the problem. Thus the position
of point L is expressed as:
Lx = tlx (7.3a)
Ly = tly (7.3b)
Lz = tz (7.3c)
Since z is the focal length, a value of 1 for the parameter t will position L
on the image plane.
Thus there are only three unknowns — the three parameters of the LED
points on their projection vectors. In fact one of these unknowns is elimi-
nated, since the location of one of the points can be fixed — in this solution,
the location of R was fixed at depth Rz = z, thus making its x- and y-
82
coordinates equal to rx and ry respectively.
The position of the point F is expressed as:
Fx = ufx (7.4a)
Fy = ufy (7.4b)
Fz = uz (7.4c)
q q
(tlx − vi)2 + (tly − vj)2 + (tz − vz)2 = (rx − ufx )2 + (ry − ufy )2 + (z − uz)2
(7.5)
z 2 (t2 − 1) + lx 2 t2 + ly 2 t2 − rx 2 − ry 2
u(t) = (7.6)
2(z 2 (t − 1) + lx fx t + ly fy t − rx fx − ry fy )
83
2
1.5
Front-point
1
Parameter (u)
0.5
0
0 0.5 1 1.5 2
Left-point Parameter (t)
Figure 7.4 shows a plot of Equation 7.6. It should be noted that the asymp-
tote is at:
rx fx + ry fy + z 2
t= (7.7)
lx fx + ly fy + z 2
q
b(t) = (rx − tlx )2 + (ry − tly )2 + (z − tz)2 (7.8)
84
750
Baseline 500
Distance
250
0 0.5 1 1.5 2
Left-point Parameter (t)
q
h(t) = ((u(t)fx − tlx )2 + (u(t)fy − tly )2 + (u(t)z − tz)2 ) − (b(t)/2)2 (7.9)
500
400
Height from 300
baseline to
front-point 200
100
0
0 0.5 1 1.5 2
Left-point Parameter (t)
Figure 7.6 shows a plot of Equation 7.9. It should be noted that this
function, since it is dependent on u(t), shares the asymptote defined in Equa-
tion 7.7.
85
1.6
1.4
1.2
Triangle 1
Height/Baseline 0.8
Ratio 0.6
0.4
0.2
0
0 0.5 1 1.5 2
Left-point Parameter (t)
At this stage the actual baseline distance or height of the triangle is not
relevant — only their relationship. Figure 7.7 shows a plot of h(t)/b(t). This
function has a near-invisible ‘hump’ just after it reaches its minimum value
after the asymptote (around t = 1.4 in this case). This graph holds the key
to the solution, and can reveal the value of t for which the triangle has a ratio
which matches the model. Unfortunately, it is too complex to be analytically
inverted, thus requiring root-approximation techniques to find the solution
values. Thankfully, the solution range can be further reduced by noting two
more constraints inherent in the problem.
86
Firstly, only solutions in which the head is facing toward the camera are
relevant. Rearward facing solutions are considered to be invalid as the user’s
head would obscure the LEDs. Thus the addition of the constraint that:
Fz < Mz (7.10)
1000
900 Front-point z
Midpoint z
800
700
z-coordinates
600
500
400
300
0 0.5 1 1.5 2
Left-point Parameter (t)
87
facing solutions are invalid, but also solutions beyond the rotational range
of the LED configuration; that is, the point at which the front LED would
occlude one of the outer LEDs. The prototype LED configuration allows
rotation (panning) of approximately 58◦ to either side before this occurs.
The upper-limit (intersection between the Fz and Mz functions) can be
expressed as:
q
−S − −4(−lx 2 − ly 2 + lx fx + ly fy ) (rx 2 + ry 2 − rx fx − ry fy ) + S 2
t≤
2(−lx 2 − ly 2 + lx fx + ly fy )
(7.12)
88
2
Lower limit
1.5 Upper limit
Triangle
Height/Baseline 1
Ratio
0.5
0
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
Left-point Parameter (t)
Figure 7.9 illustrates the upper and lower limits for root-approximation in
finding the value of t for which the triangle ratio matches the model geometry.
Once t has been approximated, u can be easily derived using Equation 7.6,
and these parameter values substituted into the parametric coordinate equa-
tions for L and F . Thus the orientation has been derived. The solution can
now be simply scaled to the appropriate size using the dimensions of the
model. This provides accurate three-dimensional coordinates for the model
in relation to the camera. Thus the user’s ‘gaze’ (based on head-orientation)
can be projected onto a ‘virtual screen’ positioned relative to the camera.
89
coordinates. The t parameter was calculated to 10-decimal-place precision,
in approximately 30 bisection iterations.
To test the accuracy of the system, the camera was mounted in the centre
of a piece of board measuring 800x600mm. A laser-pointer was mounted just
below the centre LED position to indicate the ‘gaze’ position on the board.
The system was tested over a number of different distances, orientations and
video resolutions. The accuracy was monitored over many frames in order to
measure the system’s response to noise introduced by the dynamic camera
image. Table 7.1 and Figure 7.10 report the variation in calculated ‘gaze’
x- and y-coordinates when the position of the spectacles remained static.
Note that this variation increases as the LEDs are moved further from the
camera, because the resolution effectively drops as the blobs become smaller
(see Table 7.2). This problem could be avoided by using a camera with optical
zoom capability providing the varying focal length could be determined.
90
3 3
320x240 Pixels b 320x240 Pixels b
2.5 640x480 Pixels 4 2.5 640x480 Pixels 4
2 2 b
◦
x◦ 1.5 b y 1.5 4
1 4 1
b
4b
0.5 0.5 b
4b
4
b b 4
04 04
500 1000 1500 2000 500 1000 1500 2000
Distance from camera (mm) Distance from camera (mm)
Table 7.2: LED ‘blob’ diameters (pixels) at different resolutions and camera
distances
91
As an additional accuracy measure, the system’s depth resolution was
measured at a range of distances from the camera. As with the ‘gaze’ reso-
lution, the depth resolution was limited by the video noise. In each case, the
spectacles faced directly toward the camera. These results are tabulated in
Table 7.3.
7.1.4 Summary
92
Other applications include scroll control of head-mounted virtual reality dis-
plays, or any application where the head position and orientation is to be
monitored.
The infrared LED tracking solution indeed proved highly accurate and ex-
tremely low-cost, however requiring the user to wear tracking spectacles was
far from ideal. A more recent development in sensor technology presented
an opportunity to maintain the high level of accurancy without requiring a
tracking target with known geometry: time-of-flight cameras.
In 2006, Swiss company MESA Imaging announced the release of the SR-
3000 “SwissRanger” time-of-flight camera [61]. The camera (pictured in
Figure 7.11) is surrounded by infrared LEDs which illuminate the scene,
and allows the depth of each pixel to be measured based on the time of
arrival of the frequency modulated infrared light in real-time. Thus, for each
frame it is able to provide a depth map in addition to a standard greyscale
amplitude image (see examples in Figure 7.12). The amplitude image is
based on reflected infrared light, and therefore is not affected by external
lighting conditions.
93
Figure 7.11: SwissRanger SR-3000
94
(a) SR-3000 amplitude image
95
Despite the technological breakthrough that the SwissRanger has pro-
vided, it has a number of limitations. The sensor is QCIF (Quarter Common
Intermediate Format, 176x144 pixels), so the resolution of the data is low.
The sensor also has a limited ‘non-ambiguity range’ before the signals get
out of phase. At the standard 20MHz frequency, this range is 7.5 metres.
However, given the comparatively short-range nature of the application, this
limitation does not pose a problem for the head-pose tracking system. The
main limitation of concern is noise associated with rapid movement. The
SR-3000 sensor is controlled as a so-called one-tap sensor. This means that
in order to obtain distance information, four consecutive exposures have to
be performed. Fast moving targets in the scene may therefore cause errors
in the distance calculations; see [62]. Whilst the accuracy of a depth map
of a relatively stationary target is quite impressive (see Figure 7.13a), the
depth map of a target in rapid motion is almost unusable by itself (see Fig-
ure 7.13b). This problem may be overcome to a considerable extent by using
a combination of median filtering, time fusion and by combining the intensity
image data with the depth map.
96
(a) Stationary subject
97
7.2.2 Overview
98
considered to be the most important point in this system.
Although the nose can be considered to be the best facial feature to
track due to its availability and universality, more than this single feature
is needed to obtain the orientation of the face in three-dimensional space.
One approach would be to use an algorithm such as Iterative Closest Point
(ICP) [2] to match the facial model obtained in the previous frame with the
current frame. This method may work but is expensive to do in real time. It
may also fail if the head moves too quickly or if some frames are noisy and
the initial fit is a considerable distance from optimal.
Alternatively, an adaptive feature selection algorithm could be formu-
lated which automatically detects identifiable features within the depth map
or amplitude image (or ideally a combination of the two) in order to de-
tect additional features which can be used to perform matching. Here, a
redundant set of features could theoretically provide the ability to match the
orientation between two models with high accuracy. In practice however,
the low resolution of the SwissRanger camera combined with the noisy na-
ture of the depth map have caused this approach to prove unsuccessful. The
features obtained in such an adaptive feature selection algorithm would also
need to be coerced to conform to the target spacial configuration. To over-
come these difficulties, a novel approach was developed that simplified the
feature selection process whilst simultaneously removing the need for spatial
coercion.
The premise is relatively simple: with only one feature so far (i.e. the
nose tip) more are required, preferably of a known spatial configuration. In-
tersecting the model with a sphere of radius r centred on the nose feature
99
results in an intersection profile containing all points on the model which
are r units away from the central feature, as shown in Figure 7.14. Because
of the spherical nature of the intersection, the resulting intersection pro-
file is completely unaffected by the orientation of the model, and thus ideal
for tracking purposes. It could simply be analysed for symmetry, if it were
safe to assume that the face is sufficiently symmetrical and that the central
feature lies on the axis of symmetry, and an approximate head-pose could
be calculated based on symmetry alone. However, given that many human
noses are far from symmetrical, and up to 50% of the face may be occluded
due to rotation, this approach will not always succeed. But if the model is
saved from the first frame, spherical intersections can be used to match it
against subsequent frames and thus obtain the relative positional and rota-
tional transformation. Multiple spherical intersections can be performed to
increase the accuracy of the system.
100
Figure 7.14: Illustration of tracing a spherical intersection profile starting
from the nose tip
101
profile due to the ‘roll’ rotational limits of the human head.
After having matched a subsequent 3D model to the position and orienta-
tion of the previous one, additional data becomes available. If the orientation
of the face has changed, some regions that were previously occluded (e.g. the
far-side of the nose) may now be visible. Thus, merging the new 3D data
into the existing 3D model improves the accuracy of subsequent tracking.
Consequently, with every new matched frame, more data is added to the
target 3D model making it into a continuously evolving mesh.
To prevent the data comprising the 3D model from becoming too large,
the use of polygonal simplification algorithms is proposed to adaptively re-
duce the complexity of the model. If a sufficient number of frames indicate
that some regions of the target model are inaccurate, points can be adap-
tively moulded to match the majority of the data, thus filtering out noise,
occlusion, expression changes, etc. In fact, regions of the model can be iden-
tified as being rigid (reliably robust) or noisy / fluid (such as hair, regions
subject to facial expression variation, etc.) and appropriately labelled. Thus,
matching can be performed more accurately by appropriately weighting the
robust areas of the 3D model.
This approach to head pose tracking depends heavily on the accuracy of
the estimation of the initial central feature point (i.e. the nose tip). If this
is offset, the entire intersection profile changes. Fortunately, the spherical
intersection profiles themselves can be used to improve the initial central
point position. By using a hill climbing approach in three-dimensional space,
the central point can be adjusted slightly in each direction to check for a more
accurate profile match. This will converge upon the best approximation of
102
the target centre point provided the initial estimate is relatively close to the
optimal position.
Furthermore, the system can be used to differentiate or identify users.
Each evolving mesh can be stored in a database, and a new model can be
generated for a face which does not sufficiently match any existing models.
Due to the simplified nature of spherical intersection profile comparisons,
a database of faces can be searched with considerable efficiency. Spherical
intersection profiles for each model could also be cached for fast searching.
7.2.3 Preprocessing
Several steps are used to prepare the acquired data for processing.
Median Filtering
Some cross-correlation with the amplitude image can help eliminate erro-
neous data in the depth map. For example, pixels in the depth map which
correspond to zero values in the amplitude image (most frequently found
around the edges of objects) are likely to have been affected by object mo-
103
tion, and can be filtered out. Minimum and maximum distance thresholds
can also be applied to the depth map to eliminate objects in the foreground
or background.
Region of interest
Whilst the user’s torso can be of some value to a head-pose tracking ap-
plication, it is generally desirable to perform calculations based solely upon
the facial region of the images. Identifying the facial region robustly is non-
trivial. The initial prototype system performed accurate jaw-line detection
based on the greatest discontinuity in each column of the depth map in order
to separate the head from the body. This approach sometimes failed due
to extreme head rotation, excessive noise, presence of facial hair, occlusion,
etc. Consequently, a more robust approach was developed using a simple
bounding-box. Given that a depth map is available, it is straightforward to
determine the approximate distance of the user from the camera. This is
achieved by sampling the first n non-empty rows in the depth map (the top
of the user’s head) and then calculating the average depth. By taking the
approximate distance of the user’s head, and anthropometric statistics [69],
the maximum number of rows the head is likely to occupy within the images
can be determined. The centroid of the depth map pixels within these rows
is then calculated, and a region of interest of the appropriate dimensions
is centred on this point (see Figure 7.15). This method has proved 100%
reliable in all sample sequences recorded to date.
104
Figure 7.15: Example facial region-of-interest
Once the acquired data has been preprocessed, the central feature must be
found in order to localise the models in preparation for orientation calcu-
lations. The rationale for choosing the nose tip as the central feature was
discussed in Section 7.2.2. As Gorodnichy points out [23], the nose tip can
be robustly detected in an amplitude image by assuming it is surrounded
by a spherical Lambertian surface of constant albedo. However, using a
SwissRanger sensor provides an added advantage. Since the amplitude im-
105
age is illuminated solely by the integrated infrared light source, calculating
complex reflectance maps to handle differing angles of illumination is unnec-
essary. Additional data from the depth map such as proximity to camera and
curvature (see Figure 7.16) can also be used to improve the search and assist
with confirming the location of the nose tip. Figure 7.17 shows a typical
frame with nose localisation data overlaid. Preliminary results have shown
that this approach is fast and robust enough to locate the nose within typical
frame sequences with sufficient accuracy for this application.
Figure 7.16: Amplitude image with curvature data overlaid. Greener pixels
indicate higher curvature (calculated from depth map).
106
Figure 7.17: Sample frame with nose-tracking data displayed. Green pixels
are candidate nose pixels; red cross indicates primary candidate.
Once the central feature (nose tip) has been located, the dimensionality of
the problem has been reduced considerably by removing the translational
component of the transformation. Now a single rotation about a specific
axis in three-dimensional space (passing through the central feature point)
will be sufficient to match the orientation of the current models to the saved
target. Small errors in the initial location of the nose point can be iteratively
improved using three-dimensional hill-climbing to optimise the matching of
the spherical intersection profiles, as discussed in Section 7.2.2.
107
Spherical intersection algorithm
108
Algorithm 1 Find intersection profile of projected depth map with sphere of
radius radius centred on pixel projected from depthM ap[noseRow][noseCol]
r ⇐ noseRow, c ⇐ noseCol
if noseCol > f aceCentroidCol then
direction ⇐ LEF T
else
direction ⇐ RIGHT
end if
f ound ⇐ false {No intersection found yet}
centre ⇐ projectInto3d(depthM ap, r, c)
innerP oint ⇐ centreP oint
while r and c are within region of interest, and distance(inner, centre) <
radius do
(r, c) ⇐ translate(r, c, direction)
outerP oint ⇐ projectInto3d(depthM ap, r, c)
if distance(outer, centre) > radius then
f ound ⇐ true
end if
end while
if Not f ound then
return No intersection
end if
109
Algorithm 1 (continued)
for startDirection = U P to DOW N do
direction ⇐ startDirection
while Loop not closed do
(r2, r2) ⇐ translate(inner.r, inner.c, direction)
inner2 ⇐ projectInto3d(depthM ap, r2, c2)
(r2, c2) ⇐ translate(outer.r, outer.c, direction)
outer2 ⇐ projectInto3d(depthM ap, r2, c2)
if inner2 or outer2 are invalid then
break
else if distance(inner2, centre) > radius then
outer ⇐ inner2
else if distance(outer2, centre) < radius then
inner ⇐ outer2
else
inner ⇐ inner2
outer ⇐ outer2
end if
id ⇐ distance(inner, centre)
t ⇐ (radius − id)/((distance(outer, centre) − id)
if startDirection = U P then
Append (inner + (outer − inner) × t) to SIP
else
Prepend (inner + (outer − inner) × t) to SIP
end if
Update direction
end while
end for
return SIP
110
Figure 7.18: Example spherical intersection profiles overlaid on depth (top-
left), amplitude (bottom-left) and 3D point cloud (right).
111
only to provide an initial alignment for each profile pair, after which a new
algorithm will measure the fit and optimise it in a manner similar to ICP [2].
The fit metric provided by this algorithm could be used in the hill-climbing-
based optimisation of the central point (nose tip).
It is advantageous that each profile pair should lead to the same three-
dimensional axis and magnitude of rotation required to align the model. Thus
the resultant collection of rotational axes provides a redundant approxima-
tion of the rotation required to align the entire model. This can be analysed
to remove outliers, etc., and then averaged to produce the best approximation
of the overall transformation.
Once the current frame has been aligned with the target 3D model, any addi-
tional information can be used to evolve the 3D model. For example, regions
which were originally occluded in the target model may now be visible, and
can be used to extend the model. Thus a near-360◦ model of the user’s head
can be obtained over time. Adjacent polygons with similar normals can be
combined to simplify the mesh. Areas with higher curvature can be sub-
divided to provide greater detail. For each point in the model, a running
average of the closest points in the frames is maintained. This can be used
to push or pull points which stray from the average and allow the mesh to
‘evolve’ and become more accurate with each additional frame. The contri-
bution of a given frame to the running averages is weighted by the quality
of that frame, which is simply a measure of depth map noise combined with
112
the mean standard deviation of the intersection profile rotational transforms.
Furthermore, a running standard deviation can be maintained for each point
to allow the detection of noisy regions of the model, such as hair, glasses
which might reflect the infrared light, facial regions subject to expression
variation, etc. These measures of rigidity can then be used to weight regions
of the intersection profiles to make the algorithm tolerant of fluid regions,
and able to focus on the rigid regions. Non-rigid regions on the extremities
of the model can be ignored completely. For example, the neck will show up
as a non-rigid region due to the fact that its relation to the face changes as
the user’s head pose varies.
Initial alignment
The most obvious method of calculating the orientation is to find the centroid
of each spherical intersection profile, and then project a line through the cen-
troids from the centre of the spheres using a least-squares fit. This provides
a reasonable approximation in most cases but performs poorly when the face
113
orientation occludes considerable regions of the spherical intersection profiles
from the camera. A spherical intersection profile which is 40% occluded will
produce a poor approximation of the true centroid and orientation vector
using this method.
Instead, the system finds the average of the latitudinal extrema of each
spherical intersection profile (i.e. the topmost and bottommost points). This
proved effective over varying face orientations for two reasons. Firstly, these
points are unlikely to be occluded due to head rotation. Secondly, in most
applications the subject’s head is unlikely to ‘roll’ much (as opposed to ‘pan’
and ‘tilt’), so these points are likely to be the true vertical extremities of
the face. If the head is rotated so far that these points are not visible on
the spherical intersection profile, the system detects that the spherical inter-
section profile is still rising/falling at the upper/lower terminal points and
therefore dismisses it as insufficient. A least-squares fit of a vector from the
nose tip passing through the latitudinal extrema midpoints provides a good
initial estimate of the face orientation. Several further optimisations are sub-
sequently performed by utilising additional information, as discussed in the
following sections.
Symmetry optimisation
114
faceprint using the technique described above, to produce good orientation
estimation. Given the observation of limited roll discussed in Section 7.2.7, it
is reasonable to assume that the orientation plane will intersect the faceprint
approximately vertically. The symmetry of the faceprint is then measured
using Algorithm 2. This algorithm returns a set of midpoints for each spher-
ical intersection profile, which can be used to measure the symmetry of the
current faceprint orientation. Algorithm 2 also provides an appropriate trans-
formation which can be used to optimise the symmetry by performing a least-
squares fit to align the plane of symmetry approximated by the midpoints.
Figure 7.19 illustrates an example faceprint with symmetry midpoints visible.
115
Algorithm 2 Calculate the symmetry of a set of SIP s
midpoints ⇐ new Vector[length(SIP s)]
for s ⇐ 0 to length(SIP s) − 1 do
for p ⇐ 0 to length(SIP s[s]) − 1 do
other ⇐ ∅
Find other point on SIP s[s] at same ‘height’ as p:
for q ⇐ 0 to length(SIP s[s]) − 1 do
next ⇐ q + 1
if next >= length(SIP s[s] then
next ⇐ 0 {Wrap}
end if
if q = p ∨ next = p then
continue
end if
if (q.y < p.y < next.y) ∨ (q.y > p.y > next.y) then
other ⇐ interpolate(q, next)
break
end if
end for
if other 6= ∅ then
midpoints[s].append((p + other)/2)
end if
end for
end for
return midpoints
116
Figure 7.19: Example faceprint with symmetry midpoints (in yellow), and
per-sphere midpoint averages (cyan)
Temporal optimisation
Utilising data from more than one frame provides opportunities to increase
the quality and accuracy of the faceprint tracking and recognition. This is
achieved by maintaining a faceprint model over time, where each new frame
contributes to the model by an amount weighted by the quality of the current
frame compared to its predecessors. The quality of a frame is assessed using
two parameters. Firstly, the noise in the data is measured during the median
filtering process mentioned in Section 7.2.3. This is an important parameter,
117
as frames captured during fast motions will be of substantially lower quality
due to the sensor issues discussed in Section 7.2.1. In addition, the measure
of symmetry of the faceprint (see Section 7.2.7) provides a good assessment
of the quality of the frame. These parameters are combined to create an
estimate of the overall quality of the frame using Equation 7.14.
p
quality = noiseF actor × symmetryF actor (7.14)
Thus, for each new frame captured, the accuracy of the stored faceprint
model can be improved. The system maintains a collection of the n high-
est quality faceprints captured up to the current frame, and calculates the
optimal average faceprint based on the quality-weighted average of those n
frames. This average faceprint quickly evolves as the camera captures the
subject in real-time, and forms a robust and symmetric representation of the
subject’s 3D facial profile (see Figure 7.20).
118
Figure 7.20: Faceprint from noisy sample frame (white) with running-average
faceprint (red) superimposed
119
Comparing faceprints
Figure 7.21 shows some actual 2D transforms of faceprints taken from differ-
ent subjects. Although the system is yet to be tested with a large sample of
120
subjects, the preliminary results suggest that faceprints derived from spher-
ical intersection profiles vary considerably between individuals.
The number of spherical intersection profiles comprising faceprints, their
respective radii and the number of quantised points in each spherical inter-
section profile are important parameters that define how a faceprint is stored
and compared. Experimentation has shown that increasing the number of
spherical intersection profiles beyond six and the number of quantised points
in each spherical intersection profile beyond twenty did not significantly im-
prove the system’s ability to differentiate between faceprints. It seems this
is due mainly to the continuous nature of the facial landscape and residue
noise in the processed data. Consequently, the experiments were conducted
with faceprints comprised of six spherical intersection profiles. Each spher-
ical intersection profile was divided into twenty points. All faceprints were
taken from ten human subjects positioned approximately one metre from the
camera and at various angles.
Experiments showed that 120-point faceprints provided accurate recog-
nition of the test subjects from most view points. It was found to take an
average of 6.7µs to process and compare two faceprints with an Intel dual-
core 1.8GHz Centrino processor. This equates to faceprint comparison rate
of almost 150,000 per second which clearly demonstrates the potential search
speed of the system. The speed of the each procedure (running on the same
processor) involved in capturing faceprints from the SR3000 camera is out-
lined in Table 7.4.
121
Processing Stage Execution Time
Distance Thresholding 210µs
Brightness Thresholding 148µs
Noise Filtering 6,506µs
Region of Interest 122µs
Locate Nose 17,385µs
Trace Profiles 1,201µs
Quantise Faceprint 6,330µs
Facial Recognition 91µs
Update Average Faceprint 694µs
Total per frame 32,687µs
122
7.3 Conclusions
123
124
8
Conclusions
125
• Perception of colour information via haptic feedback
and specifically the use of the head for controlling the region of interest and
electro-neural stimulation for haptic communication.
8.1 Achievements
126
5. 360◦ perception protocol for communication of landmark bearings (see
Section 5.2)
127
8.2 Suggestions for further research
As can be seen by the breadth of the achievement list above, most avenues
of research and experimentation in these areas open up several more exciting
possibilities.
Many others have followed this research and already built upon it. In 2004,
the ENVS pioneered research in this field and was the only device of its kind
for a number of years. In 2010, Pradeep et al. developed an ENVS-like
device using stereo cameras and vibration motors mounted on the shoulders
and waist [70]. In 2011, Mann et al. built a device [53] very similar to the
1
original ENVS prototype, but utilising a Microsoft Kinect and an array of
vibro-tactile sensors mounted within a helmet. Several others (for example
[16], [68] and [50]) have also continued research into stereo-matching based
navigation aids. Reviews such as [51] and [12] also note the significance of
the ENVS.
Very little research has been conducted into the communication of colour
information via haptic feedback. As noted in [37], at the time of publication
of [59] there were only a couple of references to the mapping of colours to
textures in the literature. Since then researchers such as [74] have explored
the concept a little further based on this work, but there is great scope for
further research.
1
http://www.xbox.com/en-AU/Kinect
128
8.2.2 Other areas
129
8.3 Closing remarks
All four of the target research areas were explored extensively, and each re-
sulted in the successful design, development and testing of devices to assist
the visually impaired. Use of the head for controlling the environmental sen-
sors (or the region of interest within the virtual environment) indeed proved
natural and intuitive, harnessing the existing design within the human body
for controlling visual receptors. Electro-neural stimulation was shown to be
highly effective in providing rich haptic feedback using intensity, frequency,
placement and pulse-encoding. Areas of difficulty and limitations were dis-
covered, as well as areas of great potential for future development. Thus, in
a way, this research has acted as a navigational aid for research in this area –
exposing obstacles, and illuminating navigable paths in the direction of the
destination: use of technology to not only replace the missing visual sense,
but provide capabilities above and beyond the standard human apparatus.
130
Bibliography
131
[6] D. C. Chang, L. B. Rosenberg, and J. R. Mallett. Design of force
sensations for haptic feedback computer interfaces, 2010. US Patent
7,701,438.
132
[16] S. Fazli and H. Mohammadi. Collision-free navigation for blind persons
using stereo matching. International Journal of Scientific and Engi-
neering Research, Volume 2, December 2011. http://www.ijser.org/
viewPaperDetail.aspx?DEC1171.
[18] Eric Foxlin, Yury Altshuler, Leonid Naimark, and Mike Harrington. A
3d motion and structure estimation algorithm for optical head tracker
system. In ISMAR ’04: Proceedings of the Third IEEE and ACM In-
ternational Symposium on Mixed and Augmented Reality (ISMAR’04),
pages 212–221, Washington, DC, USA, 2005. IEEE Computer Society.
[19] Freedom Scientific. Job Access With Speech (JAWS), 2012. http:
//www.freedomscientific.com/products/fs/jaws-product-page.
asp.
133
[26] M. Haker, M. Bohme, T. Martinetz, and E. Barth. Geometric invariants
for facial feature tracking with 3D TOF cameras. Signals, Circuits and
Systems, 2007. ISSCS 2007. International Symposium on, 1:1–4, July
2007.
134
control of a reaching task for the visually impaired. In Systems, Man
and Cybernetics, 2007. ISIC. IEEE International Conference on, pages
894–901. IEEE, 2007.
[37] Kanav Kahol, Jamieson French, Laura Bratton, and Sethuraman Pan-
chanathan. Learning and perceiving colors haptically. In Proceedings
of the 8th international ACM SIGACCESS conference on Computers
and accessibility, Assets ’06, pages 173–180, New York, NY, USA, 2006.
ACM. http://doi.acm.org/10.1145/1168987.1169017.
[40] Yoshihiro Kawai and Fumiaki Tomita. Interactive tactile display sys-
tem: a support system for the visually disabled to recognize 3d objects.
In Assets ’96: Proceedings of the second annual ACM conference on
Assistive technologies, pages 45–50, New York, NY, USA, 1996. ACM.
135
visual direction cues with a handheld test platform. Haptics, IEEE
Transactions on, 5(1):33 –38, 2012.
[45] Jonathan Lazar, Aaron Allen, Jason Kleinman, and Chris Malarkey.
What frustrates screen reader users on the web: A study of 100
blind users. International Journal of Human-Computer Interac-
tion, 22(3):247–269, 2007. http://www.tandfonline.com/doi/abs/
10.1080/10447310709336964.
[50] Kai Wun Lin, Tak Kit Lau, Chi Ming Cheuk, and Yunhui Liu. A wear-
able stereo vision system for visually impaired. In Mechatronics and
Automation (ICMA), 2012 International Conference on, pages 1423 –
1428, 2012.
[51] Jihong Liu and Xiaoye Sun. A survey of vision aids for the blind. In In-
telligent Control and Automation, 2006. WCICA 2006. The Sixth World
Congress on, volume 1, pages 4312 –4316, 2006.
[52] Jin Liu, Jingbo Liu, Luqiang Xu, and Weidong Jin. Electronic travel
aids for the blind based on sensory substitution. In Computer Science
and Education (ICCSE), 2010 5th International Conference on, pages
1328 –1331, 2010.
136
[53] Steve Mann, Jason Huang, Ryan Janzen, Raymond Lo, Valmiki Ram-
persad, Alexander Chen, and Taqveer Doha. Blind navigation with a
wearable range camera and vibrotactile helmet. In Proceedings of the
19th ACM international conference on Multimedia, MM ’11, pages 1325–
1328, New York, NY, USA, 2011. ACM.
[58] Simon Meers and Koren Ward. A vision system for providing 3D percep-
tion of the environment via transcutaneous electro-neural stimulation.
In Information Visualisation, 2004. IV 2004. Proceedings. Eighth Inter-
national Conference on, pages 546–552, 2004.
[59] Simon Meers and Koren Ward. A vision system for providing the blind
with 3D colour perception of the environment. In Proceedings of the
Asia-Pacific Workshop on Visual Information Processing, December
2005.
137
[64] R. Newman, Y. Matsumoto, S. Rougeaux, and A. Zelinsky. Real-time
stereo tracking for head pose and gaze estimation. In Proceedings. Fourth
IEEE International Conference on Automatic Face and Gesture Recog-
nition, pages 122–128, 2000.
138
[73] R. Samra, D. Wang, and M. H. Zadeh. On texture perception in a haptic-
enabled virtual environment. In Haptic Audio-Visual Environments and
Games (HAVE), 2010 IEEE International Symposium on, pages 1 –4,
2010.
[80] Calle Sjöström. Designing haptic computer interfaces for blind people.
In Proceedings of ISSPA 2001, pages 1–4, 2001.
[81] Mu-Chun Su, Yi-Zeng Hsieh, and Yu-Xiang Zhao. A simple approach
to stereo matching and its application in developing a travel aid for the
blind. In JCIS. Atlantis Press, 2006. http://dblp.uni-trier.de/db/
conf/jcis/jcis2006.html#SuHZ06.
[82] Hui Tang and D. J. Beebe. An oral tactile interface for blind navigation.
Neural Systems and Rehabilitation Engineering, IEEE Transactions on,
14(1):116 –123, 2006.
139
[83] YingLi Tian, Xiaodong Yang, Chucai Yi, and Aries Arditi. Toward a
computer vision-based wayfinding aid for blind persons to access unfa-
miliar indoor environments. Machine Vision and Applications, pages
1–15, 2012. 10.1007/s00138-012-0431-7 http://dx.doi.org/10.1007/
s00138-012-0431-7.
[84] R. Velázquez. Wearable assistive devices for the blind. Wearable and
Autonomous Biomedical Devices and Systems for Smart Environment,
pages 331–349, 2010.
[87] S. Willis and S. Helal. Rfid information grid for blind navigation and
wayfinding. In Wearable Computers, 2005. Proceedings. Ninth IEEE
International Symposium on, pages 34 – 37, 2005.
[89] J. Wyatt and J. Rizzo. Ocular implants for the blind. Spectrum, IEEE,
33(5):47 –53, May 1996.
[91] L. Yin and A. Basu. Nose shape estimation and tracking for model-
based coding. Acoustics, Speech, and Signal Processing, 2001. Proceed-
ings.(ICASSP’01). 2001 IEEE International Conference on, 3, 2001.
[92] Z. Zhu and Q. Ji. Real time 3d face pose tracking from an uncalibrated
camera. In First IEEE Workshop on Face Processing in Video, in con-
junction with IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR’04), Washington DC, June 2004.
140