0% found this document useful (0 votes)
17 views11 pages

Distance Perception in Interactive Virtual Acoustic Environments Using First and Higher Order Ambisonic Sound Fields

This paper investigates the perception of source distance in interactive virtual auditory environments using First and Higher Order Ambisonic sound fields. It assesses the performance of different Ambisonic orders in representing distance cues through subjective audio perception tests, concluding that 1st order sound fields can adequately represent distance cues for Ambisonic-to-binaural decodes. The study highlights the importance of incorporating Head Related Transfer Functions and head-tracking for improved spatialization in headphone reproduction.

Uploaded by

Don Kişot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

Distance Perception in Interactive Virtual Acoustic Environments Using First and Higher Order Ambisonic Sound Fields

This paper investigates the perception of source distance in interactive virtual auditory environments using First and Higher Order Ambisonic sound fields. It assesses the performance of different Ambisonic orders in representing distance cues through subjective audio perception tests, concluding that 1st order sound fields can adequately represent distance cues for Ambisonic-to-binaural decodes. The study highlights the importance of incorporating Head Related Transfer Functions and head-tracking for improved spatialization in headphone reproduction.

Uploaded by

Don Kişot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ACTA ACUSTICA UNITED WITH ACUSTICA

Vol. 98 (2012) 61 – 71
DOI 10.3813/AAA.918492

Distance Perception in Interactive Virtual


Acoustic Environments using First and Higher
Order Ambisonic Sound Fields

Gavin Kearney1) , Marcin Gorzel2) , Henry Rice3) , Frank Boland2)


1)
Department of Theatre, Film and Television, University of York, United Kingdom. [email protected]
2)
Department of Electronic and Electrical Engineering, Trinity College Dublin, Ireland.
[gorzelm, fboland]@tcd.ie
3)
Department of Mechanical and Manufacturing Engineering, Trinity College Dublin, Ireland. [email protected]

Summary
In this paper, we present an investigation into the perception of source distance in interactive virtual auditory
environments in the context of First (FOA) and Higher Order Ambisonic (HOA) reproduction. In particular, we
investigate the accuracy of sound field reproduction over virtual loudspeakers (headphone reproduction) with
increasing Ambisonic order. Performance of 1st , 2nd and 3rd order Ambisonics in representing distance cues is
assessed in subjective audio perception tests. Results demonstrate that 1st order sound fields can be sufficient in
representing distance cues for Ambisonic-to-binaural decodes.
PACS no. 43.20.-f, 43.55.-n, 43.58.-e, 43.60.-c, 43.71.-k, 43.75.-z

1. Introduction over loudspeakers can be used. Many different spatializa-


tion systems have been proposed for such application in
Recent advances in interactive entertainment technology the literature, most notably Vector Based Amplitude Pan-
have led to visual displays with a convincing perception ning (VBAP) [5] and Wavefield Synthesis [6]. However,
of source distance, based not only on stereo vision tech- the Ambisonics system [7], which is based on the spheri-
niques, but also on real time graphics rendering technol- cal harmonic decomposition of the sound field, represents
ogy for correct motion parallax [1, 2]. a practical and asymptotically holographic approach to
Typically, such presentations are accompanied by loud- spatialization. It is well known in Ambisonic loudspeaker
speaker surround technology based on amplitude panning reproduction, that as the order of sound field representa-
techniques and aimed at multiple listeners. However, in tion gets higher, the localization accuracy increases due to
interactive virtual environments, headphone listening al- greater directional resolution.
lows for greater control over personalized sound field re- However, there are many unanswered questions of the
production. One method of auditory spatialization is to in- capability of Ambisonic techniques with regard to the per-
corporate Head Related Transfer Functions (HRTFs) into ception of depth and distance. In this paper, we want to
the headphone reproduction signals. HRTFs describe the investigate whether enhanced directional accuracy of di-
interaction of a listener’s head and pinnae on impinging rect sound and early reflections in a sound field can pos-
source wavefronts. It has been shown that for effective sibly lead to better perception of environmental depth and
externalization and localization to occur, head-tracking thus better localization of the sound source distance in this
should be employed to control this spatialization pro- environment. We approach the problem by means of sub-
cess [3], particularly where non-individualised HRTFs are jective listening tests in which we compare the perception
used. However, the switching of the directionally depen- of distance of real sound sources to the First Order Am-
dent HRTFs with head movement can lead to auditory arti- bisonic (FOA) and Higher Order Ambisonic (HOA) sound
facts caused by wave discontinuity in the convolved binau- fields presented over headphones.
ral signals [4]. A more flexible solution is to form ‘virtual This paper is outlined as follows: We will begin by pre-
loudspeakers’ from HRTFs, where the listener is placed senting a succinct review of the relevant psychoacousti-
at the centre of an imaginary loudspeaker array. Here, the cal aspects of auditory localization and distance percep-
loudspeaker feeds are changed relative to the head po- tion. We will then outline the incorporation of Ambisonic
sition and any technique for sound source spatialization techniques to virtual loudspeaker reproduction and sub-
sequent re-synthesis of measured FOA sound fields into
higher orders. A case study investigating the perception
Received 25 February 2011, of source distance at higher Ambisonic orders is then pre-
accepted 1 October 2011. sented through subjective listening tests.

© S. Hirzel Verlag · EAA 61


ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.: Distance perception
Vol. 98 (2012)

2. Distance Perception
4
It is important to note that throughout the literature there

RMS Monaural Transfer Function (dB)


exists a clear distinction between ‘distance’ and ‘depth’, 3 0°
both understood as perceptual attributes of sound. Accord- 30°
60°
ing to [8], ‘distance’ is related to the physical range be- 2 90°
tween the sound source and a listener, whereas ‘depth’ re-
lates to the recreated auditory scene as a whole and con- 1

cerns a sense of perspective in that scene.


0

2.1. Distance Perception in a Free Field


-1
Although the human ability to perceive sources at differ-
ent distances is not fully understood, there are several key -2
.5 1 2 5 10
factors, which are known to contribute to distance percep- .1 .2
Source distance (m)
tion. In the first case, changes in distance lead to changes
in the monaural transfer function (the sound pressure at
Figure 1. RMS monaural transfer function for a spherical head
one ear). This is shown in Figure 1 for a spherical model model at the left ear for broadband source at different angles with
of a head. We see that for sources of less than 1m distance, varying source distance (reference = plane wave at (0◦ , 0◦ )).
the sound pressure level varies depending on the angle of
incidence, due to the shadowing effects of the head. Be-
yond 1m, the intensity of the source decays according to
the inverse square law. 30
However, absolute monoaural cues will only be mean-
25
ingful if we have some prior knowledge of the source level, 0°
30°
i.e., how familiar we are with the source. In other words, a 20 60°
form of semiosis occurs, where the perception of localiza- 90°
ILD (dB)

tion is based on anticipation and experience [9]. For exam- 15

ple, for normal level speech (approximately 60dB at 1m),


10
we expect nearer sources to be loud, and quieter sources
further away. However, this is more difficult to assess for 5
synthetic sounds or sounds that we are unfamiliar with.
It is interesting to note that for sources in the median 0

plane, the level at distances less than 1m does not change


-5
as dramatically as sources located at the ipsilateral point. .1 .2 .5 1 2 5 10
Source distance (m)
This will not significantly affect the low frequency Interau-
ral Time Difference (ITD), but it is reflected in the Interau-
ral Level Difference (ILD) as shown in Figure 2. We note Figure 2. Interaural level difference of spherical head model for
broadband source at different angles with varying source dis-
that the most extreme ILD is exhibited at the side of the
tance.
head (90◦ ), due to the maximum head shadowing effect.
For a similar reason, subconscious head movements may
be regarded as another important cue since level changes much greater than the reverberant field, the sound pressure
close to the source will be more apparent then far from it level approximately changes in accordance to the free-field
[10]. Thus, near-field ILD cues exist, which aid us in dis- conditions. However, for source-listener distances greater
criminating source distance. than the critical distance, the level of reverberation is in
On the other hand, for larger distances and high sound general independent of the source position due to the ho-
pressure levels, the propagation speed of a sound wave mogeneous level of the diffuse field and the direct to re-
in a medium ceases to be constant with frequency, which verberant ratio changes approximately 6dB per doubling
may lead to distortion of the waveform [11]. Furthermore, of distance from the source.
sound waves travelling a substantial distance also undergo The directions of arrival of the early reflections are an-
a process of energy absorption by water molecules in the other parameter, which change according to the source-
atmosphere. This is more apparent for high-frequency en- listener position and can be regarded as an important factor
ergy of the wave and leads to spectral changes (low-pass in creating environmental depth. Whether it is useful to the
filtering) of the sound being heard. listeners in determining the distance to the sound source
in the presence of other cues like sound intensity, direct
2.2. Distance Perception in a Reverberant Field
to reverberant energy ratio or the arrival pattern of delays,
In reverberant rooms, the ratio of the direct to reverberant remains an open question that needs to be addressed. Am-
sound plays an extremely important role in distance per- bisonics allows for enhanced directional reproduction of
ception. For near sources, where the direct field energy is deterministic components of a sound field by increasing

62
Kearney et al.: Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)

the order of spherical harmonic decomposition. However, sound field measured at a single point in space into spher-
better directional localization can be achieved without af- ical harmonic functions defined as
fecting other important cues for distance estimation like
σ
overall sound intensity or direct to reverberant energy ra- Ymn (Φ, Θ) = Amn Pmn (sin Θ) (1)
tio. Thus it can constitute an ideal framework for testing 
cos(mΦ) if σ = +1
whether less apparent properties of a sound field can influ- ·
sin(mΦ) if σ = −1 ,
ence the perception of distance.
where m is the order and n is the degree of the spherical
2.3. Former Psychoacoustical Studies on Distance harmonic and Pmn is the fully normalized (N3D) associ-
Perception ated Legendre function. The coordinate system used com-
The perception of distance has been shown to be one that is prises x, y and z axes pointing to the front, left and up
not linearly proportional to the source distance. For exam- respectively, Φ is the azimuthal angle with the clockwise
ple, both Nielson et al. [12] and Gardner [13] have shown rotation and Θ is the elevation angle form the x-y plane.
that the localization of speech signals is consistently un- For each order m there are (2m + 1) spherical harmonics.
derestimated in an anechoic environment. This underesti- In order for plane wave representation over a loud-
mation has also been shown by other authors in the context speaker array we must ensure that
of reverberant environments, both real and virtual. In [14],
Bronkhorst et al. demonstrate that in a damped virtual en- I
vironment, sources are consistently perceived to be closer s σ
Ymn (Φ, Θ) = σ
gi Ymn (φi , θi ), (2)
than in a reverberant virtual environment, due to the direct i=1
to reverberant ratio. In their studies, the room simulation
is conducted using simulated Binaural Room Impulse Re- where s is the pressure of the source signal from direction
sponses (BRIRs) created from the image source method (Φ, Θ) and gi is the ith loudspeaker gain from direction
[15]. They show how perceived distance increases rapidly (φi , θi ). We can then express the left hand side of equation
with the number and amplitude of the reflections. (2) in vector notation, giving the Ambisonic channels
In a similar study, Rychtarikova et al. [16] investi-
gated the difference in localization accuracy between real B = YΦΘ s (3)
rooms and computationally derived BRIRs. Their findings  1 1 σ
T
= Y0,0 (Φ, Θ), Y1,0 (Φ, Θ), ....Ymm (Φ, Θ) s.
show that at 1 m, localization accuracy in both the virtual
and real environments is in good agreement with the true
Equation (2) can then be rewritten as
source position. However, at 2.4 m, the accuracy degrades,
and high frequency localization errors were found in the
virtual acoustic pertaining to the difference in HRTFs be- B = C · g, (4)
tween the model and the subject. In the same vain, Chan et
al. [17] have shown that distance perception using record- where C are the encoding gains associated with the loud-
ings made from the in-ear microphones on individual sub- speaker positions and g is the loudspeaker signal vector. In
jects again lead to underestimation of the source distance order to obtain g, we require a decode matrix, D, which is
in virtual reverberant environments, more so than with real the inverse of C. However, to invert C we need the matrix
sources. to be a square, which is only possible when the number of
Waller [18] and Ashmead et al. [10] have identified that Ambisonic channels is equal to the number of loudspeak-
one of the factors improving distance perception is the lis- ers. When the number of loudspeaker channels is greater
tener movement in the virtual or real space. It is therefore than the number of Ambisonic channels, which is usually
crucial to account for any listener’s movements (or lack the case, we then obtain the pseudo-inverse of C where
thereof) in the experimental design.
Similarly, for headphone reproduction of virtual acous-
tic environments, small, subconscious head rotations may D = pinv(C) = CT (CCT )−1 . (5)
lead to improvements in distance perception by providing
enhanced ILD and ITD cues. Therefore, the sound field Since the sound field is represented by a spherical coor-
transformations should reflect well the small changes of dinate system, sound field transformation matrices can be
orientation of the listener’s head. used to rotate, tilt and tumble the sound fields. In this way,
the Ambisonic signals themselves can be controlled by the
user, allowing for the virtual loudspeaker approach to be
3. Ambisonic Spatialization employed. For 3-D reproduction, the number of I virtual
loudspeakers employed with the Ambisonics approach is
Ambisonics was originally developed by Gerzon, Barton dependent on the Ambisonic order m, where
and Fellgett [7] as a unified system for the recording, re-
production and transmission of surround sound. The the-
ory of Ambisonics is based on the decomposition of the I ≥ N = (m + 1)2 . (6)

63
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.: Distance perception
Vol. 98 (2012)

4. Virtual Loudspeaker Reproduction


In the ‘virtual loudspeaker’ approach, HRTFs are mea-
sured at the ‘sweet-spot’ (the limited region in the cen-
tre of a reproduction array where an adequate spatial im-
pression is generally guaranteed) in a multi-loudspeaker
reproduction setup, and the resultant binaural playback is
formed from the convolution of the loudspeaker feeds with
the virtual loudspeakers. This concept is illustrated in Fig-
ure 3. For the left ear we have
I
L= hLi ∗ qi , (7)
i=1

where ∗ denotes convolution and hLi is the left ear HRIR


corresponding to the ith virtual loudspeaker and qi is the
ith loudspeaker feed. Similar relations apply for the right
ear signal. This method was first introduced by McKeag
and McGrath [19] and examples of its adoption can be
found in [20] and [21]. This approach has major computa-
tional advantages, since a complex filter kernel is not re-
quired and head rotation can be simulated by changing the
loudspeaker feeds p as opposed to the HRTFs. Whilst the
HRTFs in this case play an important role in the spatializa-
tion, ultimately it is the sound field creation over the virtual
loudspeakers which gives the overall spatial impression.
Most existing research uses a block frequency domain ap-
proach to this convolution. However, given that the virtual
loudspeaker feeds are controlled via head-tracking in real-
time, a time-domain filtering approach can also be utilized.
For short filter lengths, obtaining the output in a point wise Figure 3. The virtual loudspeaker reproduction concept.
manner avoids the inherent latencies introduced by block
convolution in the frequency domain. A strategy for sig-
and the particle velocity by
nificant reduction of the filter length without artifacts has
been proposed in [22]. 1  
u(t) = √ x(t)ex + y(t)ey + z(t)ez , (10)
2Z0
5. Higher Order Synthesis where ex , ey , and ez represent Cartesian unit vectors, x(t),
y(t), z(t) are the FOA signals and Z0 is the characteristic
In order to compare the distance perception of different
acoustic impedance of air.
orders of Ambisonic sound fields, it is desirable to take
The instantaneous intensity represents the direction of
real world sound field measurements. However, the for-
the energy transfer of the sound field and the direction of
mation of higher order spherical harmonic directional pat-
arrival can be determined simply by the opposite direction
terns is non-trivial. Thus, in order for us to change FOA
of I. For FOA, we can calculate the intensity for each coor-
impulse responses to HOA representations, we will em-
dinate axis, and in the frequency domain. Since a portion
ploy a perceptual based approach which will allow us to
of the energy will also oscillate locally, a diffuseness esti-
synthesize the increased directional resolution that would
mate can be made from the ratio of the magnitude of the
be achieved with a HOA sound field recording. For this we
intensity vector to the overall energy density E, given as
adopt the directional analysis method of Pulkki and Meri-

maa, found in [23]. Here the B-format signals are analyzed I
in terms of sound intensity and energy in order to derive ψ =1−   , (11)
c E
time-frequency based direction of arrival and diffuseness.
The instantaneous intensity vector is given from the pres- where · denotes time averaging, || · || denotes the norm
sure p and particle velocity u as of the vector and c is the speed of sound. The diffuseness
I(t) = p(t) u(t). (8) estimate will yield a value of zero for incident plane waves
from a particular direction, but will give a value of 1 where
Since we are using FOA impulse response measurements, there is no net transport of acoustic energy, such as in the
the pressure can be approximated by the 0th order Am- cases of reverberation or standing waves. Time averaging
bisonics component w(t) which is omnidirectional is used since it is difficult to determine an instantaneous
p(t) = w(t), (9) measure of diffuseness.

64
Kearney et al.: Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)

The output of the analysis is then subject to smoothing


based on the Equivalent Rectangular Bandwidth (ERB)
scale, such that the resolution of the human auditory sys- Direct sound
tem is approximated. Since the frequency dependent direc-
tion of arrival of the non-diffuse portion of the sound field
can be determined, HOA reproduction can be achieved
by re-encoding point like sources corresponding to the di-
rection indicated in each temporal average and frequency
band into a higher order spherical harmonic representa-
tion. The resultant Ambisonic signalsare then weighted in
each frequency band k according to 1 − ψk . However, it
is only vital to re-encode non-diffuse components to higher
Left wall reflection
order and the diffuse field can be obtained by multiplying

the FOA signals by ψk and forming a first order decode.
This is justified since source localisation is dependent on
the direction of arrival of the direct sound and early reflec-
tions and not on late room reverberation [24]. Thus, from
the perceptual point of view, it is questionable whether
(a)
there is a need to preserve the full directional accuracy of
the reverberant field. Furthermore, if there exists a general
Direct sound
directional distribution to the diffuse field, this will still be
preserved in first order form. On the other hand, the diffuse
component should not be simply derived from the 0th order
signal. One can easily see that such a solution would pro-
vide perfectly correlated versions of the diffuse field to the
left and right ear signals, which have no equivalent in the
physical world (i.e. real, physical sound field). Moreover,
interaural decorrelation is an important factor in providing
spatial impression in enclosed environments [25].
Figure 4 shows an example of the first 20 ms of a 1st Left wall reflection
order impulse response taken in a reverberant hall [26].
Here the source was located 3 m from a Soundfield
ST350 microphone, and the Spatial Room Impulse Re-
sponse (SRIR) captured using the exponentially swept-
sine tone technique [27]. In these plots, particular attention
is drawn to the direct sound (coming from directly in front (b)
of the microphone) and a left wall reflection at approxi-
mately 14 ms. It can be seen that the directional resolution Figure 4. Ambisonic sound field from 1st order measurement
increases significantly with HOA representation. It should with a Soundfield ST350: (a) 1st order representation, (b) 3rd
be noted, that the A-format capsule on sound field micro- order up-mix.
phones only display adequate directionality up to 10 kHz
[28]. Spatial aliasing is therefore an issue for high fre-
quencies and as a result, the directional information above physical loudspeakers lined up (and slightly offset in order
10 kHz cannot be relied upon. to provide ‘acoustic transparency’) in front of their eyes.
However, for the present study, in order to completely
eliminate any possible anchors as well as visual cues, it
6. Method: Localization of Distance of Test was decided to utilize the method of direct blind walking.
Sounds Of the main concerns in the experiment was a direct com-
parison of distance perception of real sound sources versus
Different protocols have been used in literature for subjec- virtual sound sources presented over headphones. Due to
tive assessment of distance perception, most notably a ver- different apparatus requirements, the experiment had to be
bal report [29, 30], direct or indirect blind walking [31, 32] conducted in two separate phases.
or imagined timed walking [32]. All of these methods have
proved to provide reliable and comparable results for both, 6.1. Participants
auditory and visual stimuli, with direct blind walking ex-
hibiting the least between-subject variability [31, 32]. Seven participants aged 24–58 took part in the experiment.
In former work [26], authors of this paper developed a All subjects were of good hearing and were either music
method where subjects indicated the perceived distance of technology students or practitioners actively involved in
real and virtual sound sources by selecting one of several audio research or production. Prior to the test, HRIR data

65
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.: Distance perception
Vol. 98 (2012)

Figure 6. Array of 16 loudspeakers used for HRIR measure-


ments.

reproduction, all 16 loudspeakers were used. Although the


oversampled configuration was not optimal from the 2nd
order reproduction point of view, it was not possible to
easily and accurately rearrange the loudspeaker array in
order to accommodate for a different layout.
HRIRs were captured using the exponentially swept-
sine tone technique [27] at 44.1 kHz sampling rate and
Figure 5. Measuring Head Related Impulse Responses with
miniature microphones. 16-bit resolution. Since the measurement environment was
not fully anechoic, further processing of the measured data
was necessary. The HRIRs were tapered before the arrival
for all the participants has been obtained in a sound-proof, of the first reflection (from the floor) yielding filter kernels
large (18×15×10 m3 ) but quite damped (T60 @ 1000 Hz = with 257 taps and were subsequently diffuse-field equal-
0.57 s) multipurpose room (Black Box) in the Department ized.
of Theatre, Film and Television at the University of York.
Additional damping was assured by thick, heavy, curtains 6.2. Stimuli
covering all four walls and a carpet on the floor. The The stimuli used in the experiment were pink noise
measurement process consisted of a standard procedure bursts and phonetically balanced phrases selected from
where miniature, omnidirectional microphones (Knowles the TIMIT Acoustic-Phonetic Continuous Speech Corpus
FG-23629-P16) were placed at the entrance of a blocked database and recorded by a female reader [34]. A sam-
ear canal in order to capture acoustic pressure generated pling rate of 44.1 kHz and 16 bit resolution was used in
by one loudspeaker at a time located at constant distance both cases. These two sample types were selected in order
and varying angular direction. to represent both unfamiliar and familiar sound sources.
Subjects were seated on an elevated platform so that They were presented to the subjects in a pseudo-random-
their ears were 2.20 m above the ground and their head ized manner to avoid any ordering effects.
was in the centre of a spherical loudspeaker array, arranged For headphone reproduction, prior to the test phase,
in diametrically opposed pairs. The ear height was cali- FOA impulse response measurements were taken from the
brated using a laser guide, as shown in Figure 5. The ar- listener position of each loudspeaker using the exponen-
ray consisted of 16 full range Genelec 8050A loudspeak- tially swept-sine tone technique [27]. From these measure-
ers since the intention was to reproduce Ambisonic sound ments, 2nd and 3rd order impulse response sets were ex-
fields up to and including 3rd order. This 3-D setup, shown tracted using the directional analysis approach outlined in
in Figure 6, comprised a flat-front, horizontal octagon and section 5. 0th order Ambisonics does not provide any di-
a cube (four loudspeakers on top, and four on the bottom). rectional information which means that it would lack the
The radius of the loudspeaker array (and thus the virtual cues that are investigated in the higher order renderings.
loudspeaker array) was 3.27 m. For FOA-to-binaural de- Therefore, it was decided not to include it in this compari-
code, only virtual loudspeakers from the cube configura- son.
tion were utilized, since no directional resolution is gained The only psychoacoustical optimization applied to the
by using a higher number of loudspeakers. Furthermore, Ambisonics decodes was shelf filtering and was intended
despite careful alignment, oversampling of the sound field to satisfy Gerzon’s localization criteria for maximized ve-
with higher numbers of speakers has the potential to yield locity decode at low frequencies and energy decode at
sound field distortions [33]. Note that for 2nd and 3rd order higher frequencies [35]. This involved changing the ratio

66
Kearney et al.: Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)

of the pressure to velocity components at low and high


frequencies. Whilst the crossover frequency for the high
frequency boost in the pressure channel at first order is
normally in the region of 400 Hz for regular loudspeaker
listening, here, we restore the crossover point to 700 Hz,
since the subject is always perfectly centred in the virtual
loudspeaker array.

6.3. Test Environment and Apparatus


A series of subjective listening tests was conducted in the
Large Rehearsal Room in the Department of Theatre, Film
and Television in the University of York. The room dimen-
sions were 12 × 9 × 3.5 m3 and the spatially averaged T60
at 1 kHz was 0.26 s. A low T60 was desired for this study,
so the walls were covered with thick, heavy curtains, as
shown in Figure 7. Since the up-mix from 1st to 2nd and 3rd
order Ambisonics concerned only the deterministic part of
the measured SRIRs, it was assumed that no advantage
would be gained from using a more reverberant space.
A professional camera dolly track was set up roughly
in the direction of the diagonal of the room. It not only
allowed for testing distances of the real loudspeaker up
to 8 m but its non-symmetrical position also assured that
early reflections of the same order from different surfaces
did not easily coincide at the subjects ears, but instead ar-
rived at different times. A single full-range loudspeaker
(Genelec 8050A) was mounted on a camera dolly which Figure 7. Participant performing a trial during the experiment.
enabled it to be noiselessly translated by the experiment
assistant to different locations. The guiding rope was hung
6.4. Procedure
along the dolly track which was intended to help and guide
the participants when walking toward the sound source. In the experiment, subjects entered the test environment
Since it was not possible to walk exactly on the dolly track, blindfolded and without any prior expectation regarding
it was decided that the walking path would be directly next the room dimensions, its acoustic properties or the test
to it, as shown in Figure 7. The only weakness of this so- apparatus. They were guided by the experimenter to the
lution was that the sound source horizontal angle varied reference point (the ‘origin’). After a short explanation of
from 14.04 degrees at the closest distance (2 m) to 3.58 de- the experiment objectives, a training session began with
grees at the furthest distance (8 m). However, this did not a short (3–5 min) walking-only trial until participants felt
have any effect on the distance judgments for two reasons: comfortable with walking blindfolded and using a guide
Firstly, the subjects were allowed (or even encouraged) to rope. Next, they performed 4–6 training trials in which the
rotate their head in order to fully utilize the available ITD same test stimuli to be used in the experiment (speech and
and ILD cues. Secondly, the initial head orientation was pink noise) were played by the loudspeaker at randomly
not in any way fixed. This, combined with the fact that chosen distances. No feedback was given and no results
there were no clear cues to the subject’s initial orientation were recorded after each test trial. The end of the training
in the room at the origin, made this small initial angular session was clearly announced and after a 1 minute inter-
offset unimportant. Furthermore, none of the participants val, the first phase of the test began.
reported any bias in their assessment based on the horizon- In test phase I, participants were asked to listen to static
tal offset of the sound source. sound sources at a randomly chosen points, focusing on
For trials with binaural presentation, high quality open the perceived distance. They could listen to any audio sam-
back headphones (AKG-K601) were used, which exhibit ple as many times as they wished. During the playback
low levels of interaural magnitude and group delay dis- they were instructed to stay still and refrain from any trans-
tortion. Sound field rotation, tilt and tumble control was lational head movements. However, they were encouraged
implemented via the TrackIR 5 infra-red head tracking to rotate their head freely. After the playback had stopped,
system [36], resulting in stable virtual images with head they were asked to walk guided by the rope to the point
rotations. The system responsible for playback of virtual- where they thought the sound originated from. The dis-
ized sound sources was completely built in the Pure Data tance walked was subsequently recorded by the assistant
visual programming environment [37] and its combined using a laser measuring tool, after which the participant
latency (including head-tracker data porting and audio up- walked backwards to the origin. In the meantime, the loud-
date rate) was 20 ms. speaker was noiselessly translated to its new position and

67
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.: Distance perception
Vol. 98 (2012)

the test proceeded. Similar to the training session, no feed-


back was given at any stage. 9
During the first test phase, participants had to indicate 8
the perceived distance for sound sources randomly lo- FOA
SOA
7
cated at 2 m, 4 m, 6 m or 8 m. Taking into account that TOA
Real
both speech and pink noise bursts samples were used (in

Distance walked [m]


6
a pseudo-random order), the number of trials in the first 5
phase added up to 8. Each subject performed all the trials
4
only once.
Upon completion of the first phase of the test there was 3
a short (approximately 2 minutes) interval that was re- 2
quired in order to put on the headphones and calibrate
1
the head-tracking system. In phase II, subjects were also
asked to identify the sound source distance, but this time 0
0 1 2 3 4 5 6 7 8 9
using Ambisonic sound fields presented over headphones. Real distance [m]
Other than the fact that headphones and the head-tracking
system were used, the test protocol remained the same as Figure 8. Mean localization of real and virtual sound sources (fe-
in phase I. However, due to the fact that there were three male speech).
playback configurations to be tested (1st , 2nd and 3rd order
Ambisonics), participants had to perform 24 trails instead
of 8. Instead of separate phases for each Ambisonic order, 9
all samples were randomly presented to the subject within
8
the same test phase. Again, subjects performed all the tri- FOA
SOA
als only once and no feedback was given at any stage. 7 TOA
Real
Distance walked [m]

6
7. Results
5
The perceived sound source distance (indicated by the dis- 4
tance walked) was collected from 7 subjects for 4 presen-
3
tation points (2 m, 4 m, 6 m and 8 m), two stimuli (female
speech and pink noise bursts) and four playback options: 2
1st , 2nd and 3rd Order Ambisonics and real loudspeakers, 1
which for analysis we will denote FOA, SOA, TOA and
0
REAL respectively. With headphone trials, none of the 0 1 2 3 4 5 6 7 8 9
Real distance [m]
participants reported in-head localization, however there
were 3 cases were the proximity of the sound source was
Figure 9. Mean localization of real and virtual sound sources
very apparent so participants decided not to move at all.
(pink noise bursts).
In some cases, the virtual sound source was initially local-
ized behind the subjects but all participants were able to
resolve the confusion by applying head-rotation. investigate the effects of these two factors (referred later
We computed the mean values of walked distances µ for as factors A and B) as well as potential interaction ef-
each test condition along with the corresponding standard fects, for each presentation distance a two-way ANOVA
errors se(µ). The results are presented separately for each has been performed. The null hypothesis being tested here
stimulus type within 95% Confidence Intervals. is that all the mean perceived distances for all the stimuli
As expected, the perception of distance for the real sour- and playback methods do not differ significantly
ces was more accurate for near sources. Beyond 4 m, dis-
H0 : µF OA =µSOA =µT OA =µReal =µ,
tance perception was continuously underestimated which
is congruent with the previous studies outlined in sec- H1 : not all localization means (µi ) are the same.
tion 2. Furthermore, the standard deviation of localiza- No statistically significant effect of stimuli (familiar vs.
tion increases as the source moves further into the diffuse unfamiliar) on the perception of distance has been found
field. We also see, that unfamiliar stimuli produce greater (F2m (3, 48) = 0.835, p = 0.365; F4m (3, 48) = 2.0462,
variability in subjects’ answers. The mean localization of p = 0.159; F6m (3, 48) = 2.575, p = 0.115; F8m (3, 48) =
the virtual sources follows the reference source localiza- 2.0462, p = 0.159). For distances of 4m and more,
tion well. The answers for virtual sources deviate from playback option had also no statistically significant effect
their means roughly in the same fashion as the answers for (F4m (3, 48) = 2.192, p = 0.101; F6m (3, 48) = 0.665,
reference sources, as localization becomes more difficult p = 0.577; F8m (3, 48) = 0.202, p = 0.894).
within the diffuse field. However, a statistically significant difference has been
Since the study followed the within-subject factorial de- detected for the distance of 2 m. In larger study de-
sign with 2(stimuli)*4(playback conditions), in order to signs with multiple levels it is advisable to use the Hon-

68
Kearney et al.: Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)

Table I. Mean localization [m] of virtual and real sound sources expected that a further underestimation of the source dis-
at 2 m. tance would ensue with the binaural rendering, as reported
µF OA µSOA µT OA µReal
in [17]. However, this was not the case, even for first or-
der presentations, and the apparent distances of the vir-
Speech 1.119 1.389 0.841 1.638 tual sources matched the real source distances well. One
Noise 0.877 1.001 0.902 1.641 should note that the major difference between this study
and that of [17] is our use of head-tracking, indicating
the importance of head-movements in perceiving source
Table II. Correlation coefficients ρ and corresponding p − values
for pairs of distance estimations for real and virtual sound
distance, which develops the findings of Waller [18] and
sources (Speech). Ashmead et al. [10] on user interaction in a virtual space.
Further work is required to quantify the effect of this.
ρ p − value Moreover the presented study demonstrates that the en-
hanced directional accuracy gained by presenting sound
Real vs FOA 0.9828 0.0172
Real vs SOA 0.9960 0.0040 sources in HOA through head-tracked binaural rendering
Real vs TOA 0.9590 0.0410 does not yield a significant improvement in the perception
of the source distance. What is noteworthy is that for each
order, there is no significant difference in the perception of
Table III. Correlation coefficients ρ and corresponding p−values the source location when compared to real-world sources.
for pairs of distance estimations for real and virtual sound We therefore conclude that sound field directionality for
sources (Noise). distance perception is sufficient with 1st order playback.
ρ p − value The presence of the ANOVA false alarm at the 2 m point
is of interest. It is noteworthy that the 2m point represents
Real vs FOA 0.9913 0.0087 a source inside the virtual array geometry. It is a known
Real vs SOA 0.9857 0.0143 issue that virtual sound sources rendered inside the array
Real vs TOA 0.9972 0.0028 of loudspeakers cannot be reproduced in a straightforward
way without artifacts. Some of these artifacts include in-
correct wave-front curvature and insufficient bass boost.
estly Significant Difference (HSD) approach since there In the first case, there is ample evidence in the litera-
is an increased risk of spuriously significant difference ture to suggest that the wavefront curvature translates to a
arisen purely by chance. So, in order to investigate further significant binaural cues for sound sources near the head
where the difference occurs, an HSD has been computed, [30, 38]. It was already shown in section 2.1 that as a
(HSD = 1.423m). If we now compile the table of mean source moves closer to the head the levels of the monau-
perceived distances for the sound sources located at 2 m ral transfer function and the ILD both change significantly
we can see that all of the above values clearly lie within with source angle. However this effect is not strong at 1m
a single HSD to each other and cannot be distinguished. and beyond. For sources further away, it has been shown
We can safely assume then an ANOVA false alarm (type in [39] that it is very difficult to assess distance by binaural
I error) and no statistically significant effect of playback cues alone.
method for the sources at the distance of 2 m as well. In the second case, the requirement for distance com-
Lastly, for all the distances no synergetic effects of fac- pensation filtering due to near field effects for the large
tors A (stimuli) and B (playback conditions) have been loudspeaker radius (3.27 m) and the given source distances
detected. (>2 m) is only prominent below 100 Hz. For the female
Additionally, we calculated correlation coefficients ρ for speech test stimuli, this will not have an effect, since the
pairs of distance estimations for real and virtual sound first formant frequencies do not go down below 180 Hz.
sources (either 1st , 2nd or 3rd order) and two stimuli. In Also, the current method employed for capturing HRIRs
all cases, high correlation coefficients have been obtained, allowed for reliably obtaining filters with a frequency
which confirms our findings that for these particular test response reaching down to around 170 Hz, thereby also
conditions, the perception of distance of binaurally ren- band-limiting the delivery of the pink noise stimuli.
dered Ambisonic sound fields of orders 1 to 3 cannot be
Finally, there was no significant difference in the results
distinguished from the perception of distance of the real
presented for different sources, although the greater vari-
sound sources.
ance in the results for pink noise suggest that the famil-
iarity of the source does indeed play a role in the percep-
8. Discussion tion of source distance, as mentioned in section 2.3. Future
studies will investigate the use of these monaural cues fur-
The results presented for real sources corroborate the clas- ther, and will utilize 0th order sound field rendering, since
sic underestimation of source distance, as reported in the it will remove the influence of any directional information.
literature. These results were used as a basis with which Considering the aforementioned study of Bronkhorst et
to measure the ability of Ambisonic sound fields of differ- al. [14], where the accuracy of distance perception for bin-
ent orders to present sources at different distances. It was aural playback increases with the number of reflections,

69
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.: Distance perception
Vol. 98 (2012)

our findings demonstrate that the net effect of the monaural [9] J. Blauert: Communication acoustics. Springer, 2008.
cues of direct to reverberant ratio, level difference and time [10] D. H. Ashmead, D. L. Davis, A. Northington: Contribution
of arrival of early reflections are of greater importance in of listeners’ approaching motion to auditory distance per-
distance perception for binaural rendering than Ambisonic ception. J. Exp. Psy: Hum. Percep. and Perform. 21 (1995)
directional accuracy beyond 1st order. 239–256.
[11] E. Czerwinski, A. Voishvillo, S. Alexandrov, A. Terekhov:
Propagation distortion in sound systems: Can we avoid it?
9. Conclusions J. Audio Eng. Soc 48 (2000) 30–48.
[12] S. H. Nielsen: Auditory distance perception in different
We have assessed through subjective analysis the per- rooms. J. Audio Eng. Soc. 41 (1993) 755–770.
ceived source distance in virtual Ambisonic sound fields
in comparison to real world sources. The hypothesis tested [13] M. B. Gardner: Distance estimation of 0◦ or apparent 0◦
oriented speech signals in anechoic space. J. Acoust. Soc.
was that enhanced directional accuracy of deterministic Am. 45 (1969) 47–53.
part of the sound field may lead to better reconstruction
[14] A. W. Bronkhorst, T. Houtgast: Auditory distance percep-
of environmental depth and thus improve the perception
tion in rooms. Nature 397 (1999) 517–520.
of sound source distance. However, it was shown that
Ambisonic reproduction matches the perceived real world [15] J. B. Allen, D. A. Berkley: Image method for efficiently
simulating small-room acoustics. J. Acoust. Soc. Am. 65
source distances well even at 1st order and no improvement (1979) 943–950.
in this regard was observed when increasing the order. It
[16] M. Rychtarikova, T. V. d. Bogaert, G. Vermeir, J. Wouters:
must be emphasized though, that this analysis applies to
Binaural sound source localization in real and virtual
Ambisonic-to-binaural decodes with higher order synthe- rooms. J. Audio Eng. Soc. 57 (2009) 205–220.
sis achieved using the directional analysis method of [23].
[17] J. S. Chan, C. Maguinness, D. Lisiecka, C. Ennis, M.
Therefore, further work will examine this topic for loud- Larkin, C. O’Sullivan, F. Newell: Comparing audiovisual
speaker reproduction for both centre and off-centre listen- distance perception in various real and virtual environ-
ing as well as investigate the effectiveness of HOA synthe- ments. Proc. of the 32nd Euro. Conf. on Vis. Percep., Re-
sis in comparison to real world HOA measurements. gensburg, Germany, 2009.
[18] D. Waller: Factors affecting the perception of interobject
Acknowledgments distances in virtual environments. Presence: Teleoper. Vir-
tual Environ. 8 (1999) 657–670.
The authors gratefully acknowledge the participation of
the test subjects for both their time and constructive com- [19] A. McKeag, D. McGrath: Sound field format to binaural
ments, as well as the technical support staff at the Depart- decoder with head-tracking. Proc. of the 6th Australian Re-
gional Convention of the AES, 1996.
ment of Theatre, Film and Television at the University of
York for their assistance in the experimental setups. This [20] M. Noisternig, A. Sontacchi, T. Musil, R. Holdrich: A
3D Ambisonic based binaural sound reproduction system.
research is supported by Science Foundation Ireland. Proc. of the 24th Int. Conf. of the Audio Eng. Soc., Alberta,
Canada, 2003.
References .. ..
[21] B.-I. Dalenback, M. Stromberg: Real time walkthrough au-
ralization - the first year. Proc. of the Inst. of Acous.,
[1] L. Fauster: Stereoscopic techniques in computer graphics. Copenhagen, Denmark, 2006.
Technical paper, TU Wien, 2007.
[22] C. Masterson, S. Adams, G. Kearney, F. Boland: A method
[2] J. Lee: Head tracking for desktop VR displays using the
for head related impulse response simplification. Proc.
Wii remote. http://johnnylee.net/projects/wii/,
of the 17th European Signal Processing Conference (EU-
accessed 30th Sept. 2011.
SIPCO), Glasgow, Scotland, 2009.
[3] D. R. Begault: Direct comparison of the impact of head
tracking, reverberation, and individualized head-related [23] J. Merimaa, V. Pulkki: Spatial impulse response rendering
transfer functions on the spatial perception of a virtual i: Analysis and synthesis. J. Audio Eng. Soc. 53 (2005).
sound source. J. Audio Eng. Soc 49 (2001) 904–916. [24] W. M. Hartmann: Localization of sound in rooms. J.
[4] M. Otani, T. Hirahara: Auditory artifacts due to switching Acoust. Soc. Am. 74 (1983) 1380–1391.
head-related transfer functions of a dynamic virtual audi- [25] D. Griesinger: Spatial impression and envelopment in small
tory display. IEICE Trans. Fundam. Electron. Commun. rooms. Proc. of the 103rd Conv. of the Audio. Eng. Soc,
Comput. Sci. E91-A (2008) 1320–1328. New York, USA, 1997.
[5] V. Pulkki: Virtual sound source positioning using Vector [26] G. Kearney, M. Gorzel, H. Rice, F. Boland: Depth per-
Base Amplitude Panning. J. Audio Eng. Soc. 45 (1997) ception in interactive virual acoustic environments using
456–466. higher order ambisonic soundfields. Proc. of the 2nd Int.
[6] A. J. Berkhout: A Holographic Approach to Acoustic Con- Ambisonics Symp., Paris, France, 2010.
trol. J. Audio Eng. Soc 36 (1988) 977–995. [27] A. Farina: Simultaneous measurement of impulse response
[7] M. A. Gerzon: Periphony: With-height sound reproduction. and distortion with a swept-sine technique. Proc. of the
J. Audio Eng. Soc 21 (1973) 2–10. 108th Conv. of the Audio Eng. Soc., Paris, France, 2000.
[8] F. Rumsey: Spatial quality evaluation for reproduced [28] M. Gerzon: The design of precisely coincident microphone
sound: Terminology, meaning, and a scene-based para- arrays for stereoand surround sound. Proc. of the 50th
digm. J. Audio Eng. Soc. 50 (2002) 651–666. Conv. of the Audio Eng. Soc., London, UK, 1975.

70
Kearney et al.: Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)

[29] C. Guastavino, B. F. G. Katz: Perceptual evaluation of [34] W. M. Fisher, G. R. Doddington, K. M. Goudie-Marshall:


multi-dimensional spatial audio reproduction. J. Acoust. The darpa speech recognition research database: Specifica-
Soc. Am. 116 (2004) 1105–1115. tions and status. Proc. of the DARPA Workshop on Speech
Recognition, 1986.
[30] P. Zahorik: Assessing auditory distance perception using
virtual acoustics. J. Acoust. Soc. Am. 111 (2002) 1832– [35] M. A. Gerzon, G. J. Barton: Ambisonic decoders for
1846. HDTV. Proc. of the 92nd Conv. of the Audio Eng. Soc.,
Vienna, Austria, 1992.
[31] J. M. Loomis, R. L. Klatzky, J. W. Philbeck, R. G. Goll- [36] NaturalPoint: Trackir 5. http://www.naturalpoint.
edge: Assessing auditory distance perception using percep- com/trackir/, accessed 30th Sept. 2011.
tually directed action. Perception And Psychophysics 60
(1998) 966–980. [37] M. Puckette: Pure data. http://puredata.info/, ac-
cessed 30th Sept. 2011.
[32] T. Y. Grechkin, T. D. Nguyen, J. M. Plumert, J. F. Cremer, [38] P. Zahorik, D. S. Brungart, A. W. Bronkhorst: Auditory dis-
J. K. Kearney: How does presentation method and measure- tance perception in humans: A summary of past and present
ment protocol affect distance estimation in real and virtual research. Acta Acustica united with Acustica 91 (2005)
environments? ACM Trans. Appl. Percept. 7 (2010) 26:1– 409–420.
26:18.
[39] H. Wittek: Perceptual differences between Wavefield Syn-
[33] S. Bertet: Formats audio 3d hiérarchiques: Caractérisation thesis and Stereophony. Department of Music and Sound
objective et perceptive des systémes ambisonicsd’ordres Recording, School of Arts, Communication and Humani-
supérieurs. Ph.D. dissertation, INSA Lyon, 2008. ties, University of Surrey, UK, 2007.

71

You might also like