0% found this document useful (0 votes)
108 views22 pages

Audio Augmented Reality

This document provides a systematic review of audio augmented reality (AAR) technologies and applications. It summarizes over 100 papers on AAR and identifies key technologies used, application domains, and future research directions to advance the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views22 pages

Audio Augmented Reality

This document provides a systematic review of audio augmented reality (AAR) technologies and applications. It summarizes over 100 papers on AAR and identifies key technologies used, application domains, and future research directions to advance the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Freely available online

PAPERS
J. Yang, A. Barde and M. Billinghurst, “Audio Augmented Reality:
A Systematic Review of Technologies, Applications, and Future Research Directions”
J. Audio Eng. Soc., vol. 70, no. 10, pp. 788–809, (2022 October).
DOI: https://doi.org/10.17743/jaes.2022.0048

Audio Augmented Reality: A Systematic Review of


Technologies, Applications, and Future Research
Directions
1 2 2
JING YANG, AMIT BARDE, AND MARK BILLINGHURST
([email protected]) ([email protected])([email protected])

1
Department of Computer Science, ETH Zurich, Switzerland
2
The Empathic Computing Laboratory, Auckland Bioengineering Institute, The University of Auckland, New Zealand

Audio Augmented Reality (AAR) aims to augment people’s auditory perception of the real
world by synthesizing virtual spatialized sounds. AAR has begun to attract more research in-
terest in recent years, especially because Augmented Reality (AR) applications are becoming
more commonly available on mobile and wearable devices. However, because audio aug-
mentation is relatively under-studied in the wider AR community, AAR needs to be further
investigated in order to be widely used in different applications. This paper systematically
reports on the technologies used in past studies to realize AAR and provides an overview of
AAR applications. A total of 563 publications indexed on Scopus and Google Scholar were
reviewed, and from these, 117 of the most impactful papers were identified and summarized in
more detail. As one of the first systematic reviews of AAR, this paper presents an overall land-
scape of AAR, discusses the development trends in techniques and applications, and indicates
challenges and opportunities for future research. For researchers and practitioners in related
fields, this review aims to provide inspirations and guidance for conducting AAR research in
the future.

0 INTRODUCTION ception, virtual sounds are usually binaurally spatialized


(along with reverberation if required) to create a realistic
Augmented Reality (AR) technology aims to seamlessly sense of direction and distance.
blend computer-generated virtual content with the physical AAR technology has a remarkable potential for creating
environment so that the virtual content appears fused with effective and immersive AR experiences, for the following
the real world [1]. AR can enhance people’s perception of reasons:
and interaction with their surroundings and can also help
them more easily perform real-world tasks [2]. Over the 1. Diversity of sound content: Different types of audio
past few decades, technological advancements have signif- content can provide the user with rich information.
icantly increased the adoption of AR technology in a wide For example, speech conveys information to address
range of application domains, including industrial main- questions and issue commands, whereas non-speech
tenance [3, 1, 4, 5], education [6–8], gaming [9, 10], and beacons and alerts inform operation status of ap-
collaborative work [11–15]. plications and notify users of new messages [20].
An important affordance of AR technology is its ability Sounds created for AAR applications are thus capa-
to augment human senses [16] so that people can interact ble of conveying a range of information based on the
with virtual objects and scenes as easily as they do with context in which they occur.
the physical world. However, the overwhelming majority 2. Localization and immersive experience: Given hu-
of AR research has been focusing on visual augmentation man binaural hearing in three dimensional space
[17–19]. Audio Augmented Reality (AAR) remains rela- [21], AAR can provide users with an enhanced
tively under-explored. In AAR, virtual auditory content is sense of immersion by spatializing virtual sounds
blended into the physical world to augment the user’s real that recreate distance, direction, and spectral cues,
acoustic environment. To mimic real-world auditory per- which can play a critical role in several situations.

788 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

For example, if AAR applications provide alarms, 3. Based on published studies, this paper identifies
users may first hear a threat to their safety and move seven application domains where AAR has shown
away from the danger promptly. to be practical or has the potential to make a signifi-
3. Ubiquitous hardware: Hardware capable of deliver- cant difference.
ing AAR experiences is readily available. For ex- 4. Discussing future research challenges and opportu-
ample, mobile devices with powerful computing ca- nities for making AAR more beneficial and accept-
pabilities (like smartphones) can be used to com- able.
pute virtual sounds and deliver high-quality immer-
sive audio experiences. Users can access these ex- The rest of this paper first explains the methods em-
periences using readily available off-the-shelf head- ployed for paper selection and the review process. Follow-
phones [20]. These devices can be conveniently used ing this, technologies used for developing AAR systems
to deliver AAR experiences [20]. are reviewed. The application domains of AAR technology
are then reviewed, and finally, future research directions to
Given these reasons, it is unsurprising that AAR technol- advance AAR development are discussed.
ogy has begun to attract a greater amount of research inter-
est. The availability of devices that are capable of deliver-
ing real-time AAR experiences has further spurred interest
in this field. For example, Apple AirPods Pro,1 Samsung 1 METHODOLOGY FOR PAPER SELECTION
Galaxy Buds Pro,2 and JBL Quantum ONE3 can enable AND REVIEWING
accurate spatialization of virtual sounds because of their
This section outlines the process for selecting and re-
integrated modules for head tracking.
viewing papers. The potential limitations of this process
Overall, AAR is a promising field yet still relatively
are also discussed.
under-studied in the wider AR community. Moreover, tech-
nologies required to realize AAR make it more difficult to
implement audio augmentation in AR scenarios than in Vir-
tual Reality (VR) scenarios. More specifically, in VR, using 1.1 Paper Selection and Review
pre-designed virtual scenes can simplify the rendering of This survey paper aims to provide a comprehensive re-
audio content, whereas creating virtual sounds in the phys- view of the existing AAR landscape. The Scopus bib-
ical world and adapting them to the user in real time is liographic database was first searched, and then Google
more complicated in AR. For example, the user’s pose with Scholar was searched to include more related work, both of
respect to the environment should be tracked to spatialize which have been commonly used for previous AR reviews
sounds properly, and the environmental acoustics should [17, 22, 23].
be updated according to the user’s movements in the space. Papers published in conferences and journals up until
AAR technologies and AAR usability still need to be in- November 2021 were considered. A start date for the search
vestigated and considerably improved in order to be widely was not specified in order to cover early works as well. Table
applied and accepted by end users. 1 lists the search terms used for paper collection. Note that
This paper, as one of the first surveys of its type, aims the search terms cover two distinct aspects of AAR:
to provide a systematic overview of AAR. It aims to mo-
tivate the wider AR community to actively consider audio 1. Technologies: This part covers the different tech-
augmentation in the delivery of informative and immer- nologies that have been used to realize AAR. To
sive experiences. This review focuses on spatialized rather binaurally spatialize virtual sounds, AAR systems
than monophonic virtual sounds. To better reflect the au- should include three functional components: user-
ralization process in real-world situations, the integration object pose tracking, room acoustics modeling, and
of simulating the real environmental acoustics is also con- spatial sound synthesis. Two other important tech-
sidered in this review. In summary, this paper makes the nologies for creating AAR systems are also re-
following contributions: viewed: interaction technology and display technol-
ogy. Interaction technology refers to how the user
1. Providing one of the first comprehensive summaries provides input (e.g., touch screen input) to enable or
to facilitate a systematic understanding of AAR adjust AAR applications. Display technology refers
technology and its development over the past few to how the virtual sounds are output to end users
decades. (e.g., only audio via earphones, through handheld
2. With a focus on spatial sound-related AAR, this pa- displays together with visual content).
per identifies five functional components of AAR 2. Application domains: This part covers AAR appli-
systems and discusses techniques to implement these cations over a given time period in a number of real-
functions. world use cases. The use of generic search terms
such as “user study” and “experiments” allowed for
1
https://www.apple.com/airpods-pro/. gathering a larger set of papers and examine the
2
https://www.samsung.com/us/mobile/audio/galaxy-buds-pro/. various types of AAR applications proposed and/or
3
https://www.jbl.com.sg/gaming/QUANTUMONE.html. implemented by researchers.

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 789
YANG ET AL. PAPERS

Table 1. Search terms used for collecting publications. can perceive virtual spatialized sounds in the given scenes.
From these 62 papers, the technologies used to implement
Technologies “Audio Augmented Reality” AAR and the domains in which they were applied are sum-
“Augmented Reality” AND “Audio
marized. The remaining 55 papers did not demonstrate the
Augmentation”
“Augmented Reality” AND “Head development and/or use of complete AAR systems. Instead,
Pose Tracking” they proposed algorithms, methods, or techniques for im-
“Augmented Reality” AND “User plementing one specific AAR component. The proposed
Pose Tracking” techniques were not adequately covered in the existing
“Augmented Reality” AND “Pose
complete AAR systems, therefore, these papers are also
Tracking”
“Augmented Reality” AND “Acoustics reviewed in the related sections.
Modeling”
“Augmented Reality” AND “Room 1.2 Limitations
Acoustics”
“Augmented Reality” AND “Acoustic Two limitations that might influence the thoroughness
Effect(s)” of this review are identified. First, this review focuses on
“Augmented Reality” AND research studies rather than commercial practices, so some
“3D/Spatial Sound Synthesis” works that can also been involved (e.g., some white papers
“Augmented Reality” AND “3D
Audio/Sound”
and patents) might not be covered by the Scopus database
“Augmented Reality” AND “Spatial and Google Scholar. Second, the authors strove to collect
Audio/Sound” all related papers by using the search terms in Table 1.
“Augmented Reality” AND “HRTF” However, some papers might only use other keywords such
“Augmented Reality” AND as “Mixed Reality” to describe AR-related research. Nev-
“Audio/Auditory Interaction”
“Audio Augmented Reality” AND
ertheless, the search terms should have covered a large
“Interaction” proportion of the work that is relevant to AAR technology
“Augmented Reality” AND and its applications.
“Audio/Auditory Display”
“Audio Augmented Reality” AND
“Display” 2 MAJOR TECHNOLOGIES FOR CREATING AAR
Application domains “Audio Augmented Reality” AND This section discusses the technologies used to develop
“Study/-ies” the AAR systems mentioned in the reviewed literature. Pre-
“Audio Augmented Reality” AND
“User Study/-ies”
vious research [25] has indicated that tracking, interaction,
“Audio Augmented Reality” AND and displays form the main components of typical AR sys-
“Pilot Study/-ies” tems. In addition to these, two technology components are
“Audio Augmented Reality” AND specifically needed for AAR: room acoustics modeling and
“Experiment(s)” spatial sound synthesis. For papers that presented a com-
HRTF = head-related transfer function. plete AAR system, the technologies used for each com-
ponent are summarized in Table 2. Note that some works
The terms were searched in the title, abstract, and key- did not specify or include some technology components, so
words fields to identify relevant literature. The full text these were represented by “···” in the table.
of each paper was read to identify its suitability for this Fig. 1 shows percentages of the methods that were used
survey, and papers were excluded for two reasons: 1) The to realize each technology component of the AAR systems
primary research theme/objective of the paper had little to listed in Table 2. In the remainder of this section, a detailed
do with the reviewed topics. 2) A few works added mono- review for each of these five technologies is provided.
phonic sounds, whereas this survey focuses on spatial vir-
tual sounds. For example, [24] provided monophonic audio 2.1 User-Object Pose Tracking
description in a tourist guide application that users heard In the context of AAR, to guarantee that virtual sounds
through earphones. are correctly spatialized from real locations, tracking
After filtering out papers according to their research specifically refers to tracking the user’s location, orien-
theme and content, the impact of each remaining paper tation, and relative pose to the desired audio source. Table
was considered to ensure that representative and influen- 2 shows that 43% of the works implemented tracking us-
tial work was being reviewed. To this end, every paper’s ing a single type of sensor, among which visual tracking
average citation count (ACC) [17] was calculated. For pa- is the most commonly used method (58%). For AAR sys-
pers published before 2020, 96 papers with ACC ≥ 2.0 tems using visual tracking, a typical approach is to detect
were included. The remaining papers published in 2020 and track visual features from input video frames to calcu-
and 2021 were chosen to be included regardless of their late the current pose. Although some works employ natural
ACC because most of them were still too new to accrue a images captured from unmodified environments (e.g., [26,
significant citation count. 27, 28]), others exploit specifically designed fiducial mark-
Overall, 117 papers were reviewed, of which 62 pre- ers (i.e., image markers that serve as references) that are
sented a complete AAR system through which the user pre-allocated in the space (e.g., [29, 30, 31]).

790 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October

PAPERS
Table 2. Summary of the technologies used in the 62 complete AAR systems. Categories in each column are clarified in more detail in the corresponding sections.

Paper Tracking method Interaction method Display method Acoustics modeling Sound spatialization

Bederson [55] Infrared tracking Implicit Audio only ··· ···


Mynatt et al. [56] Infrared tracking Implicit Audio only ··· ···
Behringer et al. [88] GPS-inertial tracking Voice input/game pad HMD ··· AudioTechnica ATW-R100
receivers
Sawhney and Schmandt [89] Static head position Voice input/button Audio only ··· ···
Walker et al. [83] ··· Mouse scroll Audio only ··· Microsoft DirectX with
generic HRTFs
Härmä et al. [102, 103] ··· Implicit Audio only Artificial Measured BRIRs
reverberation
Sundareswaran et al. [98] GPS-inertial tracking Implicit HMD ··· Off-the-shelf engines with
generic HRTFs
Tachi et al. [99] Visual tracking Implicit HMD ··· ···
Hatala et al. [68] RFID-visual tracking 3D tangible interface Audio only ··· ···
Terrenghi and Zimmermann RFID tracking Implicit Audio only Artifical reverberation Generic HRTFs
[57]
Zhou et al. [29] Visual/acoustic-inertial tracking Implicit HMD ··· OpenAL with generic
HRTFs
Zhou et al. [90] Visual tracking Foldable AR book HMD ··· ···
Zotkin et al. [105] Visual tracking ··· PC display Computed IRs Selected HRTFs
Hatala and Wakkary [69] RFID-visual tracking 3D tangible interface Audio only ··· ···
Walker and Lindsay [81] ··· Keyboard input Audio only Artificial Generic HRTFs
reverberation
Fröhlich et al. [145] GPS-inertial tracking Implicit Audio only Outdoor ···
Sodnik et al. [30] Visual tracking Implicit HMD ··· OpenAL using generic
HRTFs
Tonnis and Klinker [79] ··· Implicit Head-up display ··· Surrounding speakers
Walker and Lindsay [61] GPS-inertial tracking Implicit Audio only ··· Generic HRTFs
Grasset et al. [38] Visual tracking 3D tangible Handheld display ··· ···
interface/gaze input
···

A REVIEW OF AUDIO AUGMENTED REALITY


Liarokapis [82] Visual tracking Keyboard/mouse/touch HMD OpenAL using generic
screen HRTFs
Stahl [62] GPS-inertial tracking Slider on GUI Mobile device Outdoor ···
Wakkary and Hatala [70] RFID-visual tracking 3D tangible interface Audio only ··· ···
Wilson et al. [146] GPS-inertial tracking 2D scrolling interface Audio only Outdoor ···
Zimmermann and Lorenz [58] RFID tracking Implicit Audio only Artifical reverberation ···
Heller et al. [72] UWB-inertial tracking Implicit Audio only ··· OpenAL using generic
HRTFs
Kern et al. [80] ··· Implicit PC display ··· ···
Blum et al. [91] GPS-inertial tracking 3D tangible interface Audio only Outdoor OpenAL using generic
HRTFs
Katz et al. [65] Visual-inertial tracking Implicit Audio only Outdoor ···
(continued on next page)
791
Table 2. (Continued)
792

YANG ET AL.
Paper Tracking method Interaction method Display method Acoustics modeling Sound spatialization

McGookin et al. [84] GPS-inertial tracking Touch screen Mobile device Outdoor ···
Ribeiro et al. [66] Visual-inertial tracking Implicit Audio only Pre-modeled room Generic HRTFs
Vazquez-Alvarez et al. [92] GPS-inertial tracking 3D tangible interface Audio only Outdoor JAVA JSR-234 using
generic HRTFs
Blum et al. [51] Inertial tracking 3D tangible interface Audio only ··· PureData using generic
HRTFs
Langlotz et al. [71] GPS-visual tracking Touch screen Mobile device Outdoor Stereo sound panning
de Borba Campos et al. [123] ··· ··· Audio only ··· Stereo sound panning
Heller et al. [78] Retroreflective tracking Implicit Audio only Artificial OpenAL using generic
reverberation HRTFs
Blessenohl et al. [26] Visual tracking Implicit Audio only ··· Generic HRTFs
Ruminski [31] Visual tracking ··· Mobile device ··· ···
Chatzidimitris et al. [59] GPS tracking Touch screen Mobile device Outdoor OpenAL using generic
HRTFs
Heller et al. [52] Inertial tracking Implicit Audio only ··· KLANG using generic
HRTFs
Russell et al. [73] UWB-inertial tracking Implicit Audio only Outdoor 3DCeption using generic
HRTFs
Heller and Schöning [63] GPS-inertial tracking Implicit Audio only ··· KLANG using generic
HRTFs
Kim et al. [86] ··· Touch screen HMD ··· ···
Lim et al. [85] ··· Touch screen Mobile device Outdoor ···
Schoop et al. [27] Visual tracking Implicit Audio only Outdoor Stereo sound panning
Sikora et al. [64] GPS-inertial tracking Touch screen Audio only Outdoor Generic HRTFs
Huang et al. [100] Visual tracking ··· HMD Outdoor Generic HRTFs
Rovithis et al. [60] GPS tracking Gesture control Mobile device Outdoor SceneKit using generic
HRTFs
Yang et al. [37] Retroreflective tracking Implicit Audio only Pre-modeled room Generic HRTFs
Bandukda and Holloway [148] ··· Implicit Audio only ··· ···
Cliffe et al. [106] Visual tracking Implicit Audio only Pre-recorded Generic HRTFs
J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October

soundscape
Joshi et al. [149] ··· ··· Audio only ··· ···
Kaghat et al. [53] Inertial tracking Gesture control Audio only ··· Generic HRTFs
Lawton et al. [122] ··· ··· Audio only Outdoor Surrounding speakers
Mattheiss et al. [104] ··· Implicit Audio only Artificial Individual and generic
reverberation HRTFs
May et al. [147] ··· Implicit Audio only ··· Generic HRTFs
Sagayam et al. [87] Visual tracking Touch screen Mobile device Pre-modeled room Generic HRTFs
Yang et al. [101] Visual tracking Implicit HMD ··· Generic HRTFs
Chong and Alimardanov [169] ··· Implicit Audio only Outdoor Generic HRTFs
Comunita et al. [67] Visual-inertial tracking Implicit Mobile device Pre-modeled room Generic HRTFs
Guarese et al. [54] Inertial tracking Implicit HMD ··· Generic HRTFs
···

PAPERS
Kaul et al. [28] Visual tracking Implicit Audio only Generic HRTFs
*AAR = audio augmented reality; HMD = head-mounted display; HRTF = head-related transfer function; IR = impulse response; RFID = radio frequency identification; UWB = ultra wideband.
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

Fig. 1. Overview of the technologies used in the reviewed audio augmented reality (AAR) systems. BRIR = binaural room impulse
response; HMD = head-mounted display; HRTF = head-related transfer function; RFID = radio frequency identification.

The fact that visual tracking is most popular could be scenarios. For example, visual tracking is not feasible in
largely due to the development of computer vision tech- low-light environments, in spaces with occlusions, or when
niques. More specifically, advancements in vision sensors power consumption is critical [20]. To achieve robust track-
(e.g., stereoscopic depth cameras [32] and RGB-D camera ing in various environments, researchers have also explored
sensors [33–36]) have enabled increasingly accurate pose hybrid techniques that fuse several kinds of sensors.
tracking. Moreover, computer vision tracking algorithms Table 2 shows that 34% of the AAR systems employed
have been studied for decades and are relatively mature. hybrid pose tracking approaches, with the most popular
Furthermore, visual tracking can be implemented in many combination being GPS and inertial sensors (53%; e.g.,
forms (e.g., inside-out tracking [26], outside-in tracking [61–64]). These implementations were typically designed
[37], tracking for small-scale desktop scenes [38] and large- for large spaces or outdoor environments, where GPS sen-
scale outdoor places [27], etc.). Therefore, visual tracking sors can be used for localizing the user’s position and the
can be used in many different application scenarios. In the inertial sensors can be used to determine the user’s head
AR community, some visual tracking methods have been orientation. The reviewed AAR systems demonstrate an-
used in AR systems for vision augmentation [39, 32, 40–43, other five hybrid tracking approaches. A total of 14% of the
33, 34, 44, 45, 35, 46–50, 36], and they are also feasible for AAR systems fused visual and inertial sensors [65–67], and
AAR systems. 14% fused visual and RFID sensors [68–70]. Additionally,
Inertial sensors appear to be the second-most–used sen- one AAR system used GPS-visual tracking [71], whereas
sors in AAR systems that used a single sensor type for another demonstrated acoustic-inertial tracking [29]. Two
pose tracking (19%) [29, 51, 52, 53, 54]. Inertial sensors AAR systems used an implementation of ultra-wideband–
usually involve accelerometers and gyroscopes to track the inertial tracking [72, 73].
position and orientation of an object relative to a known It can be seen that hybrid tracking improves tracking ac-
starting point, orientation, and velocity. Note that because curacy by combining the strengths of different sensor types.
the drift issue may cause significant errors especially for For example, inertial sensors are commonly used for mea-
position tracking, these AAR systems used inertial sensors suring orientation in the wider tracking community, and
for the three-degrees-of-freedom orientation tracking in a 81% of the hybrid methods exploited inertial sensors. GPS
small area where the user barely changed location. and visual sensors are more commonly used for determin-
In addition to visual and inertial sensors, some other sen- ing the user’s position. Although visual sensors are more
sors have also been employed for AAR tracking. Two in- flexibly used for different scales of scenes, GPS fits better
door AAR systems used infrared transmitters and receivers in large environments or outdoor spaces.
[55, 56] and another two indoor AAR systems used radio Overall, the AAR systems show that mainstream user-
frequency identification (RFID) technology [57, 58]. Fur- object pose tracking approaches are based on visual, iner-
thermore, GPS has been used in two outdoor AAR systems tial, and GPS sensors. This trend is, in general, aligned with
to determine the user’s location [59, 60]. the popular tracking methods in the wider AR community.
Although some AAR systems have successfully used a However, the authors find it surprising that acoustic track-
single sensor type to track the user’s pose, each sensor type ing and acoustic-inertial tracking, which have been used in
has limitations that might restrict its deployment in some a few AR systems [74–76], were rarely exploited by AAR

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 793
YANG ET AL. PAPERS

systems. When using acoustic sensors for pose tracking, 51]. For example, users can tip the phone to switch between
acoustic signals can be emitted from one or several sound “stop” and “listen” mode [51]. In an AR book application,
sources and received by microphones that are attached on the book was designed to be foldable for interaction [90].
the object or user to be tracked. The authors see the poten- More specifically, appropriate audio content will be played
tial to employ acoustic sensors for pose tracking in AAR accompanying visual animations in a pre-defined sequence
systems. As some earbuds are now equipped with acoustic when the user unfolds the book into a specific state [90].
and motion sensors [77], using acoustic or acoustic-inertial Overall, a variety of interaction methods have been
tracking might be a lightweight and convenient solution demonstrated in the reviewed AAR systems. These interac-
for implementing AAR systems on wearable devices. More tion methods can be used to choose and activate the desired
will be discussed about acoustic sensor-based tracking as a audio augmentation content or adjust the presentation of
future research direction in SEC. 4. the audio augmentation. However, most AAR systems in-
tegrated pre-designed virtual audio content that was not ed-
itable during run-time by using the interaction techniques.
2.2 Interaction Technology
Interaction refers to how the user initiates or responds to
an AAR system and how the system dynamically reacts to
the user’s actions. To this end, AAR systems must incorpo- 2.3 Display Technology
rate input methods that allow users to choose and activate Display technology, in the context of audio reproduction,
the audio augmentation content or adjust the presentation refers to the hardware used to present sounds to a user.
of the audio augmentation (e.g., adjust the volume). This may be in the form of loudspeakers, headphones, and
Among the reviewed AAR systems, 49% of them im- earphones, among other methods used to deliver sound.
plemented “implicit” interaction. Implicit interactions with Among the reviewed AAR systems, 61% of them included
AAR systems refer to those that are not actively initiated by only audio augmentation (“audio only” in Table 2). In these
the user. Instead, the system reacts to the user’s surround- cases, virtual sounds were rendered on computing devices
ings and their actions in the environment and provides the (e.g., smartphones, tablets, and PCs) and then delivered to
desired virtual sounds. Users’ actions serve as the only the user via wired/wireless headphones or earbuds or bone-
inputs to the AAR system. Such implicit interaction is typi- conducted headsets. The virtual sounds can help users finish
cally used in localization and navigation services (e.g., [30, some tasks or provide users with a better experience. For
72, 78, 27]). Because the user does not need to control example, spatialized sounds can indicate the direction and
devices or objects to provide explicit commands, implicit distance in a navigation application [92].
interaction is convenient and helpful for visually impaired For AAR displays, acoustic hear-through is an impor-
individuals (e.g., [26, 54]) or when the user is engaged in tant functionality for some applications in which there
an attention-critical task (e.g., driving [79, 80]). However, already exist real environmental sounds apart from those
implicit interaction does not allow users to flexibly control added virtually. If the real sounds are wanted, acoustically-
the audio augmentation at will. transparent devices can be used so that real sounds can
More proactive interaction modes are shown in the rest pass through unaltered for natural fusion with the virtual
of the AAR systems. Traditional 2D user interfaces were sounds [93–95]. In some other cases, real sounds might be
implemented in 22% of the AAR systems, such as a GUI unwanted and thus supposed to be reduced or removed.
[62], keyboard [81, 82], mouse [83, 82], and touch-screen For example, for users to clearly hear spatialized naviga-
input [82, 84]. Some AAR systems (8%) designed a mobile tion cues through earphones when riding bicycles outdoors,
application or game for the user’s interaction on the touch environmental wind noise was attenuated [96, 97].
screen [59, 85, 86, 64, 87]. By actions such as clicking Some AAR systems also integrated visual augmenta-
buttons and selecting items on menus, the user can activate tion in addition to audio augmentation, which require other
the AAR service, change virtual audio content, and adjust display methods for the user to comprehensively experi-
the audio presentation as they prefer. ence the augmented environment. Some systems (18%)
Some AAR systems (8%) employed more natural user employed head-mounted displays (HMDs) [88, 98, 99, 29,
interfaces. For example, the user can directly provide voice 90, 30, 82, 86, 100, 101, 54], whereas others (16%) imple-
commands to select or adjust the audio content [88, 89]. mented handheld displays (such as smartphones, tablets,
Hand gestures [60], head gestures [53], and eye gaze [38] and some other handheld devices) [38, 62, 84, 71, 31, 59,
were other common types of natural body input for AAR 85, 60, 87, 67] to enable users to perceive virtual visual
systems. For example, the user can control the audio volume and auditory content together. For applications like driving
by swinging their head to the left or right [53]. simulation, head-up display [79] and PC display [80] were
Finally, the authors noticed that 12% of the AAR systems equipped in the environment to simulate an inside-car situ-
employed novel, application-associated 3D tangible user ation. When visual augmentation was also included, users
interfaces [68, 90, 69, 38, 70, 91, 92, 51]. For example, in an could perceive virtual sounds using the device-integrated
exhibition, users could choose the virtual audio content at a speakers (e.g., Magic Leap One [101]). Alternatively, addi-
specific position by rotating a physical cube to that direction tional headsets or earbuds could be connected to the display
[68, 69]. Some other AAR systems used mobile devices devices for delivering virtual sounds like in the audio-only
(e.g., smartphones) as the manipulating interface [91, 92, cases.

794 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

Overall, AAR systems mainly included audio-only dis- The remaining 12 systems involved room acoustics mod-
plays or audio-visual displays. Off-the-shelf headsets or eling, but most of them did not provide their implementa-
earbuds have been most commonly used to deliver vir- tion details. Some works mentioned that they modeled some
tual sounds. Newer display devices such as miniature loud- artificial reverberation [102, 103, 57, 81, 58, 78, 104]. Arti-
speaker arrays placed close to ears, as seen in the HoloLens ficial reverberation simulates sound wave propagation phe-
[54] and Magic Leap [101], also started to gain popularity. nomena in an environment such as reflection and diffusion,
The choice(s) of display technology is heavily dependent on which can create the feeling of being in an indoor environ-
the nature of the application and whether the augmentation ment. However, because they did not mention the imple-
of other sense(s) is needed. mentation details of creating such artificial reverberation, it
is difficult to determine how well the added reverberation
matched the real environment.
Four works implemented room acoustics modeling by
2.4 Room Acoustics Modeling first creating a virtual 3D room model that corresponded
For the purpose of this review, room acoustics modeling to the real environment and then simulating room acous-
technology specifically pertains to AAR systems used in tics based on the room model [66, 37, 87, 67]. One work
indoor environments. The need for this technology com- mentioned that they computed IRs in a rectangular room
ponent stems from peoples’ natural auditory perception of by using an extended image source method for several
the real world. The perception of the same sound in differ- source-receiver pairs [105]. Another work [106] integrated
ent environments can vary drastically, even if the relative pre-recorded audio clips with their room acoustics (e.g.,
pose stays the same. This is because the acoustic properties concert hall and drama scenes). Switching between these
of an environment (e.g., room geometry, surface materials, audio clips could then create the feeling of being immersed
etc.) influence sound propagation and affect how the user in a different indoor environment.
perceives the sound source width, externalization, spectral In general, room acoustics has been ignored or probably
characteristics, etc. These acoustic properties are unique not modeled well enough in many of the systems. Although
to each environment. For users to perceive virtual sounds the reviewed AAR systems did not present much about
like they physically “belong” in the environment, an AAR room acoustics modeling, research on acoustics modeling
system should model the room acoustics when rendering has been conducted for years and some methods have the
virtual sounds. potential to be further explored and integrated into future
A typical approach to room acoustics modeling is to ac- AAR systems.
quire the impulse response (IR) of the environment. An IR is As mentioned above, one can first create a desired 3D
a function that describes how the environment would influ- room model that includes geometry modeling and surface
ence the sound propagation from the source to the listener. material identification and then model the environmental
Convolving the IR and a “dry” version of the sound results IRs by simulating the sound wave propagation in the space.
in a virtual sound source that is colored by the acoustic To this end, visual inputs from cameras can be used to re-
properties of the environment it occupies. construct 3D environment models [107–111] and recognize
Among the reviewed AAR systems, 27% focused on out- materials [109, 111]. Apart from vision-based approaches,
door environments, which did not include room acoustics acoustics-based methods can also be used for geometry
modeling. In fact, to embed virtual sounds seamlessly in modeling and material classification. For example, smart-
a real environment, appropriate reflection modeling is also phones can be used to receive ultrasonic chirps to recon-
important for outdoor applications. However, these works struct the environment geometry and estimate the sound
did not involve such reflection modeling in their systems. absorption coefficients of indoor surface materials [112].
Among the remaining 45 AAR systems, 73% of them Based on the modeled geometry and the classification of
did not include or specify the room acoustics modeling materials, some computational techniques can be applied
component, which might be because of the following three to generate the desired IRs [113]. In addition to the ge-
reasons: 1) Some systems used virtual spatial sounds for ometry and material-based sound propagation simulation
vivid presentations (e.g., to play messages at different spa- approaches, it is also possible to exploit parametric meth-
tial locations around the user’s head based on their time ods for statistically coding desired IRs [114]. It is pos-
of arrival [89]). In these applications, a strong association sible to statistically code a sound field because much of
with the environment in which the user was located was not the perceptual quality of virtually rendered sounds can be
required. Therefore, integrating room acoustics did not add quantified by a few critical acoustic parameters (e.g., re-
much value to the user experience. 2) Some AAR systems verberation time) [114]. Compared to a complete sound
aimed to provide localization or navigation services (e.g., propagation simulation based on the geometry and surface
[72, 26, 52]). Researchers focused on rendering 3D loca- material properties, the parametric coding method may run
tions or directions, and room acoustics modeling did not faster and have lower computational requirements.
significantly influence the user’s perception of the source In summary, although research on room acoustics mod-
location, especially when the application was designed for eling has been active for a long time, little of it has been
small-scale scenes (e.g., a desktop [30]). 3) Some AAR sys- implemented in AAR systems. One reason could be that
tems that integrated acoustics effects did not specify how some room acoustics modeling methods are computation-
or what was used to simulate acoustic environments. ally expensive [109–111] or require additional measure-

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 795
YANG ET AL. PAPERS

ments in the application environments [111, 114], which a location in the space will reach the eardrum after the
makes them difficult to be used for interactive AAR appli- sound waves interact with the listener’s anatomical struc-
cations in arbitrary environments. From another perspec- ture such as head and torso [125]. There is a pair of HRTFs,
tive, it has been analyzed that room acoustics modeling one for each ear. Because people are anthropometrically
played a less important role in some AAR systems, and different, the HRTFs associated with each individual tend
thus this component was little considered when designing to be unique. Acquiring precise HRTFs for each person
the system. The authors argue that appropriate room acous- usually requires laborious measurements in strictly con-
tics modeling can significantly improve the user’s immer- trolled environments, along with a lot of equipment such as
sive experience in some applications, such as AR-facilitated multiple speakers, etc. Fortunately, because many people’s
remote collaboration and exhibition tour. Thus, exploring anatomical structures are largely similar, previous works
computationally-efficient and conveniently-implementable have shown that using a pair of generic HRTFs of an av-
solutions remains an important future research topic. More erage head or those of a “good localizer” [126, 127] can
discussions about future work will be presented in SEC. 4. produce perceptually adequate virtual sounds for a large
group of users [128–130].
Of the reviewed AAR systems, 54% implemented bin-
2.5 Spatial Sound Synthesis aural spatialization with generic HRTFs. Some of these
Spatial sound synthesis technology aims to synthesize systems used open-source or freely available spatial audio
virtual sounds that can be externalized like they are present engines, such as OpenAL [29, 30, 82, 72, 91, 78, 59] and
in the user’s real environment versus “inside-the-head” KLANG [52, 63]. However, the auralization details of the
[115]. A key aspect of this is being able to process sounds spatial audio engines are not available, and some of these
in a way that the spectral and binaural characteristics of systems did not specify the auralization principles or the
sounds delivered through headphones or earphones mimic audio engines they used for binaural spatialization.
those of sounds that are incident on the ears in the real Instead of using generic HRTFs, users may select the
world. most suitable HRTFs for themselves from a dataset of pre-
A classic technique is ambisonics recording and repro- measured HRTFs (e.g., MIT HRTF database [131], CIPIC
duction [116–118]. The sound of interest is first recorded HRTF database [132], SADIE II HRTF database [133])
using microphone arrays that typically consist of a large [134]. Among the AAR systems that have been reviewed,
number of homogeneously distributed microphones [119– Zotkin et al. [105] selected HRTFs for each user by measur-
121]. Thereafter, depending on the user’s real-time location ing the user’s anthropometric parameters and then finding
in the environment, the recorded sound field can be repro- the closest match in their HRTF database.
duced from the desired source location. Such reproduction Although generic or similar HRTFs can work well for
can be implemented through spatially distributed real loud- many users, some studies have shown that personalized
speakers such as [79] and [122] did in their AAR systems. HRTFs demonstrate significantly better results, especially
In order to conveniently use AAR applications in envi- if users demonstrate a high sensitivity of auditory localiza-
ronments that are not pre-equipped with real loudspeakers, tion or if their anatomical structures are far from average
virtual sounds are better delivered through off-the-shelf or [135–137]. Among the AAR systems that have been re-
specifically designed headsets or earphones. Three works viewed, one work [104] measured HRTFs for some users
implemented a stereo sound panning [71, 123, 27] tech- and used these individual HRTFs to render binaural audio
nique. This technique assigns a piece of monophonic sound for these users.
to the left and right audio channels with time and level dif- So far, it has been summarized how most of the reviewed
ferences, which creates the illusion of width and space for AAR systems used binaural spatialization (with generic,
the user. However, stereo panning sounds are usually per- similar, or individual HRTFs) to synthesize virtual auditory
ceived as localized inside the head rather than outside in the sources. For users to experience an immersive auditory ex-
space, causing an unnatural fusion of virtual sounds with perience in the environment, or to have a better localization
the real environment. In these works, because of their spe- performance [138], one should integrate room acoustics
cific application settings [71] or the use of bone-conduction modeling into spatial sound synthesis. One approach to this
headset in streets [27], the stereo panning technique was is by combining environmental IRs and HRTFs [139] when
able to provide a reasonable performance. In more gen- rendering virtual sounds. Alternatively, researchers can di-
eral cases, more precise localization and externalization of rectly measure a user’s binaural room impulse responses
virtual sounds should be achieved through binaural spatial- (BRIRs) in the environment of interest, which integrate
ization. room acoustics and the user’s personal auditory perception
Binaural spatialization reproduces audio in a manner that [140, 141] in one measurement. To avoid the complexity of
mimics auditory perception with two ears in the real world. in-situ measurements, it is also possible to simulate percep-
Binaural spatialization can be achieved through various tually plausible BRIRs [142]. Among the reviewed AAR
processes such as equalization, delay filters, or convolu- systems, only one of them chose to measure each user’s
tion with head-related transfer functions (HRTFs) [124]. BRIRs for binaural audio rendering [102, 103].
An HRTF is typically formulated as a function of the sound Overall, it can be seen that most AAR systems have
source position and its spectral distribution [125]. More used open-source engines and generic HRTFs when creat-
specifically, an HRTF describes how a sound emitted from ing binaural sounds that users can perceive through normal

796 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

headsets or earphones. In comparison, personalized HRTFs


or BRIRs have not been widely used in the existing AAR
systems, which could be partly because of the difficulty
of acquiring personalized HRTFs or BRIRs in reality. Be-
cause room acoustics modeling was ignored in most of the
AAR systems, spatialized virtual sounds might be perceived
without appropriate engagement in the real environment. To
seamlessly blend virtual sounds with the physical world, the
technologies of room acoustics modeling and spatial sound
synthesis should go hand in hand in AAR implementations. Fig. 2. Number of publications from 1995 to 2021 for each appli-
cation domain.

2.6 Summary
In this section, the five major technologies for creating Based on the literature reviewed, applications for AAR
AAR systems have been reviewed. From the above discus- can be broadly divided into seven categories. Most of the
sions, some general trends of AAR technologies over the reviewed AAR systems have been lab work, and a handful
past decades are summarized. of these are available to be used as systems or services in the
User-object pose tracking: Around 43% of the AAR real world. These application domains include navigation
systems used a single type of sensor for pose tracking. and location-awareness assistance, augmented environmen-
However, to guarantee a more robust pose tracking in dif- tal perception, presentation and display, entertainment and
ferent environments, hybrid tracking methods that exploit recreation, telepresence applications, education and train-
the strengths of different sensor types were more favored. ing, and healthcare. There can be some overlap between
For hybrid tracking, visual sensors, inertial sensor, and GPS these categories. For example, in some telepresence appli-
have been used most. cations (e.g., Mixed Reality remote collaboration), the user
Interaction technology: Implicit interaction was most needs to localize objects or the other user [101], which in-
commonly used in the reviewed AAR systems, followed by volves navigation. In such cases, the work is categorized
traditional 2D interfaces, 3D tangible interfaces and natu- mainly according to its goal and targeted application sce-
ral body inputs. Interaction technologies have been mainly narios of the original study.
used to activate or adjust a pre-designed virtual sound clip To provide an overview of AAR applications, Fig. 2
rather than editing the sound signal during run-time. shows the number of publications from 1995 to 2021 for
Display technology: Most AAR systems only aug- each application domain, highlighting an overall increase
mented the user’s auditory sense, for which the virtual in the number of AAR studies. This is especially true
sounds were delivered to the user via headsets or earbuds. for applications in navigation and location awareness, fol-
Around 40% of the AAR systems combined audio and vi- lowed by entertainment and recreation. This reflects, in
sual augmentation, for which other display methods (e.g., many ways, the gradual commercial and research interest
HMDs and handheld displays) were used. in these domains and how it has grown over time with
Room acoustics modeling: Around half of the reviewed the advent of technology capable of supporting AAR. It
AAR systems did not include the room acoustics modeling was also noticed that several application domains have be-
component. Moreover, those that included room acoustics gun to attract more research interest in recent years, such
modeling tended to only create some approximate artifi- as education and training and healthcare. In the following
cial reverberation effects. Additionally, 27% of the AAR subsections, each application domain will be reviewed in
systems aimed for outdoor applications, and none of them more detail.
modeled outdoor reflections.
Spatial sound synthesis: Around half of the reviewed
AAR systems created spatial sounds using generic HRTFs. 3.1 Navigation and Location-Awareness
Only a few works employed the user’s BRIRs or the user’s Assistance
individual HRTFs.
Given the nature of binaural hearing, navigation and
SEC. 4 will identify and discuss important future research
location-awareness assistance appears to be the most pop-
directions to promote AAR technologies and advance AAR
ular application of AAR technology. Human beings have
applications.
a limited field-of-view (FoV) of approximately 120◦ either
side of the median plane and about 60◦ above and below
the plane that passes through the eyes. However, this range
3 APPLICATION DOMAINS OF AAR encompasses the entire FoV, which also includes periph-
TECHNOLOGY eral vision [143, 144]. Generally, human vision is the best
within an approximately 60◦ horizontal and vertical arc in
In SEC. 2, technologies used to create AAR systems were the front. Most of the understanding of the environment out-
reviewed. In this section, a range of different real-world side this “window” tends to come from surrounding sounds
applications for AAR technology is reviewed. [143, 144]. A large part of AAR applications attempts to

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 797
YANG ET AL. PAPERS

bly not associated with specific objects or the space where


the user is, but such a presentation or display could en-
rich the user’s perception of their surrounding activities or
events so that they could conveniently select items, receive
messages, or monitor progress [89, 83].
Some other researchers have created spatial soundscapes
to immerse users in specific scenes [100]. For example, the
Fig. 3. Audio augmented reality (AAR) system presented in [66]. soundscape of a city scene could be rendered to augment
This system used visual-inertial tracking, implicit interaction, a 360◦ visual panorama [100]. Overall, AAR technology
audio-only display, pre-modeled room reverberation for acoustics
can enrich the presentation of information and enhance
modeling, and generic head-related transfer functions (HRTFs)
for binauralization. (a) is adapted from Fig. 2 and (b) is Fig. 3 in interaction by engaging the user more effectively.
[66].
3.4 Entertainment and Recreation
Entertainment and recreation has also been a popular
exploit this natural form of auditory perception to provide application for AAR technology. Two major scenarios in
information to users. this category are summarized.
Spatialization of virtual sound sources can assist with The first popular scenario is exhibition scenes, including
prompt localization of targets, in particular when the targets museums [55, 68, 57, 69, 58, 123, 106, 53], cultural her-
are outside the user’s FoV. Some researchers have used itage displays [72, 67], and archaeological sites [84, 64].
spatialized auditory beacons to provide coarse directional In such scenarios, AAR technology could be used to spa-
guidance [26, 63, 80], whereas more work aimed to direct tialize the preface or background knowledge and vividly
the user to specific objects or landmarks [88, 98, 105, 145, introduce exhibits or re-direct visitors’ interest [57, 68, 69,
30, 79, 61, 62, 146, 65, 92, 51, 78, 31, 73, 27, 37, 104, 70, 84, 106, 67]. Alternatively, some implementations use
147, 54]. Some experiments have demonstrated adequate AAR technology to augment the exhibits or scenes them-
localization accuracy when directed to an object or position selves by adding content-related virtual sounds [72, 64, 53].
by spatialized virtual sound sources [30, 78, 26, 52, 73, 63, For example, the operating sound of printers was virtually
37, 104]. These experiments demonstrated a promising and added and spatialized to accompany an old printer exhibit
intuitive use of AAR technology for navigation and also in an exhibition at the MAM Museum [53].
established the foundation for some other applications, such Moreover, AAR technology could also be used in mo-
as augmented environmental perception and telepresence bile games [59, 60]. In these applications, spatial sounds
applications. were used to either provide navigation cues [59] or render
In this application domain, several projects specifically content-related soundscapes [60].
focused on warning systems. They used spatial sounds to
indicate the locations of safety threats [98] or the directions 3.5 Telepresence Applications
of imminent dangers while driving [79]. Furthermore, the AAR technology has also been implemented in several
function of navigation and location-awareness assistance telepresence applications [99, 29, 101]. In such applica-
could especially help visually impaired people [26, 54]. tions, spatial audio was commonly used to augment the vir-
tual avatars/objects and enable people to easily distinguish
3.2 Augmented Environmental Perception who is speaking in a multi-party setting [99, 29, 101]. This
AAR technology could assist visually impaired people assists with remote collaboration tasks and enhances the
in understanding the ongoing events or notice surrounding feeling of human presence.
objects in their environment [56, 146, 91, 66, 147, 28]. Note
that although environmental perception is largely based on 3.6 Education and Training
the awareness of direction or object locations, it is more AAR technology has been applied in several educa-
about providing users with an overall understanding of their tional scenarios to impart knowledge and convey informa-
environment. For example, [66] developed an AAR system tion more vividly and effectively [90, 82, 84, 85, 122, 87].
(see Fig. 3) that can describe surrounding objects (e.g., stuff The most popular application is storytelling using AR/MR
on a table and furniture in a hallway) to users, and the audio books [90, 38]. The content and characters in the book could
messages are rendered from the object locations. be augmented by story-related spatial sounds, thus improv-
ing reader experience and retention of the book. Such a
3.3 Presentation and Display vivid storytelling application was also deployed at cultural
Some research used spatial sounds to vividly convey mes- heritage sites to enhance visitor experience [85].
sages [89, 71] or create a certain ambience [100]. For exam- Another educational application is to augment the teach-
ple, calendar items could be played to the user from different ing material to help students acquire a better understanding
directions to remind them of events at corresponding hours of underlying concepts [82, 87]. For example, the revolu-
[83]. Similarly, messages could be played at different spa- tion pattern of the solar system was presented with 3D audio
tial locations around the user’s head based on their time of effects to help students understand the concepts with im-
arrival [89]. In these applications, spatial sounds are proba- pressive illustrations [87]. Moreover, AAR technology has

798 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

also been used to augment natural soundscapes to enhance to put on some form of obtrusive tracking apparatus (e.g.,
public understanding of the natural world [122]. visual tracking using HMDs [29, 30, 101]). This is nec-
Another novel application of AAR technology is tele- essary in some application scenarios such as when visual
coaching for fast-paced tasks such as training users to play and auditory augmentation is needed together. However,
tennis [86]. This coaching application is based on the func- using HMDs might modify the user’s HRTFs, thus impact-
tion of navigation and localization using spatial sounds. ing their AAR experience [150, 151] Furthermore, using
More specifically, the user’s coach could initiate a spatial HMDs might also be physically uncomfortable and not so-
audio instruction that guided the user to hit the ball toward cially acceptable [152], which limits the use of AAR.
a specific direction or a spot. In the future, it might inspire Therefore, the authors suggest that future research
more applications in sports training, given the advantage of on pose tracking could investigate approaches using
directional and timing guidance by spatial sounds. lightweight but powerful wearable devices that have already
been adopted by consumers. For example, earables, which
3.7 Healthcare refer to wearable devices around the ear and head such
Healthcare is a relatively new application domain of as hearing aids, earbuds, and electronics-embedded glasses
AAR technology. For example, a spatial soundscape that [153], are becoming increasing popular. Example devices
demonstrates natural elements in open spaces could be used include Nokia eSense [77, 154] and Bose Frames.4 These
to enhance people’s connection with the nature, which could devices are typically equipped with acoustic and motion
benefit their mental and physical well-being [148]. Such an sensors that can be exploited for tracking the user’s position
implementation creates the illusion of staying in outdoor and orientation. Moreover, such devices are typically in the
environments, which can be especially useful when ven- form of earbuds or glasses, which can be conveniently used
turing outside is not possible. Furthermore, this might help for audio delivery in everyday work and life. Therefore, it is
visually impaired people who find it difficult navigating possible to integrate pose tracking and spatial audio delivery
outdoor environments to remain indoors and experience into a single device and perform the required computation
some aspects of an outdoor natural surrounding [148]. In on the device too. However, there might be trade-offs in
another example, researchers proposed to bring the restora- terms of power consumption and latency, which requires
tive benefits of outdoor environments to indoor spaces by further research and development in the future.
creating virtual natural soundscapes, which can help to deal
with geriatric depression [149]. 4.1.2 Interaction and Display Technologies
Approximately half of the AAR systems that have been
4 FUTURE RESEARCH DIRECTIONS reviewed implemented implicit interaction. This form of
interaction is well suited to certain scenarios, such as when
The previous two sections reviewed technologies used to performing an attention-critical task. Moreover, visually
build AAR systems and the application domains in which impaired people might find implicit interaction helpful, but
AAR has been studied. Specifically, five technology com- an implementation of voice input or a well-designed tan-
ponents are needed for implementing AAR, namely, user- gible interface is recommended to allow for more flexible
object pose tracking, interaction technology, display tech- and personalized control of the system. Future research can
nology, room acoustics modeling, and spatial sound syn- also explore real-time audio editing using interaction tech-
thesis. The increasing research interest in AAR technology nologies, which might open up more opportunities to enrich
has prompted the exploration of its application across seven AAR systems and their applications.
domains. Among these, navigation and location-awareness Although interaction and display are two distinct aspects
assistance has been the most popular application type. In of AAR systems, there exists a close functional relation-
recent years, a significant number of novel applications ship between the two. For example, [85] deployed their
have also been developed for entertainment and recreation, AAR system on mobile device while enabling interaction
education and training, healthcare, etc. via a mobile application. Most of the current AAR sys-
From the papers that have been reviewed, a number of im- tems use audio-only displays. Future research direction for
portant future research directions, which will be discussed display technology could focus on the integration of sev-
in this section, are identified. eral senses into one AAR system. For example, if using
electronics-embedded glasses in AAR applications, visual
4.1 Future AAR Technologies augmentation and vibrations might be added to enrich the
4.1.1 Tracking user’s experience.
Based on the papers reviewed, it is seen that a number
of different sensor types have been used for pose tracking 4.1.3 Room Acoustics Modeling
in AAR systems, including visual sensors, inertial sensors, As covered in SEC. 2, most of the AAR systems that have
and GPS. However, some pose tracking approaches require been reviewed did not include a room acoustics modeling
an environment to be equipped in advance to enable track-
ing (e.g., retroreflective tracking using a Vicon system [78,
37]), which limits the use of AAR applications in arbitrary
4
environments. Some pose tracking approaches ask the user https://www.bose.com/en_us/products/frames.html.

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 799
YANG ET AL. PAPERS

component. Those that included this component tended to methods that can enable the convenient capture and im-
model rough artificial reverberation to only create the feel- plementation of individual HRTFs. To this end, there have
ing of being in a room. However, as has also been discussed, been some attempts in recent years. Because a person’s an-
some research has studied room acoustics modeling meth- thropometric features (e.g., head width, shoulder width, and
ods, but these methods have not been employed in the AAR pinna height) and the corresponding HRTFs are closely re-
systems yet. When modeling environmental acoustics for lated [159], some researchers have explored techniques that
AAR applications, one must take into account the com- first acquire a user’s anthropometric parameters [160, 161]
putational costs and quality of the modeled acoustic envi- and then approximate the matching HRTFs using a numeric
ronment. Computationally efficient approaches are favored sound propagation solver [160], numerical acoustic simula-
because the desired room acoustics could vary in real time tions [162], or neural network-based regression algorithms
and need to be adapted to the user’s movement in the envi- [163, 164]. In the commercial space, Sony has implemented
ronment. In other words, online methods for room acoustics personalization using an application that visually scans the
modeling can better fit most AAR applications. To obtain ears to enable tailor-made immersive experiences.5 Future
satisfactory room acoustics modeling in an efficient man- research could explore methods that can make this process
ner, it could be worth exploring parametric approaches to faster, more accurate, and easy to implement.
code sound fields and improve computational techniques SEC. 2.5 reviewed the use of BRIRs for spatial sound
on wearable or mobile devices. synthesis. Although directly measuring BRIRs provides an
The time efficiency in computation might sacrifice the alternative to acquiring room acoustic effects and individual
quality of the modeled room acoustics to some extent. How- HRTFs separately, it has some limitations. For example, the
ever, it is also not necessary to achieve perfect modeling for measurement needs to be conducted in the desired environ-
two reasons. First, since human auditory perception is only ment for a specific user. To address this restriction, future
sensitive to a certain level of difference between sounds, research can explore techniques that adapt BRIRs mea-
users might not recognize the differences between the mod- sured in one room to a different room, different listener,
eled sounds and those present in the real world as long as and arbitrary sound source. Previous work has presented
the differences fall within certain perceptual limits. These an adaptive algorithm that applied different equalizations
perceptual limits are the “just noticeable difference” (JND) to different reverberation stages [165], and more research
[155]. JND thresholds are typically measured for different is needed to advance the generalization of BRIRs.
parameters (e.g., reverberation time, early decay time, cen- This review paper has discussed room acoustics model-
ter time, etc.) [156, 157]. Moreover, when measuring JNDs ing and spatial sound synthesis separately. This is because
under different conditions (e.g., different rooms, partici- many of the reviewed AAR systems did not include room
pants, audio frequencies, etc.), the resultant JNDs may also acoustics modeling. However, combining room acoustics
be different. Therefore, there exist different JND standards. modeling and spatial sound synthesis is necessary for pro-
From one perspective, the authors suggest more investi- viding an appropriate sense of space and engagement in
gations into JNDs to provide more detailed standards for an environment. The acquisition or simulation of BRIRs
different parameters under different conditions. From an- could be a way of combining these two approaches, and
other perspective, researchers can follow relatively strict more research is needed to promote individualized binaural
standards from literature when designing and evaluating spatialization with environmental acoustics.
their AAR systems.
The second reason that it is not necessary to achieve per-
4.2 Future AAR Applications
fect room acoustics modeling is because the visual sense
tends to work in concert with the auditory sense to perceive In previous sections, it was seen that the most funda-
the environment as a whole. More specifically, research mental affordance of AAR technology is navigation and
[158] shows that a perceptually adequate acoustic environ- localization. Given the human nature of binaural hearing,
ment is likely to suffice in AR applications in which the and as the foundation in some other applications (e.g.,
user can also perceive their surroundings by seeing the real sports training and telepresence applications), navigation
space. In the future, more studies are needed to investigate and location-awareness assistance is anticipated to remain
the required precision of acoustics modeling in different one of the most intuitive and popular AAR applications in
application scenarios. Additionally, more efforts should be the future. Along with the development of AAR technolo-
made to develop new acoustics modeling algorithms, espe- gies, such as more comfortable tracking apparatus with a
cially in situations in which the movement of the user and long battery life, navigation-related and localization-related
the surrounding objects is arbitrary. applications might be widely accepted in everyday life.
Another popular AAR application in the future could
lie in the field of AR-mediated remote work, education,
4.1.4 Spatial Sound Synthesis gaming, and social activities. Nowadays, using video-audio
conferencing technologies has become a new normal for
From the reviewed AAR systems, it can be seen that
most of them used generic HRTFs to synthesize binaural
sounds. As mentioned earlier, individualized HRTFs can
produce better localization and immersive experience for
5
users. One important future research direction is to explore https://www.sony.co.nz/electronics/360-reality-audio.

800 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

conducting business and social activities remotely. The tance, which also provides the basis for extended applica-
COVID-19 pandemic has resulted in the widespread adop- tions in some other fields. More recently, AAR appears to
tion of conferencing platforms in everyday life. To cre- be gaining a foothold in the healthcare industry. There is
ate the feeling of belonging and connectedness that people also a huge untapped potential for using AAR in remote
experience in face-to-face interactions, AR technologies, collaborative environments for work, study, and social ac-
involving both visual and audio augmentation, are being tivities.
explored to enable “natural interactions” that transcend dis- Overall, this survey provided a systematic review of
tances. the research that has been conducted in the domain of
AR technology has been widely adopted in industrial ap- AAR. After reviewing existing AAR systems, the rele-
plications, but audio augmentation is usually ignored. Spa- vant technological methods, areas that may benefit from the
tialized sounds could be used for reporting device status application-based research, and future research directions
in routine maintenance and error diagnostics systems [166, for advancing each AAR technology component and appli-
167]. Compared to vision augmentation, audio augmenta- cations were also identified. The authors hope researchers
tion still remains under-employed in industrial scenarios. and practitioners can derive inspiration from this review
Future studies can be conducted to investigate the potential when they plan for related work in the future. AAR has
of using AAR technology to enhance industrial activities. the potential to benefit numerous aspects of peoples’ lives.
In this review, a few novel trials of using AAR for health- The authors hope that it becomes more widely used in the
care have been discussed. In the future, the authors see great future to enable working better, staying more connected,
potential to extend explorations in this direction. For exam- and living healthier.
ple, mobile music therapy could be a field worth exploring.
Clinical research has shown that spatial configuration of 6 REFERENCES
physical instruments could help to attract users’ focus and
guide their movements in some therapeutic exercises [168]. [1] R. T. Azuma, “A Survey of Augmented Reality,”
These insights indicate remarkable potential for exploring Presence (Camb), vol. 6, no. 4, pp. 355–385 (1997 Aug.).
AAR technology to create virtual spatial soundscapes for https://doi.org/10.1162/pres.1997.6.4.355.
music therapeutic applications. [2] F. P.Jr. Brooks,, “The Computer Scientist as Tool-
Overall, existing work has shown a broad landscape of smith II,” Commun. ACM, vol. 39, no. 3, pp. 61–68 (1996
AAR applications, and more extensive use of AAR tech- Mar.). https://doi.org/10.1145/227234.227243.
nology is expected to be seen in the coming years. With [3] S. Feiner, B. Macintyre, and D. Seligmann,
the ubiquity of mobile and/or wearable devices on the rise, “Knowledge-Based Augmented Reality,” Com-
AAR has the potential to significantly help everyday work mun. ACM, vol. 36, no. 7, pp. 53–62 (1993 Jul.).
and life in several ways. https://doi.org/10.1145/159544.159587.
[4] M. Funk, T. Kosch, R. Kettner, O. Korn, and A.
5 CONCLUSION Schmidt, “motionEAP: An Overview of 4 Years of Com-
bining Industrial Assembly With Augmented Reality for
In this paper, the development of AAR technology was Industry 4.0,” in Proceedings of the International Confer-
summarized by reviewing a range of research papers pub- ence on Knowledge Technologies and Data-Driven Busi-
lished over the last few decades. Five techniques for im- ness, paper 4 (Graz, Austria) (2016 Oct.).
plementing AAR were first reviewed. Overall, the quality [5] M. Funk, A. Bächler, L. Bächler, et al., “Work-
of audio augmentation appears to have steadily improved ing With Augmented Reality? A Long-Term Analy-
over time. This is a result of a greater amount of AAR re- sis of In-Situ Instructions at the Assembly Workplace,”
search that has contributed to a number of modeling meth- in Proceedings of the 10th International Conference
ods to replicate human auditory perception, model room on PErvasive Technologies Related to Assistive Envi-
acoustics, etc. The development of allied sensing technol- ronments, pp. 222–229 (Rhodes, Greece) (2017 Jun.).
ogy, availability of low-cost high-performance hardware, https://doi.org/10.1145/3056540.3056548.
and exponential increases in computing power have also [6] S. Claudino Daffara, A. Brewer, B. Thoravi Ku-
played a significant role in the advance of AAR technol- maravel, and B. Hartmann, “Living Paper: Authoring AR
ogy. Technical advancements have enabled AAR systems Narratives Across Digital and Tangible Media,” in Ex-
to be integrated into mobile devices, such as smartphones tended Abstracts of the Conference on Human Factors in
and hearables.6 These advances have contributed to, and Computing Systems, pp. 1–10 (Honolulu, HI) (2020 Apr.).
continue to contribute to, the development of AAR across https://doi.org/10.1145/3334480.3383091.
a range of applications. [7] D. Pérez-López and M. Contero, “Delivering Educa-
This review also demonstrated that there appear to be tional Multimedia Contents Through an Augmented Reality
seven domains within which the application of AAR has Application: A Case Study on Its Impact on Knowledge Ac-
been studied. The most fundamental and popular applica- quisition and Retention,” Turkish Online J. Educ. Technol.,
tion of AAR is navigation and location-awareness assis- vol. 12, no. 4, pp. 19–28 (2013 Oct.).
[8] M. Rusiñol, J. Chazalon, and K. Diaz-Chito, “Aug-
mented Songbook: An Augmented Reality Educational Ap-
6
https://www.bragi.com/. plication for Raising Music Awareness,” Multimed. Tools

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 801
YANG ET AL. PAPERS

Appl., vol. 77, no. 11, pp. 13773–13798 (2018 Jul.). of ISMAR (2008–2017),” IEEE Trans. Vis. Comput.
https://doi.org/10.1007/s11042-017-4991-4. Graph., vol. 24, no. 11, pp. 2947–2962 (2018 Nov.).
[9] W. Piekarski and B. Thomas, “ARQuake: The https://doi.org/10.1109/TVCG.2018.2868591.
Outdoor Augmented Reality Gaming System,” Com- [20] J. Yang, Audio-Facilitated Human Interaction with
mun. ACM, vol. 45, no. 1, pp. 36–38 (2002 Jan.). the Environment: Advancements in Audio Augmented Re-
https://doi.org/10.1145/502269.502291. ality and Auditory Notification Delivery, Ph.D. thesis, ETH
[10] B. H. Thomas, “A Survey of Visual, Mixed, Zurich, Zurich, Switzerland (2021 Nov.).
and Augmented Reality Gaming,” Comput. En- [21] J. Blauert, Spatial Hearing: The Psychophysics of
tertain., vol. 10, no. 1, paper 3 (2012 Oct.). Human Sound Localization (MIT Press, Cambridge, MA,
https://doi.org/10.1145/2381876.2381879. 1997). https://doi.org/10.7551/mitpress/6391.001.0001.
[11] S. R. Fussell, R. E. Kraut, and J. Siegel, “Co- [22] P. Vávra, J. Roman, P. Zonča, et al., “Recent De-
ordination of Communication: Effects of Shared Vi- velopment of Augmented Reality in Surgery: A Review,”
sual Context on Collaborative Work,” in Proceedings of J. Healthc. Eng., vol. 2017, paper 4574172 (2017 Aug.).
the ACM Conference on Computer Supported Cooper- https://doi.org/10.1155/2017/4574172.
ative Work, pp. 21–30 (Philadelphia, PA) (2000 Dec.). [23] X. Li, W. Yi, H.-L. Chi, X. Wang, and A.
https://doi.org/10.1145/358916.358947. P. C. Chan, “A Critical Review of Virtual and Aug-
[12] G. A. Lee, T. Teo, S. Kim, and M. Billinghurst, mented Reality (VR/AR) Applications in Construction
“A User Study on MR Remote Collaboration Using Safety,” Autom. Constr., vol. 86, pp. 150–162 (2018 Feb.).
Live 360 Video,” in Proceedings of the IEEE In- https://doi.org/10.1016/j.autcon.2017.11.003.
ternational Symposium on Mixed and Augmented Re- [24] D. Szymczak, K. Rassmus-Gröhn, C. Mag-
ality, pp. 153–164 (Munich, Germany) (2018 Oct.). nusson, and P.-O. Hedvall, “A Real-World Study
https://doi.org/10.1109/ISMAR.2018.00051. of an Audio-Tactile Tourist Guide,” in Proceedings
[13] T. Piumsomboon, G. A. Lee, J. D. Hart, et of the 14th International Conference on Human-
al., “Mini-Me: An Adaptive Avatar for Mixed Re- Computer Interaction With Mobile Devices and Ser-
ality Remote Collaboration,” in Proceedings of the vices, pp. 335–344 (San Francisco, CA) (2012 Sep.).
CHI Conference on Human Factors in Computing https://doi.org/10.1145/2371574.2371627.
Systems, paper 46 (Montreal, Canada) (2018 Apr.). [25] M. Billinghurst, A. Clark, and G. Lee, “A Sur-
https://doi.org/10.1145/3173574.3173620. vey of Augmented Reality,” Found. Trends Hum.-Comput.
[14] T. Teo, L. Lawrence, G. A. Lee, M. Billinghurst, and Interact., vol. 8, no. 2-3, pp. 73–272 (2015 Mar.).
M. Adcock, “Mixed Reality Remote Collaboration Com- http://doi.org/10.1561/1100000049.
bining 360 Video and 3D Reconstruction,” in Proceed- [26] S. Blessenohl, C. Morrison, A. Criminisi, and J.
ings of the CHI Conference on Human Factors in Com- Shotton, “Improving Indoor Mobility of the Visually Im-
puting Systems, paper 201 (Glasgow, UK) (2019 May). paired With Depth-Based Spatial Sound,” in Proceedings
https://doi.org/10.1145/3290605.3300431. of the IEEE International Conference on Computer Vision
[15] T. Teo, G. A. Lee, M. Billinghurst, and Workshops, pp. 418–426 (Santiago, Chile) (2015 Dec.).
M. Adcock, “Hand Gestures and Visual Annota- https://doi.org/10.1109/ICCVW.2015.62.
tion in Live 360 Panorama-Based Mixed Reality [27] E. Schoop, J. Smith, and B. Hartmann, “HindSight:
Remote Collaboration,” in Proceedings of the 30th Enhancing Spatial Awareness by Sonifying Detected Ob-
Australian Conference on Computer-Human Interac- jects in Real-Time 360-Degree Video,” in Proceedings
tion, pp. 406–410 (Melbourne, Australia) (2018 Dec.). of the CHI Conference on Human Factors in Comput-
https://doi.org/10.1145/3292147.3292200. ing Systems, paper 143 (Montreal, Canada) (2018 Apr.).
[16] R. Azuma, Y. Baillot, R. Behringer, et al., “Re- https://doi.org/10.1145/3173574.3173717.
cent Advances in Augmented Reality,” IEEE Comput. [28] O. B. Kaul, K. Behrens, and M. Rohs, “Mobile
Graph. Appl., vol. 21, no. 6, pp. 34–47 (2001 Dec.). Recognition and Tracking of Objects in the Environment
https://doi.org/10.1109/38.963459. Through Augmented Reality and 3D Audio Cues for
[17] A. Dey, M. Billinghurst, R. W. Lindeman, People With Visual Impairments,” in Extended Abstracts
and II J. E. Swan, “A Systematic Review of 10 of the CHI Conference on Human Factors in Comput-
Years of Augmented Reality Usability Studies: 2005 to ing Systems, paper 394 (Yokohama, Japan) (2021 May).
2014,” Front. Robot. AI, vol. 5, paper 37 (2018 Apr.). https://doi.org/10.1145/3411763.3451611.
https://doi.org/10.3389/frobt.2018.00037. [29] Z. Zhou, A. D. Cheok, X. Yang, and Y.
[18] F. Zhou, H. B.-L. Duh, and M. Billinghurst, “Trends Qiu, “An Experimental Study on the Role of 3D
in Augmented Reality Tracking, Interaction and Display: Sound in Augmented Reality Environment,” Interact.
A Review of Ten Years of ISMAR,” in Proceedings of Comput., vol. 16, no. 5, pp. 989–1016 (2004 Oct.).
the 7th IEEE/ACM International Symposium on Mixed and https://doi.org/10.1016/j.intcom.2004.06.014.
Augmented Reality, pp. 193–202 (Washington, D.C.) (2008 [30] J. Sodnik, S. Tomazic, R. Grasset, A. Duenser,
Sep.). https://doi.org/10.1109/ISMAR.2008.4637362. and M. Billinghurst, “Spatial Sound Localization in
[19] K. Kim, M. Billinghurst, G. Bruder, H. B.-L. an Augmented Reality Environment,” in Proceedings
Duh, and G. F. Welch, “Revisiting Trends in Aug- of the 18th Australia Conference on Computer-Human
mented Reality Research: A Review of the 2nd Decade Interaction: Design: Activities, Artefacts and Environ-

802 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

ments, pp. 111–118 (Sydney, Australia) (2006 Nov.). [42] K. Xu, K. W. Chia, and A. D. Cheok, “Real-
https://doi.org/10.1145/1228175.1228197. Time Camera Tracking for Marker-Less and Unpre-
[31] D. Rumiński, “An Experimental Study pared Augmented Reality Environments,” Image Vis.
of Spatial Sound Usefulness in Searching and Comput., vol. 26, no. 5, pp. 673–689 (2008 May).
Navigating Through AR Environments,” Virtual https://doi.org/10.1016/j.imavis.2007.08.015.
Real., vol. 19, no. 3, pp. 223–233 (2015 Nov.). [43] F. Ababsa and M. Mallem, “A Robust Circular Fidu-
https://doi.org/10.1007/s10055-015-0274-4. cial Detection Technique and Real-Time 3D Camera Track-
[32] G. Gordon, M. Billinghurst, M. Bell, et al., “The ing,” J. Multimed., vol. 3, no. 4, pp. 34–41 (2008 Oct.).
Use of Dense Stereo Range Data in Augmented Reality,” [44] A. Loquercio, M. Dymczyk, B. Zeisl, et al., “Ef-
in Proceedings of the International Symposium on Mixed ficient Descriptor Learning for Large Scale Localization,”
and Augmented Reality, pp. 14–23 (Darmstadt, Germany) in Proceedings of International Conference on Robotics
(2002 Oct.). http://doi.org/10.1109/ISMAR.2002.1115063. and Automation, pp. 3170–3177 (Singapore) (2017 May).
[33] R. A. Newcombe, S. Izadi, O. Hilliges, et https://doi.org/10.1109/ICRA.2017.7989359.
al., “KinectFusion: Real-Time Dense Surface Mapping [45] G. Younes, D. Asmar, I. Elhajj, and H. Al-
and Tracking,” in Proceedings of the 10th IEEE In- Harithy, “Pose Tracking for Augmented Reality Ap-
ternational Symposium on Mixed and Augmented Re- plications in Outdoor Archaeological Sites,” J. Elec-
ality, pp. 127–136 (Basel, Switzerland) (2011 Oct.). tron. Imag., vol. 26, no. 1, paper 011004 (2016 Oct.).
https://doi.org/10.1109/ISMAR.2011.6092378. https://doi.org/10.1117/1.JEI.26.1.011004.
[34] B. W. Babu, S. Kim, Z. Yan, and L. Ren, “σ-DVO: [46] W. Ma, H. Xiong, X. Dai, X. Zheng, and Y. Zhou,
Sensor Noise Model Meets Dense Visual Odometry,” in “An Indoor Scene Recognition-Based 3D Registration
Proceedings of the International Symposium on Mixed and Mechanism for Real-Time AR-GIS Visualization in Mobile
Augmented Reality, pp. 18–26 (Merida, Mexico) (2016 Applications,” ISPRS Int. J. Geo-Inf., vol. 7, no. 3, paper
Oct.). http://doi.org/10.1109/ISMAR.2016.11. 112 (2018 Mar.). https://doi.org/10.3390/ijgi7030112.
[35] H. Jiang, D. Weng, Z. Zhang, Y. Bao, Y. Jia, [47] J. Rambach, C. Deng, A. Pagani, and D. Stricker,
and M. Nie, “HiKeyb: High-Efficiency Mixed Real- “Learning 6DoF Object Poses From Synthetic Single
ity System for Text Entry,” in Proceedings of Inter- Channel Images,” in Proceedings of the IEEE Inter-
national Symposium on Mixed and Augmented Reality national Symposium on Mixed and Augmented Reality
Adjunct, pp. 132–137 (Munich, Germany) (2018 Oct.). Adjunct, pp. 164–169 (Munich, Germany) (2018 Oct.).
http://doi.org/10.1109/ISMAR-Adjunct.2018.00051. https://doi.org/10.1109/ISMAR-Adjunct.2018.00058.
[36] Z. Yuan, K. Cheng, J. Tang, and X. Yang, [48] C.-Y. Tsai, K.-J. Hsu, and H. Nisar, “Effi-
“RGB-D DSO: Direct Sparse Odometry With cient Model-Based Object Pose Estimation Based on
RGB-D Cameras for Indoor Scenes,” IEEE Trans. Multi-Template Tracking and PnP Algorithms,” Al-
Multimed, vol. 24, pp. 4092–4101 (2021 Sep.). gorithms, vol. 11, no. 8, paper 122 (2018 Aug.).
http://doi.org/10.1109/TMM.2021.3114546. https://doi.org/10.3390/a11080122.
[37] J. Yang, Y. Frank, and G. Sörös, “Hearing Is Believ- [49] H. Huang, F. Zhong, Y. Sun, and X. Qin, “An
ing: Synthesizing Spatial Audio From Everyday Objects to Occlusion-Aware Edge-Based Method for Monocular 3D
Users,” in Proceedings of the 10th Augmented Human In- Object Tracking Using Edge Confidence,” in Comput.
ternational Conference, paper 28 (Reims France) (2019 Graph. Forum, vol. 39, no. 7, pp. 399–409 (2020 Nov.).
Mar.). https://doi.org/10.1145/3311823.3311872. https://doi.org/10.1111/cgf.14154.
[38] R. Grasset, A. Duenser, H. Seichter, and [50] M. Ortega, E. Ivorra, A. Juan, et al., “MANTRA:
M. Billinghurst, “The Mixed Reality Book: A New An Effective System Based on Augmented Reality
Multimedia Reading Experience,” in Extended Ab- and Infrared Thermography for Industrial Maintenance,”
stracts on CHI Human Factors in Computing Sys- Appl. Sci., vol. 11, no. 1, paper 385 (2021 Jan.).
tems, pp. 1953–1958 (San Jose, CA) (2007 Apr.). https://doi.org/10.3390/app11010385.
https://doi.org/10.1145/1240866.1240931. [51] J. R. Blum, M. Bouchard, and J. R. Coop-
[39] U. Neumann and S. You, “Natural Feature erstock, “Spatialized Audio Environmental Awareness
Tracking for Augmented Reality,” IEEE Trans. Mul- for Blind Users With a Smartphone,” Mobile Netw.
timed., vol. 1, no. 1, pp. 53–64 (1999 Mar.). Appl., vol. 18, no. 3, pp. 295–309 (2013 Jun.).
http://doi.org/10.1109/6046.748171. https://doi.org/10.1007/s11036-012-0425-8.
[40] Y. K. Yu, K. H. Wong, M. M. Y. Chang, [52] F. Heller, J. Jevanesan, P. Dietrich, and J. Borchers,
and S.-H. Or, “Recursive Camera-Motion Estimation “Where Are We?: Evaluating the Current Rendering Fi-
With the Trifocal Tensor,” IEEE Trans. Syst. Man Cy- delity of Mobile Audio Augmented Reality Systems,”
bern., vol. 36, no. 5, pp. 1081–1090 (2006 Oct.). in Proceedings of the 18th International Conference
http://doi.org/10.1109/TSMCB.2006.874133. on Human-Computer Interaction With Mobile Devices
[41] G. Klein and D. Murray, “Parallel Tracking and and Services, pp. 278–282 (Florence, Italy) (2016 Sep.).
Mapping for Small AR Workspaces,” in Proceedings of the http://doi.org/10.1145/2935334.2935365.
6th IEEE and ACM International Symposium on Mixed and [53] F. Z. Kaghat, A. Azough, M. Fakhour, and M.
Augmented Reality, pp. 225–234 (Nara, Japan) (2007 Oct.). Meknassi, “A New Audio Augmented Reality Interac-
https://doi.org/10.1109/ISMAR.2007.4538852. tion and Adaptation Model for Museum Visits,” Com-

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 803
YANG ET AL. PAPERS

put. Electr. Eng., vol. 84, paper 106606 (2020 Jun.). put. Commun. Appl., vol. 14, no. 3, paper 74 (2018 Aug.).
https://doi.org/10.1016/j.compeleceng.2020.106606. https://doi.org/10.1145/3230652.
[54] R. Guarese, F. Bastidas, J. Becker, et al., [65] B. F. Katz, S. Kammoun, G. Parseihian,
“Cooking in the Dark: Exploring Spatial Au- et al., “NAVIG: Augmented Reality Guidance Sys-
dio as MR Assistive Technology for the Visually tem for the Visually Impaired: Combining Ob-
ImpairedHuman-Computer Interaction – INTERACT ject Localization, GNSS, and Spatial Audio,” Vir-
2021, Lecture Notes in Computer Science, vol. 12936, tual Real., vol. 16, no. 4, pp. 253–269 (2012 Nov.).
pp. 318–322 (Springer, Cham, Switzerland, 2021). https://doi.org/10.1007/s10055-012-0213-6.
http://doi.org/10.1007/978-3-030-85607-6_29. [66] F. Ribeiro, D. Florêncio, P. A. Chou, and Z.
[55] B. B. Bederson, “Audio Augmented Real- Zhang, “Auditory Augmented Reality: Object Sonifica-
ity: A Prototype Automated Tour Guide,” in Con- tion for the Visually Impaired,” in Proceedings of the
ference Companion on Human Factors in Comput- IEEE 14th International Workshop on Multimedia Sig-
ing Systems, pp. 210–211 (Denver, CO) (1995 May). nal Processing, pp. 319–324 (Banff, Canada) (2012 Sep.).
https://doi.org/10.1145/223355.223526. http://doi.org/10.1109/MMSP.2012.6343462.
[56] E. D. Mynatt, M. Back, R. Want, and R. Fred- [67] M. Comunità, A. Gerino, V. Lim, and L. Pic-
erick, “Audio Aura: Light-Weight Audio Augmented inali, “Design and Evaluation of a Web- and Mobile-
Reality,” in Proceedings of the 10th Annual ACM Based Binaural Audio Platform for Cultural Heritage,”
Symposium on User Interface Software and Tech- Appl. Sci., vol. 11, no. 4, paper 1540 (2021 Feb.).
nology, pp. 211–212 (Banff, Canada) (1997 Oct.). http://doi.org/10.3390/app11041540.
https://doi.org/10.1145/263407.264218. [68] M. Hatala, L. Kalantari, R. Wakkary, and K. Newby,
[57] L. Terrenghi and A. Zimmermann, “Tailored Au- “Ontology and Rule Based Retrieval of Sound Objects
dio Augmented Environments for Museums,” in Proceed- in Augmented Audio Reality System for Museum Visi-
ings of the 9th International Conference on Intelligent User tors,” in Proceedings of the ACM Symposium on Applied
Interfaces, pp. 334–336 (Funchal, Portugal) (2004 Jan.). Computing, pp. 1045–1050 (Nicosia, Cyprus) (2004 Mar.).
https://doi.org/10.1145/964442.964523. https://doi.org/10.1145/967900.968114.
[58] A. Zimmermann and A. Lorenz, “LISTEN: A User- [69] M. Hatala and R. Wakkary, “Ontology-Based
Adaptive Audio-Augmented Museum Guide,” User Model. User Modeling in an Augmented Audio Reality Sys-
User-Adap. Interact., vol. 18, no. 5, pp. 389–416 (2008 tem for Museums,” User Model. User-Adap. Inter-
Nov.). https://doi.org/10.1007/s11257-008-9049-x. act., vol. 15, no. 3, pp. 339–380 (2005 Aug.).
[59] T. Chatzidimitris, D. Gavalas, and D. https://doi.org/10.1007/s11257-005-2304-5.
Michael, “SoundPacman: Audio Augmented Re- [70] R. Wakkary and M. Hatala, “Situated Play in a Tan-
ality in Location-Based Games,” in Proceedings gible Interface and Adaptive Audio Museum Guide,” Pers.
of the 18th Mediterranean Electrotechnical Con- Ubiquit. Comput., vol. 11, no. 3, pp. 171–191 (2007 Mar.).
ference, pp. 1–6 (Limassol, Cyprus) (2016 Apr.). https://doi.org/10.1007/s00779-006-0101-8.
http://doi.org/10.1109/MELCON.2016.7495414. [71] T. Langlotz, H. Regenbrecht, S. Zollmann, and D.
[60] E. Rovithis, N. Moustakas, A. Floros, and K. Schmalstieg, “Audio Stickies: Visually-Guided Spatial Au-
Vogklis, “Audio Legends: Investigating Sonic Interac- dio Annotations on a Mobile Augmented Reality Plat-
tion in an Augmented Reality Audio Game,” Multimodal form,” in Proceedings of the 25th Australian Computer-
Technol. Interact., vol. 3, no. 4, paper 73 (2019 Nov.). Human Interaction Conference: Augmentation, Applica-
http://doi.org/10.3390/mti3040073. tion, Innovation, Collaboration, pp. 545–554 (Adelaide,
[61] B. N. Walker and J. Lindsay, “Navigation Per- Australia) (2013 Nov.). https://doi.org/10.1145/2541016.
formance With a Virtual Auditory Display: Effects of 2541022.
Beacon Sound, Capture Radius, and Practice,” Hum. [72] F. Heller, T. Knott, M. Weiss, and J. Borchers,
Factors, vol. 48, no. 2, pp. 265–278 (2006 Jun.). “Multi-User Interaction in Virtual Audio Spaces,” in Ex-
https://doi.org/10.1518/001872006777724507. tended Abstracts on CHI Human Factors in Comput-
[62] C. Stahl, “The Roaring Navigator: A Group ing Systems, pp. 4489–4494 (Boston, MA) (2009 Apr.).
Guide for the Zoo With Shared Auditory Landmark Dis- https://doi.org/10.1145/1520340.1520688.
play,” in Proceedings of the 9th International Confer- [73] S. Russell, G. Dublon, and J. A. Paradiso,
ence on Human Computer Interaction With Mobile De- “HearThere: Networked Sensory Prosthetics Through
vices and Services, pp. 383–386 (Singapore) (2007 Sep.). Auditory Augmented Reality,” in Proceedings of
https://doi.org/10.1145/1377999.1378042. the 7th Augmented Human International Confer-
[63] F. Heller and J. Schöning, “NavigaTone: Seamlessly ence, paper 20 (Geneva, Switzerland) (2016 Feb.).
Embedding Navigation Cues in Mobile Music Listening,” https://doi.org/10.1145/2875194.2875247.
in Proceedings of the SIGCHI Conference on Human Fac- [74] R. Kapoor, S. Ramasamy, A. Gardi, and R. Saba-
tors in Computing Systems, paper 637 (Montreal, Canada) tini, “A Bio-Inspired Acoustic Sensor System for UAS
(2018 Apr.). https://doi.org/10.1145/3173574.3174211. Navigation and Tracking,” in Proceedings of the 36th
[64] M. Sikora, M. Russo, J. −Derek, and A. Jurčević, Digital Avionics Systems Conference, pp. 1–7 (St. Peters-
“Soundscape of an Archaeological Site Recreated With burg, FL) (2017 Sep.). https://doi.org/10.1109/DASC.2017.
Audio Augmented Reality,” ACM Trans. Multimed. Com- 8102080.

804 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

[75] C. Evers and P. A. Naylor, “Acoustic Cultural Heritage, Lecture Notes in Computer Science, vol.
SLAM,” IEEE/ACM Trans. Audio Speech Lang. Pro- 10754, pp. 117–129 (Springer, Cham, Switzerland, 2018).
cess., vol. 26, no. 9, pp. 1484–1498 (2018 Sep.). https://doi.org/10.1007/978-3-319-75789-6_9.
https://doi.org/10.1109/TASLP.2018.2828321. [86] Y. Kim, S. Hong, and G. J. Kim, “Augmented
[76] A. Terán Espinoza, Acoustic-Inertial Forward-Scan Reality-Based Remote Coaching for Fast-Paced Physical
Sonar Simultaneous Localization and Mapping, Master’s Task,” Virtual Real., vol. 22, no. 1, pp. 25–36 (2018 Mar.).
thesis, KTH Royal Institute of Technology, Stockholm, https://doi.org/10.1007/s10055-017-0315-2.
Sweden (2020 Sep.). [87] K. M. Sagayam, A. J. Timothy, C. C. Ho, L. E.
[77] F. Kawsar, C. Min, A. Mathur, and A. Montanari, Henesey, and R. Bestak, “Augmented Reality-Based Solar
“Earables for Personal-Scale Behavior Analytics,” IEEE System for E-Magazine With 3-D Audio Effect,” Int. J.
Pervasive Comput., vol. 17, no. 3, pp. 83–89 (2018 Oct.). Simul. Process. Model., vol. 15, no. 6, pp. 524–534 (2021
https://doi.org/10.1109/MPRV.2018.03367740. Jan.). http://doi.org/10.1504/IJSPM.2020.112460.
[78] F. Heller, A. Krämer, and J. Borchers, “Simpli- [88] R. Behringer, C. Tam, J. McGee, S. Sundareswaran,
fying Orientation Measurement for Mobile Audio Aug- and M. Vassiliou, “A Wearable Augmented Reality
mented Reality Applications,” in Proceedings of the Testbed for Navigation and Control, Built Solely With
SIGCHI Conference on Human Factors in Computing Commercial-off-the-Shelf (COTS) Hardware,” in Proceed-
Systems, pp. 615–624 (Toronto, Canada) (2014 Apr.). ings IEEE and ACM International Symposium on Aug-
https://doi.org/10.1145/2556288.2557021. mented Reality, pp. 12–19 (Munich, Germany) (2000 Oct.).
[79] M. Tonnis and G. Klinker, “Effective Control of https://doi.org/10.1109/ISAR.2000.880918.
a Car Driver’s Attention for Visual and Acoustic Guid- [89] N. Sawhney and C. Schmandt, “Nomadic Radio:
ance Towards the Direction of Imminent Dangers,” in Pro- Speech and Audio Interaction for Contextual Messag-
ceedings of the IEEE/ACM International Symposium on ing in Nomadic Environments,” ACM Trans. Comput.
Mixed and Augmented Reality, pp. 13–22 (Santa Barbara, Hum. Interact., vol. 7, no. 3, pp. 353–383 (2000 Sep.).
CA) (2006 Oct.). https://doi.org/10.1109/ISMAR.2006. https://doi.org/10.1145/355324.355327.
297789. [90] Z. Zhou, A. D. Cheok, J. Pan, and Y. Li, “Magic
[80] D. Kern, P. Marshall, E. Hornecker, Y. Rogers, Story Cube: An Interactive Tangible Interface for Sto-
and A. Schmidt, “Enhancing Navigation Information rytelling,” in Proceedings of the ACM SIGCHI Interna-
With Tactile Output Embedded Into the Steering tional Conference on Advances in Computer Entertain-
Wheel,” in H. Tokuda, M. Beigl, A. Friday, A. J. ment Technology, pp. 364–365 (Singapore) (2004 Jun.).
B. Brush, Y. Tobe (Eds.), Pervasive Computing: Per- https://doi.org/10.1145/1067343.1067404.
vasive 2009, Lecture Notes in Computer Science, vol. [91] J. R. Blum, M. Bouchard, and J. R. Cooper-
5538, pp. 42–58 (Springer, Berlin, Germany, 2009). stock, “What’s Around Me? Spatialized Audio Aug-
https://doi.org/10.1007/978-3-642-01516-8_5. mented Reality for Blind Users With a Smartphone,”
[81] B. N. Walker and J. Lindsay, “Navigation Per- in A. Puiatti and T. Gu (Eds.), Mobile and Ubiqui-
formance in a Virtual Environment With Bonephones,” tous Systems: Computing, Networking, and Services, Lec-
in Proceedings of the 11th International Conference ture Notes of the Institute for Computer Sciences, So-
on Auditory Display, pp. 260–263 (Limerick, Ireland) cial Informatics and Telecommunications Engineering,
(2005 Jul.). vol. 104, pp. 49–62 (Springer, Berlin, Germany, 2011).
[82] F. Liarokapis, “An Augmented Reality Interface http://doi.org/10.1007/978-3-642-30973-1_5.
for Visualizing and Interacting With Virtual Content,” [92] Y. Vazquez-Alvarez, I. Oakley, and S. A. Brew-
Virtual Real., vol. 11, no. 1, pp. 23–43 (2007 Mar.). ster, “Auditory Display Design for Exploration in Mo-
https://doi.org/10.1007/s10055-006-0055-1. bile Audio-Augmented Reality,” Pers. Ubiquit. Com-
[83] A. Walker, S. Brewster, D. McGookin, and A. put., vol. 16, no. 8, pp. 987–999 (2012 Dec.).
Ng, “Diary in the Sky: A Spatial Audio Display for a https://doi.org/10.1007/s00779-011-0459-0.
Mobile Calendar,” in A. Blandford, J. Vanderdonckt, P. [93] J. Rämö and V. Välimäki, “Digital Aug-
Gray (Eds.), People and Computers XV—Interaction With- mented Reality Audio Headset,” J. Electric. Com-
out Frontiers, pp. 531–539 (Springer, London, UK, 2001). put. Eng., vol. 2012, paper 457374 (2012 Oct.).
https://doi.org/10.1007/978-1-4471-0353-0_33. https://doi.org/10.1155/2012/457374.
[84] D. McGookin, Y. Vazquez-Alvarez, S. Brewster, [94] R. Gupta, R. Ranjan, J. He, and W. S. Gan,
and J. Bergstrom-Lehtovirta, “Shaking the Dead: Multi- “Parametric Hear Through Equalization for Augmented
modal Location Based Experiences for Un-Stewarded Ar- Reality Audio,” in Proceedings of the IEEE Interna-
chaeological Sites,” in Proceedings of the 7th Nordic Con- tional Conference on Acoustics, Speech and Signal Pro-
ference on Human-Computer Interaction: Making Sense cessing, pp. 1587–1591 (Brighton, UK) (2019 May).
Through Design, pp. 199–208 (Copenhagen, Denmark) https://doi.org/10.1109/ICASSP.2019.8683657.
(2012 Oct.). https://doi.org/10.1145/2399016.2399048. [95] R. Gupta, J. He, R. Ranjan, et al., “Aug-
[85] V. Lim, N. Frangakis, L. M. Tanco, and L. Pici- mented/Mixed Reality Audio for Hearables: Sens-
nali, “PLUGGY: A Pluggable Social Platform for Cultural ing, Control, and Rendering,” IEEE Signal Pro-
Heritage Awareness and Participation,” in M. Ioannides, J. cess. Mag., vol. 39, no. 3, pp. 63–89 (2022 May).
Martins, R. Žarnić, and V. Lim (Eds.), Advances in Digital https://doi.org/10.1109/MSP.2021.3110108.

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 805
YANG ET AL. PAPERS

[96] T. Kitagawa and K. Kondo, “On a Wind Noise A. Ronzhin, R. Potapova, and N. Fakotakis (Eds.), Speech
Countermeasure for Bicycle Audio Augmented Reality and Computer (SPECOM), Lecture Notes in Computer Sci-
Systems,” in Proceedings of the 6th Global Conference ence, vol. 9319, pp. 333–340 (Springer, Cham, Switzerland,
on Consumer Electronics, pp. 1–2 (Las Vegas, NV) (2017 2015). https://doi.org/10.1007/978-3-319-23132-7_41.
Oct.). https://doi.org/10.1109/GCCE.2017.8229227. [108] H. Kim, R. J. Hughes, L. Remaggi, et al., “Acoustic
[97] T. Kitagawa and K. Kondo, “Detailed Evaluation Room Modelling Using a Spherical Camera for Reverberant
of a Wind Noise Reduction Method Using DNN for 3D Spatial Audio Objects,” presented at the 142nd Convention
Audio Navigation System Audio Augmented Reality for of the Audio Engineering Society (2017 May), paper 9705.
Bicycles,” in Proceedings of the 8th Global Conference on [109] H. Kim, L. Remaggi, P. J. B. Jackson, and A.
Consumer Electronics, pp. 863–864 (Osaka, Japan) (2019 Hilton, “Immersive Spatial Audio Reproduction for VR/AR
Oct.). https://doi.org/10.1109/GCCE46687.2019.9015262. Using Room Acoustic Modelling From 360◦ Images,” in
[98] V. Sundareswaran, K. Wang, S. Chen, et al., Proceedings of the IEEE Conference on Virtual Reality
“3D Audio Augmented Reality: Implementation and and 3D User Interfaces, pp. 120–126 (Osaka, Japan) (2019
Experiments,” in Proceedings of the 2nd IEEE and Mar.). https://doi.org/10.1109/VR.2019.8798247.
ACM International Symposium on Mixed and Aug- [110] D. Li, T. R. Langlois, and C. Zheng,
mented Reality, pp. 296–297 (Tokyo, Japan) (2003 Oct.). “Scene-Aware Audio for 360◦ Videos,” ACM Trans.
https://doi.org/10.1109/ISMAR.2003.1240728. Graph., vol. 37, no. 4, paper 111 (2018 Aug).
[99] S. Tachi, K. Komoriya, K. Sawada, et al., https://doi.org/10.1145/3197517.3201391.
“Telexistence Cockpit for Humanoid Robot Control,” [111] C. Schissler, C. Loftin, and D. Manocha, “Acous-
Adv. Robot., vol. 17, no. 3, pp. 199–217 (2003 Apr.). tic Classification and Optimization for Multi-Modal Ren-
https://doi.org/10.1163/156855303764018468. dering of Real-World Scenes,” IEEE Trans. Vis. Com-
[100] H. Huang, M. Solah, D. Li, and L.-F. Yu, put. Graph., vol. 24, no. 3, pp. 1246–1259 (2018 Mar.).
“Audible Panorama: Automatic Spatial Audio Gen- https://doi.org/10.1109/TVCG.2017.2666150.
eration for Panorama Imagery,” in Proceedings of [112] O. Shih and A. Rowe, “Can a Phone Hear the
the CHI Conference on Human Factors in Comput- Shape of a Room?” in Proceedings of the 18th Interna-
ing Systems, paper 621 (Glasgow, UK) (2019 May). tional Conference on Information Processing in Sensor
https://doi.org/10.1145/3290605.3300851. Networks, pp. 277–288 (Montreal, Canada) (2019 Apr.).
[101] J. Yang, P. Sasikumar, H. Bai, A. Barde, G. Sörös, https://doi.org/10.1145/3302506.3310407.
and M. Billinghurst, “The Effects of Spatial Auditory and [113] V. Hulusic, C. Harvey, K. Debattista, et al.,
Visual Cues on Mixed Reality Remote Collaboration,” J. “Acoustic Rendering and Auditory–Visual Cross-Modal
Multimodal User Interfaces, vol. 14, no. 4, pp. 337–352 Perception and Interaction,” in Comput. Graph. Fo-
(2020 Dec.). https://doi.org/10.1007/s12193-020-00331-1. rum, vol. 31, vol. 1, pp. 102–131 (2012 Feb.).
[102] A. Härmä, J. Jakka, M. Tikander, et al., “Tech- https://doi.org/10.1111/j.1467-8659.2011.02086.x.
niques and Applications of Wearable Augmented Reality [114] J. Yang, F. Pfreundtner, A. Barde, K. Heutschi,
Audio,” presented at the 114th Convention of the Audio and G. Sörös, “Fast Synthesis of Perceptually Adequate
Engineering Society (2003 Mar.), paper 5768. Room Impulse Responses from Ultrasonic Measurements,”
[103] A. Härmä, J. Jakka, M. Tikander, et al., “Aug- in Proceedings of the 15th International Conference on
mented Reality Audio for Mobile and Wearable Appli- Audio Mostly, pp. 53–60 (Graz, Austria) (2020 Sep.).
ances,” J. Audio Eng. Soc., vol. 52, no. 6, pp. 618–639 https://doi.org/10.1145/3411109.3412300.
(2004 Jun.). [115] W. G. Gardner, 3-D Audio Using Loudspeakers,
[104] E. Mattheiss, G. Regal, C. Vogelauer, and H. The Springer International Series in Engineering and Com-
Furtado, “3D Audio Navigation - Feasibility and Re- puter Science, vol. 444 (Springer, New York. NY, 1998).
quirements for Older Adults,” in K. Miesenberger, [116] M. A. Gerzon, “Periphony: With-Height Sound
R. Manduchi, M. Covarrubias Rodriguez, and P. Reproduction,” J. Audio Eng. Soc., vol. 21, no. 1, pp. 2–10
Peňáz (Eds.), Computers Helping People With Spe- (1973 Feb.).
cial Needs, Lecture Notes in Computer Science, vol. [117] M. Gorzel, A. Allen, I. Kelly, et al., “Efficient En-
12377, pp. 323–331 (Springer, Cham, Switzerland, 2020). coding and Decoding of Binaural Sound With Resonance
https://doi.org/10.1007/978-3-030-58805-2_38. Audio,” in Proceedings of the AES International Confer-
[105] D. N. Zotkin, R. Duraiswami, and L. S. Davis, ence on Immersive and Interactive Audio (2019 Mar.), paper
“Rendering Localized Spatial Audio in a Virtual Auditory 68.
Space,” IEEE Trans. Multimed., vol. 6, no. 4, pp. 553–564 [118] M. Kentgens and P. Jax, “Translation of a Higher-
(2004 Aug.). https://doi.org/10.1109/TMM.2004.827516. Order Ambisonics Sound Scene by Space Warping,” in
[106] L. Cliffe, J. Mansell, C. Greenhalgh, and Proceedings of AES International Conference on Audio for
A. Hazzard, “Materialising Contexts: Virtual Sound- Virtual and Augmented Reality (2020 Aug.), paper 2-3.
scapes for Real-World Exploration,” Pers. Ubiq- [119] S. Moreau, J. Daniel, and S. Bertet, “3D Sound
uit. Comput., vol. 25, pp. 623–636 (2021 Aug.). Field Recording With Higher Order Ambisonics – Objec-
https://doi.org/10.1007/s00779-020-01405-3. tive Measurements and Validation of a Spherical Micro-
[107] G. Arvanitis, K. Moustakas, and N. Fakotakis, phone,” presented at the 120th Convention of the Audio
“Real-Time Context Aware Audio Augmented Reality,” in Engineering Society (2006 May), paper 6857.

806 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

[120] S. Favrot, M. Marschall, J. Käsbach, J. Buchholz, [132] V. R. Algazi, R. O. Duda, D. M. Thompson, and C.
and T. Weller, “Mixed-Order Ambisonics Recording and Avendano, “The CIPIC HRTF Database,” in Proceedings of
Playback for Improving Horizontal Directionality,” pre- the IEEE Workshop on the Applications of Signal Process-
sented at the 131st Convention of the Audio Engineering ing to Audio and Acoustics, pp. 99–102 (New Paltz, NY)
Society (2011 Oct.), paper 8528. (2001 Oct.). https://doi.org/10.1109/ASPAA.2001.969552.
[121] Y. Tanabe, G. Yamauchi, M. Atsushi, and T. [133] C. Armstrong, L. Thresh, D. Murphy, and G.
Kamekawa, “Tesseral Array for Group Based Spatial Audio Kearney, “A Perceptual Evaluation of Individual and
Capture and Synthesis,” in Proceedings of the AES Inter- Non-Individual HRTFs: A Case Study of the SADIE II
national Conference on Audio for Virtual and Augmented Database,” Appl. Sci., vol. 8, no. 11, paper 2029 (2018
Reality (2020 Aug.), paper 2-7. Oct.). https://doi.org/10.3390/app8112029.
[122] M. Lawton, S. Cunningham, and I. Convery, “Na- [134] S. Spagnol, “Auditory Model Based Subsetting
ture Soundscapes: An Audio Augmented Reality Experi- of Head-Related Transfer Function Datasets,” in Proceed-
ence,” in Proceedings of the 15th International Conference ings of the International Conference on Acoustics, Speech
on Audio Mostly, pp. 85–92 (Graz, Austria) (2020 Sep.). and Signal Processing, pp. 391–395 (Online) (2020 May).
https://doi.org/10.1145/3411109.3411142. https://doi.org/10.1109/ICASSP40776.2020.9053360.
[123] M. de Borba Campos, J. Sánchez, A. Cardoso Mar- [135] L. S. Simon, N. Zacharov, and B. F. G.
tins, R. Schneider Santana, and M. Espinoza, “Mobile Nav- Katz, “Perceptual Attributes for the Comparison of
igation Through a Science Museum for Users Who Are Head-Related Transfer Functions,” J. Acoust. Soc.
Blind,” in C. Stephanidis and M. Antona (Eds.), Univer- Am., vol. 140, no. 5, pp. 3623–3632 (2016 Nov.).
sal Access in Human-Computer Interaction. Aging and As- https://doi.org/10.1121/1.4966115.
sistive Environments, Lecture Notes in Computer Science, [136] D. Poirier-Quinot and B. F. G. Katz, “Impact of
vol. 8515, pp. 717–728 (Springer, Berlin, Germany, 2014). HRTF Individualization on Player Performance in a VR
https://doi.org/10.1007/978-3-319-07446-7_68. Shooter Game II,” in Proceedings of the AES International
[124] D. A. Mauro, R. Mekuria, and M. Sanna, “Bin- Conference on Audio for Virtual and Augmented Reality
aural Spatialization for 3D Immersive Audio Communica- (2018 Aug.), paper P4-1.
tion in a Virtual World,” in Proceedings of the 8th Audio [137] Z. Ben-Hur, D. Alon, P. W. Robinson, and R.
Mostly Conference, paper 8 (Piteå, Sweden) (2013 Sep.). Mehra, “Localization of Virtual Sounds in Dynamic Lis-
https://doi.org/10.1145/2544114.2544115. tening Using Sparse HRTFs,” in Proceedings of the AES
[125] B. Xie, Head-Related Transfer Function and Vir- International Conference on Audio for Virtual and Aug-
tual Auditory Display (J. Ross Publishing, Plantation, FL, mented Reality (2020 Aug.), paper 1-1.
2013), 2nd ed. [138] D. R. Begault, E. M. Wenzel, and M. R. Ander-
[126] E. Wenzel, F. Wightman, D. Kistler, and son, “Direct Comparison of the Impact of Head Tracking,
S. Foster, “Acoustic Origins of Individual Differ- Reverberation, and Individualized Head-Related Transfer
ences in Sound Localization Behavior,” J. Acoust. Functions on the Spatial Perception of a Virtual Speech
Soc. Am., vol. 84, no. S1, pp. S79–S79 (1988 Nov.). Source,” J. Audio Eng. Soc., vol. 49, no. 10, pp. 904–916
https://doi.org/10.1121/1.2026486. (2001 Oct.).
[127] D. R. Begault, “Challenges to the Successful Im- [139] C. Pörschmann, P. Stade, and J. M. Arend, “Bin-
plementation of 3-D Sound,” presented at the 89th Conven- aural Auralization of Proposed Room Modifications Based
tion of the Audio Engineering Society (1990 Sep.), paper on Measured Omnidirectional Room Impulse Responses,”
2948. in Proc. Mtgs. Acoust., vol. 30, no. 1, paper 015012 (2017
[128] E. M. Wenzel, M. Arruda, D. J. Kistler, Jun.). https://doi.org/10.1121/2.0000622.
and F. L. Wightman, “Localization Using Nonindivid- [140] B. G. Shinn-Cunningham, N. Kopco, and T. J.
ualized Head-Related Transfer Functions,” J. Acoust. Martin, “Localizing Nearby Sound Sources in a Class-
Soc. Am., vol. 94, no. 1, pp. 111–123 (1993 Sep.). room: Binaural Room Impulse Responses,” J. Acoust.
https://doi.org/10.1121/1.407089. Soc. Am., vol. 117, no. 5, pp. 3100–3115 (2005 May).
[129] Z. Yang, Y.-L. Wei, S. Shen, and R. R. https://doi.org/10.1121/1.1872572.
Choudhury, “Ear-AR: Indoor Acoustic Augmented Re- [141] S. Werner, F. Klein, A. Neidhardt, et al., “Cre-
ality on Earphones,” in Proceedings of the 26th An- ation of Auditory Augmented Reality Using a Position-
nual International Conference on Mobile Computing Dynamic Binaural Synthesis System—Technical Com-
and Networking, paper 56 (London, UK) (2020 Sep.). ponents, Psychoacoustic Needs, and Perceptual Evalua-
https://doi.org/10.1145/3372224.3419213. tion,” Appl. Sci., vol. 11, no. 3, paper 1150 (2021 Jan.).
[130] T. Suenaga, S. Kaneko, and H. Okumura, “De- https://doi.org/10.3390/app11031150.
velopment of Shape-Based Average Head-Related Trans- [142] T. Wendt, S. van de Par, and S. D. Ewert, “A
fer Functions and Their Applications,” Acoust. Sci. Computationally-Efficient and Perceptually-Plausible Al-
Technol., vol. 41, no. 1, pp. 282–287 (2020 Jan.). gorithm for Binaural Room Impulse Response Simulation,”
https://doi.org/10.1250/ast.41.282. J. Audio Eng. Soc., vol. 62, no. 11, pp. 748–766 (2014 Nov.).
[131] B. Gardner and K. Martin, “HRTF Measurements https://doi.org/10.17743/jaes.2014.0042.
of a KEMAR Dummy-Head Microphone,” Tech Rep. 280 [143] D. R. Perrott, T. Sadralodabai, K. Saberi, and
(1994 May). T. Z. Strybel, “Aurally Aided Visual Search in the Cen-

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 807
YANG ET AL. PAPERS

tral Visual Field: Effects of Visual Load and Visual Sensor Systems, pp. 371–372 (Shenzhen, China) (2018
Enhancement of the Target,” Hum. Factors, vol. 33, Nov.). https://doi.org/10.1145/3274783.3275188.
no. 4, pp. 389–400 (1991 Aug.). https://doi.org/10.1177/ [155] S. Klockgether and S. van de Par, “Just No-
001872089103300402. ticeable Differences of Spatial Cues in Echoic and
[144] J. J. Gibson, The Ecological Approach to Visual Anechoic Acoustical Environments,” J. Acoust. Soc.
Perception: Classic Edition (Psychology Press, New York. Am., vol. 140, no. 4, pp. EL352–EL357 (2016 Oct.).
NY, 2014). https://doi.org/10.1121/1.4964844.
[145] P. Fröhlich, R. Simon, L. Baillie, and H. [156] C. L. Christensen, G. Koutsouris, and J. H. Rindel,
Anegg, “Comparing Conceptual Designs for Mo- “The ISO 3382 Parameters: Can We Simulate Them? Can
bile Access to Geo-Spatial Information,” in Proceed- We Measure Them?” in Proceedings of the International
ings of the 8th International Conference on Human- Symposium on Room Acoustics, pp. 9–11 (Toronto, Canada)
Computer Interaction With Mobile Devices and Ser- (2013 Jun.).
vices, pp. 109–112 (Helsinki, Finland) (2006 Sep.). [157] Z. Meng, F. Zhao, and M. He, “The Just No-
https://doi.org/10.1145/1152215.1152238. ticeable Difference of Noise Length and Reverberation
[146] J. Wilson, B. N. Walker, J. Lindsay, C. Cambias, Perception,” in Proceedings of the International Sym-
and F. Dellaert, “SWAN: System for Wearable Audio Navi- posium on Communications and Information Technolo-
gation,” in Proceedings of the 11th IEEE International Sym- gies, pp. 418–421 (Bangkok, Thailand) (2006 Oct.).
posium on Wearable Computers, pp. 91–98 (Boston, MA) https://doi.org/10.1109/ISCIT.2006.339980.
(2007 Oct.). http://doi.org/10.1109/ISWC.2007.4373786. [158] W. Bailey and B. Fazenda, “The Effect of Vi-
[147] K. R. May, B. J. Tomlinson, X. Ma, P. Roberts, sual Cues and Binaural Rendering Method on Plausibil-
and B. N. Walker, “Spotlights and Soundscapes: On the De- ity in Virtual Environments,” presented at the 144th Con-
sign of Mixed Reality Auditory Environments for Persons vention of the Audio Engineering Society (2018 May),
With Visual Impairment,” ACM Trans. Access. Comput., paper 9921.
vol. 13, no. 2, pp. 8 (2020 Apr.). https://doi.org/10.1145/ [159] F. Grijalva, L. Martini, S. Goldenstein, and D. Flo-
3378576. rencio, “Anthropometric-Based Customization of Head-
[148] M. Bandukda and C. Holloway, “Audio AR to Sup- Related Transfer Functions Using Isomap in the Hor-
port Nature Connectedness in People With Visual Disabil- izontal Plane,” in Proceedings of the IEEE Interna-
ities,” in Adjunct Proceedings of the ACM International tional Conference on Acoustics, Speech and Signal Pro-
Joint Conference on Pervasive and Ubiquitous Computing cessing, pp. 4473–4477 (Florence, Italy) (2014 May).
and Proceedings of the ACM International Symposium on https://doi.org/10.1109/ICASSP.2014.6854448.
Wearable Computers, pp. 204–207 (Online) (2020 Sep.). [160] A. Meshram, R. Mehra, H. Yang, et al., “P-
https://doi.org/10.1145/3410530.3414332. HRTF: Efficient Personalized HRTF Computation for
[149] S. Joshi, K. Stavrianakis, and S. Das, “Substitut- High-Fidelity Spatial Sound,” in Proceedings of the
ing Restorative Benefits of Being Outdoors Through Inter- IEEE International Symposium on Mixed and Augmented
active Augmented Spatial Soundscapes,” in Proceedings Reality, pp. 53–61 (Munich, Germany) (2014 Oct.).
of the 22nd International ACM SIGACCESS Conference https://doi.org/10.1109/ISMAR.2014.6948409.
on Computers and Accessibility, paper 80 (Online) (2020 [161] M. T. Islam and I. J. Tashev, “Anthropometric Fea-
Oct.). https://doi.org/10.1145/3373625.3418029. tures Estimation Using Integrated Sensors on a Headphone
[150] R. Gupta, R. Ranjan, J. He, and G. Woon-Seng, for HRTF Personalization,” in Proceedings of the AES In-
“Investigation of Effect of VR/AR Headgear on Head Re- ternational Conference on Audio for Virtual and Augmented
lated Transfer Functions for Natural Listening,” in Pro- Reality (2020 Aug.), paper 1-7.
ceedings of the AES International Conference on Audio for [162] S. Kaneko, T. Suenaga, and S. Sekine, “DeepEar-
Virtual and Augmented Reality (2018 Aug.), paper P3-9. Net: Individualizing Spatial Audio With Photography, Ear
[151] A. Genovese, G. Zalles, G. Reardon, and A. Ro- Shape Modeling, and Neural Networks,” in Proceedings of
ginska, “Acoustic Perturbations in HRTFs Measured on the AES International Conference on Audio for Virtual and
Mixed Reality Headsets,” in Proceedings of International Augmented Reality (2016 Sep.), paper 6-3.
Conference on Audio for Virtual and Augmented Reality [163] F. Shahid, N. Javeri, K. Jain, and S. Badhwar, “AI
(2018 Aug.), paper P8-4. DevOps for Large-Scale HRTF Prediction and Evaluation:
[152] N. Kelly, “All the World’s a Stage: What Makes a An End to End Pipeline,” in Proceedings of the AES Inter-
Wearable Socially Acceptable,” Interactions, vol. 24, no. 6, national Conference on Audio for Virtual and Augmented
pp. 56–60 (2017 Nov.). https://doi.org/10.1145/3137093. Reality (2018 Aug.), paper P9-4.
[153] R. R. Choudhury, “Earable Computing: A New [164] G. W. Lee, J. H. Lee, S. J. Kim, and H. K. Kim, “Di-
Area to Think About,” in Proceedings of the 22nd rectional Audio Rendering Using a Neural Network Based
International Workshop on Mobile Computing Systems Personalized HRTF,” in Proceedings of the Annual Con-
and Applications, pp. 147–153 (Online) (2021 Feb.). ference of the International Speech Communication Asso-
https://doi.org/10.1145/3446382.3450216. ciation (INTERSPEECH), pp. 2364–2365 (Graz, Austria)
[154] F. Kawsar, C. Min, A. Mathur, et al., “eSense: (2019 Sep.).
Open Earable Platform for Human Sensing,” in Proceed- [165] J.-M. Jot and K. S. Lee, “Augmented Reality Head-
ings of the 16th ACM Conference on Embedded Networked phone Environment Rendering,” in Proceedings of the AES

808 J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October
PAPERS A REVIEW OF AUDIO AUGMENTED REALITY

International Conference on Audio for Virtual and Aug- of the IEEE International Conference on Multimedia Com-
mented Reality (2016 Aug.), paper 8-2. puting and Systems, vol. 1, pp. 427–432 (Florence, Italy)
[166] R. Behringer, S. Chen, V. Sundareswaran, K. (1999 Jun.). https://doi.org/10.1109/MMCS.1999.779240.
Wang, and M. Vassiliou, “A Distributed Device Diag- [168] C. M. Tomaino, “Music Therapy and the Brain,” in
nostics System Utilizing Augmented Reality and 3D Au- B. L. Wheeler (Ed.), Music Therapy Handbook, pp. 40–50
dio,” in M. Gervautz, D. Schmalstieg, and A. Hilde- (Guilford Press, New York, NY, 2015).
brand (Eds.), Virtual Environments ’99, Eurograph- [169] U. Chong and S. Alimardanov, “Audio Aug-
ics, pp. 105–114 (Springer, Vienna, Austria, 1999). mented Reality Using Unity for Marine Tourism,” in
https://doi.org/10.1007/978-3-7091-6805-9_11. M. Singh, D. K. Kang, J. H. Lee, U. S. Tiwary, D.
[167] R. Behringer, S. Chen, V. Sundareswaran, K. Singh, W. Y. Chung (Eds.), Intelligent Human Com-
Wang, and M. Vassiliou, “A Novel Interface for Device Di- puter Interaction, Lecture Notes in Computer Science, vol.
agnostics Using Speech Recognition, Augmented Reality 12616, pp. 303–311 (Springer, Cham, Switzerland, 2020).
Visualization, and 3D Audio Auralization,” in Proceedings http://doi.org/10.1007/978-3-030-68452-5_31.

THE AUTHORS

Jing Yang Amit Barde Mark Billinghurst

Jing Yang was a research assistant at ETH Zurich and is on empathy, and its use in the management and treatment
now a senior researcher at Huawei. She received her Ph.D. of tinnitus.
in 2021 from ETH Zurich. Her research interests lie in •
the intersection of augmented reality and human-computer Mark Billinghurst is Director of the Empathic Comput-
interaction with specific focus on spatial audio, room ing Laboratory and Professor at the University of South
acoustics, music style transfer, and related applications. Australia in Adelaide, Australia, and at the University of
She has been active in academic and industrial activities in Auckland in Auckland, New Zealand. He earned a Ph.D.
several companies and institutes, including Idiap Research in 2002 from the University of Washington and conducts
Institute, The University of Auckland, and Nokia Bell research on how virtual and real worlds can be merged,
Labs. publishing over 650 papers on augmented reality (AR),
• virtual reality, remote collaboration, empathic computing,
Amit Barde is a Research Fellow at the Empathic Com- and related topics. In 2013, he was elected as a Fellow of the
puting Laboratory, University of Auckland. He received Royal Society of New Zealand and, in 2019, was given the
his Ph.D. in Human Interface Technology from the HIT- IEEE International Symposium on Mixed and Augmented
Lab NZ, where he explored the use of spatialized auditory Reality (ISMAR) Career Impact Award in recognition for
cues for information delivery on wearable devices. His re- lifetime contribution to AR research and commercializa-
search interests include information delivery using spatial- tion.
ized auditory cues, interactive audio, the effects of sound

J. Audio Eng. Soc., Vol. 70, No. 10, 2022 October 809

You might also like