Streaming Immersive Audio Content
Streaming Immersive Audio Content
Streaming Immersive Audio Content
Conference Paper
Presented at the Conference on
Audio for Virtual and Augmented Reality
2016 Sept 30 – Oct 1, Los Angeles, CA, USA
This paper was peer-reviewed as a complete manuscript for presentation at this conference. This paper is available in the AES
E-Library (http://www.aes.org/e-lib) all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted
without direct permission from the Journal of the Audio Engineering Society.
ABSTRACT
”Immersion [...] is a perception of being physically present in a non-physical world.” [1] It is critical to think
about immersive audio for live music streaming because giving listeners the illusion of being transported to a
different acoustic environment makes the experience of streaming much more real. This paper describes various
approaches to enable audio engineers to create immersive audio content for streaming, whether using existing tools
and network infrastructure and delivering static binaural audio or getting ready for emerging tools and workflows
for creating immersive audio content for Cinema or Virtual Reality streaming.
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 2 of 8
Kares, Larcher Streaming Immersive Audio
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 3 of 8
Kares, Larcher Streaming Immersive Audio
side. Through the application of HRTFs at the listening with. As this is a relatively new field, our goal is to
stage, it is possible to use technologies like headtrack- offer a starting point for audio engineers eager to get
ing and individualized HRTFs. Headtracking is a major started with immersive audio recording and streaming.
component for a realistic immersive audio headphone Obviously, fellow practitioners already well versed in
experience because it helps eliminate the well-known this field may have developed valid alternatives.
front-back confusions that can occur with static binau- For stereo recording, many engineers might use close
ral content. [5] microphones only, others might use microphones for
At first hand, dynamic binaural listening in streaming is capturing the reverberant information as well as the
mostly accomplished using ambisonic or object-based overall positional balance of the various sources ("main
signals. Ambisonics has the advantage of fixed data- microphone" technique), whereas others use both kinds
bandwidth, which is probably the reason, why ambison- of capturing techniques at the same time. The technique
ics is used in major internet platforms for 360 degree chosen depends mostly on the type of content that is
video streaming. On the other hand, object-based con- created. Especially with music, the chosen recording
cepts are mostly used in VR gaming, since the object technique heavily depends on the genre of music. For
data does not have to be streamed through a network instance, Rock’n’Roll recordings are mostly done us-
but is created in the game itself. ing close miking techniques whereas Classical Music
recordings heavily rely on main microphones (captur-
3.3 Loudspeaker listening ing e.g. the whole orchestra)
At the recording stage, the basic principles used for
Immersive audio for loudspeaker playback can be capturing close microphones in stereo also apply to
defined as traditional 2D surround sound with height immersive audio. Larger difference between the two
information. Studies showed that the additional height are observed at the mixing and processing stage. When
information increases the perception of Envelopment. recording with close microphones, one still has to posi-
[6] tion the mics in order to get a nicely balanced frequency
Loudspeaker listening can be achieved with ei- response while minimizing bleeding from other nearby
ther channel-based, object-based or scene-based sound sources.
approaches, but for now the market is dominated However, for the ambient/reverberant recording micro-
by channel-based and, more recently, object-based phones, there are different capturing devices for each
systems. category of immersive content, which are introduced
Since a 3D loudspeaker setup often create some and described in the following subchapters.
logistics challenges, for example to suspend and secure
the height loudspeakers, it is mainly found in cinemas, 4.1.1 Capturing static binaural content
both as channel-based systems, with solutions such
as Auro3D since 2005, and Atmos, relying on an The best and easiest way to record reverberant infor-
object-based codec first released in 2012. [7]. More mation or a complete capture of the acoustic scene for
recently these solutions have been integrated in home static binaural content is to use a dummy head micro-
receivers, enabling end-users to experience immersive phone.
content in their living room. Some progress has also
been made to integrate object-based immersive audio
streams like into the current streaming platforms [8]
and into broadcast [9].
4.1 Capturing
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 4 of 8
Kares, Larcher Streaming Immersive Audio
A dummy head has two microphone capsules (pressure signal can be played back by exactly one speaker.
transducers) in between its ears and delivers binaurally This is especially useful when recording incoherent
processed audio as an output. Because of the head’s arrays of microphones, which is a common way for
shape and its ears, the signal is very close to what real recording immersive audio content in acoustic music.
eardrums would perceive. The size and shape of the [10] To capture environment and room using a main
head can be derived from an average of many people’s microphone approach, we have been perfecting the use
heads or can replicate the head of one individual, but of microphones in a cube-shaped array, which each
of course a dummy head recording cannot be individu- microphone corresponding to one loudspeaker channel.
alized to the specific HRTFs of each listener. With this approach, we have reliably reached natural
In a recording the position of the dummy head is im- sounding results for acoustic music.
portant. If no further processing is used, for example to
add spot into the mix, the only way to spatially design
the soundscape is by placing the dummy head at the 4.1.3 Capturing content for dynamic binaural
desired spot in the action or by moving people around playback
it. This practice has been perfected for instance in radio
dramas in the 1970s. For that purpose, dummy head Capturing content for dynamic binaural audio is some-
recording offers a very quick and effective way of cre- thing quite new, and for 360 videos it mostly relies on
ating a 3D audio stream. ambisonics, the immersive audio technology invented
Dummy head recordings only replicate the sound in the 1970s. For recording reverberant information the
present at a certain position in a given environment easiest way is to record straight in ambisonics or in a
and the dummy head is exceptionally good at this, but format easily convertible to ambisonics. Microphone
many audio engineers wish to tweak the sound after arrays are available that record in various formats. For
the recording, for example move sound sources around example, the Nimbus/Halliday microphone array is of
and control directivity and amount of room informa- native first-order ambisonic B-Format, consisting of
tion. We propose a method to address this need using one omni-directional microphone and several figure of
a dummy head combined with close microphone tech- eight microphones. One example of such microphone
niques in modern mixing workflows. (Section 4.2.1) category in a 2D configuration for horizontal capture
only is the Josephson C700S.
4.1.2 Capturing content for multi-loudspeaker A very common alternative consists of capturing signals
layouts in A-Format. Such microphone arrays consist of four
directional microphones in a tetrahedral arrangement.
Capturing immersive content for loudspeaker playback Examples of this category are the Sennheiser AMBEO
can be discouraging because the playback situation VR Mic and the SoundField DSF. Easy matrixing can
heavily depends on the end-user’s setup. Of course, convert the recorded A-Format to ambisonics B-Format
channel-based formats give the content producer an and additional filters are applied to equalize the diffuse-
easy way to control which microphones are assigned to field response. [11]
which channel in playback, but there is little one can do It is also possible to record ambisonics with high-order
to anticipate what it will end up sounding like on the ambisonics microphones, but the patterns above first-
end-user’s listening setup. Object-audio formats aim order ambisonics are not practical to build with typical
at solving this issues. However, object-audio might microphone capsules. As a result, sophisticated pro-
work best with close microphone sources as signals cessing is necessary in order to convert an array of
with room information tend to be harder to reproduce regular pattern microphones to a high-order stream. A
with some consistency across different 3D loudspeaker representative microphone of this category is the Eigen-
layouts. Our experience indicate that a better spatial mike [12].
experience is obtained by sending independent room
information signals to each loudspeaker as opposed 4.2 Mixing and live processing issues
to sending independent room information signals
across multiple loudspeakers. Luckily existing As stated above, the biggest difference in immersive au-
object-based formats enable objects to snap to the dio capture is the capture of room information, whereas
nearest loudspeaker position so that each microphone the close microphones are captured in a very similar
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 5 of 8
Kares, Larcher Streaming Immersive Audio
way as in stereo. However, when creating immersive which typically record audio objects. As in stereo,
audio content, the processing of close microphones is engineers may decide to use either one of the two
very different from stereo mixing, varying much with options or both of them, depending on the current
the format that is used at recording. constraints of the scene to record and the type of
content. With native ambisonics microphones, it is
4.2.1 Mixing static binaural audio possible to alter the soundfield in many different
different ways at the mixing stage, for example
The main goal for mixing in static binaural is to create rotate, shift, change directivity, mirror and widen the
an externalized experience so the audio is perceived soundfield. [13] )
outside of the listener’s head while minimizing color- Close microphones can be mixed in thank to simple
ing the signal. This is challenging as HRTFs alter the ambisonic panning plugins, which usually rely on
frequency response to imprint spatial information in simple matrixing appropriate for the desired ambisonic
the signal. Although listening to binaurally-processed order. Applying linear effects such as equalizers to an
audio over headphones is intended to be more realistic, ambisonic signal can be done either in B-Format or in
switching back and forth between binaural audio and A-Format. For non-linear effects, such as Dynamics
regular stereo can lead to the impression of binaural effects, or to have some control over the spatial area
audio sounding altered or over-processed. Our exper- on which an effect should apply, it is most common to
iments with dummy head recording showed that the first convert to A-Format. Overall, there is a growing
output seems usually less colored than the output of number of effects to process native ambisonic audio
binaural processing tools, i.e. most commonly binaural (see for example the TOA Manipulators VST plugins
panning plugins. [14]).
A dummy head can indeed be used together with binau-
rally processed close microphones. In many situations
the actual balance of the dummy head output alone is 4.2.3 Mixing immersive loudspeaker content
not perfect, but it has a great overall sound. This work-
flow enables the mixing process to be similar to mixing Loudspeaker content is often mixed using either
in stereo with additional control over the placement and channel- or object-based approaches. For mixing
depth of the sources. However, mixing dummy head object-based audio, there does not seem to be a
recording and binaurally-processed close microphones common way of mixing natively in any digital audio
will sound best if the dummy head’s HRTFs match the workstation (DAW) or mixing desk; the process
HRTFs used in the binaural processing tool. Many bin- relies on external tools from the separate object-audio
aural panning plugins allow for AES69 / SOFA format licensors. In most cases, the individual objects are sent
import, which enables to use custom HRTF sets in the out of the DAW and stored on a separate machine that
plugin. Furthermore, some binaural panning plugins also stores metadata about the position of the objects.
allow custom reflection calculation in a virtual room. Since many DAWs are limited to eight output channels
It remains to be discussed whether using artificial re- per track, mixing for channel-based loudspeaker setups
flections on the spot microphone signals is necessary is also somewhat more difficult. A workaround is to
when mixed in with the room information provided by use multiple regular 2D surround sound output buses
the dummy head. and using them as different height layers for your 3D
mix. By using post-fader sends on the individual tracks
4.2.2 Mixing dynamic binaural audio to the output buses, it is possible to have a channel
signal on different height layers. Third-party plugins
The scope of this paper describes mixing linear VR supporting 3D audio panning for channel-based
content using scene-based or ambisonic techniques. loudspeaker systems are available as well.
Although it is possible to use object-audio mixing for When recording with a cube-style microphone
VR, it is not commonly used yet. array for reverberant information, as mentionned
Mixing in ambisonics requires integrating the signals earlier, integrating the array output into the mix is
of two types of microphones. Native ambisonic micro- non-equivocal. However when mixing the signals from
phones which usually record reverberant information close microphones, there is more artistic freedom and
or a complete audio scene, and close microphones little rule of thumb apply (See 6 Artistic Aspects of
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 6 of 8
Kares, Larcher Streaming Immersive Audio
Immersive Audio) viewer would see, adding room information in the three-
dimensional space. Alternatively, they can opt instead
to use an artificial approach and distribute the different
5 Infrastructure and streaming audio elements spatially, possibly away from a realistic
perspective. This of course depends on the type and
application
genre of music, as well as personal taste. It is worth
mentioning that for audio content with multiple sound
The paper describes different ways of capturing and sources, the human brain will segregate these elements
producing immersive audio content, but additional hur- much more easily when they are located around you in
dles have to be solved to stream this content. An easy
3D space than with a typical stereo panning. Immersive
way of streaming immersive content is to use static
audio streaming therefore opens up a whole new field
binaural audio because it requires support for a stereo of audio creativity. [15]
signal only which is supported by almost all existing
streaming platforms and broadcast networks.
7 Conclusion
For the other formats, there are several factors which
differentiate them from stereo streaming. First of all,
The paper described several differnt ways of recording
the nature of the transmitted signals can vary (object-
and streaming immersive audio content. It explains
based, channel-based, scene-based) and even within
recording and streaming strategies for regular stereo
these formats, competition between license providers
content, which consist of using microphones with room
can create a difficult base of operations for both content
information and close microphones and tried to apply
creators and consumers on the listening side. Most plat-
this to immersive audio content creation. For captur-
forms provide their own listening or viewing mobile or
ing immersive content, the close microphones would
web-applications, and can handle the decoding process
be captured the same way as in stereo, whereas the
themselves. However in broadcast and cinema, this
capture of room information would be done differently.
remains a bottleneck.
For creating immersive loudspeaker content, we recom-
In addition, new hardware is required for new listening
mended to replace the room information microphones
formats. For example, in the case of dynamic head-
commonly used in stereo by a cube-shaped array of
phone playback, a separate headtracking device may
microphones. For static binaural, we recommended
be necessary when not built into VR goggles. For loud-
using a dummy head microphone as the room informa-
speaker playback, decoding hardware is most often
tion microphones. For virtual reality immersive audio
mandatory, though this might be alleviated when de-
streaming, we recommended using an ambisonic mi-
coding software is integrated in listening devices.
crophone as the room information microphone.
Finally, the processing power available on mobile de-
We showed several ways of processing the close micro-
vices is a limiting factor and so is battery life. This is
phones to mix them into the immersive audio streams,
especially applicable to the case of dynamic binaural
using either binaural panning tools, ambisonic panning
rendering.
tools, channel-based or object-based panning tools -
Luckily, there is a growing awareness in the industry
depending on the listening format.
of the need to provide easy access to immersive audio
For streaming, the easiest way to achieve immersive
to both consumers on the listening side and to creators
audio streaming today is to use a static binaural stream
on the content creation side.
because it can build on current infrastructure. With
major companies working on immersive audio for VR
6 Artistic aspects of Immersive Audio and 360 video, it probably remains only a matter of
time before ambisonic live streaming is made available
From a creator’s point of view, there are a lot of new to end-users. Object-audio streaming is being heavily
creative possibilities with immersive audio. Instead of explored, with MPEG-H and Atmos leading this effort.
having only one dimension to position sounds (usually As well, channel-based loudspeaker approached such
left to right), an audio engineer can take into account as Auro 3D are looking into further applications for
the additional dimensions of depth, front and back and live streaming.
height. They can opt for a naturalistic approach where
the localization of sound corresponds to a picture the
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 7 of 8
Kares, Larcher Streaming Immersive Audio
References
AES Conference on Audio for Virtual and Augmented Reality, Los Angeles, CA, USA, 2016 Sept 30 – Oct 1
Page 8 of 8