Representing Inner Voices in VR
Representing Inner Voices in VR
Representing Inner Voices in VR
ABSTRACT
The inner auditory experience comprises various sounds which, rather than originating from sources in their
environment, form as a result of internal processes within the brain of an observer. Examples of such sounds are, for
instance, verbal thoughts and auditory hallucinations. Traditional audiovisual media representations of inner voices
have tended to focus on impact and storytelling, rather than aiming to reflect a true-to-life experience. In virtual
reality (VR) environments, where plausibility is favoured over this hyper-real sound design, a question remains on
the best ways to recreate realistic, and on the other hand, entertaining inner and imagined voices via head-tracked
headphones and spatial audio tools. This paper first presents a questionnaire which has been completed by 70
participants on their own experience of inner voices. Next, the results of the questionnaire are used to inform a VR
experiment, whereby different methods to render inner voices are compared. This is conducted using a short film
created for this project. Results show that people mostly expect realism from the rendering of inner voices and
auditory hallucinations when the focus is on believability. People’s expectations for inner voice did not change
considerably in an entertainment context, whereas for hallucinations, exaggerated reverberation was preferred.
increase engagement and storytelling narratives. For speech. This could be one factor contributing to the
example, film dialogue is expected to sound clear and formation of inner speech.
crisp and similar sound effects are chosen for specific
types of events. While these hyper-realistic approaches The key difference between inner speech and auditory
are immersive in more traditional media, virtual reality vocal hallucinations (AVH) is agency. Inner speech
(VR) experiences engage with a spectator in a funda- holds a sense of agency which is believed to enable
mentally different way by making the experience more the brain to detect language inputs originating from
dynamic and interactive. Therefore, the representation within. This is thought to be connected to efference
of inner voices in VR requires investigation. copies [8, 5]. In hallucinations, the sense of agency is
lost, which causes confusion in the brain since it can
This paper studies both realistic (plausible) and enter- no longer accurately detect internally generated stimuli.
taining (engaging) reproduction of inner voices in VR AVHs are difficult to discern from reality [9].
environments and how prioritising either changes the
experience. The aim is to answer the question of how While neurology can provide a context for how inner
sounds like the perceived sound of one’s own thoughts, voices form, psychology helps to understand their per-
auditory hallucinations, and other such sounds should ceptual features. One theory explaining human con-
be represented, and by extension, how should they sciousness is the multi-component model of working
be rendered in practice using spatial audio techniques. memory [2]. The theory identifies three primary pro-
This is approached via a two-stage study comprising cesses of consciousness: the central executive and its
a questionnaire and a listening test. The questionnaire subsystems, the visuospatial scratchpad, and the phono-
aims to understand people’s experiences and expec- logical loop. The phonological loop itself comprises
tations regarding inner voices. The purpose of the two components: a temporary store which can hold on
listening test is to then test a set of inner experience to verbal content for a few seconds, and an articulatory
simulations derived from the questionnaire results on a rehearsal process which feeds and refreshes the store.
group of test subjects. The articulatory process is thought to be linked to vocal
and subvocal speech.
2 Inner Voices in the Brain
The characteristics of inner voices have been ap-
The human auditory system can be examined on three proached from many directions. One influential theory
layers. A physical layer captures sound vibrations, a is Vygotsky’s theory on the development of language
neurological layer transmits and processes them, and and thought [10], which argues that subvocal speech
a cognitive layer perceives and understands them. The is an internalised version of voiced speech, often serv-
auditory system is also connected to speech production ing the same purpose as the egocentric speech, cus-
through a number of physical and neurological paths. tomary to small children. Studies done by J.B. Wat-
Physical feedback results from sound vibrations cre- son also found that children and adults employ similar
ated through speech being received as sensory inputs techniques in problem-solving, with a distinction that
through the ears [6]. This phenomenon defines some of children often go through their thinking process out-
the characteristics, such as the effects of bone conduc- loud, whereas adults think subvocally [11]. In a 2011
tion, that people associate with their own voice. The study, 380 people were interviewed on their use of in-
neurological connections are important for the semantic ner speech [3]. The most common interactions were
understanding of language and various other cognitive found to revolve around one’s appearance and state of
phenomena. One such connection is the inner simu- affairs, such as finances, stress and future, the planning
lation of motor actions. As the premotor and motor of actions and future conversations, various problem-
cortices interact with muscles, copies of these stimuli solving tasks, and self-regulation.
are sent as feedback into the brain areas that typically
receive relevant physical and somatosensory feedback
[4]. This behaviour allows the brain to track its perfor- 3 Questionnaire on Inner Auditory
mance. Research has shown that these efference copies Experience
are sent regardless of whether physical movements are
made [7]; therefore, speech is carried back to the au- In the first phase of this work, a three-part question-
ditory regions of the brain in both vocal and subvocal naire was created to collect statistics on people’s inner
experiences. The questionnaire focused on three ar- • When you think about something you have heard,
eas: inner speech, auditory hallucinations, and sound do you sense the sound... In the ambience of the
in dreams. space you were in? / In the ambience of the space
you are currently in? / Without any ambience? /
The topics of the questionnaire were chosen based on a
Other (specify).
number of expected characteristics: inner speech in VR
should sound similar to the perceived qualities of one’s
own voice, with bone conduction present and little-to- The data collected in the first and second sections were
no reverberation. The voice should be perceived inside difficult to use directly as inputs for concrete sound
the head. Auditory hallucinations should be difficult design parameters. In the third section, therefore, a
to tonally discern from physical sources, except when selection of parameters was defined based on introspec-
aiming for entertainment, where the separation should tion while designing the survey. Here, values for each
be easier to make. These assumptions were derived parameter were collected with sliders ranging from 0
from background research and authors’ introspection. - 10, each end corresponding to one of the edge cases
for the parameter, as listed below:
3.1 Questions
The goal of the first section was to acquire first-hand • Loudness - quiet to loud.
experiences of inner voices including ones that had not • Depth - boomy to thin.
been thought of when designing the questionnaire, and
was therefore worded to be open ended and not sugges- • Spatial characteristics - dry to reverberant.
tive. The section focused on two types of inner voices: • Tonality - tonal to whispery.
inner speech and imaginary hallucination-like sounds. • Location - inside to outside.
The ordering of the sections was chosen to minimise
bias by allowing the participants to freely think, before
asking more direct questions. The questions are listed The participants repeated this part for each of three
below: contexts: their experience of their own inner speech;
their expectation for the tone of simulated inner speech;
• If inner monologue is a part of your thinking, how and their expectation for the voices of imagined
would you describe your inner monologue? hallucination-like characters.
• Have you ever experienced any form of an audi- 3.2 Results
tory hallucination? If so, how would you describe
it?
Responses were gathered over two months from 70
• Can you recall any dreams you have recently had, people of diverse backgrounds with ages ranging from
can you describe how different things sounded 22 to 76 (median age of 32). Some participants did
like in your dream? not answer all of the questions and some answers did
not contain useful information, so the number of valid
The second section comprised three questions aiming responses is disclosed alongside the data. With the wide
to more directly validate specific assumptions. Each range of ages among the participants, some minor age-
question had a selection of predefined options assumed related bias may be present in the results, particularly
to be the most common ones, but open responses were in the third stage of the questionnaire, which discussed
also allowed. The questions are presented below: relatively recent technology.
Fig. 3: The voice perceived in memories of previously Fig. 4: The presence of acoustical characteristics and
had conversations (67 valid responses). other ambience features in memories of past
events (67 valid responses). Then stands for
the time of the event, now refers to the time of
completing the questionnaire, and according to
sation speaking in that person’s voice. Some people
recollection refers to only noticeable ambiences
did, however, report how context and the topic of con-
recalled.
versation play a role. A breakdown of the responses is
displayed in Fig. 3. According to some responses, for
example, if a person had contributed to an idea origi-
nally introduced by someone else, the voice would be the expectations regarding the reproduction of the inner
different. Some people also felt the voice of the inter- speech in a VR experience.
locutor to be their own voice, or an impersonal generic The participants’ expectations for the characteristics
voice. of imaginary characters followed similar patterns to
Room acoustic phenomena such as reverberation and those of inner speech. The spatial characteristics are
elements of ambience was a divisive subject. While defined contrarily, however, with an emphasis placed on
most participants recalled no ambience, the number of reverberation and the auralisation of the characters. The
people who would recall the characteristics of spaces pseudo-acoustic space in which the character is located
did not fall far behind. Some participants also felt that is expected to be reverberant, and the character’s voice
their recollection was stronger if the acoustics were should not localise inside the head.
distinctive or had significance in the memory. The
responses are shown in Fig. 4. 4 Rendering Inner Voices in Virtual
Reality
The third section of the questionnaire collected param-
eters for the rendering of inner speech and other inner The questionnaire identified several disagreements with
voices in VR. The results are displayed as violin plots initial assumptions. For some parameters, the results
in Fig. 5, whereby the representation shows both the also varied significantly, making definitive conclusions
distribution of the responses and the key statistics of the difficult to draw. The primary goal in the second
data. Inner speech was considered to be not particularly phase was to resolve the aforementioned uncertainties.
loud or quiet. The character of the inner speech was not The listening test approached the problem by deriv-
particularly boomy nor thin, which can be interpreted ing sound design parameters for an average participant
as lacking the effects of bone conduction. Inner speech based on the median values provided by the third sec-
should have very little reverberation or spatial cues, and tion of the questionnaire. The integral questions were
the voice should originate from within the head. The then tested with a number of alternate soundtracks to
voice should resemble a normal speaking voice with find which parameters best match the test subjects’
a clear fundamental in favour of a more whisper-like experiences and expectations in terms of realism and
tone. Nearly identical results were also reported for entertainment.
(a) Reported characteristics of inner speech (70, 69, 69, 68, and 68 responses respectively)
(b) Expected characteristics of inner speech in VR (65 responses for each category)
(c) Expected characteristics of imaginary characters in VR (63, 62, 62, 63, and 63
responses respectively)
Fig. 5: Experienced and expected parameters of two types of inner voices, one’s inner speech, and imaginary
characters.
The source material for the listening test was a VR short render and two alternative renders were AB tested in
film produced by YLE and Aalto University as a part three steps. The first scene evaluated the preferred
of the Human Optimised Extended Reality (HUMOR) amount of simulated bone conduction [12], the second
project. The film is set in prison, and the characters in scene compared realistic and synthetic reverberation
the story include two (real) prisoners, an (imaginary) on the inner voice, the third scene compared internal,
angel and demon, and the lead character himself who external and close proximity source positions of the
is also a prisoner. The main character uses his vocal imaginary characters, and the last scene tested realistic
and subvocal speech to communicate with the real and and synthetic reverberation of the imaginary characters’
imaginary characters. speech. In each step the test subject was first asked to
select the soundtrack they felt the most realistic, and
4.1 Test Design then the one they felt was most entertaining. The order
of comparisons of the soundtracks was randomised.
Three questions were left unanswered in the first phase The test was conducted in a quiet conference room on
of the study. Firstly, the subjects did not report ex- an Oculus Rift S headset with Sony WH-1000XM3
periencing nor expecting their inner speech to carry headphones using a test environment implemented in
over the effects of bone conduction. Secondly, the par- Cycling 74 Max. The subject had three buttons at
ticipants did not have a clear consensus on whether their disposal: A and B buttons which allowed them to
imaginary characters should be experienced internally switch on the fly between the two soundtracks currently
or externally. Finally, for both, inner speech and imag- under comparison, and a selector button that logged
inary characters, participants did not unambiguously the preferred soundtrack. The test operator switched
agree on whether the sources should be reverberant. between scenes and test steps.
To test these observations, four test scenes were created. To promote intuitiveness and reduce the risk of over-
The first two focused on inner speech, and the next two analysing, the test subjects were not told what they
on imaginary characters. In each scene, a baseline should be listening for in each scene. This created
Fig. 7: Preferred reverberation of inner speech. Fig. 8: Preferred location of imaginary characters.
than it was in reality, causing the room reverberation the perspective of the spectator, causing a sensory con-
measured in-situ to sound unnatural. flict. Imaginary characters, or hallucinations, were not
familiar to many of the participants either. The ex-
The results for the third scene are presented in Fig. 8. pectations towards entertaining reproduction showed
This asked whether the voices of imaginary external considerably more explicit results. In this context, the
characters should be rendered inside or outside the head. synthetic reverberation was voted to best match peo-
Most participants considered a render with the voices ple’s expectations.
placed in their natural position in the room as the most
realistic and entertaining. The voices placed inside the 5 Discussion
listener’s head were not considered realistic, however,
in the context of entertainment, the ratio between the According to the initial assumptions of this study, the
internalised voices and the voices rendered very close expected characteristics of inner speech should be close
became equal. to the somatosensory response of speech, and halluci-
nations as realistic as possible. For the most part, these
The final scene in the listening test examined the effect assumptions were confirmed by the questionnaire; the
of reverberation in the voices of imaginary characters. questions on the presence of bone conduction and the
The results are shown in Fig. 9. In terms of realism, localisation of imaginary characters were left ambigu-
choices were near-equal for all three options. Based on ous, however. The listening test confirmed the initial
discussions with the test subjects, just as in the second assumption that hallucinations are in fact expected to be
scene, the room reverberation did not precisely match rendered outside the viewer’s head. While it also seems
6 Summary
selection of commonly mentioned events, such as the [7] Tian, X. and Poeppel, D., “Mental imagery of
sound in dreams fading away as one wakes up. Through speech: Linking motor and perceptual systems
a further listening test, it could be tested whether effects through internal simulation and estimation,” Fron-
used in films or inspired by actual inner experiences tiers in human neuroscience, 6, p. 314, 2012, doi:
improve immersion and the feeling of realism of VR 10.3389/fnhum.2012.00314.
media. A problem with inner speech representation in
VR is that the voice actor heard by the spectator is not [8] Haggard, P. and Clark, S., “Intentional action:
the voice of the viewer themselves. Since the experi- Conscious experience and neural prediction,”
ence of inner speech is closely tied to one’s own voice Consciousness and cognition, 12(4), pp. 695–707,
identity, an unfamiliar voice is difficult to identify with. 2003, doi:10.1016/s1053-8100(03)00052-7.
A research topic considered at an early stage of the [9] Dudley, R., Aynsworth, C., Mosimann, U., Taylor,
study presented the idea of leveraging recent deep fake J., Smailes, D., Collerton, D., McCarthy-Jones,
technologies to render the inner speech of a character S., and Urwyler, P., “A comparison of visual hal-
with their own voice simulated by artificial intelligence. lucinations across disorders,” Psychiatry research,
272, pp. 86–92, 2019, doi:10.1016/j.psychres.
7 Acknowledgements
2018.12.052.
The authors would like to thank Rébecca Kleinberger
[10] Vygotsky, L., Thought and Language, revised and
for her insights, and YLE for producing the VR film.
expanded edition, MIT Press, Cambridge, MA,
USA, 2012.
References
[11] Watson, J. B., Psychology: From the standpoint
[1] Heavey, C. and Hurlburt, R., “The phenomena of
of a behaviorist, JB Lippincott, Philadelphia, PA,
inner experience,” Consciousness and cognition,
USA, 1919.
17(3), pp. 798–810, 2008, doi:10.1016/j.concog.
2007.12.006. [12] Won, S. Y. and Berger, J., “Estimating trans-
[2] Baddeley, A. D. and Hitch, G., “Working Mem- fer function from air to bone conduction using
ory,” volume 8 of Psychology of Learning and singing voice,” in Proceedings of the 2005 Inter-
Motivation, pp. 47–89, Academic Press, 1974, national Computer Music Conference, Interna-
doi:10.1016/S0079-7421(08)60452-1. tional Computer Music Association, Barcelona,
Spain, 2005.
[3] Morin, A., Uttl, B., and Hamper, B., “Self-
reported frequency, content, and functions of in- [13] Politis, A., McCormack, L., and Pulkki, V., “En-
ner speech,” Procedia-Social and Behavioral Sci- hancement of Ambisonic binaural reproduction
ences, 30, pp. 1714–1718, 2011, doi:10.1016/j. using directional audio coding with optimal adap-
sbspro.2011.10.331. tive mixing,” in IEEE Workshop on Applications
of Sig. Proc. to Audio and Acoustics, pp. 379–383,
[4] Wolpert, D. M., Ghahramani, Z., and Jordan, 2017, doi:10.1109/WASPAA.2017.8170059.
M. I., “An Internal Model for Sensorimotor In-
tegration,” Science, 269(5232), pp. 1880–1882, [14] McCormack, L. and Politis, A., “SPARTA and
1995, doi:10.1126/science.7569931. COMPASS: Real-time implementations of linear
and parametric spatial audio reproduction and
[5] Raij, T. T., Valkonen-Korhonen, M., Holi, M., processing methods,” in Proceedings of the AES
Therman, S., Lehtonen, J., and Hari, R., “Reality Int. Conf. on Immersive and Interactive Audio,
of auditory verbal hallucinations,” Brain, 132(11), volume 2019-March, pp. 1–12, 2019.
pp. 2994–3001, 2009, doi:10.1093/brain/awp186.
[6] v. Békésy, G., “The structure of the middle ear
and the hearing of one’s own voice by bone con-
duction,” The Journal of the Acoustical Soci-
ety of America, 21(3), pp. 217–232, 1949, doi:
10.1121/1.1906501.