Representing Inner Voices in VR

Representing Inner Voices in Virtual Reality Environments

Kuura Parkkola1 , Thomas McKenzie2 , Jukka Häkkinen3 , and Ville Pulkki1
1 AcousticsLab, Department of Information and Communications Engineering, Aalto University, Espoo, Finland
2 Acousticsand Audio Group, Reid School of Music, University of Edinburgh, United Kingdom
3 Visual Cognition Research Group, Department of Psychology and Logopedics, University of Helsinki, Finland

Correspondence should be addressed to Kuura Parkkola ([email protected])

The inner auditory experience comprises various sounds which, rather than originating from sources in their
environment, form as a result of internal processes within the brain of an observer. Examples of such sounds are, for
instance, verbal thoughts and auditory hallucinations. Traditional audiovisual media representations of inner voices
have tended to focus on impact and storytelling, rather than aiming to reflect a true-to-life experience. In virtual
reality (VR) environments, where plausibility is favoured over this hyper-real sound design, a question remains on
the best ways to recreate realistic, and on the other hand, entertaining inner and imagined voices via head-tracked
headphones and spatial audio tools. This paper first presents a questionnaire which has been completed by 70
participants on their own experience of inner voices. Next, the results of the questionnaire are used to inform a VR
experiment, whereby different methods to render inner voices are compared. This is conducted using a short film
created for this project. Results show that people mostly expect realism from the rendering of inner voices and
auditory hallucinations when the focus is on believability. People’s expectations for inner voice did not change
considerably in an entertainment context, whereas for hallucinations, exaggerated reverberation was preferred.

1 Introduction as well as its characteristics [2, 1, 3]. Hallucinations

are a well-researched but not comprehensively under-
The subjective auditory experience of an individual stood topic. Over the previous decades, many potential
comprises numerous stimuli that originate not only mechanics which provide a sound basis for the phe-
from external sources in their environment but also nomenon [4, 5] have been published.
from internal processes within the nervous system and
brain [1]. Such sounds include, for instance, hallucina- While the psychology and neurology of inner voices
tions and inner monologue. Internally generated sounds is well understood, their reproduction is not. Films,
are henceforth referred to in this paper as inner voices. video games, and various other media often present
The psychology behind inner voices has been studied inner voices in particular ways that best fit the artistic
extensively in the literature, and the development of vision of the work. These representations, however,
verbal thought has been researched since the early 20th typically rely on well-established industry conventions
century by psychologists such as L. Vygotsky and J. rather than scientific research. Realism is defined in
Piaget, whereas more recent studies have investigated this paper as authenticity to real life, whereas hyper-
the neurological mechanisms behind subvocal speech realism is an exaggeration and inauthentic, in order to
Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

increase engagement and storytelling narratives. For speech. This could be one factor contributing to the
example, film dialogue is expected to sound clear and formation of inner speech.
crisp and similar sound effects are chosen for specific
types of events. While these hyper-realistic approaches The key difference between inner speech and auditory
are immersive in more traditional media, virtual reality vocal hallucinations (AVH) is agency. Inner speech
(VR) experiences engage with a spectator in a funda- holds a sense of agency which is believed to enable
mentally different way by making the experience more the brain to detect language inputs originating from
dynamic and interactive. Therefore, the representation within. This is thought to be connected to efference
of inner voices in VR requires investigation. copies [8, 5]. In hallucinations, the sense of agency is
lost, which causes confusion in the brain since it can
This paper studies both realistic (plausible) and enter- no longer accurately detect internally generated stimuli.
taining (engaging) reproduction of inner voices in VR AVHs are difficult to discern from reality [9].
environments and how prioritising either changes the
experience. The aim is to answer the question of how While neurology can provide a context for how inner
sounds like the perceived sound of one’s own thoughts, voices form, psychology helps to understand their per-
auditory hallucinations, and other such sounds should ceptual features. One theory explaining human con-
be represented, and by extension, how should they sciousness is the multi-component model of working
be rendered in practice using spatial audio techniques. memory [2]. The theory identifies three primary pro-
This is approached via a two-stage study comprising cesses of consciousness: the central executive and its
a questionnaire and a listening test. The questionnaire subsystems, the visuospatial scratchpad, and the phono-
aims to understand people’s experiences and expec- logical loop. The phonological loop itself comprises
tations regarding inner voices. The purpose of the two components: a temporary store which can hold on
listening test is to then test a set of inner experience to verbal content for a few seconds, and an articulatory
simulations derived from the questionnaire results on a rehearsal process which feeds and refreshes the store.
group of test subjects. The articulatory process is thought to be linked to vocal
and subvocal speech.
2 Inner Voices in the Brain
The characteristics of inner voices have been ap-
The human auditory system can be examined on three proached from many directions. One influential theory
layers. A physical layer captures sound vibrations, a is Vygotsky’s theory on the development of language
neurological layer transmits and processes them, and and thought [10], which argues that subvocal speech
a cognitive layer perceives and understands them. The is an internalised version of voiced speech, often serv-
auditory system is also connected to speech production ing the same purpose as the egocentric speech, cus-
through a number of physical and neurological paths. tomary to small children. Studies done by J.B. Wat-
Physical feedback results from sound vibrations cre- son also found that children and adults employ similar
ated through speech being received as sensory inputs techniques in problem-solving, with a distinction that
through the ears [6]. This phenomenon defines some of children often go through their thinking process out-
the characteristics, such as the effects of bone conduc- loud, whereas adults think subvocally [11]. In a 2011
tion, that people associate with their own voice. The study, 380 people were interviewed on their use of in-
neurological connections are important for the semantic ner speech [3]. The most common interactions were
understanding of language and various other cognitive found to revolve around one’s appearance and state of
phenomena. One such connection is the inner simu- affairs, such as finances, stress and future, the planning
lation of motor actions. As the premotor and motor of actions and future conversations, various problem-
cortices interact with muscles, copies of these stimuli solving tasks, and self-regulation.
are sent as feedback into the brain areas that typically
receive relevant physical and somatosensory feedback
[4]. This behaviour allows the brain to track its perfor- 3 Questionnaire on Inner Auditory
mance. Research has shown that these efference copies Experience
are sent regardless of whether physical movements are
made [7]; therefore, speech is carried back to the au- In the first phase of this work, a three-part question-
ditory regions of the brain in both vocal and subvocal naire was created to collect statistics on people’s inner

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

experiences. The questionnaire focused on three ar- • When you think about something you have heard,
eas: inner speech, auditory hallucinations, and sound do you sense the sound... In the ambience of the
in dreams. space you were in? / In the ambience of the space
you are currently in? / Without any ambience? /
The topics of the questionnaire were chosen based on a
Other (specify).
number of expected characteristics: inner speech in VR
should sound similar to the perceived qualities of one’s
own voice, with bone conduction present and little-to- The data collected in the first and second sections were
no reverberation. The voice should be perceived inside difficult to use directly as inputs for concrete sound
the head. Auditory hallucinations should be difficult design parameters. In the third section, therefore, a
to tonally discern from physical sources, except when selection of parameters was defined based on introspec-
aiming for entertainment, where the separation should tion while designing the survey. Here, values for each
be easier to make. These assumptions were derived parameter were collected with sliders ranging from 0
from background research and authors’ introspection. - 10, each end corresponding to one of the edge cases
for the parameter, as listed below:
3.1 Questions

The goal of the first section was to acquire first-hand • Loudness - quiet to loud.
experiences of inner voices including ones that had not • Depth - boomy to thin.
been thought of when designing the questionnaire, and
was therefore worded to be open ended and not sugges- • Spatial characteristics - dry to reverberant.
tive. The section focused on two types of inner voices: • Tonality - tonal to whispery.
inner speech and imaginary hallucination-like sounds. • Location - inside to outside.
The ordering of the sections was chosen to minimise
bias by allowing the participants to freely think, before
asking more direct questions. The questions are listed The participants repeated this part for each of three
below: contexts: their experience of their own inner speech;
their expectation for the tone of simulated inner speech;
• If inner monologue is a part of your thinking, how and their expectation for the voices of imagined
would you describe your inner monologue? hallucination-like characters.
• Have you ever experienced any form of an audi- 3.2 Results
tory hallucination? If so, how would you describe
Responses were gathered over two months from 70
• Can you recall any dreams you have recently had, people of diverse backgrounds with ages ranging from
can you describe how different things sounded 22 to 76 (median age of 32). Some participants did
like in your dream? not answer all of the questions and some answers did
not contain useful information, so the number of valid
The second section comprised three questions aiming responses is disclosed alongside the data. With the wide
to more directly validate specific assumptions. Each range of ages among the participants, some minor age-
question had a selection of predefined options assumed related bias may be present in the results, particularly
to be the most common ones, but open responses were in the third stage of the questionnaire, which discussed
also allowed. The questions are presented below: relatively recent technology.

The first section produced open responses which had

• Is your inner voice located... In the centre of your to be quantified before they could be analysed further.
head? / In front of you? / Above you? / Other First, the responses were inspected for recurring fea-
(specify). tures which were used to define a set of labels. The
• When you think about something someone else responses were then processed again marking each re-
has said or might say, do you sense the voice as... sponse with relevant labels. The labels and their relative
Your own voice? / Their voice? / Other (specify). frequencies are displayed in a word cloud in Fig. 1.

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

(b) Auditory hallucinations (39 re-

(a) Inner speech (65 responses) sponses) (c) Sounds in dreams (46 responses)

Fig. 1: Word cloud representations of the questionnaire results.

Many contributors characterised their inner speech as

similar to their own voiced speech. Some of the sub-
jects noted, however, that their largely verbal thoughts
also contained visual and unsymbolised elements. Only
a few people described the qualities of their inner
speech, but the ones who did mentioned features such
as lack of speech defects, being formed of incom-
plete sentences, and causing physical movements of
the mouth, like when speaking out loud. Some speci-
fied the quality of the voice as relaxed and low-pitched.
The most commonly reported uses of subvocal speech
Fig. 2: Heatmap of perceived inner voice location (70
were inner dialogues preparing for social interactions,
self-control, and self-reflection. Many subjects also
reported using their inner monologue for problem-
solving and explaining complex concepts for them-
selves. This is in keeping with the results of [3]. while awake. In some accounts, the sound would be ex-
tremely focused such that only the focal sound would
Characteristics of auditory hallucinations were reported be perceived with little to no background ambience.
by far fewer people compared to inner speech. Many re- The most common sounds recalled from dreams were
sponses described various neurophysiological phenom- spoken words and musical tones.
ena, such as tinnitus, which have little relevance to in-
The next part of the questionnaire featured more tar-
ner voices. Environmental sounds, such as door creaks,
geted questions focusing on the location, agencies, and
footsteps, and police sirens were also commonly re-
acoustic features of inner voices. Firstly, Fig. 2 presents
ported. Intentional sounds, including, for example,
a heatmap of the locations in the head where inner
speech, music, and calls to one’s own name, were less
speech was perceived to originate from. Most subjects
frequent but not rare. Speech and music were report-
considered their inner speech to come from the centre
edly experienced in lowered states of awareness and
of their head (47 people), though some also stated the
high-stress situations, and environmental sounds were
back of the head (5 people); either above the head (4
commonly paired with anxiety or as ’deja vu’-type phe-
people) or inside the head, but towards its crown (4
people); near the eyes and mouth area (3 people); or all
over in an indefinite location (7 people).
Many participants reported not remembering their
dreams or the sound in them. These sounds were When recalling past conversations, most people re-
mostly characterised as similar to any sound heard called the experience of the other person in the conver-

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

Fig. 3: The voice perceived in memories of previously Fig. 4: The presence of acoustical characteristics and
had conversations (67 valid responses). other ambience features in memories of past
events (67 valid responses). Then stands for
the time of the event, now refers to the time of
completing the questionnaire, and according to
sation speaking in that person’s voice. Some people
recollection refers to only noticeable ambiences
did, however, report how context and the topic of con-
versation play a role. A breakdown of the responses is
displayed in Fig. 3. According to some responses, for
example, if a person had contributed to an idea origi-
nally introduced by someone else, the voice would be the expectations regarding the reproduction of the inner
different. Some people also felt the voice of the inter- speech in a VR experience.
locutor to be their own voice, or an impersonal generic The participants’ expectations for the characteristics
voice. of imaginary characters followed similar patterns to
Room acoustic phenomena such as reverberation and those of inner speech. The spatial characteristics are
elements of ambience was a divisive subject. While defined contrarily, however, with an emphasis placed on
most participants recalled no ambience, the number of reverberation and the auralisation of the characters. The
people who would recall the characteristics of spaces pseudo-acoustic space in which the character is located
did not fall far behind. Some participants also felt that is expected to be reverberant, and the character’s voice
their recollection was stronger if the acoustics were should not localise inside the head.
distinctive or had significance in the memory. The
responses are shown in Fig. 4. 4 Rendering Inner Voices in Virtual
The third section of the questionnaire collected param-
eters for the rendering of inner speech and other inner The questionnaire identified several disagreements with
voices in VR. The results are displayed as violin plots initial assumptions. For some parameters, the results
in Fig. 5, whereby the representation shows both the also varied significantly, making definitive conclusions
distribution of the responses and the key statistics of the difficult to draw. The primary goal in the second
data. Inner speech was considered to be not particularly phase was to resolve the aforementioned uncertainties.
loud or quiet. The character of the inner speech was not The listening test approached the problem by deriv-
particularly boomy nor thin, which can be interpreted ing sound design parameters for an average participant
as lacking the effects of bone conduction. Inner speech based on the median values provided by the third sec-
should have very little reverberation or spatial cues, and tion of the questionnaire. The integral questions were
the voice should originate from within the head. The then tested with a number of alternate soundtracks to
voice should resemble a normal speaking voice with find which parameters best match the test subjects’
a clear fundamental in favour of a more whisper-like experiences and expectations in terms of realism and
tone. Nearly identical results were also reported for entertainment.

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

(a) Reported characteristics of inner speech (70, 69, 69, 68, and 68 responses respectively)

(b) Expected characteristics of inner speech in VR (65 responses for each category)

(c) Expected characteristics of imaginary characters in VR (63, 62, 62, 63, and 63
responses respectively)

Fig. 5: Experienced and expected parameters of two types of inner voices, one’s inner speech, and imaginary

The source material for the listening test was a VR short render and two alternative renders were AB tested in
film produced by YLE and Aalto University as a part three steps. The first scene evaluated the preferred
of the Human Optimised Extended Reality (HUMOR) amount of simulated bone conduction [12], the second
project. The film is set in prison, and the characters in scene compared realistic and synthetic reverberation
the story include two (real) prisoners, an (imaginary) on the inner voice, the third scene compared internal,
angel and demon, and the lead character himself who external and close proximity source positions of the
is also a prisoner. The main character uses his vocal imaginary characters, and the last scene tested realistic
and subvocal speech to communicate with the real and and synthetic reverberation of the imaginary characters’
imaginary characters. speech. In each step the test subject was first asked to
select the soundtrack they felt the most realistic, and
4.1 Test Design then the one they felt was most entertaining. The order
of comparisons of the soundtracks was randomised.
Three questions were left unanswered in the first phase The test was conducted in a quiet conference room on
of the study. Firstly, the subjects did not report ex- an Oculus Rift S headset with Sony WH-1000XM3
periencing nor expecting their inner speech to carry headphones using a test environment implemented in
over the effects of bone conduction. Secondly, the par- Cycling 74 Max. The subject had three buttons at
ticipants did not have a clear consensus on whether their disposal: A and B buttons which allowed them to
imaginary characters should be experienced internally switch on the fly between the two soundtracks currently
or externally. Finally, for both, inner speech and imag- under comparison, and a selector button that logged
inary characters, participants did not unambiguously the preferred soundtrack. The test operator switched
agree on whether the sources should be reverberant. between scenes and test steps.
To test these observations, four test scenes were created. To promote intuitiveness and reduce the risk of over-
The first two focused on inner speech, and the next two analysing, the test subjects were not told what they
on imaginary characters. In each scene, a baseline should be listening for in each scene. This created

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

possible ambiguity, however, since an untrained ear

might have difficulty in hearing differences between
the soundtracks. Since the focus of the test was on
the subjective and partially subconscious experience of
the test subject, this ambiguity was considered a part
of the experiment; thus, the test subjects were further
instructed to answer using intuition. To understand
the reasoning behind a participant’s responses, the test
operator discussed the scenes with the participants after
the test had finished.
The soundtrack variations for the listening test were
compiled from three types of source material. In-
location Ambisonic recordings were captured as the (a) Realism context
video was taken. The same recording setup was used to
separately capture the lead character’s external speech.
Mono recordings of dialogue were captured with lava-
lier microphones worn by the non-imaginary characters.
The inner speech of the lead character and the voices
of the imaginary characters were recorded in a stu-
dio environment onto mono tracks. The soundtracks
were mixed in Steinberg Cubase. The resulting mixes
were in Ambisonic format, and rendered to headphones
using the Sparta VST plugin HO-DirAC binaural de-
coder1 [13, 14]. The Oculus fed head orientation data
to Max, which allowed for dynamic compensation of
head rotations. (b) Entertainment context

4.2 Results Fig. 6: Preferred rendering of bone conduction in inner

The listening test was completed by 17 people with
backgrounds mostly in acoustics and engineering, and
ages ranging from 19 to 32 (median age of 24). In most between conditions were difficult to notice. In these
cases the responses were unambiguous with one of the cases the test subjects were instructed to make the selec-
soundtracks in a scene selected more times than the oth- tion based on their intuition. The similarity between the
ers. On some occasions, a participant chose a different three options increases uncertainty in the results. The
render in each of the three steps, thus not affecting the responses show a slight skew towards the soundtrack
distribution of the responses. These ambiguous results having a small amount of simulated bone conduction
are labelled here as inconclusive. The ratio between when looking for realism and an expectation for light to
conclusive and inconclusive results is used to judge notable bone conduction in the entertainment context.
uncertainty since these responses display either indeci-
siveness or an inability to hear differences in the audio The second scene focused on the reverberation applied
material. to the inner speech of the main character. The results
displayed in Fig. 7 show that people expect their inner
The first scene considered the amount of simulated speech to stand out acoustically from the actual room
bone conduction expected by the listener in the inner in the case of realism; in the case of entertainment, the
speech of the lead character. The results are presented results are similar but less pronounced. It should be
in Fig. 6. Many subjects reported that the differences noted that in the second scene, a few people mentioned
1 difficulty discerning the synthetic reverberation from
sparta-site/docs/plugins/hodirac-suite/ the baseline. Some also pointed out that from the per-
#binaural spective of the spectator, the VR space seemed larger

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

(a) Realism context (a) Realism context

(b) Entertainment context (b) Entertainment context

Fig. 7: Preferred reverberation of inner speech. Fig. 8: Preferred location of imaginary characters.

than it was in reality, causing the room reverberation the perspective of the spectator, causing a sensory con-
measured in-situ to sound unnatural. flict. Imaginary characters, or hallucinations, were not
familiar to many of the participants either. The ex-
The results for the third scene are presented in Fig. 8. pectations towards entertaining reproduction showed
This asked whether the voices of imaginary external considerably more explicit results. In this context, the
characters should be rendered inside or outside the head. synthetic reverberation was voted to best match peo-
Most participants considered a render with the voices ple’s expectations.
placed in their natural position in the room as the most
realistic and entertaining. The voices placed inside the 5 Discussion
listener’s head were not considered realistic, however,
in the context of entertainment, the ratio between the According to the initial assumptions of this study, the
internalised voices and the voices rendered very close expected characteristics of inner speech should be close
became equal. to the somatosensory response of speech, and halluci-
nations as realistic as possible. For the most part, these
The final scene in the listening test examined the effect assumptions were confirmed by the questionnaire; the
of reverberation in the voices of imaginary characters. questions on the presence of bone conduction and the
The results are shown in Fig. 9. In terms of realism, localisation of imaginary characters were left ambigu-
choices were near-equal for all three options. Based on ous, however. The listening test confirmed the initial
discussions with the test subjects, just as in the second assumption that hallucinations are in fact expected to be
scene, the room reverberation did not precisely match rendered outside the viewer’s head. While it also seems

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

where participants disagreed on the acoustic charac-

teristics of realistic hallucinations. On the other hand,
participants were able to give more conclusive data
on the entertainment characteristics of hallucinations.
This suggests that fewer people personally experience
auditory hallucinations, but have a clear idea of how
they should be perceived in a film setting nonetheless.
The clear peak on the synthetic reverberation suggests
that voices of the imaginary characters should be ren-
dered in some space, but the space should be easily
discernible from the actual room. Many of the partici-
pants also confirmed this idea in the post-experiment
(a) Realism context interview.

6 Summary

This paper has investigated people’s perception of their

inner voices, namely, inner speech and auditory hal-
lucinations, to find out how these sounds should be
rendered using spatial audio techniques in virtual real-
ity (VR) in either a realistic or entertaining way.
A questionnaire collected experiences of inner speech,
auditory hallucinations, and sound in dreams in the
participants’ own words, as well as the perceived lo-
cation of the inner speech, and recollection of agency
(b) Entertainment context
and ambience via semi-open multiple-choice questions.
Fig. 9: Preferred reverberation of imaginary charac- Parameter data for the rendering of inner speech and
ters. auditory hallucinations was collected for use in the
second stage of the study.
The parameters were evaluated in a listening test us-
that the effects of bone conduction are expected in VR
ing a short VR film, whereby participants viewed the
simulations of inner speech, the results do not provide
film in VR, whilst switching between different possible
a strong enough indication for definitive claims to be
rendering options, before rating which was deemed the
made and further investigation is therefore required.
most realistic and entertaining. Results of the listening
The listening test results show notable variation in both test suggest that people expect their inner speech in
reverberation-related questions. Both questions pro- VR to include some effects of bone conduction. They
duced clear differences in the context of realism and also expect imaginary characters to localise to their
entertainment. In the case of inner speech, dry non- perceived positions outside the head. The inner speech
reverberant speech was deemed the most realistic and was expected to either have no reverberation or a large
entertaining. Some participants mentioned after the synthetic one. The subjects did not agree on what type
test that they had not chosen the room reverberation as of reverberation was expected for imaginary characters
realistic because, from the spectator’s perspective, the to be realistic. For the case of entertainment, the con-
space sounded smaller than it looked. sensus was, however, that a synthetic reverberation that
gives the characters space but isolates them acoustically
Auditory hallucinations are not an approachable topic from the actual environment sounded better.
for most people. An indication of this is the nearly 50%
reduction in the number of questionnaire responses to Possible future avenues for research are the use of spe-
the hallucination questions compared to those of inner cial effects to improve the realism or entertainment of
speech. The same is also evident in the listening test, VR video. The questionnaire gave inspiration for a

Parkkola, McKenzie, Häkkinen, and Pulkki Inner Voices in VR media

