0% found this document useful (0 votes)
2 views

Multimodal objects in attention

The document explores the integration of multimodal perceptions, specifically how auditory and visual stimuli interact during early cognitive processing. It challenges the traditional view that such integration occurs only after independent processing, presenting evidence from experiments like the McGurk effect and the sound-flash illusion. The authors propose that these interactions happen in attention, suggesting a need for further research using more ecologically valid methodologies, such as virtual reality, to better understand the cognitive processes involved.

Uploaded by

Kakazumba99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Multimodal objects in attention

The document explores the integration of multimodal perceptions, specifically how auditory and visual stimuli interact during early cognitive processing. It challenges the traditional view that such integration occurs only after independent processing, presenting evidence from experiments like the McGurk effect and the sound-flash illusion. The authors propose that these interactions happen in attention, suggesting a need for further research using more ecologically valid methodologies, such as virtual reality, to better understand the cognitive processes involved.

Uploaded by

Kakazumba99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Multimodality of objects in attention

Abstract: …

Introduction
If his hand comes in contact with an orange on the table, the golden yellow of the fruit, its savor and
perfume will forthwith shoot through his mind

William James “The Principles of Psychology”

In everyday life, what we see is what we hear and what we hear is what we see. Indeed, if
it happens to not be the case, it’s jarring. Recall the last time you watched a film with
misaligned audio. Every out-of-sync sentence heard, every explosion roaring late was as
if you experienced a phenomenological contradiction—a jolting feeling of clashing
sensory expectations. The curious psychological mind asks for the source and reason
behind these experiences. Is it a mere habit of eyes and sight being denied? A
conscious association subverted? Or does the root lie deeper? Perhaps the structure of
certain sights, from the earliest moment of perception, is made to fit with sounds. More
broadly, how early or how late in the unfolding of perception does a unification of
senses occur?

The most intuitive answer is that joining auditory bits of information and visual bits of
information happens after each is independently processed. As 18th-century
empiricists would have it: Consciousness is presented with sounds of people speaking
and sights of lips moving. By mere habituation, the two are integrated. If the sounds
become misaligned, your conscious expectation gets denied. Voilà—this is the source
of the unease. What makes research in this area interesting is that this intuitive picture
is false. Over the past half-century, research into sensory multimodality—how senses
interact and integrate—has shown that many multimodal interactions occur in the
earliest moments of cortical processing. They occur not at the moment of integrating
separate streams of information but at the moment of forming the streams themselves!

Before diving into the details of this debate, it’s worth considering the faulty intuitions
again. Their failure to explain multimodal interactions parallels a related debate in
singular modalities. For a long time in psychology, thinkers (Hume, Locke, Hobbes,
James, Russell) believed vision to be a simple association process at heart. Dots of
color on your retina were to be associated with each other by nothing but co-
occurrence. The fact that you perceived an orange in front of you as an object was a
result of thousands of previous co-activations of orange sensations in a circular area.
Things, and foremost meaning, in vision came to be by the power of co-occurrence or
habituation.

When we set before our Eyes a round Globe... the Judgment presently, by an
habitual custom, alters the Appearances into their Causes: So that from that,
which truly is variety of shadow or colour, collecting the Figure, it makes it pass
for a mark of Figure

This too was proven false. Many sensations become grouped/associated together
significantly before any process of judgment may take place. Early neural structures
detect object-hood, contrasting figures, motions. Moreover, many stimuli become
grouped into a “figure” or category of objects even before habituation could take place.
Faces are one example: We’re sensitive to them even before we see the world, and their
processing forms a completely separate pathway of perceiving sensations.

The same theoretical shift occurred in research on multimodal perception. We started


off believing that perception of unified sounds and sights is nothing but co-occurrence
of two separate stimuli. As O’Callaghan describes:

We are thinking of each sense as explanatorily independent from the others. This
assumption of independence is evident at physiological, functional, and
experiential levels of explanation. ... it is possible to get a complete account of
perceiving with that sense modality that is explanatorily independent from the
others. … the senses constitute independent domains of philosophical and
scientific inquiry.

As I’ve mentioned, this was wrong. The reader need not trust y retelling of the neural
evidence and can verify this falsity firsthand. The McGurk effect, which can be
experienced by simply watching a video, is one compelling argument. The illusion
makes one perceive a different sound depending on presented lip movements. No
amount of self-convincing or factual knowledge stops this multimodal integration.
Perception seems multimodal before any judgment or habituation occurs.

Nonetheless, we can credit empiricist philosophers where due: Associations will likely
never disappear from cognition models. Rather, what we have learned is that they
precede consciousness. If one enjoys notions of higher/lower-order cognition, many
unimodal associations reside in specific unconscious processing stages. In the case of
multimodal perception, there is no consensus on where such associations happen.
This article sheds light by presenting two experiments’ results. We used two paradigms
to assess claims about associations giving grounds to multimodal objects in attention.
Below, we discuss past experiments and their shortcomings before turning to our
innovative approach.

2. Literature Review

There are two parts of the considered claim that should be laid out clearly. Firstly, the
thesis postulates that multimodal interactions occur in attention in perception and not
a specific subprocess. The McGurk effect thus does not suffice as proof. Speech
perception is a specific cognitive skill that uses many mechanisms not necessarily
shared more globally and with a separate phylogeny. Secondly, the thesis should be
distinguished from a weaker thesis postulating multimodal interactions in attention.
Multimodal interactions are a broader class which encompasses any cross-modal
influence. For example, transferring attention load between modalities. We call an
interaction an integration only if it cannot be modelled linearly as a sum of the two
modalities. The perfect example is the already mentioned McGurk effect. One’s
perception settles onto a different sound than either of the two modalities provides. If
McGurk were just a simple interaction, subjects should perceive with some distribution
either the syllables that are heard or seen. Overall, while we do discuss interactions
between modalities as supporting our thesis, they do not suffice as proof.

Besides the McGurk illusion, there exists a second illusion showing integration between
modalities: the sound-flash illusion. When playing an auditory “beep” and showing a
visual flash on a screen, the number of beeps heard influences the number of flashes
seen. Strikingly, even “when a single visual flash is accompanied by multiple auditory
beeps, the single flash is incorrectly perceived as multiple flashes”. Moreover, the effect
persisted even when subjects were informed about the number of flashes displayed.
This phenomenon constitutes a multimodal integration, as a simple linear model
should instead predict a distribution of responses depending on which modality is
trusted more. However, besides providing evidence that this integration is out of
conscious control, the experiment does not clarify where integration occurs. The sound
flash illusion could be an effect of integration in attention or perhaps even in pre-
attentive processing. Similarly one can assess a whole class of integrations that cannot
be consciously modulated: The ventriloquism effect, flash brightening with sound etc. .
These illusions occur without conscious control but the effects by themselves do not
specify their origin clearly.

The lack of information on the cognitive process responsible for integration does not
preclude knowledge about the neural location where it happens. Most broadly, there are
several so-called heteromodal regions that respond to inputs from multiple modalities.
These include the superior colliculus, the superior temporal sulcus (STS), lateral
intraparietal area, and both the visual and auditory cortices. For the latter two, a recent
study has shown that voxel decoding allows detection of auditory objects in the visual
cortex and vice versa. However, these methods (for separate reasons) cannot directly
support the presence of integration. Out of all these regions, the STS stands out, as it
has been repeatedly shown to be active during multimodal integrations. Yet, as debates
over where particular cognitive processes (like attention) occur in the brain continue
none of these studies can uncover the source of multimodal objects.

Given these ambiguities why would we suspect that integration happens in attention?
Here we can turn to the weaker class of multimodal interactions, which we have
observed plenty in attention. Studies show that auditory cues improve performance
when detecting visual blinks on a screen. Though debated, some research indicates
synchronized auditory and visual cues (forming a “supramodal binding feature”)
enhance attention efficiency. In visual search paradigms, an auditorily induced
“freezing effect” can occur: When a beep coincides with a searched-for target, gaze
duration increases and search accuracy improves. In linguistics, seeing lip movements
improves speech attention. Similarly, paired visual stimuli enhance sensitivity to
auditory volume variations in non-linguistic cognition.

Similarly confirmatory conclusions of multimodal interaction in attentional processing


can be drawn from a study measuring the object-specific preview benefit. The OSPB
refers to a phenomenon where people are able to respond faster to questions about
presented stimuli when the location of stimuli is not disturbed. Subjects are flashed two
images on two moving boxes and then shown an image a second time, either one that
occurred before or not. The task is to confirm or deny if the second shown image was
shown at first or not. If the second showing is localized on the same moving box that
showed the image before, subjects are able to respond faster. If in the experiment an
association is established between a sound and an image, the same OSPB is present
when the second sound (to which the participant confirms or denies its previous
presence) is played from a congruent location of the box.

The last bit of experimental evidence suggesting multimodal interactions in attention


processing that we will consider comes from X. There, they considered whether audition
and vision share the same attentive resources by setting up conditions for the multiple
object tracking task. MOT is an experimental paradigm created by Pylyshyn with the aim
of understanding how objects function in attention. The task consists of tracking for a
short period of time (usually around 8 s) several targets (around 4) out of a mix of
randomly moving identically looking distractors. Such versions of MOT are typically
used to assess interactions between different properties of stimuli in visual attention. In
X’s study, they showed that performance in MOT suffers when auditory attention is
loaded with a separate task. This directly suggests a singular attention process could be
responsible for the two modalities.

This allows us to turn to two studies we take as predecessors. First, in an attempt to


explore interactions between modal and auditory attention, Y conducted a study where
auditory cues in MOT provided information about the behavior of moving objects.
However, the findings were none confirmatory and these cues yielded no improvement.
The issue, as we see it, lies in the restricted information provided by the auditory cues.
The cues were triggered only when a moving object bounced off a stationary circle on
the screen. In real-world conditions, auditory cues provide far more spatial evidence
about object locations. As tracking accuracy in such paradigms depends entirely on
location information, a more ecologically valid context might represent human attention
processing more accurately.

The second study we consider is the study done by Z. There, Z attempted to create an
environment reflective of real-world conditions by providing constant auditory signals
about object locations and using 3D glasses. While improvements were seen with one
target, additional auditory information gave no benefit for two or more targets. As with
the previous study, we believe the auditory information was unrepresentative of real-
world attention functioning. In Z’s paradigm, auditory information was independent of
subject head movements. As we move our heads, the received auditory signal
changes—a key component for sound source localization. Since MOT results depend on
location tracking accuracy, distorted auditory information likely impedes natural
attention processes.

Overall, the literature provides significant evidence that multimodal integration occurs
in early perceptual processing. Moreover, several results show multimodal interactions
in attention. The two studies done to date, though yielding negative results, have—in our
opinion—failed to recreate environments reflecting natural cognition. Given the
importance of ecological validity in interpreting lab studies, we have implemented VR
technology to improve past work.

3. Methodology

Performance in tasks
Neural stuff
Studies done to date
Issues
where this is not the case are jarring. Recall the last time you watched a film with
misaligned audio. It is as if the two modalities were clashing against eachother, never to
settle on coherent objects. Given such phenomena curiosity asks where such “clash”
originates from. Is it just a conscious expectation that can be corrected? Afterall a
viewer subjected to such cinephiliacs nightmare might get adjusted after a while. Or
could it precitipate from attention? Perhaps, the sensory experience loses coherency
only after one selects objects to attend to? Or could lay even deeper? Might it is the
sheer functioning of the senses? Maybe the sensations coming from the retina are
encoded in a structure that demands coherency with sensations from the ear drums?

Even as closely back as early 90’s many directed their answers to the above set of
puzzles into the realm of conscious processing. Interaction between modalities were
thought about similarly to how the earliest psychology thought about interactions within
modalities. To connect sounds to images was to do an afterprocessing. To perhaps
recall statistical dependencies between the two (“hey, isn’t it close to certain that
mouth movements correlate with voice?”). Just as early empiricists believed that the
visual processing is a statistical processing of unordered colour patches. The analogy
continues, as for both intra and inter modalities a “mere after processing” idea has
proven tough to uphold. We got to know of very early (in terms of order of cognitive
processing) interactions within visual stimuli (e.g., gestalt effects). Also, we discovered
early interactions between vision and audition as in the mc gurk effect where the sight
of the mouth movements influences the heard sound.

Thus, we can reject the idea of mere conscious consideration as the source of clashes
between modalities. However, being more precise about the depth of the interacitons
between modalities is still an open problem. The present study pertains to this exact
issue. The hypothesis that proudly stands in the tittle is that even for the earliest
attentive processing, where mechanisms localized in X Y, track objects around the
organism the objects are multimodal. That is both auditory and visual stimuli can be
joined in a singular object and not two “coocuring” separate entities.

Literature Review
For our hypothesis to be convincing its must be mentioned that neither object or stimuly
are well defined notions. Indeed, even in the much constrained area of attnetion
processing in percepction the notion of object varies rom paradigm to paradigm. For our
task we can accept a definition reliant on spatitotemporal localisation … CITE

Having somewhat pushed this worry away we can now turn to why would one suspect
that multi modal integariton occours on during early percpecutal processsing. As we
have mentioned somewhat convincingly one an reject an extreme alternaive where all
mutlimodal integration occours in higher order cognition. The menioned mc gurk
ilusions would be a prime example worth considering. The effect of differing
percepction based on the integration of the two stimuly persists until even when is
aware of the effect. The reader my check for themselves.

Link to mcgurk
Yet, addmitidely linguistic cognition is an outlier process from many perspectives.
Mention the effect that you hear better when you see? T. A second illusiions yet non
linguistic at heart is that of the flash effect/ Here the sibject percieves two visual flashes
when presented with one visual flash influenced by two audiotory stimuy. The effect
similarly persists even when conciously known. There are multiple such phenoenma
where even the two have proven over the last 20 years to constitue a sod case against a
pure empiricist veiw of attentive processes
As It was mentioned in the beggingin any further case for deeper interactions between
modailities in attention is far less clear cut.

Here should be the support for the idea


list effects that support the idea that there is a multi sensory integration
then list effects that contradict the idea that there is multi sensory integration
then describe the set MOT task and OSPB task
To approach this idea we employed two very well researched paradism. Mulitple object
tracking and Object specific preview benefit.

Multiple object tracking is a paradigm established by Phylyshyn to study how attention


selects for objects. The experiment is a task of following few target objects (usually
circles) out a set of their copies. At the beginning of each trail objects to be followed are
indicated (often by momentarily changing colour). Then, the objects have a phase of
movement at the end of which the participant is asked to correctly point to the objects
that they were supposed to track. This paradigm allowed to verify a multitude of effects
that influence how objects are present in attention. By measuring the correct percent of
selected objects in differing conditions know the influence of motion, set size, speed of
movement etc.

To date two studies attempted to asses the effect of auditory information in this
paradigm. X modified the paradigm to include auditory information as a helping signal
when objects bounced off a big circle in the middle of the screen. Before it has been
shown that visual ques marking the moment of a bounce (like a change of the colour of
the circle that is being bounced off) help with tracking. X confirmed this effect, and
showed a similar benefit for auditory ques indicating when objects bounce off.
However, the benefit of audio visual ques was shown not to be stronger than the benefit
of the visual ques alone. A second attempt to verify the possible effect of multimodal
information was a 3D MOT tasks where audio was localized congruently with the
location of the objects. The two changes to the original paradigm are the inclusion of 3D
glasses and the inclusion of speakers playing the sounds of the moving balls. The
reported results indicated that for one target the tracking performance was better in the
audiovisual condition. Yet, with multiple objects no beneficial effect was found.

The two studies pertaining to the issues leave ample space for improvement. When
localizing objects using auditory information a big source of information is the change of
audio by head movements. When hearing the sound to the right, you can titl your head
up down and to the sides and by the change in volume (and timber) confirm the loation
of the object. Neither of the two versions of the mot task allowed for this most natural
way of using auditory information to be employed.

1. Method
2. Results
3. Discussion
Df

Other

studies have found that reaction times for cross-modal stimulus detection are

significantly faster than the prediction of redundancy models (Murray et al.,

2005; but see Otto & Mamassian, 2012; Pannunzi et al., 2015, for alternative

explanations). Addi
n this Element, the expression multisensory interaction will be used as an

all-inclusive term referring to any cross-modal influence. This term encom-

passes multisensory effects that are based on general-purpose brain mechanisms

(e.g., alerting processes and statistical redundancy effects) as well as effects that

are based on specific cross-modal convergence mechanisms. The term multi-

sensory integration, instead, will be reserved for the latter case; that is, when the

process relies on machinery that is specifically multisensory. This includes, for

example, multisensory convergence mechanisms and/or effects whose outcome

is not reducible to statistical redundancy. In such cases, multisensory events

elicit behavioural or neural responses that cannot be predicted by linear models

based on combining unisensory responses (e.g., Colonius & Diederich, 2004;

Laurienti, Kraft, Maldjian, Burdette, & Wallace, 2004; Maddox et al., 2015;

Molholm et al., 2002; Murray et al., 2005; Senkowski, Talsma, Herrmann, &

Woldorff, 2005; Talsma, Doty, & Woldorff, 2006).

Fairhall & Macaluso (2009) devised an fMRI

experiment where subjects watched two pairs of speaking lips (presented side

5 This YouTube link contains a demonstration of the ‘multisensory’ cocktail party


problem: https://

youtu.be/mN–nV61gDo

15Elements in Perception
https://doi.org/10.1017/9781108578738 Published online by Cambridge University
Press

by side) while hearing one voice at the same time. BOLD responses (a measure

of brain activity) in the superior temporal sulcus (STS: a well-known multi-

sensory brain area) were stronger when visual spatial attention was covertly

directed to the lips that corresponded to the voice, as compared to when

attention was directed away from the congruent lips.

There are many features we can direct our attention to. One feature that is shared by
both auditory and visual events is their location in space. Attentional processing can
occur in a bottom–up (exogenous) manner, for instance, when a salient event pops out
from its background (Van der Burg et al., 2008). In this case, an object is selected even
though the observer was not planning to select it. In other cases, attentional processing
operates in a top–down (endogenous) manner in which the observer voluntarily controls
what is attended and what is not (Koelewijn et al., 2010). Simultaneous and co-
localised AV cues enhance saliency and attract attention when individual stimuli are
less effective. Both spatial attention and multisensory integration can take place in
higher hetero-modal brain areas (e.g., superior colliculus, thalamus, superior temporal
sulcus, and intraparietal areas) but also in early primary sensory areas (e.g., primary
visual and auditory cortices) in parallel fashion (Koelewijn et al., 2010; Macaluso and
Driver, 2005; Wuerger et al., 2012).

The human occipito-temporal region hMT+/V5 is well known for processing visual
motion direction. Here, we

demonstrate that hMT+/V5 also represents the direction of auditory motion in a format
partially aligned with

the one used to code visual motion. We show that auditory and visual motion directions
can be reliably de-

coded in individually localized hMT+/V5 and that motion directions in one modality can
be predicted from the

activity patterns elicited by the other modality 1

Human early visual cortex was traditionally thought to process simple visual features
such as orientation, contrast, and spatial frequency via feedforward input from the

1
Shared Representation of Visual
and Auditory Motion Directions
in the Human Middle-Temporal Cortex
lateral geniculate nucleus (e.g., [1]). However, the role of nonretinal influence on early
visual cortex is so far insufficiently investigated despite much evidence that feedback
connections greatly outnumber feedforward connections [2-5]. Here, we explored in
five fMRI experiments how information originating from audition and imagery affects the
brain activity patterns in early visual cortex in the absence of any feedforward visual
stimulation. We show that category-specific information from both complex natural
sounds and imagery can be read out from early visual cortex activity in blindfolded
participants. The coding of nonretinal information in the activity patterns of early visual
cortex is common across actual auditory perception and imagery and may be mediated
by higher-level multisensory areas. Furthermore, this coding is robust to mild
manipulations of attention and working memory but affected by orthogonal, cognitively
demanding visuospatial processing. Crucially, the information fed down to early visual
cortex is category specific and generalizes to sound exemplars of the same category,
providing evidence for abstract information feedback rather than precise pictorial
feedback. Our results suggest that early visual cortex receives nonretinal input from
other brain areas when it is generated by auditory perception and/or imagery, and this
input carries common abstract information. Our findings are compatible with feedback
of predictive information to the earliest visual input level (e.g., [6]), in line with predictive
coding models [7-10].

1. Introduction

Idea for the introduction


Start with the sotory of misaligned audio,
and say that it might sound about right to say that its just conscious expectation formed
on independent stimuli that makes it feel weird.
then do a surprise
NO its actually not, there is something way deeper than mere association happening.
Give a strong statement: Interactions between modalities happen as profoundly as
interaction within modalities, even in the earliest moments when your brain is making
sense of colour dots on your retina sounds do make a differene

Then go for the metaphor of learning about what happens in a single modality
we thought (hume, structuralism in percepction) that vision makes sense of
colour dots by connecting nonesensicall colour dots together
100 or more years later we know that thinking in terms of colour dots was a
residue of using metaphors like painters creating images from thousands of
strokes. The early visual processing is way more ordered into object than we
have though. This however does not mean that the idea of associaciation things
to create meaning vanished from psychology, its just that the building blocks
which we do the asscoaiting with changed
similarly it proceeds with our thinking of processing between modailities. We
though just back in the 80’s that eyes are independent of of ears. Both get some
objects and the interactions between them are mere associations. Exactly as it
was for a singular modality we have learned that this is not the case. I meniotned
the early neural stuff. But for the neuro sceptic of us there have been discovered
tens of illlusions that are easily done on oneself with no FMRI included, list the
few illusions and maybe like a result or two
Again this does not mean that the only real rule of psychology somehow has
become false, association is how we create meaning, the question is what do we
associate. Colour dots, contoured 2d objects, 3d objects, 3d objects with extra
properties stored?
As meaning through association is the central tennent of psychology, we have
come up with several ways of testing it building blocks. One is by stydyiing the
blocks in attentive processing

Then describe MOT

Describe the reasons for the idea

Go on to literature review

You might also like