0% found this document useful (0 votes)
5 views10 pages

Content Matching For Sound Generating Objects Within A Visual Scene Using A Computer Vision Approach

This convention paper presents a feasibility study on using computer vision to detect and match sound-generating objects in visual scenes, particularly for immersive audio content production. The methodology shows promise for single moving objects but faces challenges with multiple objects due to limitations in the current computer vision system. The study emphasizes the importance of accurate visual object detection and audio repository labeling for effective sound design in immersive environments.

Uploaded by

lacafeteriaderco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Content Matching For Sound Generating Objects Within A Visual Scene Using A Computer Vision Approach

This convention paper presents a feasibility study on using computer vision to detect and match sound-generating objects in visual scenes, particularly for immersive audio content production. The methodology shows promise for single moving objects but faces challenges with multiple objects due to limitations in the current computer vision system. The study emphasizes the importance of accurate visual object detection and audio repository labeling for effective sound design in immersive environments.

Uploaded by

lacafeteriaderco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Audio Engineering Society

Convention Paper 10375


Presented at the 148th Convention, 2020 June 2-5, Online
This convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at
least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has
been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review
Board. The AES takes no responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-
lib), all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the
Journal of the Audio Engineering Society.

Content matching for sound generating objects within a


visual scene using a computer vision approach
Dan Turner1 , Chris Pike2 , and Damian Murphy1
1 University of York
2 BBC Research and Development
Correspondence should be addressed to Dan Turner ([email protected])

ABSTRACT
The increase in and demand for immersive audio content production and consumption, particularly in VR, is driving
the need for tools to facilitate creation. Immersive productions place additional demands on sound design teams,
specifically around the increased complexity of scenes, increased number of sound producing objects, and the need
to spatialise sound in 360◦ . This paper presents an initial feasibility study for a methodology utilising visual object
detection in order to detect, track, and match content for sound generating objects, in this case based on a simple
2D visual scene. Results show that while successful for a single moving object there are limitations within the
current computer vision system used which causes complications for scenes with multiple objects. Results also
show that the recommendation of candidate sound effect files is heavily dependent on the accuracy of the visual
object detection system and the labelling of the audio repository used.

1 Introduction sign. The following definition of immersion is used in


this research, as defined by Argrawal et al [2]:
With the rise in popularity of spatial audio content in Immersion is a phenomenon experienced by
recent years, the term "immersion" or "immersive" has an individual when they are in a state of deep
been used to describe a plethora of content related to mental involvement in which their cognitive
new audio experiences. However, it is not uncommon processes (with or without sensory stimula-
for this term to be used vaguely and interchangeably tion) cause a shift in their attentional state
with others, for example, naturalness, envelopment, such that one may experience disassociation
presence, or realism [1], and differing definitions can from the awareness of the physical world.
cause confusion [2].
Argrawal concluded there are two prominent perspec-
It is helpful to have a clear definition of what is meant tives on immersion, the first being an individual’s psy-
by ‘immersion’ and therefore what would constitute chological state and the second being an objective prop-
immersive content and therefore immersive sound de- erty of a technological system [2]. The latter is rejected
Turner, Pike, and Murphy Content matching for sound generating objects

by Argrawl due to its exclusion of the individual and be plausible and ‘fill’ the additional visual and hence
their experience, which, being highly subjective, is seen auditory space. The users have a greater level of inter-
of paramount importance. activity with the scene allowing them to control how
they view and focus on individual aspects within it, and
Immersive content can, therefore, be described as con- often the audience is no longer limited by the framing
tent which is designed to elicit a state of immersion of a single shot.
in line with the given definition. While not a prereq-
uisite, many modern day immersive experiences use Due to this extended visual field, sound spatialisation
technologies such as 360◦ audio and video to provide plays an important role in creating immersive content
the subjective sense of being surrounded and, in the by facilitating the positioning of audio sources around
case of audiovisual content, provide an experience of the complete 360◦ space. The aim being to create a
multi-sensory stimulation. The production of such con- state of immersion by increasing the sense of realism
tent has become increasingly popular in recent years and the ability to interact with the virtual environment
with companies such as BBC [3], Facebook [4], and [7]. It also allows for increased engagement (or, more
Google [5], releasing tools and content for such experi- specifically, the lack of disruption caused to a user’s
ences. engagement) by ensuring consistency within the envi-
ronment such that visual and auditory information is
In this paper we propose a methodology, based on an
considered spatially congruent [7]. Furthermore, sound
initial feasibility study, that aims to streamline immer-
spatialisation can also take advantage of listener ex-
sive sound design workflows through the application of
pectation to reinforce the sense of immersion [8] with
a computer vision approach that facilities the detection,
respect to sound location (such as aircraft emanating
spatial tracking, and content matching of appropriate
from above the listener).
audio assets to sound generating objects within a vi-
sual scene. The current study is undertaken using a Spatialisation of sound can, however, be a time con-
simple 2-D scene as a proof of concept with a view to suming task and one which has seen the development
expanding to 360◦ spatial audio/video in future work. of a multitude of tools for use within Digital Audio
Workstations (DAWs) [9, 10, 4]. Automating aspects
2 Background of the process could allow sound designers to have a
greater focus on those tasks requiring creative decision
2.1 Sound Design for Immersive Content making as opposed to those where the decision making
process is simple, but the task itself is more procedural,
Sound design is a well established area of practice iterative, or labour intensive.
and is traditionally associated with the design and pro-
A common component of many immersive experiences
duction of sound for the purposes of film, TV, radio
(though admittedly not all) is the presence of visual in-
programes, and video games. Such sounds often fall
formation, normally in the form of a video. Using this
broadly into the categories of sound effects, dialogue,
readily available data correctly could provide a wealth
and music [6]. In the context of this paper we are
of information about the audio scene being constructed
specifically interested in sound design practice for new
and this could then be utilised to inform context-aware
immersive film, TV, and radio content.
automated or semi-automated audio outcomes. The util-
The aesthetics and practice around these content forms isation of computer vision techniques could lend them-
are still new or as yet unknown, but it is clear that work- selves to this application of machine-assisted sound
flows are changing from established practices such that design, especially within the context of automatic con-
sound designers will need to adapt. At present, how- tent matching and object tracking.
ever, little formal research has been conducted on the
subject. Immersive audio production often places addi- 2.2 Visually driven sound design
tional demands on sound designs teams, with little or
no additional time allocated to complete the task. By Computer vision is an established area of machine
their very nature, 360◦ scenes are often more complex learning that focuses on making sense of the informa-
and contain many more sound cues or sound generating tion contained within digital images and videos. Two
objects, and require a greater level of detail in order to of the most common tasks involved in many computer

AES 148th Convention, Online, 2020 June 2–5


Page 2 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

vision applications are object detection (localisation of be accomplished by adding a human user in the loop to
objects within a given image) and object classification rate the suitability of the audio content presented for the
(estimating which of a given class the object is most given scene. Alternately, the system could also provide
likely to belong to). Algorithms for such tasks will of- options for the human user to choose between, which
ten have to deal with evaluating multiple objects within would be more akin to the functionality of a production
a given image. tool.
Within the field of sound design there are some exam- A user’s ability to interact with the installation in [12]
ples of how visual features can be matched to audio is very limited. They are able to provide an image or
files in a database or used to synthesise sounds from select a location but have no control over the resulting
this visual information [11, 12]. soundscape. The aim of this work is to use computer
vision as a collaborative agent, much in the same way
Owens et al. [11] trained a recurrent neural network
other machine learning techniques have been utilised
(RNN) to map visual features to audio features, which
to act as collaborators for music composition [14] and
were then transformed into a waveform by either match-
sound synthesis [15].
ing them to already existing audio files in a database
or by, what the authors describe as, parametrically in- 3 Methods
verting the features. The sounds synthesised were of
people hitting and scratching different surfaces and ob- 3.1 Google’s object detection API
jects with a drum stick. While this is still somewhat
distant from complex soundscape creation, this outlines It should be noted that the scope of this feasibility study
a general approach that could be used in order to pro- has been limited to the use of existing visual recogni-
duce other plausible sound objects. Performance of the tion tools, with the method adopted in this paper being
model were measured via a psychophysical studying built around Google’s object detection API [16]. The
using a two-alternative forced choice test where partic- API is written in TensorFlow and is used to detect, lo-
ipants were required to distinguish real and machine- cate, and classify content from a simple 2D video frame
generated sounds. Results were mixed, with parametric image. The tutorial for the API is easily adapted to run
generation performing well for materials which were detection on frames of a video following instructions
considered more noisy, for example leaves and dirt, but detailed in [17]. The description which follows of how
performing poorly for harder surfaces such as metal this is applied is illustrated in Fig. 1 and shows flow of
and wood. It was also found that matching the mapped information and the different components that make up
audio features to existing audio files was ineffective for the wider system.
textured sounds such as splashing water.
The model used for this paper is the Single Shot De-
The online sound installation Imaginary Soundscapes tection (SSD) meta-architecture, with the inception V2
[12] creates ‘psuedo’ soundscapes by extracting Con- feature extractor, chosen because it gave a good com-
volution Neural Network (CNN) features from an im- promise between speed, accuracy, and memory usage.
age and matches these with corresponding features for The model was run on an Intel Core i5-600 CPU @
sound files from a database of environmental sounds. 3.20GHz, with 8GB RAM, and an Intel HD 530 inte-
These features were extracted by using a CNN archi- grated graphics processor. It should be noted that the
tecture based on Sound-Net [13]. specifications of the computing platform being used
will greatly influence the time it takes to run the de-
While the authors of [12] document no formal testing,
tection. Using the aforementioned specification it took
it appears that the system is capable of extracting rele-
approximately 75 seconds to run the process of detec-
vant sound files based on the location of specific objects
tion and information extraction on a 7.97s video filmed
within the scene. However, at times the choices made
at 30 frames per second (fps). This roughly equates to
based on the scene’s visual content can be slightly ill-
0.32s per frame. This can be reduced to between 65s
suited, such as audio which appears to a recording of
(0.27s per frame) if the visualisation of the detection
a train station (including crowd noise and announce-
output is bypassed.
ments over a loudspeaker system) matched to an image
of a church interior. If the aim of the system is to pro- The object detection system provides a variety of data
duce more accurate/plausible soundscapes this could related to each frame including number of detections,

AES 148th Convention, Online, 2020 June 2–5


Page 3 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

classes detected, detection scores (confidence), and


object bounding box coordinates. The system first
collates this data for each frame so it can be used to
create the object dictionary. This contains a unique
ID number for each detected object, class number of
the object detected, and the coordinates relating to the
bounding box position of each object. Following the
collection of this data it is then used in several processes
outlined in the following sections.

3.2 Continuity between frames

For the proposed system to be successful it must be


able to correctly group the data for each detected object
across successive frames or create a new object ID if a
detected object is considered as being new to the scene.
The original API is designed for detection on a single
image and this is done iteratively over each frame of
video input, but has no internal reference or ‘memory’
for anything taking place during previous iterations or
frames. Each frame is considered as a standalone in-
dividual image rather than constituent parts of a wider
object where each frame will (usually) have some tem-
poral and/or spatial relationship with those preceding
it. This results in a higher chance of misclassification
between frames.

The use of single image detectors being used for many


video object detectors is an acknowledged problem. For
instance, they are unable to take advantage of available
temporal information such as objects in adjacent frames
being in similar locations [18], which can lead to lower
confidence levels and misclassification between frames.
There is a wealth of current research within the com-
puter vision community to increase the accuracy of
video object detection system by exploiting the avail-
able temporal information see, e.g. [18, 19]. In this
implementation, to account for between frame misclas-
sifications, and to accurately group object data across
multiple frames to their associated object IDs, a basic
continuity check has been implemented. An adaption
of the Intersection of Union (IoU) evaluation metric
was used, which as described in [20], is a common
method of comparing the similarity between arbitrary
shapes by calculating a normalised measure that fo-
cused on the areas of the shapes. This measure is a
Fig. 1: Flow chart illustrating order of operations and ratio of the area of intersection (Fig. 2a) over the area
flow of data within the proposed methodology of union (Fig. 2b). Traditionally this is a metric used
when training object detection systems and is calcu-
lated using ground truth boxes (hand labelled bounding

AES 148th Convention, Online, 2020 June 2–5


Page 4 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

boxes for the testing set specifying the location of the the BBC sound effects archive [21]. In this instance
objects) and prediction boxes (boxes generated by the each unique object class detected is compared to the
object detection system indication where it predicts the metadata tag from the BBC’s sound effect archive [21],
objects are located). Accuracy is deemed sufficient if which is an open source repository made up of 16,011
the IoU value exceeds a user specified amount (usually labelled audio files. The archive is are available to
0.5<) with value ranges between 0 and 1. This metric download as WAV files and is subject to terms of use
is appropriate as it is expected that an object in the cur- under the RemArc Licence, which permits use for per-
rent frame will be in a similar location to its position sonal, educational, and research purposes. Chosen
in the previous frame. If the IoU value is above the set because it provides a large database of labelled audio
threshold the object is defined as being the same as that files containing a variety of different acoustic scenes
identified in the previous frame, otherwise it is treated and events, with tagging and metadata stored in an as-
as a new object and is added to the object dictionary. sociated .CSV file. Table 1 shows examples of tagging
and metadata format common to each audio file in the
It should be noted this method has limitations which database. Tagging consists of the description of each
are discussed in sec. 4.2.2 sound effect (as taken from the original CD) and the cat-
egory (e.g. Engines: Petrol, Engines: Diesel) to which
it belongs. Metadata stored is the length of audio file
in seconds, name of the original CD containing the ef-
fect, and track number. There are some inconsistencies
within the tagging conventions with not all audio files
having an associated category and/or CD origin name.
Any inconsistencies within a database’s tagging con-
vention may impact its effectiveness when used as data
for training and evaluating machine listening systems
(a) Area of Intersection
[22].

3.4 Object Tracking

Alongside checking for continuity, object location data


can also be used to calculate the trajectory for each
object over the course of the video, which can then
be utilised to position and pan audio content. Object
trajectory is calculated by calculating the centre point
(b) Area of Union
of an objects bounding boxes as shown in Fig. 3.The
data can then be transmitted to DAWs such as Cockos
Fig. 2: IoU can be calculated by dividing the area of Reaper [23] via OSC [24] in order to populate automa-
intersection (the area covered by the overlap tion data for the desired parameter. In the case of stereo
of the two boxes) by the area of union (total panning the horizontal portion of the trajectory data
area covered by the two boxes). Within this needs to be normalised to between 0 (hard left) and 1
work, it is used as a continuity check on objects (hard right). Due to the temporal resolution available in
within the visual scene taking advantage of the Reaper’s automation lanes, resolution of location data
similar locations an object will occupy within was reduced by a factor of two, resulting in 15 discrete
the current and previous frame points per second for a 30fps video.

3.5 Test Material Specification


3.3 Sound Effects Suggestions
Two test videos were created to allow for direct and
Once the object dictionary has been compiled it is used controlled evaluation of simple scenes containing a
to generate a list of suggested sound effects from the single and multiple objects. A photographic image
chosen repository of audio files, which in this case is containing animals was also sources from the internet in

AES 148th Convention, Online, 2020 June 2–5


Page 5 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

Description Duration (s) Category CD Number CD Name Track #


Two-stroke petrol engine Diesel
driving small elevator, 194 Engines: Petrol EC117D and 4
start, run, stop. Petrol Engines
Single-cylinder Petter Diesel
engine, start, run, 194 Engines:Diesel EC117D and 1
stop. (1 1/2 h.p.) Petrol Engines
Single hen 63 EC31A Chickens 1
Motorcycle Scrambling:
Motorcycle Scrambling
General atmosphere,
194 and EC5M4 1
pre-1965 machines,
General Atmosphere
250-500cc

Table 1: Examples of metadata format associated with BBC’s sound effect archive. Available metadata fields
consist of a description, duration in seconds, category, CD number, CD Name, and track number. As
shown there is inconsistency within the archive as not all audio files will contain information for within
the category and CD name fields

4 Results

4.1 Candidate Sound Effects Recommendations

The system takes approximately 4s to compile a list


of candidate sound effect recommendations for Video
2, returning a total of 36 recommendations (a selec-
tion of which are shown in Table 2) of which 6 were
considered usable for the given scene. Those deemed
unsuitable were for reasons such as a different environ-
ment/activity to the one in the example video (e.g. a
Fig. 3: Single frame taken from a test video with pre- person exiting a car and a person in an ice skating rink).
ceding trajectory of the detected object has been The current implementation takes the class label as a
plotted and shows a good match. string of characters and compares this to the tags in the
metadata. If an exact match is found it will determine
order to assess capabilities relating to candidate audio the associated audio file as being a candidate sound
file recommendation for non-human objects. All videos effect. A limitations of this method is the reliance on
were recorded on an iPhone SE at 1080p 30 frames per exact matching of tags between the database and the
second at a distance of 5m and have the following class labels of the detection system. It is therefore un-
conditions able to recommend audio files which may be suitable
but whose tags use different (yet still related) terms,
• Video 1 − Single person walking from left to right such as ‘man’, ‘woman’, or ‘human’ if detecting the
of scene. class ‘person’. Tagging within the BBC archive is in-
consistent (admittedly due to the repository consisting
• Video 2 − Two people walking ∼1.5m apart from of many decades worth of archived audio files) meaning
left to right of scene many potentially suitable sound effects go undetected
using the current string comparison method.An alterna-
Example images from the two video examples are tive method which may alleviate this issue would be to
shown in Fig. 4 and Fig. 5. train another machine learning algorithm to detect and

AES 148th Convention, Online, 2020 June 2–5


Page 6 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

Fig. 4: Image from a single video frame extracted from Fig. 5: Image from a single video frame of Video 2
example Video 1, and used as input for the used to derive panning information for two mov-
object detection system to generate candidate ing objects with a 2D visual scene. The exam-
audio file recommendations. The detected ob- ple video is of two people crossing the field
ject’s location is indicated by the green bound- of view from left to right approximately 1.5m
ing box and is assigned the class label of ‘per- apart.
son’.
travelling across the field of view and takes into ac-
recognise synonyms, which is often associated with count the variations in centre point of the bounding
lexical substitution tasks [25] that require systems to box that occurs when walking, and variation in the ob-
predict alternatives for a target word, while maintaining ject’s speed, in this case indicated by the non-uniform
its meaning with a sentences context. Within our use distribution in spatial proximity of the data points.
case this can be used in order to suggest candidate au- Fig. 6a shows horizontal panning data plotted over
dio files whose tags may not exactly match the detected time in frames derived from the objects positional data.
class but are deemed, by the system, to have similar This highlights the precision at which the system will
meaning. take into account the object’s variation in speed as it
crosses the field of view as relating with the gradient
Limitations also exist relating to the type of detection of the plotted data. It should be noted that it is the
system used. Google’s API is for object detection and distance moved by the centre of the objects associated
is limited to detecting the specific objects. It there- bounding box between each frame that is being tracked
fore does not allow for prediction of activities taking rather than the object itself. For objects whose move-
place within the scene, such as walking as in Videos 1 ment causes bounding boxes of varying sizes (such
and 2. As such, the system did not retrieve the 1,484 as a human walking with their arms swinging) this
sound effects containing the term ‘footsteps’ which may, produce variable results. Once the object exits
may have been suitable as candidate sound effects. It the field of view the panning value defaults to 0 which
also lacks the functionality of scene recognition sys- may present problems for objects whose audio needs
tems to predict more generic scene elements such as to remain active for a set time after being no longer
location e.g. living room, beach, city centre, which visible. However, this is an issue that is a feature of
may help to inform recommendations for audio files the current 2D only implementation. Field of view
relating to environmental/atmosphere sounds. in 360◦ audio/visual content is often dictated by the
direction a user is facing, therefore allowing objects
4.2 Spatial positioning and trajectory tracking outside the field of view to be still be tracked as the
video content extends beyond the field of view. There
4.2.1 Single Object will, however, be additional, alternative limitations and
situations that will need to be considered when using
Fig. 3 shows a single frame taken from Video 1 where 360◦ audio/visual content.
the trajectory of the detected object has been plotted. Fig. 6b shows the horizontal trajectory data translated
The trajectory appears to accurately track the object into panning information within a Reaper stereo track

AES 148th Convention, Online, 2020 June 2–5


Page 7 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

Candidate Audio File Recommendations


Walking, 1 person in mud
Footsteps, one person walking in mud
Cars: 1.6 GL (Manual) 1982 model Ford Cortina. Interior, door opens, person exits, door closes
Ice Skating, one person circling close, others in the distance on indoor rink
Footsteps, one person walking in water

Table 2: Selection of candidate audio file recommendations generated from Fig. 4. Each file was defined by the
system as being a potential candidate if the metadata field ‘description’ contained an exact match for the
detected objects class name, in this case ‘person’.

automation lane. Upon visual inspection, the reduction detected objects according to the associated confidence
in data does not seem to have had an adverse effect on scores, beginning with the highest score. This results in
trajectory trends. The linear interpolation generated by the data for specific objects being output in a different
Reaper has little impact on the overall trend due to the order for each frame depending on how the confidence
size of the time steps but may have a perceptual im- scores change throughout the video. This then affects
pact for larger timesteps. The timestep is dependent on how the data interacts with the object dictionary com-
the videos fps and is the length of time between each piler. The system compiles the object dictionary, and
discrete data point of panning data (in this case the therefore how the trajectory data is grouped, according
timestep is 0.0667s). A reduction in fps results in an in- to the results of the continuity check. This relies on the
creased timestep duration which may introduce greater detected object’s data being output in the same order
spatial mismatches between the visual object and the each time. When presented with the objects in a differ-
associated auditory material. The results from previous ent order it causes the check to use positional data from
literature vary greatly with respect to the angular offset a different object, and if the distance between the ob-
required in order to create a perceptually noticeable jects is great enough (which usually it will be), the con-
misalignment. Using reaction time (RT) measurements tinuity check fails and the objects in the current frame
as an indirect method of measurement there was found are defined as new. This causes the panning data from
in [26] to be a significant difference measured from 5◦ what should be a single object to be spread across mul-
to 10◦ on wards for the Simon effect (the observation tiple entries within the object dictionary. The change in
that responses in two-alternative forced-choice-tests, confidence score over the length of the video resulted
where space is a parameter not relevant to the task, are in a total of 32 objects being added to the dictionary.
faster if the stimulus presentation and response side Due to the object detector in this implementation being
match; responses are slower if the stimulus is presented based on a single image detector, it is not straightfor-
in the visual hemisphere opposite of the response side) ward to override the ordering method in order to create
and it was concluded that that for speech signals, even a more consist output order on a frame-by-frame basis.
small audio-visual offsets can subconsciously influ- This presents a problem for projects that require not
ence the spatial integration of sources. Future work only accurate location and classification of objects, but
will ascertain the minimum resolution of panning data the ability to track them through frames and for them
required to maintain congruence between the objects be recognised as pre-existing or new objects. It may
visual position and the position of the associated audio be possible to address this problem using an alternative
content. object detection system.

4.2.2 Multiple Objects 5 Conclusions


While the system handles the tracking of single ob- In this paper, a methodology for detecting, tracking,
jects well, multiple objects introduce extra complexi- and matching content for sound generating objects
ties. When presented with a scene containing multiple within a simple visual scene has been presented. Work
objects the API will store and output the data for the to date allows for successful interaction between a large

AES 148th Convention, Online, 2020 June 2–5


Page 8 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

sound designer may assign, as it may not be the case


that an object at the extremes of the visual field will
be assigned a value at the extremes of the available
panning values. This research also raises questions of
the impact this kind of technology could have on the
current working practices of sound designers working
with immersive content and user testing would be car-
ried out once a suitably functioning prototype has been
developed. Finally, future work will utilise a GPU in
order to reduce time taken to run detection and extract
the relevant data.

7 Acknowledgements
(a) Original data output from system. Note the y axis has been This project is support by an EPRSC iCASE PhD Stu-
flipped to match Reaper’s and the data has been normalised to dentship in partnership with BBC R&D
between 0 and 1 to match the values used by Reaper.
References
[1] Francombe, J., Brookes, T., and Mason, R., “Eval-
uation of spatial audio reproduction methods (Part
1): Elicitation of perceptual differences,” AES:
Journal of the Audio Engineering Society, 65(3),
pp. 198–211, 2017.
[2] Agrawal, S., Simon, A., Bech, S., Bærenstein, K.,
and Forchhammer, S., “Defining Immersion: Lit-
(b) Stereo panning data was derived by using every second data point
to account for the resolution available in Reaper’s automation erature Review and Implications for Research on
lanes. Immersive Audiovisual Experiences,” AES 147th
Convention, pp. 1–11, 2019.
Fig. 6: Horizontal panning data plotted over time as
derived from example Video 1. [3] BBC Research and Development, “Immersive and
Interactive Content - BBC R&D,” 2019.

labelled audio repository and the visual content of a [4] FaceBook, “Spatial Workstation ,” 2019-12-17.
simple 2D scene as taken from a video. Suggested po- [5] Google, “Google VR,” 2019.
sitional data for dynamic audio content can be attached
to a single visual object within a scene. However, at [6] Sonnenschien, D., Sound Design: The Expres-
present, is unable to support multiple objects due to sive Power of Music, Voice, and Sound Effects in
limitations of the system being used. Erroneous results Cinema, Michael Wiese Productions, 2001.
are also produced from the candidate sound effects [7] Salselas, I. and Penha, R., “The role of sound in
search dependent on the accuracy or interpretation of inducing storytelling in immersive environments,”
the labels attached to the database of audio files. in Proceedings of the 14th International Audio
Mostly Conference: A Journey in Sound - AM’19,
6 Future Work pp. 191–198, ACM Press, New York, New York,
USA, 2019.
Future work will look into current methodologies for
tracking objects throughout a scene ensuring unique [8] Chueng, P., Chueng, P., Marsden, P., and Mars-
identification is maintained, such as the use of Kalman den, P., “Designing auditory spaces: the role of
Filters [27]. It would also be of interest to investigate expectation,” Proceedings of 10th International
the numerical relationship between an object’s posi- Conference on Human Computer Interaction, pp.
tion within a 2D visual scene and the panning value a 616–620, 2003.

AES 148th Convention, Online, 2020 June 2–5


Page 9 of 10
Turner, Pike, and Murphy Content matching for sound generating objects

[9] S3A Project Team, “VISR Production Suite,” [20] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian,
2019-12-17. A., Reid, I., and Savarese, S., “Generalized In-
tersection over Union: A Metric and A Loss for
[10] Stitt, Peter, “SSA Plug-ins,” 2019-12-17. Bounding Box Regression,” 2019.
[11] Owens, A., Isola, P., McDermott, J., Torralba,
[21] BBC, “BBC Sound Effects Archive Resource,”
A., Adelson, E. H., and Freeman, W. T., “Vi-
BBC Sound Effects Archive Resource • Research
sually Indicated Sounds,” in 2016 IEEE Confer-
& Education Space, 2019.
ence on Computer Vision and Pattern Recognition
(CVPR), pp. 2405–2413, IEEE, 2016. [22] Stowell, D., Giannoulis, D., Benetos, E., La-
grange, M., and Plumbley, M. D., “Detection and
[12] Kajihara, Y., Dozono, S., and Tokui, N., “Imag-
Classification of Acoustic Scenes and Events,”
inary Soundscape : Cross-Modal Approach to
IEEE Transactions on Multimedia, 17(10), pp.
Generate Pseudo Sound Environments,” Work-
1733–1746, 2015.
shop on Machine Learning for Creativity and De-
sign (NIPS 2017), (Nips), pp. 1–3, 2017. [23] Cockos, “REAPER | Audio Production Without
[13] Aytar, Y., Vondrick, C., and Torralba, A., “Sound- Limits,” 2019.
Net: Learning Sound Representations from Unla- [24] Wright, M., “Open Sound Control 1.0 Specifica-
beled Video,” (Nips), 2016. tion,” 2002.
[14] Fiebrink, R. A., “Real-time Human Interaction [25] Melamud, O., Levy, O., and Dagan, I., “A Sim-
with Supervised Learning Algorithms for Music ple Word Embedding Model for Lexical Substi-
Composition and Performance,” (January), 2011. tution,” in Proceedings of the 1st Workshop on
[15] Miranda, E., Sound Design: An Artificial Intelli- Vector Space Modeling for Natural Language Pro-
gence Approach, Phd, University of Edinburgh, cessing, pp. 1–7, Association for Computational
1994. Linguistics, Stroudsburg, PA, USA, 2015, doi:
10.3115/v1/W15-1501.
[16] Huang, J., Rathod, V., Sun, C., Zhu, M., Ko-
rattikara, A., Fathi, A., Fischer, I., Wojna, Z., [26] Stenzel, H., Francombe, J., and Jackson, P. J.,
Song, Y., Guadarrama, S., and Murphy, K., “Limits of perceived audio-visual spatial coher-
“Speed/accuracy trade-offs for modern convo- ence as defined by reaction time measurements,”
lutional object detectors,” Proceedings - 30th Frontiers in Neuroscience, 13(MAY), pp. 1–17,
IEEE Conference on Computer Vision and Pattern 2019, ISSN 1662453X, doi:10.3389/fnins.2019.
Recognition, CVPR 2017, 2017, pp. 3296–3305, 00451.
2017.
[27] Saho, K., “Kalman Filter for Moving Object
[17] Vladimirov, L., “Detect Objects Using Your Web- Tracking: Performance Analysis and Filter De-
cam,” n/a. sign,” in Kalman Filters - Theory for Advanced
Applications, p. 13, 2017.
[18] Zhu, M. and Liu, M., “Mobile Video Object De-
tection with Temporally-Aware Feature Maps,”
Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recogni-
tion, pp. 5686–5695, 2018.
[19] Kang, K., Li, H., Yan, J., Zeng, X., Yang, B.,
Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang,
X., and Ouyang, W., “T-CNN: Tubelets with Con-
volutional Neural Networks for Object Detection
from Videos,” IEEE Transactions on Circuits and
Systems for Video Technology, 28(10), pp. 2896–
2907, 2018.

AES 148th Convention, Online, 2020 June 2–5


Page 10 of 10

You might also like