Content Matching For Sound Generating Objects Within A Visual Scene Using A Computer Vision Approach
Content Matching For Sound Generating Objects Within A Visual Scene Using A Computer Vision Approach
ABSTRACT
The increase in and demand for immersive audio content production and consumption, particularly in VR, is driving
the need for tools to facilitate creation. Immersive productions place additional demands on sound design teams,
specifically around the increased complexity of scenes, increased number of sound producing objects, and the need
to spatialise sound in 360◦ . This paper presents an initial feasibility study for a methodology utilising visual object
detection in order to detect, track, and match content for sound generating objects, in this case based on a simple
2D visual scene. Results show that while successful for a single moving object there are limitations within the
current computer vision system used which causes complications for scenes with multiple objects. Results also
show that the recommendation of candidate sound effect files is heavily dependent on the accuracy of the visual
object detection system and the labelling of the audio repository used.
by Argrawl due to its exclusion of the individual and be plausible and ‘fill’ the additional visual and hence
their experience, which, being highly subjective, is seen auditory space. The users have a greater level of inter-
of paramount importance. activity with the scene allowing them to control how
they view and focus on individual aspects within it, and
Immersive content can, therefore, be described as con- often the audience is no longer limited by the framing
tent which is designed to elicit a state of immersion of a single shot.
in line with the given definition. While not a prereq-
uisite, many modern day immersive experiences use Due to this extended visual field, sound spatialisation
technologies such as 360◦ audio and video to provide plays an important role in creating immersive content
the subjective sense of being surrounded and, in the by facilitating the positioning of audio sources around
case of audiovisual content, provide an experience of the complete 360◦ space. The aim being to create a
multi-sensory stimulation. The production of such con- state of immersion by increasing the sense of realism
tent has become increasingly popular in recent years and the ability to interact with the virtual environment
with companies such as BBC [3], Facebook [4], and [7]. It also allows for increased engagement (or, more
Google [5], releasing tools and content for such experi- specifically, the lack of disruption caused to a user’s
ences. engagement) by ensuring consistency within the envi-
ronment such that visual and auditory information is
In this paper we propose a methodology, based on an
considered spatially congruent [7]. Furthermore, sound
initial feasibility study, that aims to streamline immer-
spatialisation can also take advantage of listener ex-
sive sound design workflows through the application of
pectation to reinforce the sense of immersion [8] with
a computer vision approach that facilities the detection,
respect to sound location (such as aircraft emanating
spatial tracking, and content matching of appropriate
from above the listener).
audio assets to sound generating objects within a vi-
sual scene. The current study is undertaken using a Spatialisation of sound can, however, be a time con-
simple 2-D scene as a proof of concept with a view to suming task and one which has seen the development
expanding to 360◦ spatial audio/video in future work. of a multitude of tools for use within Digital Audio
Workstations (DAWs) [9, 10, 4]. Automating aspects
2 Background of the process could allow sound designers to have a
greater focus on those tasks requiring creative decision
2.1 Sound Design for Immersive Content making as opposed to those where the decision making
process is simple, but the task itself is more procedural,
Sound design is a well established area of practice iterative, or labour intensive.
and is traditionally associated with the design and pro-
A common component of many immersive experiences
duction of sound for the purposes of film, TV, radio
(though admittedly not all) is the presence of visual in-
programes, and video games. Such sounds often fall
formation, normally in the form of a video. Using this
broadly into the categories of sound effects, dialogue,
readily available data correctly could provide a wealth
and music [6]. In the context of this paper we are
of information about the audio scene being constructed
specifically interested in sound design practice for new
and this could then be utilised to inform context-aware
immersive film, TV, and radio content.
automated or semi-automated audio outcomes. The util-
The aesthetics and practice around these content forms isation of computer vision techniques could lend them-
are still new or as yet unknown, but it is clear that work- selves to this application of machine-assisted sound
flows are changing from established practices such that design, especially within the context of automatic con-
sound designers will need to adapt. At present, how- tent matching and object tracking.
ever, little formal research has been conducted on the
subject. Immersive audio production often places addi- 2.2 Visually driven sound design
tional demands on sound designs teams, with little or
no additional time allocated to complete the task. By Computer vision is an established area of machine
their very nature, 360◦ scenes are often more complex learning that focuses on making sense of the informa-
and contain many more sound cues or sound generating tion contained within digital images and videos. Two
objects, and require a greater level of detail in order to of the most common tasks involved in many computer
vision applications are object detection (localisation of be accomplished by adding a human user in the loop to
objects within a given image) and object classification rate the suitability of the audio content presented for the
(estimating which of a given class the object is most given scene. Alternately, the system could also provide
likely to belong to). Algorithms for such tasks will of- options for the human user to choose between, which
ten have to deal with evaluating multiple objects within would be more akin to the functionality of a production
a given image. tool.
Within the field of sound design there are some exam- A user’s ability to interact with the installation in [12]
ples of how visual features can be matched to audio is very limited. They are able to provide an image or
files in a database or used to synthesise sounds from select a location but have no control over the resulting
this visual information [11, 12]. soundscape. The aim of this work is to use computer
vision as a collaborative agent, much in the same way
Owens et al. [11] trained a recurrent neural network
other machine learning techniques have been utilised
(RNN) to map visual features to audio features, which
to act as collaborators for music composition [14] and
were then transformed into a waveform by either match-
sound synthesis [15].
ing them to already existing audio files in a database
or by, what the authors describe as, parametrically in- 3 Methods
verting the features. The sounds synthesised were of
people hitting and scratching different surfaces and ob- 3.1 Google’s object detection API
jects with a drum stick. While this is still somewhat
distant from complex soundscape creation, this outlines It should be noted that the scope of this feasibility study
a general approach that could be used in order to pro- has been limited to the use of existing visual recogni-
duce other plausible sound objects. Performance of the tion tools, with the method adopted in this paper being
model were measured via a psychophysical studying built around Google’s object detection API [16]. The
using a two-alternative forced choice test where partic- API is written in TensorFlow and is used to detect, lo-
ipants were required to distinguish real and machine- cate, and classify content from a simple 2D video frame
generated sounds. Results were mixed, with parametric image. The tutorial for the API is easily adapted to run
generation performing well for materials which were detection on frames of a video following instructions
considered more noisy, for example leaves and dirt, but detailed in [17]. The description which follows of how
performing poorly for harder surfaces such as metal this is applied is illustrated in Fig. 1 and shows flow of
and wood. It was also found that matching the mapped information and the different components that make up
audio features to existing audio files was ineffective for the wider system.
textured sounds such as splashing water.
The model used for this paper is the Single Shot De-
The online sound installation Imaginary Soundscapes tection (SSD) meta-architecture, with the inception V2
[12] creates ‘psuedo’ soundscapes by extracting Con- feature extractor, chosen because it gave a good com-
volution Neural Network (CNN) features from an im- promise between speed, accuracy, and memory usage.
age and matches these with corresponding features for The model was run on an Intel Core i5-600 CPU @
sound files from a database of environmental sounds. 3.20GHz, with 8GB RAM, and an Intel HD 530 inte-
These features were extracted by using a CNN archi- grated graphics processor. It should be noted that the
tecture based on Sound-Net [13]. specifications of the computing platform being used
will greatly influence the time it takes to run the de-
While the authors of [12] document no formal testing,
tection. Using the aforementioned specification it took
it appears that the system is capable of extracting rele-
approximately 75 seconds to run the process of detec-
vant sound files based on the location of specific objects
tion and information extraction on a 7.97s video filmed
within the scene. However, at times the choices made
at 30 frames per second (fps). This roughly equates to
based on the scene’s visual content can be slightly ill-
0.32s per frame. This can be reduced to between 65s
suited, such as audio which appears to a recording of
(0.27s per frame) if the visualisation of the detection
a train station (including crowd noise and announce-
output is bypassed.
ments over a loudspeaker system) matched to an image
of a church interior. If the aim of the system is to pro- The object detection system provides a variety of data
duce more accurate/plausible soundscapes this could related to each frame including number of detections,
boxes for the testing set specifying the location of the the BBC sound effects archive [21]. In this instance
objects) and prediction boxes (boxes generated by the each unique object class detected is compared to the
object detection system indication where it predicts the metadata tag from the BBC’s sound effect archive [21],
objects are located). Accuracy is deemed sufficient if which is an open source repository made up of 16,011
the IoU value exceeds a user specified amount (usually labelled audio files. The archive is are available to
0.5<) with value ranges between 0 and 1. This metric download as WAV files and is subject to terms of use
is appropriate as it is expected that an object in the cur- under the RemArc Licence, which permits use for per-
rent frame will be in a similar location to its position sonal, educational, and research purposes. Chosen
in the previous frame. If the IoU value is above the set because it provides a large database of labelled audio
threshold the object is defined as being the same as that files containing a variety of different acoustic scenes
identified in the previous frame, otherwise it is treated and events, with tagging and metadata stored in an as-
as a new object and is added to the object dictionary. sociated .CSV file. Table 1 shows examples of tagging
and metadata format common to each audio file in the
It should be noted this method has limitations which database. Tagging consists of the description of each
are discussed in sec. 4.2.2 sound effect (as taken from the original CD) and the cat-
egory (e.g. Engines: Petrol, Engines: Diesel) to which
it belongs. Metadata stored is the length of audio file
in seconds, name of the original CD containing the ef-
fect, and track number. There are some inconsistencies
within the tagging conventions with not all audio files
having an associated category and/or CD origin name.
Any inconsistencies within a database’s tagging con-
vention may impact its effectiveness when used as data
for training and evaluating machine listening systems
(a) Area of Intersection
[22].
Table 1: Examples of metadata format associated with BBC’s sound effect archive. Available metadata fields
consist of a description, duration in seconds, category, CD number, CD Name, and track number. As
shown there is inconsistency within the archive as not all audio files will contain information for within
the category and CD name fields
4 Results
Fig. 4: Image from a single video frame extracted from Fig. 5: Image from a single video frame of Video 2
example Video 1, and used as input for the used to derive panning information for two mov-
object detection system to generate candidate ing objects with a 2D visual scene. The exam-
audio file recommendations. The detected ob- ple video is of two people crossing the field
ject’s location is indicated by the green bound- of view from left to right approximately 1.5m
ing box and is assigned the class label of ‘per- apart.
son’.
travelling across the field of view and takes into ac-
recognise synonyms, which is often associated with count the variations in centre point of the bounding
lexical substitution tasks [25] that require systems to box that occurs when walking, and variation in the ob-
predict alternatives for a target word, while maintaining ject’s speed, in this case indicated by the non-uniform
its meaning with a sentences context. Within our use distribution in spatial proximity of the data points.
case this can be used in order to suggest candidate au- Fig. 6a shows horizontal panning data plotted over
dio files whose tags may not exactly match the detected time in frames derived from the objects positional data.
class but are deemed, by the system, to have similar This highlights the precision at which the system will
meaning. take into account the object’s variation in speed as it
crosses the field of view as relating with the gradient
Limitations also exist relating to the type of detection of the plotted data. It should be noted that it is the
system used. Google’s API is for object detection and distance moved by the centre of the objects associated
is limited to detecting the specific objects. It there- bounding box between each frame that is being tracked
fore does not allow for prediction of activities taking rather than the object itself. For objects whose move-
place within the scene, such as walking as in Videos 1 ment causes bounding boxes of varying sizes (such
and 2. As such, the system did not retrieve the 1,484 as a human walking with their arms swinging) this
sound effects containing the term ‘footsteps’ which may, produce variable results. Once the object exits
may have been suitable as candidate sound effects. It the field of view the panning value defaults to 0 which
also lacks the functionality of scene recognition sys- may present problems for objects whose audio needs
tems to predict more generic scene elements such as to remain active for a set time after being no longer
location e.g. living room, beach, city centre, which visible. However, this is an issue that is a feature of
may help to inform recommendations for audio files the current 2D only implementation. Field of view
relating to environmental/atmosphere sounds. in 360◦ audio/visual content is often dictated by the
direction a user is facing, therefore allowing objects
4.2 Spatial positioning and trajectory tracking outside the field of view to be still be tracked as the
video content extends beyond the field of view. There
4.2.1 Single Object will, however, be additional, alternative limitations and
situations that will need to be considered when using
Fig. 3 shows a single frame taken from Video 1 where 360◦ audio/visual content.
the trajectory of the detected object has been plotted. Fig. 6b shows the horizontal trajectory data translated
The trajectory appears to accurately track the object into panning information within a Reaper stereo track
Table 2: Selection of candidate audio file recommendations generated from Fig. 4. Each file was defined by the
system as being a potential candidate if the metadata field ‘description’ contained an exact match for the
detected objects class name, in this case ‘person’.
automation lane. Upon visual inspection, the reduction detected objects according to the associated confidence
in data does not seem to have had an adverse effect on scores, beginning with the highest score. This results in
trajectory trends. The linear interpolation generated by the data for specific objects being output in a different
Reaper has little impact on the overall trend due to the order for each frame depending on how the confidence
size of the time steps but may have a perceptual im- scores change throughout the video. This then affects
pact for larger timesteps. The timestep is dependent on how the data interacts with the object dictionary com-
the videos fps and is the length of time between each piler. The system compiles the object dictionary, and
discrete data point of panning data (in this case the therefore how the trajectory data is grouped, according
timestep is 0.0667s). A reduction in fps results in an in- to the results of the continuity check. This relies on the
creased timestep duration which may introduce greater detected object’s data being output in the same order
spatial mismatches between the visual object and the each time. When presented with the objects in a differ-
associated auditory material. The results from previous ent order it causes the check to use positional data from
literature vary greatly with respect to the angular offset a different object, and if the distance between the ob-
required in order to create a perceptually noticeable jects is great enough (which usually it will be), the con-
misalignment. Using reaction time (RT) measurements tinuity check fails and the objects in the current frame
as an indirect method of measurement there was found are defined as new. This causes the panning data from
in [26] to be a significant difference measured from 5◦ what should be a single object to be spread across mul-
to 10◦ on wards for the Simon effect (the observation tiple entries within the object dictionary. The change in
that responses in two-alternative forced-choice-tests, confidence score over the length of the video resulted
where space is a parameter not relevant to the task, are in a total of 32 objects being added to the dictionary.
faster if the stimulus presentation and response side Due to the object detector in this implementation being
match; responses are slower if the stimulus is presented based on a single image detector, it is not straightfor-
in the visual hemisphere opposite of the response side) ward to override the ordering method in order to create
and it was concluded that that for speech signals, even a more consist output order on a frame-by-frame basis.
small audio-visual offsets can subconsciously influ- This presents a problem for projects that require not
ence the spatial integration of sources. Future work only accurate location and classification of objects, but
will ascertain the minimum resolution of panning data the ability to track them through frames and for them
required to maintain congruence between the objects be recognised as pre-existing or new objects. It may
visual position and the position of the associated audio be possible to address this problem using an alternative
content. object detection system.
7 Acknowledgements
(a) Original data output from system. Note the y axis has been This project is support by an EPRSC iCASE PhD Stu-
flipped to match Reaper’s and the data has been normalised to dentship in partnership with BBC R&D
between 0 and 1 to match the values used by Reaper.
References
[1] Francombe, J., Brookes, T., and Mason, R., “Eval-
uation of spatial audio reproduction methods (Part
1): Elicitation of perceptual differences,” AES:
Journal of the Audio Engineering Society, 65(3),
pp. 198–211, 2017.
[2] Agrawal, S., Simon, A., Bech, S., Bærenstein, K.,
and Forchhammer, S., “Defining Immersion: Lit-
(b) Stereo panning data was derived by using every second data point
to account for the resolution available in Reaper’s automation erature Review and Implications for Research on
lanes. Immersive Audiovisual Experiences,” AES 147th
Convention, pp. 1–11, 2019.
Fig. 6: Horizontal panning data plotted over time as
derived from example Video 1. [3] BBC Research and Development, “Immersive and
Interactive Content - BBC R&D,” 2019.
labelled audio repository and the visual content of a [4] FaceBook, “Spatial Workstation ,” 2019-12-17.
simple 2D scene as taken from a video. Suggested po- [5] Google, “Google VR,” 2019.
sitional data for dynamic audio content can be attached
to a single visual object within a scene. However, at [6] Sonnenschien, D., Sound Design: The Expres-
present, is unable to support multiple objects due to sive Power of Music, Voice, and Sound Effects in
limitations of the system being used. Erroneous results Cinema, Michael Wiese Productions, 2001.
are also produced from the candidate sound effects [7] Salselas, I. and Penha, R., “The role of sound in
search dependent on the accuracy or interpretation of inducing storytelling in immersive environments,”
the labels attached to the database of audio files. in Proceedings of the 14th International Audio
Mostly Conference: A Journey in Sound - AM’19,
6 Future Work pp. 191–198, ACM Press, New York, New York,
USA, 2019.
Future work will look into current methodologies for
tracking objects throughout a scene ensuring unique [8] Chueng, P., Chueng, P., Marsden, P., and Mars-
identification is maintained, such as the use of Kalman den, P., “Designing auditory spaces: the role of
Filters [27]. It would also be of interest to investigate expectation,” Proceedings of 10th International
the numerical relationship between an object’s posi- Conference on Human Computer Interaction, pp.
tion within a 2D visual scene and the panning value a 616–620, 2003.
[9] S3A Project Team, “VISR Production Suite,” [20] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian,
2019-12-17. A., Reid, I., and Savarese, S., “Generalized In-
tersection over Union: A Metric and A Loss for
[10] Stitt, Peter, “SSA Plug-ins,” 2019-12-17. Bounding Box Regression,” 2019.
[11] Owens, A., Isola, P., McDermott, J., Torralba,
[21] BBC, “BBC Sound Effects Archive Resource,”
A., Adelson, E. H., and Freeman, W. T., “Vi-
BBC Sound Effects Archive Resource • Research
sually Indicated Sounds,” in 2016 IEEE Confer-
& Education Space, 2019.
ence on Computer Vision and Pattern Recognition
(CVPR), pp. 2405–2413, IEEE, 2016. [22] Stowell, D., Giannoulis, D., Benetos, E., La-
grange, M., and Plumbley, M. D., “Detection and
[12] Kajihara, Y., Dozono, S., and Tokui, N., “Imag-
Classification of Acoustic Scenes and Events,”
inary Soundscape : Cross-Modal Approach to
IEEE Transactions on Multimedia, 17(10), pp.
Generate Pseudo Sound Environments,” Work-
1733–1746, 2015.
shop on Machine Learning for Creativity and De-
sign (NIPS 2017), (Nips), pp. 1–3, 2017. [23] Cockos, “REAPER | Audio Production Without
[13] Aytar, Y., Vondrick, C., and Torralba, A., “Sound- Limits,” 2019.
Net: Learning Sound Representations from Unla- [24] Wright, M., “Open Sound Control 1.0 Specifica-
beled Video,” (Nips), 2016. tion,” 2002.
[14] Fiebrink, R. A., “Real-time Human Interaction [25] Melamud, O., Levy, O., and Dagan, I., “A Sim-
with Supervised Learning Algorithms for Music ple Word Embedding Model for Lexical Substi-
Composition and Performance,” (January), 2011. tution,” in Proceedings of the 1st Workshop on
[15] Miranda, E., Sound Design: An Artificial Intelli- Vector Space Modeling for Natural Language Pro-
gence Approach, Phd, University of Edinburgh, cessing, pp. 1–7, Association for Computational
1994. Linguistics, Stroudsburg, PA, USA, 2015, doi:
10.3115/v1/W15-1501.
[16] Huang, J., Rathod, V., Sun, C., Zhu, M., Ko-
rattikara, A., Fathi, A., Fischer, I., Wojna, Z., [26] Stenzel, H., Francombe, J., and Jackson, P. J.,
Song, Y., Guadarrama, S., and Murphy, K., “Limits of perceived audio-visual spatial coher-
“Speed/accuracy trade-offs for modern convo- ence as defined by reaction time measurements,”
lutional object detectors,” Proceedings - 30th Frontiers in Neuroscience, 13(MAY), pp. 1–17,
IEEE Conference on Computer Vision and Pattern 2019, ISSN 1662453X, doi:10.3389/fnins.2019.
Recognition, CVPR 2017, 2017, pp. 3296–3305, 00451.
2017.
[27] Saho, K., “Kalman Filter for Moving Object
[17] Vladimirov, L., “Detect Objects Using Your Web- Tracking: Performance Analysis and Filter De-
cam,” n/a. sign,” in Kalman Filters - Theory for Advanced
Applications, p. 13, 2017.
[18] Zhu, M. and Liu, M., “Mobile Video Object De-
tection with Temporally-Aware Feature Maps,”
Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recogni-
tion, pp. 5686–5695, 2018.
[19] Kang, K., Li, H., Yan, J., Zeng, X., Yang, B.,
Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang,
X., and Ouyang, W., “T-CNN: Tubelets with Con-
volutional Neural Networks for Object Detection
from Videos,” IEEE Transactions on Circuits and
Systems for Video Technology, 28(10), pp. 2896–
2907, 2018.