0% found this document useful (0 votes)
21 views11 pages

Visually Indicated Sounds

Uploaded by

NickOl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Visually Indicated Sounds

Uploaded by

NickOl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Visually Indicated Sounds

Andrew Owens1 Phillip Isola2,1 Josh McDermott1


Antonio Torralba1 Edward H. Adelson1 William T. Freeman1,3
1 2 3
MIT U.C. Berkeley Google Research
arXiv:1512.08512v2 [cs.CV] 30 Apr 2016
Input video
Predicted Sound

0.5 0.5
1/21/2
)

Amplitude1/2
(Amplitude
Amplitude

0 0

-0.5 -0.5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
seconds Time (seconds) seconds

Figure 1: We train a model to synthesize plausible impact sounds from silent videos, a task that requires implicit knowledge of material
properties and physical interactions. In each video, someone probes the scene with a drumstick, hitting and scratching different objects.
We show frames from two videos and below them the predicted audio tracks. The locations of these sampled frames are indicated by the
dotted lines on the audio track. The predicted audio tracks show seven seconds of sound, corresponding to multiple hits in the videos.

Abstract by the physical interaction being depicted: you see what is


making the sound.
Objects make distinctive sounds when they are hit or
We call these events visually indicated sounds, and we
scratched. These sounds reveal aspects of an object’s ma-
propose the task of predicting sound from videos as a way
terial properties, as well as the actions that produced them.
to study physical interactions within a visual scene (Fig-
In this paper, we propose the task of predicting what sound
ure 1). To accurately predict a video’s held-out soundtrack,
an object makes when struck as a way of studying physical
an algorithm has to know about the physical properties of
interactions within a visual scene. We present an algorithm
what it is seeing and the actions that are being performed.
that synthesizes sound from silent videos of people hitting
This task implicitly requires material recognition, but unlike
and scratching objects with a drumstick. This algorithm
traditional work on this problem [4, 38], we never explicitly
uses a recurrent neural network to predict sound features
tell the algorithm about materials. Instead, it learns about
from videos and then produces a waveform from these fea-
them by identifying statistical regularities in the raw audio-
tures with an example-based synthesis procedure. We show
visual signal.
that the sounds predicted by our model are realistic enough
to fool participants in a “real or fake” psychophysical ex- We take inspiration from the way infants explore the
periment, and that they convey significant information about physical properties of a scene by poking and prodding at
material properties and physical interactions. the objects in front of them [36, 3], a process that may help
them learn an intuitive theory of physics [3]. Recent work
suggests that the sounds objects make in response to these
1. Introduction interactions may play a role in this process [39, 43].
From the clink of a ceramic mug placed onto a saucer, We introduce a dataset that mimics this exploration
to the squelch of a shoe pressed into mud, our days are process, containing hundreds of videos of people hitting,
filled with visual experiences accompanied by predictable scratching, and prodding objects with a drumstick. To syn-
sounds. On many occasions, these sounds are not just statis- thesize sound from these videos, we present an algorithm
tically associated with the content of the images – the way, that uses a recurrent neural network to map videos to audio
for example, that the sounds of unseen seagulls are associ- features. It then converts these audio features to a wave-
ated with a view of a beach – but instead are directly caused form, either by matching them to exemplars in a database

1
Carpet Ceramic Cloth Dirt Glass Scattering

Materials
0.8 0.8 Grass Gravel Leaf Metal Paper Deformation
0.6 0.6

0.4 0.4

0.2 0.2

0 0
sp m

-m ic

sc n

r
rig s h

ot r
he
te
sc hit
h
r

io
s
id tat
he

r
tc

la
fo

at
ot
ra
ot

de

Actions Reactions Plastic Plastic bag Rock Water Wood Splash

Figure 2: Greatest Hits: Volume 1 dataset. What do these materials sound like when they are struck? We collected 977 videos in which
people explore a scene by hitting and scratching materials with a drumstick, comprising 46,577 total actions. Human annotators labeled
the actions with material category labels, the location of impact, an action type label (hit vs. scratch), and a reaction label (shown on right).
These labels were used only for analyzing what our sound prediction model learned, not for training it. We show images from a selection
of videos from our dataset for a subset of the material categories (here we show examples where it is easy to see the material in question).

and transferring their corresponding sounds, or by paramet- help recognize objects and materials that were ambiguous
rically inverting the features. We evaluate the quality of our from visual cues alone. Other work recognizes objects us-
predicted sounds using a psychophysical study, and we also ing audio produced by robotic interaction [41, 29].
analyze what our method learned about actions and materi-
Sound synthesis Our technical approach resembles
als through the task of learning to predict sound.
speech synthesis methods that use neural networks to pre-
2. Related work dict sound features from pre-tokenized text features and
then generate a waveform from those features [30]. There
Our work closely relates to research in sound and mate- are also methods, such as the FoleyAutomatic system, for
rial perception, and to representation learning. synthesizing impact sounds from physical simulations [45].
Foley The idea of adding sound effects to silent movies Work in psychology has studied low-dimensional repre-
goes back at least to the 1920s, when Jack Foley and collab- sentations for impact sounds [7], and recent work in neu-
orators discovered that they could create convincing sound roimaging has shown that silent videos of impact events ac-
effects by crumpling paper, snapping lettuce, and shaking tivate the auditory cortex [19].
cellophane in their studio1 , a method now known as Foley.
Learning visual representations from natural signals
Our algorithm performs a kind of automatic Foley, synthe-
Previous work has explored the idea of learning visual rep-
sizing plausible sound effects without a human in the loop.
resentations by predicting one aspect of a raw sensory sig-
Sound and materials In the classic mathematical work nal from another. For example, [11, 22] learned image rep-
of [26], Kac showed that the shape of a drum could be par- resentations by predicting the spatial relationship between
tially recovered from the sound it makes. Material proper- image patches, and [1, 23] by predicting the egocentric mo-
ties, such as stiffness and density [37, 31, 14], can likewise tion between video frames. Several methods have also used
be determined from impact sounds. Recent work has used temporal proximity as a supervisory signal [33, 17, 47, 46].
these principles to estimate material properties by measur- Unlike in these approaches, we learn to predict one sensory
ing tiny vibrations in rods and cloth [8], and similar methods modality (sound) from another (vision). There has also been
have been used to recover sound from high-speed video of work that trains neural networks from multiple modalities.
a vibrating membrane [9]. Rather than using a camera as For example, [34] learned a joint model of audio and video.
an instrument for measuring vibrations, we infer a plausible However, while they study speech using an autoencoder, we
sound for an action by recognizing what kind of sound this focus on material interaction, and we use a recurrent neural
action would normally make in the visually observed scene. network to predict sound features from video.
Impact sounds have been used in other work to recognize A central goal of other methods has been to use a proxy
objects and materials. Arnab et al. [2] recently presented a signal (e.g., temporal proximity) to learn a generically use-
semantic segmentation model that incorporates audio from ful representation of the world. In our case, we predict a sig-
impact sounds, and showed that audio information could nal – sound – known to be a useful representation for many
1 To our delight, Foley artists really do knock two coconuts together to tasks [14, 37], and we show that the output (i.e. the pre-
fake the sound of horses galloping [6]. dicted sound itself, rather than some internal representation
in the model) is predictive of material and action classes. 0.2
0.50

Frequency
3. The Greatest Hits dataset 0.0

Time Cloth Rock

In order to study visually indicated sounds, we collected


a dataset containing videos of humans (the authors) prob- 0.25

ing environments with a drumstick – hitting, scratching, and


Dirt Wood
poking different objects in the scene (Figure 2). We chose to
use a drumstick so that we would have a consistent way of
generating the sounds. Moreover, since the drumstick does
0.00

not occlude much of a scene, we can also observe what hap-


Scattering Deformation
pens to the object after it is struck. This motion, which we
call a reaction, can be important for inferring material prop- (a) Mean cochleagrams (b) Sound confusion matrix
erties – a soft cushion, for example, will deform more than Figure 3: (a) Cochleagrams for selected classes. We extracted
a firm one, and the sound it produces will vary with it. Sim- audio centered on each impact sound in the dataset, computed our
ilarly, individual pieces of gravel will scatter when they are subband-envelope representation, and then estimated the mean for
hit, and their sound varies with this motion (Figure 2, right). each class. (b) Confusion matrix derived by classifying sound fea-
Unlike traditional object- or scene-centric datasets, such tures. Rows correspond to confusions made for a single category.
as ImageNet [10] or Places [48], where the focus of the im- The row ordering was determined automatically, by similarity in
age is a full scene, our dataset contains close-up views of a material confusions (see Section A1.2).
small number of objects. These images reflect the viewpoint
of an observer who is focused on the interaction taking place peak2 . In many of our experiments, we use short video clips
(similar to an egocentric viewpoint). They contain enough that are centered on these amplitude peaks.
detail to see fine-grained texture and the reaction that oc-
curs after the interaction. In some cases, only part of an Semantic annotations We also collected annotations for
object is visible, and neither its identity nor other high-level a sample of impacts (approximately 62%) using online
aspects of the scene are easily discernible. Our dataset is workers from Amazon Mechanical Turk. These include ma-
also related to robotic manipulation datasets [41, 35, 15]. terial labels, action labels (hit vs. scratch), reaction labels,
While one advantage of using a robot is that its actions are and the pixel location of each impact site. To reduce the
highly consistent, having a human collect the data allows number of erroneous labels, we manually removed annota-
us to rapidly (and inexpensively) capture a large number of tions for material categories that we could not find in the
physical interactions in real-world scenes. scene. During material labeling, workers chose from finer-
We captured 977 videos from indoor (64%) and outdoor grained categories. We then merged similar, frequently con-
scenes (36%). The outdoor scenes often contain materi- fused categories (please see Section A2 for details). Note
als that scatter and deform, such as grass and leaves, while that these annotations are used only for analysis: we train
the indoor scenes contain a variety of hard and soft mate- our models on raw audio and video. Examples of several
rials, such as metal, plastic, cloth, and plastic bags. Each material and action classes are shown in Figure 2.
video, on average, contains 48 actions (approximately 69%
hits and 31% scratches) and lasts 35 seconds. We recorded 4. Sound representation
sound using a shotgun microphone attached to the top of
the camera and used a wind cover for outdoor scenes. We Following work in sound synthesis [42, 32], we com-
used a separate audio recorder, without auto-gain, and we pute our sound features by decomposing the waveform into
applied a denoising algorithm [20] to each recording. subband envelopes – a simple representation obtained by
filtering the waveform and applying a nonlinearity. We ap-
Detecting impact onsets We detected amplitude peaks in ply a bank of 40 band-pass filters spaced on an equivalent
the denoised audio, which largely correspond to the onset rectangular bandwidth (ERB) scale [16] (plus a low- and
of impact sounds. We thresholded the amplitude gradient high-pass filter) and take the Hilbert envelope of the re-
to find an initial set of peaks, then merged nearby peaks sponses. We then downsample these envelopes to 90Hz
with the mean-shift algorithm [13], treating the amplitude (approximately 3 samples per frame) and compress them.
as a density and finding the nearest mode for each peak. Fi- More specifically, we compute envelope sn (t) from a wave-
nally, we used non-maximal suppression to ensure that on-
sets were at least 0.25 seconds apart. This is a simple onset- 2 Scratches and hits usually satisfy this assumption, but splash sounds
detection method that most often corresponds to drumstick often do not – a problem that could be addressed with more sophisticated
impacts when the impacts are short and contain a single onset-detection methods [5].
form w(t) and a filter fn by taking:

Waveform
sn = D(|(w ∗ fn ) + jH(w ∗ fn )|)c , (1)

Example-based
synthesis
Cochleagram
where H is the Hilbert transform, D denotes downsam-
pling, and the compression constant c = 0.3. In Sec-
tion A1.2, we study how performance varies with the num-
ber of frequency channels. ⇢
The resulting representation is known as a cochleagram.

LSTM
In Figure 3(a), we visualize the mean cochleagram for a
selection of material and reaction classes. This reveals, for (
example, that cloth sounds tend to have more low-frequency

CNN
energy than those of rock.
How well do impact sounds capture material properties
in general? To measure this empirically, we trained a lin-
ear SVM to predict material class for the sounds in our
database, using the subband envelopes as our feature vec-

Video
tors. We resampled our training set so that each class con- … …
tained an equal number of impacts (260 per class). The re-
sulting material classifier has 45.8% (chance = 5.9%) class- Time
averaged accuracy (i.e., the mean of per-class accuracy val- Figure 4: We train a neural network to map video sequences to
ues), and its confusion matrix is shown in Figure 3(b). sound features. These sound features are subsequently converted
These results suggest that impact sounds convey signifi- into a waveform using either parametric or example-based synthe-
cant information about materials, and thus if an algorithm sis. We represent the images using a convolutional network, and
could learn to accurately predict these sounds from images, the time series using a recurrent neural network. We show a sub-
it would have implicit knowledge of material categories. sequence of images corresponding to one impact.
5. Predicting visually indicated sounds is closely related to 3D video CNNs [24, 27], as derivatives
We formulate our task as a regression problem – one across channels correspond to temporal derivatives.
where the goal is to map a sequence of video frames to a For each frame t, we construct an input feature vector xt
sequence of audio features. We solve this problem using by concatenating CNN features for the spacetime image at
a recurrent neural network that takes color and motion in- frame t and the color image from the first frame3 :
formation as input and predicts the subband envelopes of xt = [φ(Ft ), φ(I1 )], (2)
an audio waveform. Finally, we generate a waveform from
these sound features. Our neural network and synthesis pro- where φ are CNN features obtained from layer fc7 of the
cedure are shown in Figure 4. AlexNet architecture [28] (its penultimate layer), and Ft is
the spacetime image at time t. In our experiments (Sec-
5.1. Regressing sound features tion 6), we either initialized the CNN from scratch and
Given a sequence of input images I1 , I2 , ..., IN , we trained it jointly with the RNN, or we initialized the CNN
would like to estimate a corresponding sequence of sound with weights from a network trained for ImageNet classi-
features ~s1 , ~s2 , ..., ~sT , where ~st ∈ R42 . These sound fea- fication. When we used pretraining, we precomputed the
tures correspond to blocks of the cochleagram shown in Fig- features from the convolutional layers and fine-tuned only
ure 4. We solve this regression problem using a recurrent the fully connected layers.
neural network (RNN) that takes image features computed Sound prediction model We use a recurrent neural net-
with a convolutional neural network (CNN) as input. work (RNN) with long short-term memory units (LSTM)
[18] that takes CNN features as input. To compensate
Image representation We found it helpful to represent
for the difference between the video and audio sampling
motion information explicitly in our model using a two-
rates, we replicate each CNN feature vector k times, where
stream approach [12, 40]. While two-stream models often
k = bT /N c (we use k = 3). This results in a sequence of
use optical flow, it is challenging to obtain accurate flow
CNN features x1 , x2 , ..., xT that is the same length as the
estimates due to the presence of fast, non-rigid motion. In-
sequence of audio features. At each timestep of the RNN,
stead, we compute spacetime images for each frame – im-
we use the current image feature vector xt to update the
ages whose three channels are grayscale versions of the pre-
vious, current, and next frames. This image representation 3 We use only the first color image to reduce the computational cost.
vector of hidden variables ht 4 . We then compute sound fea- 6.1. Sound prediction tasks
tures by an affine transformation of the hidden variables:
In order to study the problem of detection – that is, the
~st = W ht + b
task of determining when and whether an action that pro-
ht = L(xt , ht−1 ), (3) duces a sound has occurred – separately from the task of
where L is a function that updates the hidden state [18]. sound prediction, we consider two kinds of videos. First, we
During training, we minimize the difference between the focus on the prediction problem and consider only videos
predicted and ground-truth predictions at each timestep: centered on audio amplitude peaks, which often correspond
XT to impact onsets (Section 3). We train our model to predict
E({~st }) = ρ(k~st − ~s˜t k2 ), (4) sound for 15-frame sequences (0.5 sec.) around each peak.
t=1 For the second task, which we call the detection problem,
where ~s˜t and ~st are the true and predicted sound features at we train our model on longer sequences (approximately 2
time t, and ρ(r) = log( + r2 ) is a robust loss that bounds sec. long) sampled from the training videos with a 0.5-
the error at each timestep (we use  = 1/252 ). We also in- second stride, and we subsequently evaluate this model on
crease robustness of the loss by predicting the square root full-length videos. Since it can be difficult to discern the
of the subband envelopes, rather than the envelope values precise timing of an impact, we allow the predicted fea-
themselves. To make the learning problem easier, we use tures to undergo small shifts before they are compared to
PCA to project the 42-dimensional feature vector at each the ground truth. We also introduce a two-frame lag in
timestep down to a 10-dimensional space, and we predict the RNN output, which allows the model to observe future
this lower-dimensional vector. When we evaluate the net- frames before outputting sound features. Finally, before
work, we invert the PCA transformation to obtain sound querying sound features, we apply a coloring procedure to
features. We train the RNN and CNN jointly using stochas- account for statistical differences between the predicted and
tic gradient descent with Caffe [25, 12]. We found it help- real sound features (e.g., under-prediction of amplitude), us-
ful for convergence to remove dropout [44] and to clip large ing the silent videos in the test set to estimate the empirical
gradients. When training from scratch, we augmented the mean and covariance of the network’s predictions. For these
data by applying cropping and mirroring transformations to implementation details, please see Section A1.1. For both
the videos. We also use multiple LSTM layers (the number tasks, we split the full-length videos into training and test
depends on the task; please see Section A1.1). sets (75% training and 25% testing).
5.2. Generating a waveform Models For the prediction task, we compared our model
We consider two methods for generating a waveform to image-based nearest neighbor search. We computed fc7
from the predicted sound features. The first is the simple features from a CNN pretrained on ImageNet [28] for the
parametric synthesis approach of [42, 32], which iteratively center frame of each sequence, which by construction is the
imposes the subband envelopes on a sample of white noise frame where the impact sound occurs. We then searched
(we used just one iteration). This method is useful for ex- the training set for the best match and transferred its corre-
amining what information is captured by the audio features, sponding sound. We considered variations where the CNN
since it represents a fairly direct conversion from features features were computed on an RGB image, on (three-frame)
to sound. However, for the task of generating plausible spacetime images, and on the concatenation of both fea-
sounds to a human ear, we find it more effective to impose tures. To understand the influence of different design de-
a strong natural sound prior during conversion from fea- cisions, we also considered several variations of our model.
tures to waveform. Therefore, we also consider an example- We included models with and without ImageNet pretrain-
based synthesis method that snaps a window of sound fea- ing; with and without spacetime images; and with example-
tures to the closest exemplar in the training set. We form a based versus parametric waveform generation. Finally, we
query vector by concatenating the predicted sound features included a model where the RNN connections were broken
~s1 , ..., ~sT (or a subsequence of them), searching for its near- (the hidden state was set to zero between timesteps).
est neighbor in the training set as measured by L1 distance, For the RNN models that do example-based waveform
and transferring the corresponding waveform. generation (Section 5.2), we used the centered impacts in
the training set as the exemplar database. For the predic-
6. Experiments
tion task, we performed the query using the sound features
We applied our sound-prediction model to several tasks, for the entire 15-frame sequence. For the detection task,
and evaluated it with a combination of human studies and this is not possible, since the videos may contain multiple,
automated metrics. overlapping impacts. Instead, we detected amplitude peaks
4 To simplify the presentation, we have omitted the LSTM’s hidden cell of the parametrically inverted waveform, and matched the
state, which is also updated at each timestep. sound features in small (8-frame) windows around each
0.50

Psychophysical study Loudness Centroid


Algorithm Labeled real Err. r Err. r
Full system 40.01% ± 1.66 0.21 0.44 3.85 0.47
- Trained from scratch 36.46% ± 1.68 0.24 0.36 4.73 0.33
- No spacetime 37.88% ± 1.67 0.22 0.37 4.30 0.37 0.25

- Parametric synthesis 34.66% ± 1.62 0.21 0.44 3.85 0.47


- No RNN 29.96% ± 1.55 1.24 0.04 7.92 0.28
Image match 32.98% ± 1.59 0.37 0.16 8.39 0.18
Spacetime match 31.92% ± 1.56 0.41 0.14 7.19 0.21 0.00

Image + spacetime 33.77% ± 1.58 0.37 0.18 7.74 0.20


Random impact sound 19.77% ± 1.34 0.44 0.00 9.32 0.02
(a) Model evaluation (b) Predicted sound confusions (c) CNN feature confusions
Figure 5: (a) We measured the rate at which subjects chose an algorithm’s synthesized sound over the actual sound. Our full system,
which was pretrained from ImageNet and used example-based synthesis to generate a waveform, significantly outperformed models based
on image matching. For the neural network models, we computed the auditory metrics for the sound features that were predicted by the
network, rather than those of the inverted sounds or transferred exemplars. (b) What sounds like what, according to our algorithm? We
applied a classifier trained on real sounds to the sounds produced by our algorithm, resulting in a confusion matrix (c.f . Fig. 3(b), which
shows a confusion matrix for real sounds). It obtained 22.7% class-averaged accuracy. (c) Confusions made by a classifier trained on fc7
features (30.2% class-averaged accuracy). For both confusion matrices, we used the variation of our model that was trained from scratch
(see Fig. A1(b) for the sound confusions made with pretraining).
peak (starting the window one frame before the peak).
Algorithm Labeled real Features Avg. Acc.
6.2. Evaluating the sound predictions Full sys. + mat. 41.82% ± 1.46 Audio-supervised CNN 30.4%
Full sys. 39.64% ± 1.46 ImageNet CNN 42.0%
We assessed the quality of the sounds using psychophys- fc7 NN + mat. 38.20% ± 1.47 Sound 45.8%
ical experiments and measurements of acoustic properties. fc7 NN 32.83% ± 1.41 ImageNet + sound 48.2%
Psychophysical study To test whether the sounds pro- Random + mat. 35.36% ± 1.42 ImageNet crop 52.9%
duced by our model varied appropriately with different ac- Random 20.64% ± 1.22 Crop + sound 59.4%
tions and materials, we conducted a psychophysical study Real sound match 46.90% ± 1.49
on Amazon Mechanical Turk. We used a two-alternative
Figure 6: (a) We ran variations of the full system and an image-
forced choice test in which participants were asked to distin-
matching method (RGB + spacetime). For each model, we include
guish real and fake sounds. We showed them two videos of
an oracle model (labeled with “+ mat”) that draws its sound exam-
an impact event – one playing the recorded sound, the other
ples from videos with the same material label. (b) Class-averaged
playing a synthesized sound. We then asked them to choose
material recognition accuracy obtained by training an SVM with
the one that played the real sound. The sound-prediction
different image and sound features.
algorithm was chosen randomly on a per-video basis. We
randomly sampled 15 impact-centered sequences from each Figure 7. For some categories, such as grass and leaf, par-
full-length video, showing each participant at most one im- ticipants were frequently fooled by our results. Often when
pact from each one. At the start of the experiment, we re- a participant was fooled, it was because the sound predic-
vealed the correct answer to five practice videos. tion was simple and prototypical (e.g., a simple thud noise),
We measured the rate at which participants mistook our while the actual sound was complex and atypical. True leaf
model’s result for the ground-truth sound (Figure 5(a)), sounds, for example, are highly varied and may not be fully
finding that our full system – with RGB and spacetime in- predictable from a silent video. When they are struck, we
put, RNN connections, ImageNet pretraining, and example- hear a combination of the leaves themselves, along with
based waveform generation – significantly outperformed rocks, dirt, and whatever else is underneath them. In con-
the image-matching methods. It also outperformed a base- trast, the sounds predicted by our model tend to be closer to
line that sampled a random (centered) sound from the train- prototypical grass/dirt/leaf noises. Participants also some-
ing set (p < 0.001 with a two-sided t-test). We found that times made mistakes when the onset detection failed, or
the version of our model that was trained from scratch out- when multiple impacts overlapped, since this may have de-
performed the best image-matching method (p = 0.02). Fi- fied their expectation of hearing a single impact.
nally, for this task, we did not find the difference between We found that the model in which the RNN connections
our full and RGB-only models to be significant (p = 0.08). were broken was often unable to detect the timing of the hit,
We show results broken down by semantic category in and that it under-predicted the amplitude of the sounds. As a
100 100 100
labeled as real

100
Material Action Reaction
Ours
Impact detection We also used our methods to pro-
80
80 80 80
duce sounds for long, uncentered videos, a problem set-

Mean accuracy
Mean accuracy

Image+spacetime match

Mean accuracy
60
60 60 60 ting that allows us to evaluate their ability to detect impact
events. We provide qualitative examples in Figure 8 and
% synthesized

40
40 40 40
on our webpage (vis.csail.mit.edu). To quantitatively evalu-
20
20 20 20
ate its detection accuracy, we used the parametric synthesis
00 0 0 method to produce a waveform, applied a large gain to that

rm
ra it

st on
sp atic
h

fo h
pl per

ot r
gl all

di f

pl cer loth
w el
gr tile

ro t

g
s

w tic
rt

ca od

m ck
gr ass

pa al

c er
tic ic
a

-m tte
sc h

de s
as

tc
ba
as am
av

et
le

rp

at
waveform, and then detected amplitude peaks (Section 3).
yw

a
as

i
o

id ca

l
dr

rig s
We then compared the timing of these peaks to those of the
Figure 7: Semantic analysis of psychophysical study. We show ground truth, considering an impact to be detected if a pre-
the rate that our algorithm fooled human participants for each dicted spike occurred within 0.1 seconds of it. Using the
material, action, and reaction class. The error bars show 95% predicted amplitude as a measure of confidence, we com-
confidence intervals. Our approach significantly outperforms the puted average precision. We compared our model to an
highest-performing image-matching method (RGB + spacetime). RGB-only model, finding that the spacetime images signif-
icantly improve the result, with APs of 43.6% and 21.6%
result, it performed poorly on automated metrics and failed respectively. Both models were pretrained with ImageNet.
to find good matches. The performance of our model with
parametric waveform generation varied widely between cat- 6.3. Learning about material and action by
egories. It did well on materials such as leaf and dirt that predicting sounds
are suited to the relatively noisy sounds that the method pro-
duces but poorly on hard materials such as wood and metal By learning to predict sounds, did the network also learn
(e.g., a confusion rate of 62% ± 6% for dirt and 18% ± 5% something about material and physical interactions? To as-
for metal). On the other hand, the example-based approach sess this, we tested whether the network’s output sounds
was not effective at matching textural sounds, such as those were informative about material and action class. We ap-
produced by splashing water (Fig. 7). plied the same SVM that was trained to predict mate-
rial/action class on real sound features (Sec. 4) to the
Auditory metrics We measured quantitative properties of sounds predicted by the model. Under this evaluation
sounds for the prediction task. We chose metrics that were regime, it is not enough for the network’s sounds to merely
not sensitive to precise timing. First, we measured the loud- be distinguishable by class: they must be close enough to
ness of the sound, which we took to be the maximum energy real sounds so as to be classified correctly by an SVM that
(L2 norm) of the compressed subband envelopes over all has never seen a predicted sound. To avoid the influence
timesteps. Second, we compared the sounds’ spectral cen- of pretraining, we used a network that was trained from
troids, which we measured by taking the center of mass of scratch. We note that this evaluation method is different
the frequency channels for a one-frame (approx. 0.03 sec.) from that of recent unsupervised learning models [11, 1, 47]
window around the center of the impact. We found that that train a classifier on the network’s feature activations,
on both metrics, the network was more accurate than the rather than on a ground-truth version of the output.
image-matching methods, both in terms of mean squared Using this idea, we classified the material category from
error and correlation coefficients (Figure 5(a)). predicted sound features. The classifier had class-averaged
Oracle results How helpful is material category infor- accuracy of 22.7%, and its confusion matrix is shown in Fig.
mation? We conducted a second study that controlled for 5(b). This accuracy indicates that our model learned an out-
material-recognition accuracy. Using the subset of the data put representation that was informative about material, even
with material annotations, we created a model that chose a though it was only trained to predict sound. We applied
random sound from the same class as the input video. We a similar methodology to classify action categories from
also created a number of oracle models that used these ma- predicted sounds, obtaining 68.6% class-averaged accuracy
terial labels (Table 6(a)). For the best-performing image- (chance = 50%), and 53.5% for classifying reaction cate-
matching model (RGB + spacetime), we restricted the pool gories (chance = 20%). We found that material and reaction
of matches to be those with the same label as the input recognition accuracy improved with ImageNet pretraining
(and similarly for the example-based synthesis method). (to 28.8% and to 55.2%, respectively), but that there was a
We also considered a model that matched the ground-truth slight decrease for action classification (to 66.5%).
sound to the training set and transferred the best match. We We also tested whether the predicted sound features
found that, while knowing the material was helpful for each convey information about the hardness of a surface. We
method, it was not sufficient, as the oracle models did not grouped the material classes into superordinate hard and
outperform our model. In particular, our model significantly soft classes, and trained a classifier on real sound features
outperformed the random-sampling oracle (p < 10−4 ). (sampling 1300 examples per class), finding that it obtained
Frame from input video Real vs. synthesized cochleagram Frame from input video Real vs. synthesized cochleagram
40 40
35 0.5 35 0.5

Frequency subband
Frequency subband
30 30

Real

Real
25 25
20 20
15 15
10 10
5 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
40 Time (seconds) 40 Time (seconds)

Synthesized

Synthesized
35 35

Frequency subband
Frequency subband
30 30
25 25
20 20
15 15
10 10
5 0 5 0

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (seconds) Time (seconds)

40 40
35 0.5 35 0.5

Frequency subband

Frequency subband
30 30

Real

Real
25 25
20 20
15 15
10 10
5 5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
40 Time (seconds) 40 Time (seconds)
Synthesized

Synthesized
35 35
Frequency subband

Frequency subband
30 30
25 25
20 20
15 15
10 10
5 0 5 0

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (seconds) Time (seconds)

Figure 8: Automatic sound prediction results. We show cochleagrams for a representative selection of video sequences, with a sample
frame from each sequence on the left. The frame is sampled from the location indicated by the black triangle on the x-axis of each
cochleagram. Notice that the algorithm’s synthesized cochleagrams match the general structure of the ground truth cochleagrams. Dark
lines in the cochleagrams indicate hits, which the algorithm often detects. The algorithm captures aspects of both the temporal and spectral
structure of sounds. It correctly predicts staccato taps in rock example and longer waveforms for rustling ivy. Furthermore, it tends to
predict lower pitched thuds for a soft couch and higher pitched clicks when the drumstick hits a hard wooden railing (although the spectral
differences may appear small in these visualizations, we evaluate this with objective metrics in Section 6). A common failure mode is
that the algorithm misses a hit (railing example) or hallucinates false hits (cushion example). This frequently happens when the drumstick
moves erratically. Please see our video for qualitative results.

66.8% class-averaged accuracy (chance = 50%). Here we quently co-occur in a scene. When we analyzed our model
have defined soft materials to be {leaf, grass, cloth, plas- by classifying its sound predictions (video → sound → ma-
tic bag, carpet} and hard materials to be {gravel, rock, tile, terial), the resulting confusion matrix (Fig. 5(b)) contains
wood, ceramic, plastic, drywall, glass, metal}. both kinds of error: there are visual analysis errors when
We also considered the problem of directly predicting it misidentifies the material that was struck, and sound syn-
material class from visual features. In Table 6(b), we trained thesis errors when it produces a sound that was not a con-
a classifier using fc7 features – both those of the model vincing replica of the real sound.
trained from scratch, and of a model trained on ImageNet 7. Discussion
[28]. We concatenated color and spacetime image features,
since we found that this improved performance. We also In this work, we proposed the problem of synthesizing
considered an oracle model that cropped a high-resolution visually indicated sounds – a problem that requires an al-
(256 × 256) patch from the impact location using human gorithm to learn about material properties and physical in-
annotations, and concatenated its features with those of the teractions. We introduced a dataset for studying this task,
full image (we used color images). To avoid occlusions which contains videos of a person probing materials in the
from the arm or drumstick, we cropped the patch from the world with a drumstick, and an algorithm based on recurrent
final frame of the video. We found that performing these neural networks. We evaluated the quality of our approach
crops significantly increased the accuracy, suggesting that with psychophysical experiments and automated metrics,
localizing the impact is important for classification. We showing that the performance of our algorithm was signifi-
also tried concatenating vision and sound features (similar cantly better than baselines.
to [2]), finding that this significantly improved the accuracy. We see our work as opening two possible directions for
The kinds of mistakes that the visual classifier (video → future research. The first is producing realistic sounds from
material) made were often different from those of the sound videos, treating sound production as an end in itself. The
classifier (sound → material). For instance, the visual clas- second direction is to use sound and material interactions as
sifier was able to distinguish classes that have a very differ- steps toward physical scene understanding.
ent appearance, such as paper and cloth. These classes both Acknowledgments. This work was supported by NSF grants
make low-pitched sounds (e.g., cardboard and cushions), 6924450 and 6926677, by Shell, and by a Microsoft Ph.D. Fellow-
and were sometimes are confused by the sound classifier. ship to A.O. We thank Carl Vondrick and Rui Li for the helpful
On the other hand, the visual classifier was more likely to discussions, and the workers at Middlesex Fells, Arnold Arbore-
confuse materials from outdoor scenes, such as rocks and tum, and Mt. Auburn Cemetery for not asking too many questions
leaves – materials that sound very different but which fre- while we were collecting the Greatest Hits dataset.
References [24] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks
for human action recognition. IEEE TPAMI, 2013. 4
[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
ICCV, 2015. 2, 7 S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
[2] A. Arnab, M. Sapienza, S. Golodetz, J. Valentin, O. Miksik, S. Izadi, fast feature embedding. In Proceedings of the ACM International
and P. H. S. Torr. Joint object-material category segmentation from Conference on Multimedia, pages 675–678. ACM, 2014. 5
audio-visual cues. In BMVC, 2015. 2, 8 [26] M. Kac. Can one hear the shape of a drum? The american mathe-
[3] R. Baillargeon. The acquisition of physical knowledge in infancy: A matical monthly, 1966. 2
summary in eight lessons. Blackwell handbook of childhood cogni- [27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
tive development, 1:46–83, 2002. 1 L. Fei-Fei. Large-scale video classification with convolutional neural
[4] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recog- networks. In CVPR, 2014. 4
nition in the wild with the materials in context database. CoRR, [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica-
abs/1412.0623, 2014. 1 tion with deep convolutional neural networks. In Advances in neural
[5] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. information processing systems, pages 1097–1105, 2012. 4, 5, 8, 10
Sandler. A tutorial on onset detection in music signals. Speech and [29] E. Krotkov. Robotic perception of material. In IJCAI, 1995. 2
Audio Processing, IEEE Transactions on, 13(5):1035–1047, 2005. 3 [30] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian,
[6] T. Bonebright. Were those coconuts or horse hoofs? visual context H. M. Meng, and L. Deng. Deep learning for acoustic modeling in
effects on identification and perceived veracity of everyday sounds. parametric speech generation: A systematic review of existing tech-
In International Conference on Auditory Display, 2012. 2 niques and future trends. IEEE Signal Processing Magazine, 2015.
[7] S. Cavaco and M. S. Lewicki. Statistical modeling of intrinsic struc- 2
tures in impacts sounds. The Journal of the Acoustical Society of [31] R. A. Lutfi. Human sound source identification. In Auditory percep-
America, 121(6):3558–3568, 2007. 2 tion of sound sources, pages 13–42. Springer, 2008. 2
[8] A. Davis, K. L. Bouman, M. Rubinstein, F. Durand, and W. T. Free- [32] J. H. McDermott and E. P. Simoncelli. Sound texture perception via
man. Visual vibrometry: Estimating material properties from small statistics of the auditory periphery: evidence from sound synthesis.
motion in video. In CVPR, 2015. 2 Neuron, 71(5):926–940, 2011. 3, 5
[9] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, and [33] H. Mobahi, R. Collobert, and J. Weston. Deep learning from tempo-
W. T. Freeman. The visual microphone: passive recovery of sound ral coherence in video. In ICML, 2009. 2
from video. ACM Transactions on Graphics (TOG), 2014. 2 [34] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multi-
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Im- modal deep learning. In ICML, 2011. 2
agenet: A large-scale hierarchical image database. In CVPR, 2009. [35] L. Pinto and A. Gupta. Supersizing self-supervision: Learning
3 to grasp from 50k tries and 700 robot hours. arXiv preprint
[11] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual repre- arXiv:1509.06825, 2015. 3
sentation learning by context prediction. ICCV, 2015. 2, 7 [36] L. Schulz. The origins of inquiry: Inductive inference and explo-
[12] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venu- ration in early childhood. Trends in cognitive sciences, 16(7):382–
gopalan, K. Saenko, and T. Darrell. Long-term recurrent convolu- 389, 2012. 1
tional networks for visual recognition and description. CVPR, 2015. [37] A. A. Shabana. Theory of vibration: an introduction. Springer Sci-
4, 5 ence & Business Media, 1995. 2
[13] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a [38] L. Sharan, C. Liu, R. Rosenholtz, and E. H. Adelson. Recognizing
density function, with applications in pattern recognition. Informa- materials using perceptually inspired features. International journal
tion Theory, IEEE Transactions on, 21(1):32–40, 1975. 3 of computer vision, 103(3):348–371, 2013. 1
[14] W. W. Gaver. What in the world do we hear?: An ecological approach [39] M. H. Siegel, R. Magid, J. B. Tenenbaum, and L. E. Schulz. Black
to auditory event perception. Ecological psychology, 1993. 2 boxes: Hypothesis testing via indirect perceptual evidence. Proceed-
ings of the 36th Annual Conference of the Cognitive Science Society,
[15] M. Gemici and A. Saxena. Learning haptic representation for ma-
2014. 1
nipulating deformable food objects. In IROS, 2014. 3
[40] K. Simonyan and A. Zisserman. Two-stream convolutional networks
[16] B. R. Glasberg and B. C. Moore. Derivation of auditory filter shapes for action recognition in videos. In Advances in Neural Information
from notched-noise data. Hearing research, 47(1):103–138, 1990. 3 Processing Systems, 2014. 4
[17] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Un- [41] J. Sinapov, M. Wiemer, and A. Stoytchev. Interactive learning of the
supervised feature learning from temporal data. arXiv preprint acoustic properties of household objects. In ICRA, 2009. 2, 3
arXiv:1504.02518, 2015. 2 [42] M. Slaney. Pattern playback in the 90s. In NIPS, pages 827–834,
[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural 1994. 3, 5
computation, 9(8):1735–1780, 1997. 4, 5 [43] L. Smith and M. Gasser. The development of embodied cognition:
[19] P.-J. Hsieh, J. T. Colas, and N. Kanwisher. Spatial pattern of bold Six lessons from babies. Artificial life, 11(1-2):13–29, 2005. 1
fmri activation reveals cross-modal information in auditory cortex. [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
Journal of neurophysiology, 2012. 2 R. Salakhutdinov. Dropout: A simple way to prevent neural net-
[20] Y. Hu and P. C. Loizou. Speech enhancement based on wavelet works from overfitting. The Journal of Machine Learning Research,
thresholding the multitaper spectrum. Speech and Audio Processing, 15(1):1929–1958, 2014. 5
IEEE Transactions on, 12(1):59–67, 2004. 3 [45] K. Van Den Doel, P. G. Kry, and D. K. Pai. Foleyautomatic:
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep physically-based sound effects for interactive simulation and anima-
network training by reducing internal covariate shift. arXiv preprint tion. In Proceedings of the 28th annual conference on Computer
arXiv:1502.03167, 2015. 10 graphics and interactive techniques, pages 537–544. ACM, 2001. 2
[22] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson. Learning vi- [46] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the fu-
sual groups from co-occurrences in space and time. arXiv preprint ture by watching unlabeled video. arXiv preprint arXiv:1504.08023,
arXiv:1511.06811, 2015. 2 2015. 2
[23] D. Jayaraman and K. Grauman. Learning image representations tied [47] X. Wang and A. Gupta. Unsupervised learning of visual representa-
to ego-motion. In ICCV, December 2015. 2 tions using videos. In ICCV, 2015. 2, 7
0.50

[48] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning


60
deep features for scene recognition using places database. In NIPS,
2014. 3
[49] L. Zitnick. 80,000 ms coco images in 5 minutes. In

% accuracy
40

https://www.youtube.com/watch?v=ZUIEOUoCLBo. 11 0.25

20

A1. Model implementation


We provide more details about our model and sound rep- 0
0 10 20 30 40 50 60 70 80 90
0.00

resentation. Number of frequency channels

A1.1. Detection model (a) Sound recognition accuracy (b) Predicted sound confusions

We describe the variation of our model that performs the Figure A1: (a) Class-averaged accuracy for recognizing materi-
detection task (Section 6.1) in more detail. als, with an SVM trained on real sounds. We varied the number of
Timing We allow the sound features to undergo small band-pass filters and adjusted their frequency spacing accordingly
(we did not vary the temporal sampling rate). (b) Confusion ma-
time shifts in order to account for misalignments for the
trix obtained by classifying the sounds predicted by our pretrained
detection task. During each iteration of backpropagation,
model, using a classifier trained on real sound features (c.f . the
we shift the sequence so as to minimize the loss in Equa- same model without pretraining in Figure 5(b).)
tion 4. We resample the feature predictions to create a new
sequence ~sˆ1 , ~sˆ2 , ..., ~sˆT such that ~sˆt = ~st+Lt for some small
and covariance of) the predicted features into the space of
shift Lt (we use a maximum shift of 8 samples, approxi-
real features before computing their L1 nearest neighbors.
mately 0.09 seconds). During each iteration, we infer this
To avoid the influence of multiple, overlapping impacts on
shift by finding the optimal labeling of a Hidden Markov
the nearest neighbor search, we use a search window that
Model:
T
starts at the beginning fo the amplitude spike.
X
wt ρ(k~sˆt − ~s˜t k) + V (Lt , Lt+1 ), (5) Evaluating the RNN for long videos When evaluating
t=1 our model on long videos, we run the RNN on 10-second
where V is a smoothness term for neighboring shifts. For subsequences that overlap by 30%, transitioning between
this, we use a Potts model weighted by 12 (k~s˜t k + k~s˜t+1 k) consecutive predictions at the time that has the least sum-
to discourage the model from shifting the sound near high- of-squares difference between the overlapping predictions.
amplitude regions. We also include a weight variable wt =
1 + αδ(τ ≤ ||~s˜t ||) to decrease the importance of silent por-
A1.2. Sound representation
tions of the video (we use α = 3 and τ = 2.2). During each We measured performance on the task of assigning mate-
iteration of backpropagation, we align the two sequences, rial labels to ground-truth sounds after varying the number
then propagate the gradients of the loss to the shifted se- frequency channels in the subband envelope representation.
quence. The result is shown in Figure A1. To obtain the ordering of
To give the RNN more temporal context for its predic- material classes used in visualizations of the confusion ma-
tions, we also delay its predictions, so that at frame f , it trices (Figure 3), we iteratively chose the material category
predicts the sound features for frame f − 2. that was most similar to the previously chosen class. When
Transforming features for neighbor search For the de- measuring the similarity between two classes, we computed
tection task, the statistics of the synthesized sound features Euclidean distance between rows of a (soft) confusion ma-
can differ significantly from those of the ground truth – for trix – one whose rows correspond to the mean probability
example, we found the amplitude of peaks in the predicted assigned by the classifier to each target class (averaged over
waveforms to be smaller than those of real sounds. We cor- all test examples).
rect for these differences during example-based synthesis
A1.3. Network structure
(Section 5.2) by applying a coloring transformation before
the nearest-neighbor search. More specifically, we obtain We used AlexNet [28] for our CNN architecture. For
a whitening transformation for the predicted sound features the pretrained models, we precomputed the pool5 fea-
by running the neural network on the test videos and esti- tures and fine-tuned the model’s two fully-connected lay-
mating the empirical mean and covariance at the detected ers. For the model that was trained from scratch, we ap-
amplitude peaks, discarding peaks whose amplitude is be- plied batch normalization [21] to each training mini-batch.
low a threshold. We then estimate a similar transformation For the centered videos, we used two LSTM layers with a
for ground-truth amplitude peaks in the training set, and we 256-dimensional hidden state (and three for the detection
use these transformations to color (i.e. transform the mean model). When using multiple LSTM layers, we compen-
Figure A2: A “walk” through the dataset using AlexNet fc7 nearest-neighbor matches. Starting from the left, we matched an image with
the database and placed its best match to its right. We repeat this 5 times, with 20 random initializations. We used only images taken at a
contact point (the middle frames from the “centered” videos). To avoid loops, we removed videos when any of their images were matched.
The location of the hit, material, and action often vary during the walk. In some sequences, the arm is the dominant feature that is matched
between scenes.

sate for the difference in video and audio sampling rates by with leaf (confused 5% of the time); grass with dirt and
upsampling the input to the last LSTM layer (rather than up- leaf (8% each); and cloth with (the fine-grained category)
sampling the CNN features), replicating each input k times cushion (9% of the time).
(where again k = 3).

A2. Dataset details


In Figure A2, we show a “walk” through the dataset us-
ing fc7 features, similar to [49]. Our data was collected us-
ing two wooden (hickory) drumsticks, and an SLR camera
with a 29.97 Hz framerate. We used a ZOOM H1 exter-
nal audio recorder, and a Rode VideoMic Pro microphone.
Online workers labeled the impacts by visually examining
silent videos, without sound. We gave them finer-grained
categories than, then merged similar categories that were
frequently labeled inconsistently by workers. Specifically,
we merged cardboard and paper; concrete and rock; cloth
and cushion (often the former physically covers the latter);
and rubber and plastic. To measure overall consistency be-
tween workers, we labeled a subset of the impacts with 3
or more workers, finding that their material labels agreed
with the majority 87.6% of the time on the fine-grained cat-
egories. Common inconsistencies include confusing dirt

You might also like