Visually Indicated Sounds
Visually Indicated Sounds
0.5 0.5
1/21/2
)
Amplitude1/2
(Amplitude
Amplitude
0 0
-0.5 -0.5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
seconds Time (seconds) seconds
Figure 1: We train a model to synthesize plausible impact sounds from silent videos, a task that requires implicit knowledge of material
properties and physical interactions. In each video, someone probes the scene with a drumstick, hitting and scratching different objects.
We show frames from two videos and below them the predicted audio tracks. The locations of these sampled frames are indicated by the
dotted lines on the audio track. The predicted audio tracks show seven seconds of sound, corresponding to multiple hits in the videos.
1
Carpet Ceramic Cloth Dirt Glass Scattering
Materials
0.8 0.8 Grass Gravel Leaf Metal Paper Deformation
0.6 0.6
0.4 0.4
0.2 0.2
0 0
sp m
-m ic
sc n
r
rig s h
ot r
he
te
sc hit
h
r
io
s
id tat
he
r
tc
la
fo
at
ot
ra
ot
de
Figure 2: Greatest Hits: Volume 1 dataset. What do these materials sound like when they are struck? We collected 977 videos in which
people explore a scene by hitting and scratching materials with a drumstick, comprising 46,577 total actions. Human annotators labeled
the actions with material category labels, the location of impact, an action type label (hit vs. scratch), and a reaction label (shown on right).
These labels were used only for analyzing what our sound prediction model learned, not for training it. We show images from a selection
of videos from our dataset for a subset of the material categories (here we show examples where it is easy to see the material in question).
and transferring their corresponding sounds, or by paramet- help recognize objects and materials that were ambiguous
rically inverting the features. We evaluate the quality of our from visual cues alone. Other work recognizes objects us-
predicted sounds using a psychophysical study, and we also ing audio produced by robotic interaction [41, 29].
analyze what our method learned about actions and materi-
Sound synthesis Our technical approach resembles
als through the task of learning to predict sound.
speech synthesis methods that use neural networks to pre-
2. Related work dict sound features from pre-tokenized text features and
then generate a waveform from those features [30]. There
Our work closely relates to research in sound and mate- are also methods, such as the FoleyAutomatic system, for
rial perception, and to representation learning. synthesizing impact sounds from physical simulations [45].
Foley The idea of adding sound effects to silent movies Work in psychology has studied low-dimensional repre-
goes back at least to the 1920s, when Jack Foley and collab- sentations for impact sounds [7], and recent work in neu-
orators discovered that they could create convincing sound roimaging has shown that silent videos of impact events ac-
effects by crumpling paper, snapping lettuce, and shaking tivate the auditory cortex [19].
cellophane in their studio1 , a method now known as Foley.
Learning visual representations from natural signals
Our algorithm performs a kind of automatic Foley, synthe-
Previous work has explored the idea of learning visual rep-
sizing plausible sound effects without a human in the loop.
resentations by predicting one aspect of a raw sensory sig-
Sound and materials In the classic mathematical work nal from another. For example, [11, 22] learned image rep-
of [26], Kac showed that the shape of a drum could be par- resentations by predicting the spatial relationship between
tially recovered from the sound it makes. Material proper- image patches, and [1, 23] by predicting the egocentric mo-
ties, such as stiffness and density [37, 31, 14], can likewise tion between video frames. Several methods have also used
be determined from impact sounds. Recent work has used temporal proximity as a supervisory signal [33, 17, 47, 46].
these principles to estimate material properties by measur- Unlike in these approaches, we learn to predict one sensory
ing tiny vibrations in rods and cloth [8], and similar methods modality (sound) from another (vision). There has also been
have been used to recover sound from high-speed video of work that trains neural networks from multiple modalities.
a vibrating membrane [9]. Rather than using a camera as For example, [34] learned a joint model of audio and video.
an instrument for measuring vibrations, we infer a plausible However, while they study speech using an autoencoder, we
sound for an action by recognizing what kind of sound this focus on material interaction, and we use a recurrent neural
action would normally make in the visually observed scene. network to predict sound features from video.
Impact sounds have been used in other work to recognize A central goal of other methods has been to use a proxy
objects and materials. Arnab et al. [2] recently presented a signal (e.g., temporal proximity) to learn a generically use-
semantic segmentation model that incorporates audio from ful representation of the world. In our case, we predict a sig-
impact sounds, and showed that audio information could nal – sound – known to be a useful representation for many
1 To our delight, Foley artists really do knock two coconuts together to tasks [14, 37], and we show that the output (i.e. the pre-
fake the sound of horses galloping [6]. dicted sound itself, rather than some internal representation
in the model) is predictive of material and action classes. 0.2
0.50
Frequency
3. The Greatest Hits dataset 0.0
Waveform
sn = D(|(w ∗ fn ) + jH(w ∗ fn )|)c , (1)
Example-based
synthesis
Cochleagram
where H is the Hilbert transform, D denotes downsam-
pling, and the compression constant c = 0.3. In Sec-
tion A1.2, we study how performance varies with the num-
ber of frequency channels. ⇢
The resulting representation is known as a cochleagram.
LSTM
In Figure 3(a), we visualize the mean cochleagram for a
selection of material and reaction classes. This reveals, for (
example, that cloth sounds tend to have more low-frequency
CNN
energy than those of rock.
How well do impact sounds capture material properties
in general? To measure this empirically, we trained a lin-
ear SVM to predict material class for the sounds in our
database, using the subband envelopes as our feature vec-
Video
tors. We resampled our training set so that each class con- … …
tained an equal number of impacts (260 per class). The re-
sulting material classifier has 45.8% (chance = 5.9%) class- Time
averaged accuracy (i.e., the mean of per-class accuracy val- Figure 4: We train a neural network to map video sequences to
ues), and its confusion matrix is shown in Figure 3(b). sound features. These sound features are subsequently converted
These results suggest that impact sounds convey signifi- into a waveform using either parametric or example-based synthe-
cant information about materials, and thus if an algorithm sis. We represent the images using a convolutional network, and
could learn to accurately predict these sounds from images, the time series using a recurrent neural network. We show a sub-
it would have implicit knowledge of material categories. sequence of images corresponding to one impact.
5. Predicting visually indicated sounds is closely related to 3D video CNNs [24, 27], as derivatives
We formulate our task as a regression problem – one across channels correspond to temporal derivatives.
where the goal is to map a sequence of video frames to a For each frame t, we construct an input feature vector xt
sequence of audio features. We solve this problem using by concatenating CNN features for the spacetime image at
a recurrent neural network that takes color and motion in- frame t and the color image from the first frame3 :
formation as input and predicts the subband envelopes of xt = [φ(Ft ), φ(I1 )], (2)
an audio waveform. Finally, we generate a waveform from
these sound features. Our neural network and synthesis pro- where φ are CNN features obtained from layer fc7 of the
cedure are shown in Figure 4. AlexNet architecture [28] (its penultimate layer), and Ft is
the spacetime image at time t. In our experiments (Sec-
5.1. Regressing sound features tion 6), we either initialized the CNN from scratch and
Given a sequence of input images I1 , I2 , ..., IN , we trained it jointly with the RNN, or we initialized the CNN
would like to estimate a corresponding sequence of sound with weights from a network trained for ImageNet classi-
features ~s1 , ~s2 , ..., ~sT , where ~st ∈ R42 . These sound fea- fication. When we used pretraining, we precomputed the
tures correspond to blocks of the cochleagram shown in Fig- features from the convolutional layers and fine-tuned only
ure 4. We solve this regression problem using a recurrent the fully connected layers.
neural network (RNN) that takes image features computed Sound prediction model We use a recurrent neural net-
with a convolutional neural network (CNN) as input. work (RNN) with long short-term memory units (LSTM)
[18] that takes CNN features as input. To compensate
Image representation We found it helpful to represent
for the difference between the video and audio sampling
motion information explicitly in our model using a two-
rates, we replicate each CNN feature vector k times, where
stream approach [12, 40]. While two-stream models often
k = bT /N c (we use k = 3). This results in a sequence of
use optical flow, it is challenging to obtain accurate flow
CNN features x1 , x2 , ..., xT that is the same length as the
estimates due to the presence of fast, non-rigid motion. In-
sequence of audio features. At each timestep of the RNN,
stead, we compute spacetime images for each frame – im-
we use the current image feature vector xt to update the
ages whose three channels are grayscale versions of the pre-
vious, current, and next frames. This image representation 3 We use only the first color image to reduce the computational cost.
vector of hidden variables ht 4 . We then compute sound fea- 6.1. Sound prediction tasks
tures by an affine transformation of the hidden variables:
In order to study the problem of detection – that is, the
~st = W ht + b
task of determining when and whether an action that pro-
ht = L(xt , ht−1 ), (3) duces a sound has occurred – separately from the task of
where L is a function that updates the hidden state [18]. sound prediction, we consider two kinds of videos. First, we
During training, we minimize the difference between the focus on the prediction problem and consider only videos
predicted and ground-truth predictions at each timestep: centered on audio amplitude peaks, which often correspond
XT to impact onsets (Section 3). We train our model to predict
E({~st }) = ρ(k~st − ~s˜t k2 ), (4) sound for 15-frame sequences (0.5 sec.) around each peak.
t=1 For the second task, which we call the detection problem,
where ~s˜t and ~st are the true and predicted sound features at we train our model on longer sequences (approximately 2
time t, and ρ(r) = log( + r2 ) is a robust loss that bounds sec. long) sampled from the training videos with a 0.5-
the error at each timestep (we use = 1/252 ). We also in- second stride, and we subsequently evaluate this model on
crease robustness of the loss by predicting the square root full-length videos. Since it can be difficult to discern the
of the subband envelopes, rather than the envelope values precise timing of an impact, we allow the predicted fea-
themselves. To make the learning problem easier, we use tures to undergo small shifts before they are compared to
PCA to project the 42-dimensional feature vector at each the ground truth. We also introduce a two-frame lag in
timestep down to a 10-dimensional space, and we predict the RNN output, which allows the model to observe future
this lower-dimensional vector. When we evaluate the net- frames before outputting sound features. Finally, before
work, we invert the PCA transformation to obtain sound querying sound features, we apply a coloring procedure to
features. We train the RNN and CNN jointly using stochas- account for statistical differences between the predicted and
tic gradient descent with Caffe [25, 12]. We found it help- real sound features (e.g., under-prediction of amplitude), us-
ful for convergence to remove dropout [44] and to clip large ing the silent videos in the test set to estimate the empirical
gradients. When training from scratch, we augmented the mean and covariance of the network’s predictions. For these
data by applying cropping and mirroring transformations to implementation details, please see Section A1.1. For both
the videos. We also use multiple LSTM layers (the number tasks, we split the full-length videos into training and test
depends on the task; please see Section A1.1). sets (75% training and 25% testing).
5.2. Generating a waveform Models For the prediction task, we compared our model
We consider two methods for generating a waveform to image-based nearest neighbor search. We computed fc7
from the predicted sound features. The first is the simple features from a CNN pretrained on ImageNet [28] for the
parametric synthesis approach of [42, 32], which iteratively center frame of each sequence, which by construction is the
imposes the subband envelopes on a sample of white noise frame where the impact sound occurs. We then searched
(we used just one iteration). This method is useful for ex- the training set for the best match and transferred its corre-
amining what information is captured by the audio features, sponding sound. We considered variations where the CNN
since it represents a fairly direct conversion from features features were computed on an RGB image, on (three-frame)
to sound. However, for the task of generating plausible spacetime images, and on the concatenation of both fea-
sounds to a human ear, we find it more effective to impose tures. To understand the influence of different design de-
a strong natural sound prior during conversion from fea- cisions, we also considered several variations of our model.
tures to waveform. Therefore, we also consider an example- We included models with and without ImageNet pretrain-
based synthesis method that snaps a window of sound fea- ing; with and without spacetime images; and with example-
tures to the closest exemplar in the training set. We form a based versus parametric waveform generation. Finally, we
query vector by concatenating the predicted sound features included a model where the RNN connections were broken
~s1 , ..., ~sT (or a subsequence of them), searching for its near- (the hidden state was set to zero between timesteps).
est neighbor in the training set as measured by L1 distance, For the RNN models that do example-based waveform
and transferring the corresponding waveform. generation (Section 5.2), we used the centered impacts in
the training set as the exemplar database. For the predic-
6. Experiments
tion task, we performed the query using the sound features
We applied our sound-prediction model to several tasks, for the entire 15-frame sequence. For the detection task,
and evaluated it with a combination of human studies and this is not possible, since the videos may contain multiple,
automated metrics. overlapping impacts. Instead, we detected amplitude peaks
4 To simplify the presentation, we have omitted the LSTM’s hidden cell of the parametrically inverted waveform, and matched the
state, which is also updated at each timestep. sound features in small (8-frame) windows around each
0.50
100
Material Action Reaction
Ours
Impact detection We also used our methods to pro-
80
80 80 80
duce sounds for long, uncentered videos, a problem set-
Mean accuracy
Mean accuracy
Image+spacetime match
Mean accuracy
60
60 60 60 ting that allows us to evaluate their ability to detect impact
events. We provide qualitative examples in Figure 8 and
% synthesized
40
40 40 40
on our webpage (vis.csail.mit.edu). To quantitatively evalu-
20
20 20 20
ate its detection accuracy, we used the parametric synthesis
00 0 0 method to produce a waveform, applied a large gain to that
rm
ra it
st on
sp atic
h
fo h
pl per
ot r
gl all
di f
pl cer loth
w el
gr tile
ro t
g
s
w tic
rt
ca od
m ck
gr ass
pa al
c er
tic ic
a
-m tte
sc h
de s
as
tc
ba
as am
av
et
le
rp
at
waveform, and then detected amplitude peaks (Section 3).
yw
a
as
i
o
id ca
l
dr
rig s
We then compared the timing of these peaks to those of the
Figure 7: Semantic analysis of psychophysical study. We show ground truth, considering an impact to be detected if a pre-
the rate that our algorithm fooled human participants for each dicted spike occurred within 0.1 seconds of it. Using the
material, action, and reaction class. The error bars show 95% predicted amplitude as a measure of confidence, we com-
confidence intervals. Our approach significantly outperforms the puted average precision. We compared our model to an
highest-performing image-matching method (RGB + spacetime). RGB-only model, finding that the spacetime images signif-
icantly improve the result, with APs of 43.6% and 21.6%
result, it performed poorly on automated metrics and failed respectively. Both models were pretrained with ImageNet.
to find good matches. The performance of our model with
parametric waveform generation varied widely between cat- 6.3. Learning about material and action by
egories. It did well on materials such as leaf and dirt that predicting sounds
are suited to the relatively noisy sounds that the method pro-
duces but poorly on hard materials such as wood and metal By learning to predict sounds, did the network also learn
(e.g., a confusion rate of 62% ± 6% for dirt and 18% ± 5% something about material and physical interactions? To as-
for metal). On the other hand, the example-based approach sess this, we tested whether the network’s output sounds
was not effective at matching textural sounds, such as those were informative about material and action class. We ap-
produced by splashing water (Fig. 7). plied the same SVM that was trained to predict mate-
rial/action class on real sound features (Sec. 4) to the
Auditory metrics We measured quantitative properties of sounds predicted by the model. Under this evaluation
sounds for the prediction task. We chose metrics that were regime, it is not enough for the network’s sounds to merely
not sensitive to precise timing. First, we measured the loud- be distinguishable by class: they must be close enough to
ness of the sound, which we took to be the maximum energy real sounds so as to be classified correctly by an SVM that
(L2 norm) of the compressed subband envelopes over all has never seen a predicted sound. To avoid the influence
timesteps. Second, we compared the sounds’ spectral cen- of pretraining, we used a network that was trained from
troids, which we measured by taking the center of mass of scratch. We note that this evaluation method is different
the frequency channels for a one-frame (approx. 0.03 sec.) from that of recent unsupervised learning models [11, 1, 47]
window around the center of the impact. We found that that train a classifier on the network’s feature activations,
on both metrics, the network was more accurate than the rather than on a ground-truth version of the output.
image-matching methods, both in terms of mean squared Using this idea, we classified the material category from
error and correlation coefficients (Figure 5(a)). predicted sound features. The classifier had class-averaged
Oracle results How helpful is material category infor- accuracy of 22.7%, and its confusion matrix is shown in Fig.
mation? We conducted a second study that controlled for 5(b). This accuracy indicates that our model learned an out-
material-recognition accuracy. Using the subset of the data put representation that was informative about material, even
with material annotations, we created a model that chose a though it was only trained to predict sound. We applied
random sound from the same class as the input video. We a similar methodology to classify action categories from
also created a number of oracle models that used these ma- predicted sounds, obtaining 68.6% class-averaged accuracy
terial labels (Table 6(a)). For the best-performing image- (chance = 50%), and 53.5% for classifying reaction cate-
matching model (RGB + spacetime), we restricted the pool gories (chance = 20%). We found that material and reaction
of matches to be those with the same label as the input recognition accuracy improved with ImageNet pretraining
(and similarly for the example-based synthesis method). (to 28.8% and to 55.2%, respectively), but that there was a
We also considered a model that matched the ground-truth slight decrease for action classification (to 66.5%).
sound to the training set and transferred the best match. We We also tested whether the predicted sound features
found that, while knowing the material was helpful for each convey information about the hardness of a surface. We
method, it was not sufficient, as the oracle models did not grouped the material classes into superordinate hard and
outperform our model. In particular, our model significantly soft classes, and trained a classifier on real sound features
outperformed the random-sampling oracle (p < 10−4 ). (sampling 1300 examples per class), finding that it obtained
Frame from input video Real vs. synthesized cochleagram Frame from input video Real vs. synthesized cochleagram
40 40
35 0.5 35 0.5
Frequency subband
Frequency subband
30 30
Real
Real
25 25
20 20
15 15
10 10
5 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
40 Time (seconds) 40 Time (seconds)
Synthesized
Synthesized
35 35
Frequency subband
Frequency subband
30 30
25 25
20 20
15 15
10 10
5 0 5 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (seconds) Time (seconds)
40 40
35 0.5 35 0.5
Frequency subband
Frequency subband
30 30
Real
Real
25 25
20 20
15 15
10 10
5 5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
40 Time (seconds) 40 Time (seconds)
Synthesized
Synthesized
35 35
Frequency subband
Frequency subband
30 30
25 25
20 20
15 15
10 10
5 0 5 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (seconds) Time (seconds)
Figure 8: Automatic sound prediction results. We show cochleagrams for a representative selection of video sequences, with a sample
frame from each sequence on the left. The frame is sampled from the location indicated by the black triangle on the x-axis of each
cochleagram. Notice that the algorithm’s synthesized cochleagrams match the general structure of the ground truth cochleagrams. Dark
lines in the cochleagrams indicate hits, which the algorithm often detects. The algorithm captures aspects of both the temporal and spectral
structure of sounds. It correctly predicts staccato taps in rock example and longer waveforms for rustling ivy. Furthermore, it tends to
predict lower pitched thuds for a soft couch and higher pitched clicks when the drumstick hits a hard wooden railing (although the spectral
differences may appear small in these visualizations, we evaluate this with objective metrics in Section 6). A common failure mode is
that the algorithm misses a hit (railing example) or hallucinates false hits (cushion example). This frequently happens when the drumstick
moves erratically. Please see our video for qualitative results.
66.8% class-averaged accuracy (chance = 50%). Here we quently co-occur in a scene. When we analyzed our model
have defined soft materials to be {leaf, grass, cloth, plas- by classifying its sound predictions (video → sound → ma-
tic bag, carpet} and hard materials to be {gravel, rock, tile, terial), the resulting confusion matrix (Fig. 5(b)) contains
wood, ceramic, plastic, drywall, glass, metal}. both kinds of error: there are visual analysis errors when
We also considered the problem of directly predicting it misidentifies the material that was struck, and sound syn-
material class from visual features. In Table 6(b), we trained thesis errors when it produces a sound that was not a con-
a classifier using fc7 features – both those of the model vincing replica of the real sound.
trained from scratch, and of a model trained on ImageNet 7. Discussion
[28]. We concatenated color and spacetime image features,
since we found that this improved performance. We also In this work, we proposed the problem of synthesizing
considered an oracle model that cropped a high-resolution visually indicated sounds – a problem that requires an al-
(256 × 256) patch from the impact location using human gorithm to learn about material properties and physical in-
annotations, and concatenated its features with those of the teractions. We introduced a dataset for studying this task,
full image (we used color images). To avoid occlusions which contains videos of a person probing materials in the
from the arm or drumstick, we cropped the patch from the world with a drumstick, and an algorithm based on recurrent
final frame of the video. We found that performing these neural networks. We evaluated the quality of our approach
crops significantly increased the accuracy, suggesting that with psychophysical experiments and automated metrics,
localizing the impact is important for classification. We showing that the performance of our algorithm was signifi-
also tried concatenating vision and sound features (similar cantly better than baselines.
to [2]), finding that this significantly improved the accuracy. We see our work as opening two possible directions for
The kinds of mistakes that the visual classifier (video → future research. The first is producing realistic sounds from
material) made were often different from those of the sound videos, treating sound production as an end in itself. The
classifier (sound → material). For instance, the visual clas- second direction is to use sound and material interactions as
sifier was able to distinguish classes that have a very differ- steps toward physical scene understanding.
ent appearance, such as paper and cloth. These classes both Acknowledgments. This work was supported by NSF grants
make low-pitched sounds (e.g., cardboard and cushions), 6924450 and 6926677, by Shell, and by a Microsoft Ph.D. Fellow-
and were sometimes are confused by the sound classifier. ship to A.O. We thank Carl Vondrick and Rui Li for the helpful
On the other hand, the visual classifier was more likely to discussions, and the workers at Middlesex Fells, Arnold Arbore-
confuse materials from outdoor scenes, such as rocks and tum, and Mt. Auburn Cemetery for not asking too many questions
leaves – materials that sound very different but which fre- while we were collecting the Greatest Hits dataset.
References [24] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks
for human action recognition. IEEE TPAMI, 2013. 4
[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
ICCV, 2015. 2, 7 S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
[2] A. Arnab, M. Sapienza, S. Golodetz, J. Valentin, O. Miksik, S. Izadi, fast feature embedding. In Proceedings of the ACM International
and P. H. S. Torr. Joint object-material category segmentation from Conference on Multimedia, pages 675–678. ACM, 2014. 5
audio-visual cues. In BMVC, 2015. 2, 8 [26] M. Kac. Can one hear the shape of a drum? The american mathe-
[3] R. Baillargeon. The acquisition of physical knowledge in infancy: A matical monthly, 1966. 2
summary in eight lessons. Blackwell handbook of childhood cogni- [27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
tive development, 1:46–83, 2002. 1 L. Fei-Fei. Large-scale video classification with convolutional neural
[4] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recog- networks. In CVPR, 2014. 4
nition in the wild with the materials in context database. CoRR, [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classifica-
abs/1412.0623, 2014. 1 tion with deep convolutional neural networks. In Advances in neural
[5] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. information processing systems, pages 1097–1105, 2012. 4, 5, 8, 10
Sandler. A tutorial on onset detection in music signals. Speech and [29] E. Krotkov. Robotic perception of material. In IJCAI, 1995. 2
Audio Processing, IEEE Transactions on, 13(5):1035–1047, 2005. 3 [30] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian,
[6] T. Bonebright. Were those coconuts or horse hoofs? visual context H. M. Meng, and L. Deng. Deep learning for acoustic modeling in
effects on identification and perceived veracity of everyday sounds. parametric speech generation: A systematic review of existing tech-
In International Conference on Auditory Display, 2012. 2 niques and future trends. IEEE Signal Processing Magazine, 2015.
[7] S. Cavaco and M. S. Lewicki. Statistical modeling of intrinsic struc- 2
tures in impacts sounds. The Journal of the Acoustical Society of [31] R. A. Lutfi. Human sound source identification. In Auditory percep-
America, 121(6):3558–3568, 2007. 2 tion of sound sources, pages 13–42. Springer, 2008. 2
[8] A. Davis, K. L. Bouman, M. Rubinstein, F. Durand, and W. T. Free- [32] J. H. McDermott and E. P. Simoncelli. Sound texture perception via
man. Visual vibrometry: Estimating material properties from small statistics of the auditory periphery: evidence from sound synthesis.
motion in video. In CVPR, 2015. 2 Neuron, 71(5):926–940, 2011. 3, 5
[9] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, and [33] H. Mobahi, R. Collobert, and J. Weston. Deep learning from tempo-
W. T. Freeman. The visual microphone: passive recovery of sound ral coherence in video. In ICML, 2009. 2
from video. ACM Transactions on Graphics (TOG), 2014. 2 [34] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multi-
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Im- modal deep learning. In ICML, 2011. 2
agenet: A large-scale hierarchical image database. In CVPR, 2009. [35] L. Pinto and A. Gupta. Supersizing self-supervision: Learning
3 to grasp from 50k tries and 700 robot hours. arXiv preprint
[11] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual repre- arXiv:1509.06825, 2015. 3
sentation learning by context prediction. ICCV, 2015. 2, 7 [36] L. Schulz. The origins of inquiry: Inductive inference and explo-
[12] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venu- ration in early childhood. Trends in cognitive sciences, 16(7):382–
gopalan, K. Saenko, and T. Darrell. Long-term recurrent convolu- 389, 2012. 1
tional networks for visual recognition and description. CVPR, 2015. [37] A. A. Shabana. Theory of vibration: an introduction. Springer Sci-
4, 5 ence & Business Media, 1995. 2
[13] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a [38] L. Sharan, C. Liu, R. Rosenholtz, and E. H. Adelson. Recognizing
density function, with applications in pattern recognition. Informa- materials using perceptually inspired features. International journal
tion Theory, IEEE Transactions on, 21(1):32–40, 1975. 3 of computer vision, 103(3):348–371, 2013. 1
[14] W. W. Gaver. What in the world do we hear?: An ecological approach [39] M. H. Siegel, R. Magid, J. B. Tenenbaum, and L. E. Schulz. Black
to auditory event perception. Ecological psychology, 1993. 2 boxes: Hypothesis testing via indirect perceptual evidence. Proceed-
ings of the 36th Annual Conference of the Cognitive Science Society,
[15] M. Gemici and A. Saxena. Learning haptic representation for ma-
2014. 1
nipulating deformable food objects. In IROS, 2014. 3
[40] K. Simonyan and A. Zisserman. Two-stream convolutional networks
[16] B. R. Glasberg and B. C. Moore. Derivation of auditory filter shapes for action recognition in videos. In Advances in Neural Information
from notched-noise data. Hearing research, 47(1):103–138, 1990. 3 Processing Systems, 2014. 4
[17] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Un- [41] J. Sinapov, M. Wiemer, and A. Stoytchev. Interactive learning of the
supervised feature learning from temporal data. arXiv preprint acoustic properties of household objects. In ICRA, 2009. 2, 3
arXiv:1504.02518, 2015. 2 [42] M. Slaney. Pattern playback in the 90s. In NIPS, pages 827–834,
[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural 1994. 3, 5
computation, 9(8):1735–1780, 1997. 4, 5 [43] L. Smith and M. Gasser. The development of embodied cognition:
[19] P.-J. Hsieh, J. T. Colas, and N. Kanwisher. Spatial pattern of bold Six lessons from babies. Artificial life, 11(1-2):13–29, 2005. 1
fmri activation reveals cross-modal information in auditory cortex. [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
Journal of neurophysiology, 2012. 2 R. Salakhutdinov. Dropout: A simple way to prevent neural net-
[20] Y. Hu and P. C. Loizou. Speech enhancement based on wavelet works from overfitting. The Journal of Machine Learning Research,
thresholding the multitaper spectrum. Speech and Audio Processing, 15(1):1929–1958, 2014. 5
IEEE Transactions on, 12(1):59–67, 2004. 3 [45] K. Van Den Doel, P. G. Kry, and D. K. Pai. Foleyautomatic:
[21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep physically-based sound effects for interactive simulation and anima-
network training by reducing internal covariate shift. arXiv preprint tion. In Proceedings of the 28th annual conference on Computer
arXiv:1502.03167, 2015. 10 graphics and interactive techniques, pages 537–544. ACM, 2001. 2
[22] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson. Learning vi- [46] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the fu-
sual groups from co-occurrences in space and time. arXiv preprint ture by watching unlabeled video. arXiv preprint arXiv:1504.08023,
arXiv:1511.06811, 2015. 2 2015. 2
[23] D. Jayaraman and K. Grauman. Learning image representations tied [47] X. Wang and A. Gupta. Unsupervised learning of visual representa-
to ego-motion. In ICCV, December 2015. 2 tions using videos. In ICCV, 2015. 2, 7
0.50
% accuracy
40
https://www.youtube.com/watch?v=ZUIEOUoCLBo. 11 0.25
20
A1.1. Detection model (a) Sound recognition accuracy (b) Predicted sound confusions
We describe the variation of our model that performs the Figure A1: (a) Class-averaged accuracy for recognizing materi-
detection task (Section 6.1) in more detail. als, with an SVM trained on real sounds. We varied the number of
Timing We allow the sound features to undergo small band-pass filters and adjusted their frequency spacing accordingly
(we did not vary the temporal sampling rate). (b) Confusion ma-
time shifts in order to account for misalignments for the
trix obtained by classifying the sounds predicted by our pretrained
detection task. During each iteration of backpropagation,
model, using a classifier trained on real sound features (c.f . the
we shift the sequence so as to minimize the loss in Equa- same model without pretraining in Figure 5(b).)
tion 4. We resample the feature predictions to create a new
sequence ~sˆ1 , ~sˆ2 , ..., ~sˆT such that ~sˆt = ~st+Lt for some small
and covariance of) the predicted features into the space of
shift Lt (we use a maximum shift of 8 samples, approxi-
real features before computing their L1 nearest neighbors.
mately 0.09 seconds). During each iteration, we infer this
To avoid the influence of multiple, overlapping impacts on
shift by finding the optimal labeling of a Hidden Markov
the nearest neighbor search, we use a search window that
Model:
T
starts at the beginning fo the amplitude spike.
X
wt ρ(k~sˆt − ~s˜t k) + V (Lt , Lt+1 ), (5) Evaluating the RNN for long videos When evaluating
t=1 our model on long videos, we run the RNN on 10-second
where V is a smoothness term for neighboring shifts. For subsequences that overlap by 30%, transitioning between
this, we use a Potts model weighted by 12 (k~s˜t k + k~s˜t+1 k) consecutive predictions at the time that has the least sum-
to discourage the model from shifting the sound near high- of-squares difference between the overlapping predictions.
amplitude regions. We also include a weight variable wt =
1 + αδ(τ ≤ ||~s˜t ||) to decrease the importance of silent por-
A1.2. Sound representation
tions of the video (we use α = 3 and τ = 2.2). During each We measured performance on the task of assigning mate-
iteration of backpropagation, we align the two sequences, rial labels to ground-truth sounds after varying the number
then propagate the gradients of the loss to the shifted se- frequency channels in the subband envelope representation.
quence. The result is shown in Figure A1. To obtain the ordering of
To give the RNN more temporal context for its predic- material classes used in visualizations of the confusion ma-
tions, we also delay its predictions, so that at frame f , it trices (Figure 3), we iteratively chose the material category
predicts the sound features for frame f − 2. that was most similar to the previously chosen class. When
Transforming features for neighbor search For the de- measuring the similarity between two classes, we computed
tection task, the statistics of the synthesized sound features Euclidean distance between rows of a (soft) confusion ma-
can differ significantly from those of the ground truth – for trix – one whose rows correspond to the mean probability
example, we found the amplitude of peaks in the predicted assigned by the classifier to each target class (averaged over
waveforms to be smaller than those of real sounds. We cor- all test examples).
rect for these differences during example-based synthesis
A1.3. Network structure
(Section 5.2) by applying a coloring transformation before
the nearest-neighbor search. More specifically, we obtain We used AlexNet [28] for our CNN architecture. For
a whitening transformation for the predicted sound features the pretrained models, we precomputed the pool5 fea-
by running the neural network on the test videos and esti- tures and fine-tuned the model’s two fully-connected lay-
mating the empirical mean and covariance at the detected ers. For the model that was trained from scratch, we ap-
amplitude peaks, discarding peaks whose amplitude is be- plied batch normalization [21] to each training mini-batch.
low a threshold. We then estimate a similar transformation For the centered videos, we used two LSTM layers with a
for ground-truth amplitude peaks in the training set, and we 256-dimensional hidden state (and three for the detection
use these transformations to color (i.e. transform the mean model). When using multiple LSTM layers, we compen-
Figure A2: A “walk” through the dataset using AlexNet fc7 nearest-neighbor matches. Starting from the left, we matched an image with
the database and placed its best match to its right. We repeat this 5 times, with 20 random initializations. We used only images taken at a
contact point (the middle frames from the “centered” videos). To avoid loops, we removed videos when any of their images were matched.
The location of the hit, material, and action often vary during the walk. In some sequences, the arm is the dominant feature that is matched
between scenes.
sate for the difference in video and audio sampling rates by with leaf (confused 5% of the time); grass with dirt and
upsampling the input to the last LSTM layer (rather than up- leaf (8% each); and cloth with (the fine-grained category)
sampling the CNN features), replicating each input k times cushion (9% of the time).
(where again k = 3).