0% found this document useful (0 votes)
37 views11 pages

SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment

This paper proposes a method to generate images from sound using an audio-to-visual latent alignment model. The model is trained on paired audio and video data without labels. It translates input audio to visual features using an audio encoder, then generates images using those visual features and a pre-trained image generator. The model outperforms prior work in generating higher quality images from a variety of sounds on the VEGAS and VGGSound datasets.

Uploaded by

Denis Beli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views11 pages

SoundtoVisualSceneGenerationbyAudio To VisualLatentAlignment

This paper proposes a method to generate images from sound using an audio-to-visual latent alignment model. The model is trained on paired audio and video data without labels. It translates input audio to visual features using an audio encoder, then generates images using those visual features and a pre-trained image generator. The model outperforms prior work in generating higher quality images from a variety of sounds on the VEGAS and VGGSound datasets.

Uploaded by

Denis Beli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Kim Sung-Bin1 Arda Senocak2 Hyunwoo Ha1 Andrew Owens3 Tae-Hyun Oh1,4,5
1
Dept. of Electrical Engineering and 4 Grad. School of Artificial Intelligence, POSTECH
2 3
Dept. of Electrical Engineering, KAIST University of Michigan
5
Institute for Convergence Research and Education in Advanced Technology, Yonsei University
https://sound2scene.github.io/

Generated Images from Waveform Manipulation Generated Images from Latent Manipulation

Image
Single Feature
Waveform Train Horning Stream Burbling Cow Lowing Audio Cheering
Feature
Audio-Visual
Dog Under- Interpolation
Barking water
+ +
Water Volcano
Mixing
Flowing Explosion Skiing
Waveforms
Volume Change
Direction

Image
Feature
Volume Audio-Visual
Changes Water Flowing Elk Bugling Interpolation Given Image Volume Changes

Figure 1. Sound-to-image generation. We propose a model that synthesizes images of natural scenes from the sound. Our model is trained
solely from paired audio-visual data, without labels or language supervision. Our model’s predictions can be controlled by applying simple
manipulations to the input waveforms (left), such as by mixing two sounds together or by adjusting the volume. We can also control our
model’s outputs in latent space, such as by interpolating in directions specified by sound (right).

Abstract 1. Introduction
How does audio describe the world around us? In this pa- Humans have the remarkable ability to associate sounds
per, we propose a method for generating an image of a scene with visual scenes, such as how chirping birds and rustling
from sound. Our method addresses the challenges of dealing branches bring to mind a lush forest, and the flowing water
with the large gaps that often exist between sight and sound. conjures the image of a river. These cross-modal associations
We design a model that works by scheduling the learning pro- convey important information, such as the distance and size
cedure of each model component to associate audio-visual of sound sources, and the presence of out-of-sight objects.
modalities despite their information gaps. The key idea is to An emerging line of work has sought to create multi-
enrich the audio features with visual information by learn- modal learning systems that have these cross-modal pre-
ing to align audio to visual latent space. We translate the diction capabilities, by synthesizing visual imagery from
input audio to visual features, then use a pre-trained genera- sound [15, 20, 26, 36, 37, 63, 69]. However, these existing
tor to produce an image. To further improve the quality of methods come with significant limitations, such as being
our generated images, we use sound source localization to limited to simple datasets in which images and sounds are
select the audio-visual pairs that have strong cross-modal closely correlated [63, 69], relying on vision-and-language
correlations. We obtain substantially better results on the supervision [36], and being capable only of manipulating
VEGAS and VGGSound datasets than prior approaches. We
Acknowledgment. This work was supported by IITP grant funded by Korea
also show that we can control our model’s predictions by government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation
applying simple manipulations to the input waveform, or to Hub; No.2022-0-00124, Development of Artificial Intelligence Technology
the latent space. for Self-Improving Competency-Aware Learning Capabilities). The GPU
resource was supported by the HPC Support Project, MSIT and NIPA.

6430
the style of existing images [37] but not synthesis. domains, such as instruments [12, 15, 26, 42], birds [63] or
Addressing these limitations requires handling several speech [43]. Later, Wan et al. [69] and Fanzeres et al. [20]
challenges. First, there is a significant modality gap between attempt to alleviate the restrictions on the data domain and
sight and sound, as sound often lacks information that is generate images by conditioning on sounds from nine cate-
important for image synthesis, e.g., the shape, color, or spa- gories of SoundNet [8] and five categories of VEGAS [73],
tial location of on-screen objects. Second, the correlation respectively. Although we have a similar goal of generating
between modalities is often incongruent, e.g., highly contin- images from unrestricted sounds, our approach is capable of
gent or off-sync on timing. Cows, for example, only rarely handling much more diverse audio-visual generation prob-
moo, so associating images of cows with “moo” sounds re- lems. For example, it is capable of generating images from
quires capturing training examples with the rare moments sounds that come from a variety of categories in the VG-
when on-screen cows vocalize. GSound [14] and the VEGAS dataset. Unlike the low-quality
In this work, we propose Sound2Scene, a sound-to-image results of the previous methods, our model generates visually
generative model and training procedure that addresses these plausible images that are related to the given audio.
limitations, and which can be trained solely from unlabeled Audio-driven image manipulation. Instead of directly gen-
videos. First, given an image encoder pre-trained in a self- erating images from audio, a recent line of work has pro-
supervised way, we train a conditional generative adversarial posed to edit existing images using sound-based input. Lee et
network [11] to generate images from the visual features al. [36] used the text-based image manipulation model [51]
of the image encoder. We then train an audio encoder to and extend its embedding space to that of audio-visual modal-
translate an input sound to its corresponding visual feature, ity with the text modality. Similarly, Li et al. [37] used con-
by aligning the audio to the visual space. Afterwards, we ditional generative adversarial networks (GANs) [24] to edit
can generate diverse images from sound by translating from the visual style of an image to match a sound, and showed
audio to visual embeddings and synthesizing an image. Since that the manipulations could be controlled by adjusting a
our model must be capable of learning from challenging in- sound’s volume or by mixing together multiple sounds. Our
the-wild videos, we use sound source localization to select work differs from them in two ways. First, our model is capa-
moments in time that have strong cross-modal associations. ble of generating images conditioned on sound, rather than
We evaluate our model on VEGAS [73] and VG- only editing them. Second, unlike Lee et al., we do not re-
GSound [14], as shown in Fig. 1. Our model can synthesize quire a text-based visual-language embedding space. Instead,
a wide variety of different scenes from sound in high quality, our model is trained entirely on unlabeled audio-visual pairs.
outperforming the prior arts. It also provides an intuitive way Cross-modal generation. Learning to translate one modal-
to control the image generation process by applying manipu- ity to another, i.e., cross-modal generation, is an interesting
lations at both the input and latent space levels, such as by yet open research problem. Various tasks have been tack-
mixing multiple audios together or adjusting the loudness. led in diverse domains, such as text-to-image/video [19, 30,
Our main contributions are summarized as follows: 51, 55, 56, 64, 68], speech-to-face/gesture [23, 43], scene
• Proposing a new sound-to-image generation method that graph/layout-to-image [34, 72], image/audio-to-caption [4,
can generate visually rich images from in-the-wild audio 35, 39], etc. For bridging the heterogeneous modalities in
in a self-supervised way. cross-modal generations, several works [43, 53] leverage ex-
• Generating high-quality images from the unrestricted di- isting pre-trained models or extend pre-trained CLIP [54]
verse categories of input sounds for the first time. embedding space anchored with text-visual modality to suit
• Demonstrating that the samples generated by our model their purpose [36, 51, 55, 71]. In this trend, we tackle the
can be controlled by intuitive manipulations in the wave- task of generating images from sound by leveraging freely
form space in addition to latent space. acquired audio-visual signals from the video.
• Showing the effectiveness of training sound-to-image gen- Audio-visual learning. The natural co-occurrence of audio-
eration using highly correlated audio-visual pairs. visual cues is often leveraged as a self-supervision signal
to learn the associations between two modalities and assist
2. Related Work each other for learning better representations. The learned
representations in such a way are exploited for diverse appli-
Audio-visual generation. The audio-visual cross-modal cations including, cross-modal retrieval [7, 48], video recog-
generation field is explored in two directions: vision-to- nition [17, 41], and sound source localization [13, 50, 60–62].
sound and sound-to-vision generation. The vision-to-sound A line of work for constructing an audio-visual embedding
task has been actively researched in instrument/music [15,26, space is to jointly train two different neural networks for
65] and open-domain generic audio generation [16,33,46,73] each modality by judging if the frame and audio correspond
perspectives. In the opposite direction, early work on sound- to each other [6, 45]. Recent works use clustering [5, 31] or
to-image investigated only restricted and specialized audio contrastive learning [3, 40, 41] for better learning of the joint

6431
Figure 2. Sound2Scene framework. The frame selection method selects the highly correlated frame-audio segment from a video for training.
Then, we train Sound2Scene to produce an audio feature that aligns with the visual feature extracted from the pre-trained image encoder. In
the inference stage, the extracted audio feature from input audio is fed to the image generator to produce an image.

audio-visual embedding space. While the above-mentioned 3.1. Learning the Sound2Scene Model
approaches learn the audio-visual representations jointly
Using the audio-visual data pairs D = {Vi , Ai }N i=1 ,
from scratch, another line of work creates a joint embed-
where Vi is a video frame, and Ai is audio, our objective is to
ding space by exploiting existing knowledge from the expert
learn the audio encoder to extract informative audio features
models. The knowledge can be transferred from the audio
zA that are aligned well with anchored visual features zV .
to visual representation [47], visual representation to au-
Specifically, given the unlabeled data pairs D, the audio en-
dio [8,21], or distilled from both audio-visual representations
coder fA (·), and the image encoder fV (·), we extract audio
to that of video [17]. Our work takes the latter line, assum-
features zA =fA (A) and visual features zV =fV (V ), where
ing a visual expert model exists. We use the image feature
zV , zA ∈ R2048 . Since we exploit the well pre-trained im-
extractor to distill rich visual information from large-scale
age encoder fV (·), the visual feature zV serves as the self-
internet videos to the audio modality.
supervision signal for the audio encoder to predict the infor-
mative feature zA in the way of feature-based knowledge
3. Method distillation [25, 29]. These aligned features across modalities
construct the shared audio-visual embedding space on which
The goal of our work is to learn to translate sounds into the image generator G(·) is separately trained compatibly.
visual scenes. Most of the existing methods [15, 20, 26, 69] To align the embedding spaces defined by the heteroge-
train GANs to directly generate images from the raw sound neous modalities, a metric learning approach can be used.
or sound features. However, the aforementioned challenges Representations are aligned if they are close to each other
and the large variability of visual scenes make the task of under some distance metric. A simple approach to align
directly predicting images from sound challenging. the features of zA and zV is to minimize the L2 distance,
In contrast to prior approaches, we sidestep these chal- ∥zV − zA ∥2 . However, we discover that solely using L2
lenges by breaking down the task into sub-problems. Our loss can only teach the relationship between two different
proposed Sound2Scene pipeline is illustrated in Fig. 2. It modalities within the pair without considering the other un-
is composed of three parts: an audio encoder, an image en- paired samples. This results in unstable training and leads
coder, and an image generator. First, we pre-train a powerful to poor image quality. Therefore, we use InfoNCE [44]
image encoder and a generator conditioned by the encoder, as a specific type of contrastive learning, which has been
separately with a large image dataset alone. Since there is successfully applied to audio-visual representation learn-
a natural correspondence between sound and visual infor- ings [2, 13, 17, 36, 59, 70]:
mation, we exploit this natural alignment and transfer the
discriminative and expressive visual information from the \texttt {InfoNCE}(\ba _j, \{\bb \}_{k=1}^N) = -\log {\tfrac {\exp (-d(\ba _j,\bb _j))}{\sum ^N_{k=1}\exp (-d(\ba _j,\bb _k))}}, (1)
image encoder into audio representation. In this way, we con-
struct a joint audio-visual embedding space that is trained where a and b denotes arbitrary vectors with the same dimen-
in a self-supervised manner using only in-the-wild videos. sion, and d(a, b) = ∥a − b∥2 . With this loss, we maximize
Later, the audio representation from this aligned embedding the feature similarity between an image and its true audio
space is fed into the image generator to produce images segment (positive) while minimizing the similarity with the
corresponding to the input sound. randomly selected unrelated audios (negatives). Given the

6432
Volcano Explosion Fire Truck Siren Dog Barking
Selected Mid-frame Selected Mid-frame Selected Mid-frame
One straightforward way to collect data pairs for training,
D, is to extract a mid-frame of the video with the corre-
sponding audio segment [13, 36]. However, the mid-frame
cannot guarantee to contain informative corresponding audio-
visual signals [62]. To this end, we leverage a pre-trained
Figure 3. Examples of selected top-1 frame vs. mid-frame. sound source localization model [62] and extract highly
correlated audio and visual pairs. The backbone networks
of [62] enable us to have fine-grained temporal time steps of
j-th visual and audio feature pair, we first define our audio
audio-visual features, qA and qV , respectively. Correlation
feature-centric loss as LA A V
j = InfoNCE(ẑj , {ẑ }), where scores are computed by Cav [t] = qV A
t · qt at each time step.
ẑA and ẑV are representations with unit-norm. To make our After computing the correlation scores, Cav are sorted by
objective symmetric, we compute the visual feature-centric top-k(Cav [t]). With this correlated pair selection method,
loss term as LVj = InfoNCE(ẑV A
j , {ẑ }). Then, our final we annotate top-1 moment frames for each video in the
learning objective is to minimize the sum of each loss term training splits and use them for training. Fig. 3 shows the
for all the audio and visual pairs in the mini-batch B: comparison between selected frames and mid-frames. Even
\label {loss1} L_{total} = \textstyle \frac {1}{2B}\sum \nolimits _{j=1}^B\left (L_j^A+L_j^V\right ). (2) though frames are selected automatically, they contain the
corresponding distinctive objects to the audio accurately.
After training the audio encoder with Eq. (2), our model
learns visually enriched audio features that are aligned with 4. Experiments
the visual features. Thus, we can directly feed the learned We validate our proposed sound-to-image generation
audio feature zA with noise vector zN ∼ N (0, I) to the method with experiments on VGGSound [14] and VE-
frozen image generator as G(zN , zA ) to generate a visual GAS [73]. First, we visualize samples of generated images
scene at the inference stage. on diverse categories of sounds. Then, we quantitatively ex-
3.2. Architecture amine the generation quality, diversity, and correspondence
between the audio and generated images. Note that we do
All the following modules are separately trained accord- not use any class information during training and inference.
ing to the proposed steps.
Image encoder fV (·). We use ResNet-50 [27]. To cope 4.1. Experiment Setup
with general visual contents, we train the image encoder in a
self-supervised way [10] with ImageNet [18] without labels. Datasets. We train and test our method on VGGSound [14]
and VEGAS [73]. VGGSound is an audio-visual dataset
Image generator G(·). We use the BigGAN [9] architec-
containing around 200K videos. We select 50 classes among
ture to deal with high-quality generation and a large variabil-
this dataset and follow the train and test splits provided.
ity of scene contents. To make the BigGAN a conditional
VEGAS contains about 2.8K videos with 10 classes. For the
generator, we follow the modification of the input condition
data statistic balance, we select 800 videos for training and
structure of ICGAN [11]. We train the generator to gener-
50 videos for testing per class. Test splits in both datasets are
ate photo-realistic 128 × 128 resolution images from the
used for the following qualitative and quantitative analysis.
conditional visual embeddings zV obtained from the image
encoder. To train the generator, we use ImageNet without Evaluation metrics. We demonstrate the objective and sub-
labels in a self-supervised way. While training the image jective metrics to evaluate our method quantitatively.
generator, the image encoder is pre-trained and fixed. • CLIP [54] retrieval : Inspired by the CLIP R-Precision
Audio encoder fA (·). We use ResNet-18, which takes the metric [49], we quantify the generated images by measur-
audio spectrogram as input. After the last convolutional layer, ing image-to-text retrieval performance with recall at K
adaptive average pooling aggregates temporal-frequency in- (R@K). We feed the generated images and the texts from
formation into a single vector. The pooled feature is fed into the name of the audio category to CLIP. Then, we measure
a single linear layer to obtain an audio embedding zA . The the similarities between the image and text features and
audio network is trained on either VGGSound or VEGAS rank the candidate text descriptions for the query image.
with the loss in Eq. (2) according to target benchmarks. • Fréchet Inception Distance (FID) [28] and Inception
Score (IS) [58] : FID measures the Fréchet distance be-
3.3. Audio-Visual Pair Selection Module
tween the features obtained from real and synthesized
Learning the relationship between the images and sounds images using a pre-trained Inception-V3 [66]. This same
accurately requires highly correlated data pairs of two modal- model can also be used for measuring inception score (IS),
ities. Knowing which frame/segment in the video is infor- which computes the KL-divergence between the condi-
mative for audio-visual correspondence is not an easy task. tional class distribution and the marginal class distribution.

6433
(A)

Chainsawing Tractor Digging Stream Burbling Churchbell Ring Owl Hooting

(B)

Snowmobile Lawn Mowing Waterfall Burbling Printer Snake Hissing


(C)

Water Flowing Rail Transport Baby Cry Dog Fireworks

Fire Truck Icecream Truck Sea Waves Playing Timbales Dog Barking
VEGAS (5 classes)
Method
R@1 FID (↓) IS (↑)
(A) Pedersoli et al. [52] 23.10 118.68 1.19
(B) S2I [20] 39.19 114.84 1.45
(C) Ours 77.58 34.68 4.01
Railroad Transport Airplane Flyby Underwater Skiing Cat Purring
Figure 4. Qualitative results by feeding single waveform from Figure 5. Comparison to the baseline [52] and existing sound-
VGGSound test set. Sound2Scene generates diverse images in a to-image method [20]. Our method outperforms the others both
wide variety of categories from generic sounds as input. qualitatively and quantitatively in the VEGAS dataset.

S2I2 [20] and Pedersoli3 et al. [52]. Note that Pedersoli et al.
• Human evaluations : We recruit 70 participants to analyze
is not targeted for sound-to-image but uses VQVAE-based
the performance of our method from a human perception
model [67] for sound-to-depth or segmentation generations.
perspective. We first compare our method with the image-
Though our model can handle more diverse categories of
only model [11] and then evaluate whether our model
in-the-wild audio, we follow the training setup as in S2I by
generates proper images corresponding to input sound.
training our model and Pedersoli et al. with five categories
More details can be found in Sec. 4.3.
in VEGAS for a fair comparison. As shown in Table 5, our
Implementation details. The input of the audio encoder model outperforms all the other methods. Additionally, it
is 1004×257-dimensional log-spectrogram converted from generates visually plausible images, while previous meth-
10 seconds of audio. The extracted frame from the video is ods fail to generate recognizable images. We postulate that
resized to 224×224 and fed as an input of the image encoder. learning visually enriched audio embeddings combined with
We train our model on a single GeForce RTX 3090 for 50 a powerful image generator leads to superior results.4
epochs with early stopping. We use the Adam optimizer and Comparison with strong baselines. We further compare
set the batch size to 64, the learning rate to 10−3 , and the our proposed method with closely-related baselines in Ta-
weight decay to 10−5 . ble 1 (a). First, we compare with an image-to-image gener-
ation model identical to ICGAN [11] (A). Our model (B)
4.2. Qualitative Results shares the same image generator with (A), but differs in the
encoder type and the input modality. As shown, (B) outper-
Sound2Scene generates visually plausible images compat-
forms or gives comparable results to (A) in all metrics. We
ible with a single input waveform, as shown in Fig.1 and 4. It
presume that the noisy characteristic of the video datasets
is not limited to a handful number of categories but handles
causes (A) to fail to extract a good visual feature for image
diverse categories, from animals, and vehicles, to sceneries,
generation, while audio is relatively robust and less sensitive
etc. We highlight that the proposed model can even distin-
to those limitations, resulting in generating plausible images.
guish subtle differences in similar sound categories, such as
We also compare our model (B) with a retrieval system (C)
“engine” (Fig. 4 col. 1-2) and “water” (Fig. 4 col. 3) related
that can be regarded as a strong baseline. Given an input au-
sounds, and produces accurate and distinct images. See the
dio embedding, the retrieval system finds the closest image
supplement for more results.
from the database. (C) shares the same audio encoder with
(B), but the image generator G is replaced with the same
4.3. Quantitative Analysis
2 https://github.com/leofanzeres/s2i
Comparison with other methods. We compare our model 3 https://github.com/ubc-vision/audio_manifold

with the prior arts of which codes are publicly available, 4 More qualitative comparisons can be found in the supp. material.

6434
Encoder Generator VGGSound (50 classes) VEGAS
Method
(V /A) (G/R)
R@1 R@5 FID (↓) IS (↑) R@1 R@5
(A) ICGAN [11] V G 30.06 62.59 16.11 12.61 46.60 82.48
(B) Ours A G 40.71 77.36 17.97 19.46 57.44 84.08
(C) Retrieval A R 51.28 80.37 - - 67.20 85.00
(D) Upper bound - - 57.82 85.79 - - 73.60 88.2
(a) Comparison to baselines (b) User study
Table 1. Quantitative evaluations. We compare our method with different baselines (different settings for the encoder and the generator) on
CLIP retrieval (R@k), FID, and IS in (a). For user study, we first compare our method with ICGAN by measuring recall probability between
generated images of ICGAN and our method from the same audio-visual pair. Second, we validate our method’s output for the given audio.
Results are in (b) respectively. Abbr. V : image encoder, A: audio encoder, G: image generator, R: retrieval system.

VGGSound (50 classes)


Ablation studies. We conduct a series of experiments in
Loss F Duration order to verify our design choices in Table 2. We compare
R@1 R@5 FID (↓) IS (↑)
the performance of applying different distillation losses: a
(A) L2 ✓ 10 sec. 18.21 46.69 24.05 9.97 simple L2 loss between the image and audio feature, and In-
(B) Lnce ✓ 10 sec. 31.63 66.04 27.05 12.92 foNCE loss [44] with a cosine similarity measurement, Lnce ,
(C) Ltotal 10 sec. 37.20 73.13 21.20 17.51
rather than using L2 distance as in Eq. (2). As the results of
(D) 1 sec. 35.85 72.02 19.05 17.87 (A), (B), and (F) reveal, our loss choice (F) leads to produc-
Ltotal ✓
(E) 5 sec. 38.24 75.76 20.43 18.81
ing more diverse and higher quality results. We also observe
(F) Ltotal ✓ 10 sec. 40.71 77.36 17.97 19.46 that the frame selection method, discussed in Sec. 3.3, brings
extra performance improvement; see the performance dif-
Table 2. Ablation studies of our proposed method. We compare ference between (C) and (F). Finally, we test the effect of
the different configurations of our method by changing the loss
audio duration. We train models with 1, 5, and 10 seconds
functions, frame selection method, and duration of the audio. F
of sounds with other experimental settings being fixed. By
denotes the frame selection method.
comparing (D), (E), and (F), we observe that feeding longer
sounds consistently improves performance. We presume that
memory-sized database of the images from the training data. the longer audios capture more descriptive audio semantics,
(D) is an upper bound in which the extracted video frames while shorter ones are vulnerable to missing them.
are directly used for the evaluations. The performance gap
between (C) and (D) is dramatically lower than that of (B) 5. Controllability of Sound2Scene
and (D), which justifies that our audio encoder properly Our model learns the natural correspondence between
maps the input audio to the joint embedding space. (C) out- audio and visual signals via aligned audio-visual embedding
performs (B) on R@1 for both datasets, while (B) performs space. Thus, intuitively, we ask if manipulations on input can
comparably to (C) in R@5. Though the image generators result in corresponding changes in the generated images. We
have room to improve, these results show that our method observe that even without an explicit objective, our model al-
can reach to the proximity of the strong baseline. lows controllable outputs by applying simple manipulations
User study. We summarize the user study in Table 1 (b) on inputs in the waveform space or learned latent space. This
with two experiments: (i) comparison to ICGAN and (ii) opens up interesting experiments that we explore below.
validation of the proper image generation for given audio. 5.1. Waveform Manipulation for Image Generation
Each experiment has 20 questions. In (i), audio and five
images are given to the participants. Among the five, two Changing the volume. Humans can roughly predict the dis-
are generated by ours and ICGAN, respectively, and the rest tance or the size of an instance by the volume of the sound.
are randomly generated from either method. Participants To check if our model can also understand the volume differ-
choose all the images that illustrate the given sound, and we ences, we reduce and increase the volume of the reference
check the preference by comparing the recall probability of audio. Each audio with a different volume is fed into our
ICGAN and ours. In (ii), audio and four images are provided model with the same noise vector. As shown in Fig. 6, the
to the participants. All four images are generated by ours, instances in the synthesized images get larger as the vol-
but only one is from the given sound. Participants choose umes increase. Interestingly, the volume changes in “Water
only one image that best illustrates the given sound. As in (i), Flowing” illustrate the different flows of water, while “Rail
our model is more preferred. Moreover, (ii) shows that the Transport” shows a train approaching in the scene. These
precision of our method is 83.8%, which supports our model results highlight that our model has not only class-specific
generating highly-correlated images to the given sounds. understanding but also the relation between the volume of

6435
Volcano Explosion Water Flowing

Airplane Flyby Rail Transport


(Top) Bird / (Bottom) Dog
Wind
Figure 6. Generated images by changing the volumes of the Figure 8. Generated images by mixing multiple audios with
input audio in the waveform space. As the volume increases, the volume changes in the waveform space. We observe that
objects of the sound source become larger or more dynamic. Sound2Scene mimics the camera movement by placing the ob-
ject further as the wind sound gets larger.

Train Bird
Whistling Singing
+ +
Skiing Skiing

Train Bird Elk Bugling


Whistling Singing
+ +
Hail Hail

Figure 7. Generated images by mixing two different audios in


the waveform space.
Marching
the audio and visual changes. We assume that the supervi- (a) Inputs (b) Generated images
sion from the visual modality enables our model to capture
Figure 9. Generated images conditioned on image and audio.
such strong and expressive audio-visual relationships.
We interpolate between a given visual feature and an audio feature
Mixing waveforms. We investigate if our model can capture in the latent space. This interpolated feature is then fed to the image
the existence of multiple sounds in the generated images. generator to get a novel image.
To this end, we mix two waveforms into a single one and
feed it to our model. As shown in Fig. 1 and Fig. 7, our 5.2. Latent Manipulation for Image Generation
model synthesizes images by reflecting those multiple audio
As we construct an aligned audio-visual embedding space,
semantics. For example, the railroad or a bird pops up across
our model can also take an image and audio together as input
the snowy scene when mixing with the “Skiing” sound, and
and generate images. We introduce two different approaches
the train and bird appear in the misty scene when mixing
(Fig. 1) for audio-visual conditioned image generation where
with the “Hail” sound. Also, as in Fig. 1, mixing the “Dog
both use the features of the inputs in the latent space.
Barking” and “Water Flowing” sounds outputs a scene with
a dog playing in the water. Sensing multiple separate sounds Image and audio conditioned image generation. Given
from a single mixed audio input [32], i.e., audio source an image and audio, we extract a visual feature zV and an
separation [22], and generating their visual appearance in audio feature zA . Then, we interpolate two different features
the proper context is not trivial. However, our results show in the latent space and obtain a novel feature: znew = λzV +
that the proposed model can handle this to a certain extent. (1 − λ)zA , where λ differs throughout the examples. This
feature is fed to the image generator for generating an image.
Mixing waveforms and changing the volume. Here, we
As shown in Fig. 9, this simple approach can produce an
manipulate the input waveform by combining the multiple
image by putting the sound context into the scene, such as
waveforms and changing their volumes at the same time. In
a marching sound bringing parade-looking people or a loud
Fig. 8, we mix the “Wind” sound with each of the “Bird” and
elk sound making an elk appears in a snowy scene.
“Dog” sound with volume changes. As the “Wind” sound
We further use this approach to compare our method with
gets larger while the “Bird” sound decreases, the bird gets
the recent sound-guided image manipulation approach5 [36]
smaller and is finally covered with the bushes. In the same
in Fig. 10. Note that this task is not targeted explicitly by
experiment setting, a close-up shot of the dog indoors starts
our model but appears as a natural outcome of our design.
zooming out and gets a wide shot in the outdoor environment.
These results show that our model can capture subtle changes 5 https: / / github . com / kuai - lab / sound - guided -
in the audio and reflect them to generate images. semantic-image-manipulation

6436
Method Target task
[36] Sound-guided image manipulation
Ours Sound-to-image generation
Given image

Explosion Cheering

Given image
Figure 11. Image editing by volume changes in latent space. We
Hail Tractor extract an image feature and noise vector by GAN inversion, and
two audio features with different volumes. Then, we move the
[36] Ours [36] Ours image feature in the direction of the audio feature differences.
Figure 10. Qualitative comparison of our method and Lee et
al. [36]. Lee et al. fail to insert an object while maintaining the
contents of the given image. Our method, by contrast, successfully
inserts objects that sound in the scene by generating a new image.
Note that both works target different tasks.
Cow Lowing + Sheep Bleating Orchestra Baby Crying
While Lee et al. [36] preserves the overall content of the
(a) Mixing objects (b) Incomplete human forms
given image, it fails to insert an object corresponding to the
sound. In contrast, our method creates an image (nearly the Figure 13. Examples of failure results. In cases where both source
same as the given one) by conditioning on both modalities, audio specify certain objects, our model mixes two objects into one
(a). Our model also tends to produce incomplete human forms (b).
for example, inserting an explosion and tractor in the scene
or making the ocean view look cloudy due to the hail. X-to-Vision tasks.
Image editing with paired sound. We approach sound- Failure cases. Although our method shows favorable results
guided image editing from a different perspective by ma- for given audios, single or mixture, there are some failure
nipulating the inputs in the latent space. By using GAN cases we have observed. The first phenomenon is that our
inversion [1, 57], we extract a visual feature zV inv and corre- model often generates images with a single or a blended ob-
sponding noise vector zN inv for the given image. Additionally, ject unintentionally when the audios used in the mixture spec-
we change the volume of the corresponding audio and extract ify two distinct but similar-sounding objects (see examples
two different audio features, zA A
1 and z2 , respectively. We in Fig. 13 (a)). Another one we have observed is that the qual-
move the visual feature toward the direction of the difference ity of the outputs is lower in human-related categories (see
between the two audio features and obtain a novel feature: examples in Fig. 13 (b)). This phenomenon is shared across
znew = zV A A
inv + λ(z1 − z2 ), i.e., manipulating the visual typical GAN-based generative models, e.g., [38], when train-
feature with the audio guidance. By using this new feature ing the generator with generic object dataset, i.e., ImageNet.
with image generator, G(zN inv , z
new
), the original image is
Conclusion. In this paper, we propose Sound2Scene, a
edited. As shown in Fig. 11, by simply changing the volume
model for generating images that are relevant to the given au-
– moving through the latent space – we can change the flow
dio. This task inherently has challenges: a significant modal-
of the waterfall or make the ocean wave stronger or calmer.
ity gap between audio and visual signals, such that audio
lacks visual information, and audio-visual pairs are not al-
6. Discussion ways correspondent. Existing approaches have limitations
Generalization. We show due to these difficulties. We show that our proposed method
generalization of the model to overcomes these challenges in that it can successfully en-
some extent in two settings: rich the audio features with visual knowledge, selects audio-
1) Generating images from un- visually correlated pairs for learning, and generates rich
seen categories that are seman- images with various characteristics. Furthermore, we demon-
tically similar to the training strate our model allows controllability in inputs to get more
Figure 12. Generalization to creative results, unlike the prior arts. We would like to note
set as sound often carries over- unseen classes.
lapping information (Fig. 12). that our proposed learning approach and the audio-visual
2) Compositionality (dog barking+water flowing in Fig. 1). pair selection method are independent of the specific design
However, our method may not be generalized for every un- choice of the model. We hope that our work encourages
seen category as similar limitation is also common in other further research on multi-modal image generation.

6437
References IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2020. 2, 4
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- [15] Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chen-
age2stylegan: How to embed images into the stylegan latent liang Xu. Deep cross-modal audio-visual generation. In
space? In IEEE International Conference on Computer Vision Proceedings of the on Thematic Workshops of ACM Multime-
(ICCV), 2019. 8 dia 2017, 2017. 1, 2, 3
[2] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and [16] Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao,
Andrew Zisserman. Self-supervised learning of audio-visual Deng Huang, and Chuang Gan. Generating visually aligned
objects from video. In European Conference on Computer sound from videos. IEEE Transactions on Image Processing
Vision (ECCV), 2020. 3 (TIP), 2020. 2
[3] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, [17] Yanbei Chen, Yongqin Xian, A Koepke, Ying Shan, and
Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transform- Zeynep Akata. Distilling audio-visual knowledge by composi-
ers for multimodal self-supervised learning from raw video, tional contrastive learning. In IEEE Conference on Computer
audio and text. In Advances in Neural Information Processing Vision and Pattern Recognition (CVPR), 2021. 2, 3
Systems (NeurIPS), 2021. 2 [18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
[4] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Fei-Fei. Imagenet: A large-scale hierarchical image database.
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, In IEEE Conference on Computer Vision and Pattern Recog-
Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual nition (CVPR), 2009. 4
language model for few-shot learning. In Advances in Neural [19] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang
Information Processing Systems (NeurIPS), 2022. 2 Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia
[5] Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Yang, and Jie Tang. Cogview: Mastering text-to-image gen-
Torresani, Bernard Ghanem, and Du Tran. Self-supervised eration via transformers. In Advances in Neural Information
learning by cross-modal audio-video clustering. In Advances Processing Systems (NeurIPS), 2021. 2
in Neural Information Processing Systems (NeurIPS), 2020. [20] Leonardo A Fanzeres and Climent Nadeu. Sound-to-
2 imagination: Unsupervised crossmodal translation using deep
[6] Relja Arandjelovic and Andrew Zisserman. Look, listen and dense network architecture. arXiv preprint arXiv:2106.01266,
learn. In IEEE International Conference on Computer Vision 2021. 1, 2, 3, 5
(ICCV), 2017. 2 [21] Chuang Gan, Hang Zhao, Peihao Chen, David Cox, and An-
[7] Relja Arandjelovic and Andrew Zisserman. Objects that tonio Torralba. Self-supervised moving vehicle tracking with
sound. In European Conference on Computer Vision (ECCV), stereo sound. In IEEE International Conference on Computer
2018. 2 Vision (ICCV), 2019. 3
[8] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: [22] Ruohan Gao and Kristen Grauman. Co-separating sounds of
Learning sound representations from unlabeled video. In Ad- visual objects. In IEEE Conference on Computer Vision and
vances in Neural Information Processing Systems (NeurIPS), Pattern Recognition (CVPR), 2019. 7
2016. 2, 3 [23] Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, An-
drew Owens, and Jitendra Malik. Learning individual styles
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
of conversational gesture. In IEEE Conference on Computer
scale gan training for high fidelity natural image synthesis.
Vision and Pattern Recognition (CVPR), 2019. 2
In International Conference on Learning Representations
[24] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
(ICLR), 2018. 4
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
Yoshua Bengio. Generative adversarial nets. In Advances in
otr Bojanowski, and Armand Joulin. Unsupervised learning
Neural Information Processing Systems (NeurIPS), 2014. 2
of visual features by contrasting cluster assignments. In Ad-
[25] Jianping Gou, Baosheng Yu, Stephen J Maybank, and
vances in Neural Information Processing Systems (NeurIPS),
Dacheng Tao. Knowledge distillation: A survey. Interna-
2020. 4
tional Journal of Computer Vision (IJCV), 2021. 3
[11] Arantxa Casanova, Marlène Careil, Jakob Verbeek, Michal [26] Wangli Hao, Zhaoxiang Zhang, and He Guan. Cmcgan: A
Drozdzal, and Adriana Romero Soriano. Instance-conditioned uniform framework for cross-modal visual-audio mutual gen-
gan. In Advances in Neural Information Processing Systems eration. In AAAI Conference on Artificial Intelligence (AAAI),
(NeurIPS), 2021. 2, 4, 5, 6 2018. 1, 2, 3
[12] Moitreya Chatterjee and Anoop Cherian. Sound2sight: Gen- [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
erating visual dynamics from sound and context. In European Deep residual learning for image recognition. In IEEE Con-
Conference on Computer Vision (ECCV), 2020. 2 ference on Computer Vision and Pattern Recognition (CVPR),
[13] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Na- 2016. 4
grani, Andrea Vedaldi, and Andrew Zisserman. Localizing [28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
visual sounds the hard way. In IEEE Conference on Computer hard Nessler, and Sepp Hochreiter. Gans trained by a two time-
Vision and Pattern Recognition (CVPR), 2021. 2, 3, 4 scale update rule converge to a local nash equilibrium. In Ad-
[14] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- vances in Neural Information Processing Systems (NeurIPS),
serman. Vggsound: A large-scale audio-visual dataset. In 2017. 4

6438
[29] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distill- IEEE Conference on Computer Vision and Pattern Recogni-
ing the knowledge in a neural network. arXiv preprint tion (CVPR), 2019. 2
arXiv:1503.02531, 2015. 3 [44] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[30] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, sentation learning with contrastive predictive coding. arXiv
Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, preprint arXiv:1807.03748, 2018. 3, 6
Mohammad Norouzi, David J Fleet, et al. Imagen video: [45] Andrew Owens and Alexei A Efros. Audio-visual scene anal-
High definition video generation with diffusion models. arXiv ysis with self-supervised multisensory features. In European
preprint arXiv:2210.02303, 2022. 2 Conference on Computer Vision (ECCV), 2018. 2
[31] Di Hu, Feiping Nie, and Xuelong Li. Deep multimodal clus- [46] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor-
tering for unsupervised audiovisual learning. In IEEE Confer- ralba, Edward H Adelson, and William T Freeman. Visually
ence on Computer Vision and Pattern Recognition (CVPR), indicated sounds. In IEEE Conference on Computer Vision
2019. 2 and Pattern Recognition (CVPR), 2016. 2
[32] Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui [47] Andrew Owens, Jiajun Wu, Josh H McDermott, William T
Ding, Weiyao Lin, and Dejing Dou. Discriminative sounding Freeman, and Antonio Torralba. Ambient sound provides
objects localization via self-supervised audiovisual match- supervision for visual learning. In European Conference on
ing. In Advances in Neural Information Processing Systems Computer Vision (ECCV), 2016. 3
(NeurIPS), 2020. 7 [48] Andrew Owens, Jiajun Wu, Josh H McDermott, William T
[33] Vladimir Iashin and Esa Rahtu. Taming visually guided sound Freeman, and Antonio Torralba. Learning sight from sound:
generation. In British Machine Vision Conference (BMVC), Ambient sound provides supervision for visual learning. In-
2021. 2 ternational Journal of Computer Vision (IJCV), 2018. 2
[34] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gener- [49] Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell,
ation from scene graphs. In IEEE Conference on Computer and Anna Rohrbach. Benchmark for compositional text-to-
Vision and Pattern Recognition (CVPR), 2018. 2 image synthesis. In Thirty-fifth Conference on Neural Infor-
[35] Minkyu Kim, Kim Sung-Bin, and Tae-Hyun Oh. Prefix tun- mation Processing Systems Datasets and Benchmarks Track
ing for automated audio captioning. In IEEE International (Round 1), 2021. 4
Conference on Acoustics, Speech, and Signal Processing [50] Sooyoung Park, Arda Senocak, and Joon Son Chung. Margin-
(ICASSP), 2023. 2 nce: Robust sound localization with a negative margin. In
[36] Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho IEEE International Conference on Acoustics, Speech, and
Yoon, Chanyoung Kim, Jinkyu Kim, and Sangpil Kim. Sound- Signal Processing (ICASSP), 2023. 2
guided semantic image manipulation. In IEEE Conference on [51] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
Computer Vision and Pattern Recognition (CVPR), 2022. 1, and Dani Lischinski. Styleclip: Text-driven manipulation of
2, 3, 4, 7, 8 stylegan imagery. In IEEE Conference on Computer Vision
[37] Tingle Li, Yichen Liu, Andrew Owens, and Hang Zhao. Learn- and Pattern Recognition (CVPR), 2021. 2
ing visual styles from audio-visual associations. In European [52] Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong
Conference on Computer Vision (ECCV), 2022. 1, 2 Zhang, and Kwang Moo Yi. Estimating visual informa-
[38] Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae tion from audio through manifold learning. arXiv preprint
Lee, and Krishna Kumar Singh. Collaging class-specific arXiv:2208.02337, 2022. 5
gans for semantic image synthesis. In IEEE International [53] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall.
Conference on Computer Vision (ICCV), 2021. 8 Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint
[39] Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- arXiv:2209.14988, 2022. 2
cap: Clip prefix for image captioning. arXiv preprint [54] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
arXiv:2111.09734, 2021. 2 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[40] Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. Robust Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
audio-visual instance discrimination. In IEEE Conference on transferable visual models from natural language supervision.
Computer Vision and Pattern Recognition (CVPR), 2021. 2 In International Conference on Machine Learning (ICML),
[41] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio- 2021. 2, 4
visual instance discrimination with cross-modal agreement. [55] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
In IEEE Conference on Computer Vision and Pattern Recog- and Mark Chen. Hierarchical text-conditional image genera-
nition (CVPR), 2021. 2 tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[42] Medhini Narasimhan, Shiry Ginosar, Andrew Owens, 2
Alexei A Efros, and Trevor Darrell. Strumming to the beat: [56] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Audio-conditioned contrastive video textures. In IEEE Win- Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
ter Conference on Applications of Computer Vision (WACV), Zero-shot text-to-image generation. In International Confer-
2022. 2 ence on Machine Learning (ICML), 2021. 2
[43] Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, [57] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
William T Freeman, Michael Rubinstein, and Wojciech Ma- Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in
tusik. Speech2face: Learning the face behind a voice. In style: a stylegan encoder for image-to-image translation. In

6439
IEEE Conference on Computer Vision and Pattern Recogni- human meshes. In European Conference on Computer Vision
tion (CVPR), 2021. 8 (ECCV), 2022. 2
[58] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki [72] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image
Cheung, Alec Radford, and Xi Chen. Improved techniques for generation from layout. In IEEE Conference on Computer
training gans. In Advances in Neural Information Processing Vision and Pattern Recognition (CVPR), 2019. 2
Systems (NeurIPS), 2016. 4 [73] Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and
[59] Arda Senocak, Junsik Kim, Tae-Hyun Oh, Hyeonggon Ryu, Tamara L Berg. Visual to sound: Generating natural sound for
Dingzeyu Li, and In So Kweon. Event-specific audio-visual videos in the wild. In IEEE Conference on Computer Vision
fusion layers: A simple and new perspective on video un- and Pattern Recognition (CVPR), 2018. 2, 4
derstanding. IEEE Winter Conference on Applications of
Computer Vision (WACV), 2023. 3
[60] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang,
and In So Kweon. Learning to localize sound source in visual
scenes. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018. 2
[61] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang,
and In So Kweon. Learning to localize sound sources in
visual scenes: Analysis and applications. IEEE Transactions
on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
2
[62] Arda Senocak, Hyeonggon Ryu, Junsik Kim, and In So
Kweon. Less can be more: Sound source localization with a
classification model. In IEEE Winter Conference on Applica-
tions of Computer Vision (WACV), 2022. 2, 4
[63] Joo Yong Shim, Joongheon Kim, and Jong-Kook Kim. S2i-
bird: Sound-to-image generation of bird species using gener-
ative adversarial networks. In International Conference on
Pattern Recognition (ICPR), 2021. 1, 2
[64] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran
Gafni, et al. Make-a-video: Text-to-video generation without
text-video data. arXiv preprint arXiv:2209.14792, 2022. 2
[65] Kun Su, Xiulong Liu, and Eli Shlizerman. Audeo: Audio
generation for a silent performance video. In Advances in
Neural Information Processing Systems (NeurIPS), 2020. 2
[66] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
tecture for computer vision. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016. 4
[67] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
representation learning. In Advances in Neural Information
Processing Systems (NeurIPS), 2017. 5
[68] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kinder-
mans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar,
Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki:
Variable length video generation from open domain textual
description. arXiv preprint arXiv:2210.02399, 2022. 2
[69] Chia-Hung Wan, Shun-Po Chuang, and Hung-Yi Lee. To-
wards audio to scene image synthesis using generative adver-
sarial network. In IEEE International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP), 2019. 1, 2,
3
[70] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and
Juan Pablo Bello. Wav2clip: Learning robust audio repre-
sentations from clip. In IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 2021. 3
[71] Kim Youwang, Kim Ji-Yeon, and Tae-Hyun Oh. Clip-actor:
Text-driven recommendation and stylization for animating

6440

You might also like