Speech2Face - Learning The Face Behind A Voice
Speech2Face - Learning The Face Behind A Voice
Abstract
arXiv:1905.09773v1 [cs.CV] 23 May 2019
Speech2face
How much can we infer about a person’s looks from the
way they speak? In this paper, we study the task of recon-
structing a facial image of a person from a short audio
True face Face reconstructed True face Face reconstructed
recording of that person speaking. We design and train a (for reference) from speech (for reference) from speech
deep neural network to perform this task using millions of
natural Internet/YouTube videos of people speaking. Dur-
ing training, our model learns voice-face correlations that
allow it to produce images that capture various physical
attributes of the speakers such as age, gender and ethnicity.
This is done in a self-supervised manner, by utilizing the
natural co-occurrence of faces and speech in Internet videos,
without the need to model attributes explicitly. We evaluate
and numerically quantify how—and in what manner—our
Speech2Face reconstructions, obtained directly from audio,
resemble the true face images of the speakers.
1. Introduction
When we listen to a person speaking without seeing his/her
face, on the phone, or on the radio, we often build a mental Figure 1. Top: We consider the task of reconstructing an image
model for the way the person looks [25, 45]. There is a strong of a person’s face from a short audio segment of speech. Bottom:
connection between speech and appearance, part of which is Several results produced by our Speech2Face model, which takes
a direct result of the mechanics of speech production: age, only an audio waveform as input; the true faces are shown just for
gender (which affects the pitch of our voice), the shape of the reference. Note that our goal is not to reconstruct an accurate image
of the person, but rather to recover characteristic physical features
mouth, facial bone structure, thin or full lips—all can affect
that are correlated with the input speech. All our results including
the sound we generate. In addition, other voice-appearance the input audio, are available in the supplementary material (SM).
correlations stem from the way in which we talk: language,
accent, speed, pronunciations—such properties of speech
are often shared among nationalities and cultures, which can We design a neural network model that takes the com-
in turn translate to common physical features [12]. plex spectrogram of a short speech segment as input and
Our goal in this work is to study to what extent we can predicts a feature vector representing the face. More specif-
infer how a person looks from the way they talk. Specifically, ically, face information is represented by a 4096-D feature
from a short input audio segment of a person speaking, our that is extracted from the penultimate layer (i.e., one layer
method directly reconstructs an image of the person’s face prior to the classification layer) of a pre-trained face recogni-
in a canonical form (i.e., frontal-facing, neutral expression). tion network [40]. We decode the predicted face feature into
Fig. 1 shows sample results of our method. Obviously, there a canonical image of the person’s face using a separately-
is no one-to-one matching between faces and voices. Thus, trained reconstruction model [10]. To train our model, we use
our goal is not to predict a recognizable image of the exact the AVSpeech dataset [14], comprised of millions of video
face, but rather to capture dominant facial traits of the person segments from YouTube with more than 100,000 different
that are correlated with the input speech. people speaking. Our method is trained in a self-supervised
∗ The three authors contributed equally to this work. manner, i.e., it simply uses the natural co-occurrence of
Correspondence: [email protected] speech and faces in videos, not requiring additional informa-
Supplementary material (SM): https://speech2face.github.io tion, e.g., human annotations.
1
Millions of Face Recognition 4096-D Face Feature
0
Internet videos 1 : Pre-trained & fixed
5
Speech-Face pairs
8 Loss
7
6 : Trainable
6
Speech2Face Model
Figure 2. Speech2Face model and training pipeline. The input to our network is a complex spectrogram computed from the short audio
segment of a person speaking. The output is a 4096-D face feature that is then decoded into a canonical image of the face using a pre-trained
face decoder network [10]. The module we train is marked by the orange-tinted box. We train the network to regress to the true face feature
computed by feeding an image of the person (representative frame from the video) into a face recognition network [40] and extracting the
feature from its penultimate layer. Our model is trained on millions of speech–face embedding pairs from the AVSpeech dataset [14].
We are certainly not the first to attempt to infer infor- reconstructed face images agree with the true face images
mation about people from their voices. For example, pre- (unknown to the method) in terms of age, gender, ethnicity,
dicting age and gender from speech has been widely ex- and various craniofacial measures and ratios.
plored [53, 18, 16, 7, 50]. Indeed, one can consider an alter-
native approach to attaching a face image to an input voice 2. Ethical Considerations
by first predicting some attributes from the person’s voice
(e.g., their age, gender, etc. [53]), and then either fetching Although this is a purely academic investigation, we feel
an image from a database that best fits the predicted set of that it is important to explicitly discuss in the paper a set
attributes, or using the attributes to generate an image [52]. of ethical considerations due to the potential sensitivity of
However, this approach has several limitations. First, pre- facial information.
dicting attributes from an input signal relies on the existence Privacy. As mentioned, our method cannot recover the
of robust and accurate classifiers and often requires ground true identity of a person from their voice (i.e., an exact im-
truth labels for supervision. For example, predicting age, age of their face). This is because our model is trained to
gender or ethnicity from speech requires building classifiers capture visual features (related to age, gender, etc.) that are
specifically trained to capture those properties. More impor- common to many individuals, and only in cases where there
tantly, this approach limits the predicted face to resemble is strong enough evidence to connect those visual features
only a predefined set of attributes. with vocal/speech attributes in the data (see “voice-face cor-
We aim at studying a more general, open question: what relations” below). As such, the model will only produce
kind of facial information can be extracted from speech? Our average-looking faces, with characteristic visual features
approach of predicting full visual appearance (e.g., a face im- that are correlated with the input speech. It will not produce
age) directly from speech allows us to explore this question images of specific individuals.
without being restricted to predefined facial traits. Specif- Voice-face correlations and dataset bias. Our model is
ically, we show that our reconstructed face images can be designed to reveal statistical correlations that exist between
used as a proxy to convey the visual properties of the person facial features and voices of speakers in the training data.
including age, gender and ethnicity. Beyond these dominant The training data we use is a collection of educational videos
features, our reconstructions reveal non-negligible correla- from YouTube [14], and does not represent equally the entire
tions between craniofacial features [31] (e.g., nose structure) world population. Therefore, the model—as is the case with
and voice. This is achieved with no prior information or the any machine learning model—is affected by this uneven
existence of accurate classifiers for these types of fine geo- distribution of data.
metric features. In addition, we believe that predicting face More specifically, if a set of speakers might have vocal-
images directly from voice may support useful applications, visual traits that are relatively uncommon in the data, then
such as attaching a representative face to phone/video calls the quality of our reconstructions for such cases may degrade.
based on the speaker’s voice. For example, if a certain language does not appear in the
To our knowledge, our work is the first to explore a training data, our reconstructions will not capture well the
generic (speaker independent) model for reconstructing face facial attributes that may be correlated with that language.
images directly from speech. We test our model on various Note that some of the features in our predicted faces may
speakers and numerically evaluate different aspects of our not even be physically connected to speech, for example hair
reconstructions including: how well a true face image can be color or style. However, if many speakers in the training
retrieved based solely on an audio query; and how well our set who speak in a similar way (e.g., in the same language)
also share some common visual traits (e.g., a common hair or body animations from music or speech has been gaining
color or style), then those visual traits may show up in the interest [48, 26, 47, 43]. However, such methods typically
predictions. parametrize the reconstructed subject a priori, and its texture
For the above reasons, we recommend that any further is manually created or mined from a collection of textures.
investigation or practical use of this technology will be care- In the context of pixel-level generative methods, Sadoughi
fully tested to ensure that the training data is representative and Busso [41] reconstruct lip motions from speech, and
of the intended user population. If that is not the case, more Wiles et al. [51] control the pose and expression of a given
representative data should be broadly collected. face using audio (or another face). While not directly related
Categories. In our experimental section, we mention in- to audio, Yan et al. [52] and Liu and Tuzel [30] synthesize a
ferred demographic categories such as “White” and “Asian”. face image from given facial attributes as input. Our model
These are categories defined and used by a commercial face reconstructs a face image directly from speech, with no ad-
attribute classifier [15], and were only used for evaluation ditional information. Finally, Duarte et al. [13] synthesize
in this paper. Our model is not supplied with and does not face images from speech using a GAN model, but their goal
make use of this information at any stage. is to recover the true face of the speaker including expres-
sion and pose. In contrast, our goal is to recover general
facial traits, i.e., average looking faces in canonical pose and
3. Related Work expression but capturing dominant visual attributes across
Audio-visual cross-modal learning. The natural co- many speakers.
occurrence of audio and visual signals often provides rich su-
pervision signal, without explicit labeling, also known as self- 4. Speech2Face (S2F) Model
supervision [11] or natural supervision [24]. Arandjelović The large variability in facial expressions, head poses, occlu-
and Zisserman [4] leveraged this to learn a generic audio- sions, and lighting conditions in natural face images makes
visual representations by training a deep network to classify the design and training of a Speech2Face model non-trivial.
if a given video frame and a short audio clip correspond to For example, a straightforward approach of regressing from
each other. Aytar et al. [6] proposed a student-teacher train- input speech to image pixels does not work; such a model has
ing procedure in which a well established visual recognition to learn to factor out many irrelevant variations in the data
model was used to transfer the knowledge obtained in the vi- and to implicitly extract a meaningful internal representation
sual modality to the sound modality, using unlabeled videos. of faces—a challenging task by itself.
Similarly, Castrejon et al. [8] designed a shared audio-visual To sidestep these challenges, we train our model to regress
representation that is agnostic of the modality. Such learned to a low-dimensional intermediate representation of the
audio-visual representations have been used for cross-modal face. More specifically, we utilize the VGG-Face model,
retrieval [38, 39, 46], sound source localization [42, 5, 37], a face recognition model pre-trained on a large-scale face
and sound source separation [54, 14]. Our work utilizes the dataset [40], and extract a 4096-D face feature from the
natural co-occurrence of faces and voices in Interent videos. penultimate layer (fc7) of the network. These face features
We use a pre-trained face recognition network to transfer were shown to contain enough information to reconstruct the
facial information to the voice modality. corresponding face images while being robust to many of
Speech-face association learning. The associations be- the aforementioned variations [10].
tween faces and voices have been studied extensively in Our Speech2Face pipeline, illustrated in Fig. 2, consists
many scientific disciplines. In the domain of computer vi- of two main components: 1) a voice encoder, which takes
sion, different cross-modal matching methods have been pro- a complex spectrogram of speech as input, and predicts a
posed: a binary or multi-way classification task [34, 33, 44]; low-dimensional face feature that would correspond to the
metric learning [27, 21]; and the multi-task classification associated face; and 2) a face decoder, which takes as input
loss [50]. Cross-modal signals extracted from faces and the face feature and produces an image of the face in a
voices have been used to disambiguate voiced and unvoiced canonical form (frontal-facing and with neutral expression).
consonants [36, 9]; to identify active speakers of a video During training, the face decoder is fixed, and we train only
from non-speakers therein [20, 17]; to separate mixed speech the voice encoder that predicts the face feature. The voice
signals of multiple speakers [14]; to predict lip motions from encoder is a model we designed and trained, while we used
speech [36, 3]; or to learn the correlation between speech a face decoder model proposed by Cole et al. [10]. We now
and emotion [2]. Our goal is to learn the correlations be- describe both models in detail.
tween facial traits and speech, by directly reconstructing a Voice encoder network. Our voice encoder module is a
face image from a short audio segment. convolutional neural network that turns the spectrogram of a
Visual reconstruction from audio. Various methods short input speech into a pseudo face feature, which is sub-
have been recently proposed to reconstruct visual infor- sequently fed into the face decoder to reconstruct the face
mation from different types of audio signals. In a more image (Fig. 2). The architecture of the voice encoder is sum-
graphics-oriented application, automatic generation of facial marized in Table 1. The blocks of a convolution layer, ReLU,
Original image Reconstruction Reconstruction Original image Reconstruction Reconstruction
(ref. frame) from image from audio (ref. frame) from image from audio
Figure 3. Qualitative results on the AVSpeech test set. For every example (triplet of images) we show: (left) the original image, i.e.,
a representative frame from the video cropped around the speaker’s face; (middle) the frontalized, lighting-normalized face decoder
reconstruction from the VGG-Face feature extracted from the original image; (right) our Speech2Face reconstruction, computed by decoding
the predicted VGG-Face feature from the audio. In this figure, we highlight successful results of our method. Some failure cases are shown
in Fig. 12, and more results (including the input audio for all the examples) can be found in the SM.
0DOH
)HPDOH
CONV
CONV CONV
CONV CONV CONV CONV CONV AVGPOOL
Layer Input RELU
RELU RELU
MAXPOOL RELU MAXPOOL RELU MAXPOOL RELU MAXPOOL RELU RELU CONV RELU FC FC
BN BN BN
BN BN BN BN BN BN RELU
Channels 2 64 64 128 – 128 – 128 – 256 – 512 512 512 – 4096 4096
Stride – 1 1 1 2×1 1 2×1 1 2×1 1 2×1 1 2 2 1 1 1
!
Kernel size:+,7( – 4×4 4×4
4×4 ×1
2 4×4 2×1 4×4 2×1 4×4 2×1 4×4 4×4 4×4 ∞×1 1×1 1×1
,1',$
Table 1.%/$&.
Voice encoder
$6,$1
architecture. The input spectrogram dimensions are 598 × 257 (time × frequency) for a 6-second audio segment
(which can be arbitrarily long), with the two input channels in the table corresponding to the spectrogram’s real and imaginary components.
0DOH
)HPDOH
!
:+,7(
,1',$
%/$&.
$6,$1
Age
(a) Confusion matrices for the attributes (b) AVSpeech dataset statistics
Figure 4. Facial attribute evaluation. (a) confusion matrices (with row-wise normalization) comparing the classification results on our
Speech2Face image reconstructions (S2F) and those obtained from the original images for gender, age, and ethnicity; the stronger diagonal
tendency the better performance. Ethnicity performance in (a) appears to be biased due to uneven distribution of the training set shown in (b).
and batch normalization [23] alternate with max-pooling to the VGG-Face model to extract the 4096-D feature vec-
layers, which pool along only the temporal dimension of the tor, vf . This serves as the supervision signal for our voice
spectrograms, while leaving the frequency information car- encoder—the feature, vs , of our voice encoder is trained to
ried over. This is intended to preserve more of the vocal char- predict vf .
acteristics, since they are better contained in the frequency A natural choice for the loss function would be the L1 dis-
content, whereas linguistic information usually spans longer tance between the features: kvf − vs k1 . However, we found
time duration [22]. At the end of these blocks, we apply that the training undergoes slow and unstable progression
average pooling along the temporal dimension. This allows with this loss alone. To stabilize the training, we introduce ad-
us to efficiently aggregate information over time and makes ditional loss terms, motivated by Castrejon et al. [8]. Specifi-
the model applicable to input speech of varying duration. cally, we additionally penalize the difference in the activation
The pooled features are then fed into two fully-connected of the last layer of the face encoder, fVGG : R4096 → R2622 ,
layers to produce a 4096-D face feature. i.e., fc8 of VGG-Face, and that of the first layer of the face
decoder, fdec : R4096 →R1000 , which are pre-trained and
Face decoder network. The goal of the face decoder is fixed during training the voice encoder. We feed both our
to reconstruct the image of a face from a low-dimensional predictions and the ground truth face features to these layers
face feature. We opt to factor out any irrelevant variations to calculate the losses. The final loss is:
(pose, lighting, etc.), while preserving the facial attributes.
v
2
Ltotal = kfdec (vf ) − fdec (vs )k1 + λ1
kvff k − kvvss k
To do so, we use the face decoder model of Cole et al. [10] 2
to reconstruct a canonical face image. We train this model +λ2 Ldistill (fVGG (vf ), fVGG (vs )) , (1)
using the same face features extracted from the VGG-Face
model as input to the face decoder. This model is trained where λ1 =0.025 and λ2 =200. λ1 and λ2 are tuned such
separately and kept fixed during the voice encoder training. that the gradient magnitude of each term with respect to
vs are within a similar scale at an early iteration (we
Training. Our voice encoder is trained in a self-supervised
measured at the 1000th iteration). The knowledge distil-
manner, using the natural co-occurrence of a speaker’s P
lation loss Ldistill (a, b) = − i p(i) (a) log p(i) (b), where
speech and facial images in videos. To this end, we use
the AVSpeech dataset [14], a large-scale “in-the-wild” audio- p(i) (a) = Pexp(a i /T )
, is used as an alternative of the cross
j exp(aj /T )
visual dataset of people speaking. A single frame containing entropy loss, which encourages the output of a network to
the speaker’s face is extracted from each video clip and fed approximate the output of another [19]. T =2 is used as
recommended by the authors, which makes the activation
smoother. We found that enforcing similarity over these ad-
ditional layers stabilized and sped up the training process, in
addition to a slight improvement in the resulting quality.
Implementation details. We use up to 6 seconds of audio (a) Landmarks marked on reconstructions from image (F2F)
taken from the beginning of each video clip in AVSpeech. If
the video clip is shorter than 6 seconds, we repeat the audio
such that it becomes at least 6-seconds long. The audio wave-
form is resampled at 16 kHz and only a single channel is used.
Spectrograms are computed similarly to Ephrat et al. [14] by
taking STFT with a Hann window of 25 mm, the hop length (b) Landmarks marked on our corresponding reconstructions from speech (S2F)
of 10 ms, and 512 FFT frequency bands. Each complex
Face measurement Correlation p-value
spectrogram S subsequently goes through the power-law
compression, resulting sgn(S)|S|0.3 for real and imaginary Upper lip height 0.16 p < 0.001
independently, where sgn(·) denotes the signum. We run Lateral upper lip heights 0.26 p < 0.001
the CNN-based face detector from Dlib [28], crop the face Jaw width 0.11 p < 0.001
Nose height 0.14 p < 0.001
regions from the frames, and resize them to 224 × 224 pixels.
Nose width 0.35 p < 0.001
The VGG-Face features are computed from the resized face Labio oral region 0.17 p < 0.001
images. The computed spectrogram and VGG-Face feature Mandibular idx 0.20 p < 0.001
of each segment are collected and used for training. The Intercanthal idx 0.21 p < 0.001
resulting training and test sets include 1.7 and 0.15 million Nasal index 0.38 p < 0.001
spectra–face feature pairs, respectively. Our network is im- Vermilion height idx 0.29 p < 0.001
plemented in TensorFlow and optimized by ADAM [29] Mouth face with idx 0.20 p < 0.001
with β1 = 0.5, = 10−4 , the learning rate of 0.001 with the Nose area 0.28 p < 0.001
exponentially decay rate of 0.95 at every 10,000 iterations, Random baseline 0.02 –
and the batch size of 8 for 3 epochs. (c) Pearson correlation coefficient
Figure 5. Craniofacial features. We measure the correlation be-
5. Results tween craniofacial features extracted from (a) face decoder recon-
structions from the original image (F2F), and (b) features extracted
We test our model both qualitatively and quantitatively on
from our corresponding Speech2Face reconstructions (S2F); the
the AVSpeech dataset [14] and the VoxCeleb dataset [35]. features are computed from detected facial landmarks, as described
Our goal is to gain insights and to quantify how—and in in [31]. The table reports Pearson correlation coefficient and statis-
which manner—our Speech2Face reconstructions resemble tical significance computed over 1,000 test images for each feature.
the true face images. Random baseline is computed for “Nasal index” by comparing
Qualitative results on the AVSpeech test set are shown random pairs of F2F reconstruction (a) and S2F reconstruction (b).
in Fig. 3. For each example, we show the true image of the
speaker for reference (unknown to our model), the face re-
cally, we evaluate and compare age, gender, and ethnicity, by
constructed from the face feature (computed from the true
running the Face++ classifiers on the original images and our
image) by the face decoder (Sec. 4), and the face recon-
Speech2Face reconstructions. The Face++ classifiers return
structed from a 6-seconds audio segment of the person’s
either “male” or “female” for gender, a continuous number
speech, which is our Speech2Face result. While looking
for age, and one of the four values, “Asian”, “black”, “India”,
somewhat like average faces, our Speech2Face reconstruc-
or “white”, for ethnicity.1
tions capture rich physical information about the speaker,
such as their age, gender, and ethnicity. The predicted im- Fig. 4(a) shows confusion matrices for each of the at-
ages also capture additional properties like the shape of the tributes, comparing the attributes inferred from the original
face or head (e.g., elongated vs. round), which we often find images with those inferred from our Speech2Face recon-
consistent with the true appearance of the speaker; see the structions (S2F). See the supplementary material for similar
last two rows in Fig. 3 for instance. evaluations of our face-decoder reconstructions from the
images (F2F). As can be seen, for age and gender the clas-
5.1. Facial Features Evaluation sification results are highly correlated. For gender, there is
an agreement of 94% in male/female labels between the
We quantify how well different facial attributes are being cap- true images and our reconstructions from speech. For ethnic-
tured in our Speech2Face reconstructions and test different ity, there is a good correlation on the “white” and “Asian”,
aspects of our model. but we observe less agreement on “India” and “black”. We
Demographic attributes. We use Face++ [15], a leading
commercial service for computing facial attributes. Specifi- 1 We directly refer to the Face++ labels, which are not our terminology.
Length cos (deg) L2 L1 Duration Metric R@1 R@2 R@5 R@10
3 seconds 48.43 ± 6.01 0.19 ± 0.03 9.81 ± 1.74 3 sec L2 5.86 10.02 18.98 28.92
6 seconds 45.75 ± 5.09 0.18 ± 0.02 9.42 ± 1.54 3 sec L1 6.22 9.92 18.94 28.70
3 sec cos 8.54 13.64 24.80 38.54
Table 2. Feature similarity. We measure the similarity between our
features predicted from speech and the corresponding face features 6 sec L2 8.28 13.66 24.66 35.84
computed on the true images of the speakers. We report average 6 sec L1 8.34 13.70 24.66 36.22
6 sec cos 10.92 17.00 30.60 45.82
cosine, L2 and L1 distances over 5000 random samples from the
AVSpeech test set, using 3- and 6-second audio segments. Random 1.00 2.00 5.00 10.00
Table 3. S2F→Face retrieval performance. We measure retrieval
performance by recall at K (R@K, in %), which indicates the
chance of retrieving the true image of a speaker within the top-K
results. We used a database of 5,000 images for this experiment;
see Fig. 7 for qualitative results. The higher the better. Random
3 sec.
chance is presented as a baseline.
6 sec.
Loss (b)Asian male speaking in English (left) An Asian girl speaking in English
An
& Chinese (right)
1.50 Figure 11. The effect of language. We notice mixed performance
in terms of the ability of the model to handle languages and accents.
(a) A sample case of language-dependent face reconstructions. (b)
1.46
A sample case that successfully factors out the language.