Child Face Generation With Deep Conditional Generative Adversarial Networks
Child Face Generation With Deep Conditional Generative Adversarial Networks
Networks
2. Background
The exists some distribution C(θ) such that a random vari-
able A ∼ C(θ) consists of an image of a child with the
probability that a certain child has the facial phenotype cor-
responding to that image. Using the universality of the uni-
form, we can construct a mapping from random noise Z
to an arbitrary distribution. Unfortunately, theta includes
many variables not easily observable, including presum-
ably some environmental interactions. However, if we have
some useful information, we can construct f1 (Z, f2 (θ))
which we can then fit using a training method of our choice
and a dataset of child images (in addition to theta features).
Intuitively, a convenient and useful choice for theta is the
images of the parents of a given child. We employ the Figure 1. Model architecture for the generator network of
childGAN. Separate convolutional parts of the network process
TSKinFace dataset, which consists of 1015 triples of 64x64
the two parent images, while a fully connected layer processes
pixel images. These represent the father, the mother, and the noise. These inputs are combined at the end of the network to
the child. The dataset is roughly balanced between female produce the output.
and male children. However, the limitations of the dataset
pose some other challenges. First, the small size of the
dataset means that overfitting is a constant concern. This
concern is aggravated by the relatively large amount of data
used as an input to this model. Secondly, the dataset is bi-
ased towards East Asian families, which means that the re-
sulting model may lack the ability to generalize beyond this
group of people.
3. Related Work
Goodfellow et al. (2014) first introduced the GAN model
(Goodfellow et al., 2014). GANs have demonstrated suc-
cess in many image-based domains, including image gen-
eration (Denton et al., 2015), representation learning (Rad-
ford et al., 2015), text-to-image synthesis (Reed et al.,
2016), upsampling images (Ledig et al., 2016). The key
innovation is the adversarial loss in which the loss of the Figure 2. Model architecture for the discriminator network of
generator is based on the ability of a discriminator to cor- childGAN. As in the generator, separate convolutional compo-
rectly classify the image as coming from the generator. The nents process the parent images.
weights of the generative model can be trained directly by
backpropagating through the (fixed) weights of the discrim-
inator, since the output of the generator and input of the
discriminator align.
CGAN produces more realistic results (Isola et al., 2016).
Conditional Generative Adversarial Networks quickly fol- Further developments in this domain include the introduc-
lowed GANs. The first application of this model condi- tion of models that include an inverse mapping to translate
tioned only on a single number from 0-9, in order to gen- back (Zhu et al., 2017). Our only adjustment to the image-
erate MNIST images. However, one can condition on any to-image translation model is the use of two images instead
type of data that can be easily fed into a neural network of one as the input to our generator.
(Mirza & Osindero, 2014).
In the domain of kinship verification, we find a variety of
The most natural parallel to the problem of child face gen- approaches. The paper we draw our dataset from employs
eration is image-to-image translation, a domain which gen- an RBSM (relative symmetric bilinear model) and feature
eralizes the problem of taking an input image and produc- extraction (Qin et al., 2015). However, others have found
ing an output image. While such a problem can be ap- success with this problem using convolutional neural net-
proached with a standard CNN and deconvolutions, the works (Zhang12 et al., 2015).
Submission and Formatting Instructions for ICML 2017
4. Model
As discussed previously, we are interested in modeling and
sampling from the distribution over children given their
parent images. Because we process each input separately
initially, we can express our generator as:
1 − D(G(Z)) 5. Training
Because our model is composed of neural networks, we use
Thus, the generator network is trained to produce images
backpropagation to fit reasonable values for the parameters.
that closely conform to the distribution, as evaluated by the
When compared to other generative methods, GANs have
generator. The original GAN paper characterizes the rela-
the advantage of not requiring an inference step. Compu-
tionship between the two neural networks as that of a two
tationally, this makes it possible to estimate parameters for
player minimiax game with the following value function:
much larger models. GANs learn approximate parameters,
min max V (D, G) = and have a notoriously unstable training process, which
G D makes appropriate training procedures very important.
Ex∼pdata (x) [log D(x, f4 (p1 ), f5 (p2 ))] As mentioned previously, we are limited by a relatively
+EZ∼pz (z) [log(1 − D(G(Z, f2 (p1 ), f3 (p2 ))), p1 , p2 )] small dataset of 1015 examples of parent-parent-child trio
image sets. Because of this, we used transfer learning as
where in this case log-loss is used. a way of pseudo-augmenting our data. For example, much
of what the generator must learn at a higher level is simply
The parameters of this model are all weights in the neural making human-like faces; regardless of the parent images,
networks. This presents a challenge because the number of the output needs to look like a child. This fundamental fea-
weights is large relative to the the amount of data that we ture of human faces and higher level abstraction is not lim-
have. ited to our problem. We assume that this encoding occurs
However, the advantages of using a neural network-based from the first two convolutional layers. Therefore, we can
model include their ability to represent nearly arbitrary instead pretrain a different DCGAN that takes in a random
functions and using highly efficient backpropagation to z matrix input and outputs a child image. The advantage of
train. This allows us to make very few assumptions about splitting this problem up in this manner is that we can use
parameters in our model. Nonetheless, we bake a few as- any image dataset of children. We therefore used the Large
sumptions about the structure of the data and the problem Age-Gap (LAG) database, an image database with pictures
into our model. For example, we assume we can process taken of people at a variety of ages (Bianco, 2017). We
the parents’ images separately because the features that af- only take the child images, resulting in 9846 photos.
fect the looks of their child can be encoded at a higher level In summary, the following is the basic outline of our overall
of abstraction before they interact. We also use convolu- training process for the DCGAN:
tional layers in our network components that take images,
which reflect an assumption of positional invariance for the 1. Train a DCGAN that receives random array z as an
features in the images. Intuitively these are reasonable as- input and generates child images.
sumptions, and they help us limit the number of parameters
we have to train. 2. Initialize the weights of the first two convolutional
Submission and Formatting Instructions for ICML 2017
layers which initially takes a random array as input, (Radford et al., 2015), but we made several key changes for
which are shown in Figure 1, of a different DCGAN our problem. For example, our model is a conditional GAN
(intended to generate children from parent images) to that takes in 2 parent images, so we had to adjust our model
be the first two layers of the pre-trained DCGAN from accordingly. Also, although previous work used ReLUs for
the previous step. activation, we found that ELUs allowed the learning to be
much quicker. Also, we have found that too much batch
3. Keeping the weights in these layers fixed, train the normalization causes the discriminator to ’overpower’ the
DCGAN conditioned on parent images to generate po- generator in the early training process, so some of the batch
tential children images. normalization layers were omitted. We used Adam opti-
mizers for all generators and discriminators, with learning
In general, we alternate the training of the discriminator rate of 0.0002 and betas = [0.5, 0.999].
and the generator. The objective functions are as described
in Section 4. We use binary cross entropy loss as the loss For our final model, we trained for 12 hours, which equated
function. to 4000 epochs.
Figure 4. Results for training: probability of generator fooling the discriminator over time
to sample from highly multi-modal distributions. The most world impact. The most useful contribution in this area
obvious example of this is that the GAN successfully pro- would be the development of larger and more diverse
duces both male and female children. However, the issue datasets. The limits of the dataset limit the strength and
of image clarity can also be viewed as a similar problem. generality of the model.
While two very clear images, when viewed by a human,
Further exploration is also needed into techniques for gen-
may both be judged to be likely images of real children,
erative models in low-data environments. For example, the
the pixelwise average of these images is quite unlikely to
incorporation of data from similar spaces proved a promis-
get the same reaction. Therefore the clarity of images pro-
ing addition to our model, and large volumes of unlabeled
duced by the GAN reflects its ability to find modes, peaks
data could also contribute success on this and similar prob-
in probability density, instead of averaging them. In con-
lems.
trast, the supervised baseline model, driven by its RMSE,
”hedges” its guesses for pixel values between multiple pos-
sibilities. References
GANs are notorious for ”mode collapse” a problem that Deep learning for computer vision: Generative mod-
occurs when the generator learns to produce a single value els and adversarial training (upc 2016). URL
without any variation. We were able to avoid this problem https://www.slideshare.net/xavigiro/
by using several random re-initializations. This ability to deep-learning-for-computer-vision-generative-mode
produce a highly diverse set of images is important for this
domain. Potential applications include synthesizing hypo- Bianco, Simone. Large age-gap face verification by feature
thetical images of lost children to help biological children injection in deep networks. Pattern Recognition Letters,
find them or information retrieval (searching for online im- 90:36–42, 2017. doi: 10.1016/j.patrec.2017.03.006.
ages of children for given parents). In these cases, having Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al.
several diverse candidate images is valuable. Deep generative image models using a laplacian pyramid
The other interesting result from our experiments is that the of adversarial networks. In Advances in neural informa-
GAN was even able to produce remotely reasonable results tion processing systems, pp. 1486–1494, 2015.
given the small size of the dataset. This may reflect the
fact that generated data augments the real data when train- Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,
ing the discriminator. In effect, this doubles the size of the Bing, Warde-Farley, David, Ozair, Sherjil, Courville,
data. Furthermore, our use of transfer learning gave a sort Aaron, and Bengio, Yoshua. Generative adversarial nets.
of pseudo-augmentation to the dataset. Our success with In Advances in neural information processing systems,
this technique could potentially be replicated in other do- pp. 2672–2680, 2014.
mains with limited data from which to train generative data. Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros,
By introducing data from spaces which overlap the target Alexei A. Image-to-image translation with conditional
space, the generator can quickly learn valuable information adversarial networks. arXiv preprint arXiv:1611.07004,
applicable to the target space. 2016.