ICML - 2017 - ACGAN - Conditional Image Synthesis With Auxiliary Classifier GANs
Figure 1. 128 × 128 resolution samples from 5 classes taken from an AC-GAN trained on the ImageNet dataset.
shown have been selected to highlight the success of the model and are not representative. Samples from all ImageNet classes are linked
later in the text.
• Demonstrate an image synthesis model for all 1000 The generator is trained to minimize the second term in
ImageNet classes at a 128x128 spatial resolution (or Equation 1.
any spatial resolution - see Section 3).
The basic GAN framework can be augmented using side
• Measure how much an image synthesis model actually information. One strategy is to supply both the generator
uses its output resolution (Section 4.1). and discriminator with class labels in order to produce class
conditional samples (Mirza & Osindero, 2014). Class con-
• Measure perceptual variability and ’collapsing’ be- ditional synthesis can significantly improve the quality of
havior in a GAN with a fast, easy-to-compute metric generated samples (van den Oord et al., 2016b). Richer side
(Section 4.2). information such as image captions and bounding box lo-
calizations may improve sample quality further (Reed et al.,
• Highlight that a high number of classes is what makes 2016a;b).
ImageNet synthesis difficult for GANs and provide an
Instead of feeding side information to the discriminator,
explicit solution (Section 4.6).
one can task the discriminator with reconstructing side in-
• Demonstrate experimentally that GANs that perform formation. This is done by modifying the discriminator to
well perceptually are not those that memorize a small contain an auxiliary decoder network1 that outputs the class
number of examples (Section 4.3). label for the training data (Odena, 2016; Salimans et al.,
2016) or a subset of the latent variables from which the
• Achieve state of the art on the Inception score metric samples are generated (Chen et al., 2016). Forcing a model
when trained on CIFAR-10 without using any of the to perform additional tasks is known to improve perfor-
techniques from (Salimans et al., 2016) (Section 4.4). mance on the original task (e.g. (Sutskever et al., 2014;
Szegedy et al., 2014; Ramsundar et al., 2016)). In addi-
tion, an auxiliary decoder could leverage pre-trained dis-
2. Background criminators (e.g. image classifiers) for further improving
A generative adversarial network (GAN) consists of two the synthesized images (Nguyen et al., 2016). Motivated
neural networks trained in opposition to one another. The by these considerations, we introduce a model that com-
generator G takes as input a random noise vector z and bines both strategies for leveraging side information. That
outputs an image Xf ake = G(z). The discriminator D is, the model proposed below is class conditional, but with
receives as input either a training image or a synthesized an auxiliary decoder that is tasked with reconstructing class
image from the generator and outputs a probability distri- labels.
bution P (S | X) = D(X) over possible image sources. 1
Alternatively, one can force the discriminator to work with
The discriminator is trained to maximize the log-likelihood the joint distribution (X, z) and train a separate inference network
it assigns to the correct source: that computes q(z|X) (Dumoulin et al., 2016; Donahue et al.,
Figure 2. Generating high resolution images improves discriminability. Top: Training data and synthesized images from the zebra
class resized to a lower spatial resolution (indicated above) and subsequently artificially resized to the original resolution (128 × 128 for
the red and black lines; 64 × 64 for the blue line). Inception accuracy is shown below the corresponding images. Bottom Left: Summary
of accuracies across varying spatial resolutions for training data and image samples from 64 × 64 and 128 × 128 models. Error bar
measures standard deviation across 10 subsets of images. Dashed lines highlight the accuracy at the output spatial resolution of the
model. The training data (clipped) achieves accuracies of 24%, 54%, 81% and 81% at resolutions of 32, 64, 128, and 256 respectively.
Bottom Right: Comparison of accuracy scores at 128 × 128 and 32 × 32 spatial resolutions (x and y axis, respectively). Each point
represents an ImageNet class. 84.4% of the classes are below the line of equality. The green dot corresponds to the zebra class. We
also artificially resized 128 × 128 and 64 × 64 images to 256 × 256 as a sanity check to demonstrate that simply increasing the number
of pixels will not increase discriminability.
lution AC-GAN (red) and a 64 × 64 resolution AC-GAN resolution model achieves less discriminability at 64 spatial
(blue) in Figure 2 (bottom, left). The black curve (clipped) resolution than the 128 × 128 model.
provides an upper-bound on the discriminability of real im-
To the best of our knowledge, this work is the first that at-
tempts to measure the extent to which an image synthesis
The goal of this analysis is to show that synthesizing higher model is ‘making use of its given output resolution’, and in
resolution images leads to increased discriminability. The fact is the first work to consider the issue at all. We con-
128 × 128 model achieves an accuracy of 10.1% ± 2.0% sider this an important contribution, on par with propos-
versus 7.0% ± 2.0% with samples resized to 64 × 64 and ing a model that synthesizes images from all 1000 Ima-
5.0% ± 2.0% with samples resized to 32 × 32. In other geNet classes. We note that the proposed method can be
words, downsizing the outputs of the AC-GAN to 32 × 32 applied to any image synthesis model for which a measure
and 64 × 64 decreases visual discriminability by 50% and of ‘sample quality’ can be constructed. In fact, this method
38% respectively. Furthermore, 84.4% of the ImageNet (broadly defined) can be applied to any type of synthesis
classes have higher accuracy at 128 × 128 than at 32 × 32 model, as long as there is an easily computable notion of
(Figure 2, bottom left). sample quality and some method for ‘reducing resolution’.
In particular, we expect that a similar procecure can be car-
We performed the same analysis on an AC-GAN trained
ried out for audio synthesis.
to 64 × 64 spatial resolution. This model achieved less dis-
criminability than a 128×128 AC-GAN model. Accuracies
from the 64 × 64 model plateau at a 64 × 64 spatial reso-
lution consistent with previous results. Finally, the 64 × 64
Conditional Image Synthesis with Auxiliary Classifier GANs
hot dog promontory green apple artichoke
4.2. Measuring the Diversity of Generated Images MS-SSIM = 0.11 MS-SSIM = 0.29 MS-SSIM = 0.41 MS-SSIM = 0.90
outputs one image. Indeed, a well-known failure mode of
GANs is that the generator will collapse and output a single
prototype that maximally fools the discriminator (Goodfel- MS-SSIM = 0.05 MS-SSIM = 0.15 MS-SSIM = 0.08 MS-SSIM = 0.04
one image per class. The Inception accuracy can not mea-
sure whether a model has collapsed. A model that simply
memorized one example from each ImageNet class would
Figure 3. Examples of different MS-SSIM scores. The top and bottom rows contain AC-GAN samples and training data, respectively.
tary metric to explicitly evaluate the intra-class perceptual bottom rows contain AC-GAN samples and training data, respec-
diversity of samples generated by the AC-GAN. tively.
There are two points related to the MS-SSIM metric and We have presented quantitative metrics demonstrating that
our use of it that merit extra attention. The first point is AC-GAN samples may be diverse and discriminable but
that we are ‘abusing’ the metric: it was originally intended we have yet to examine how these metrics interact. Fig-
ure 6 shows the joint distribution of Inception accuracies
Conditional Image Synthesis with Auxiliary Classifier GANs
Figure 4. Comparison of the mean MS-SSIM scores between Figure 5. Intra-class MS-SSIM for selected ImageNet classes
pairs of images within a given class for ImageNet training data throughout a training run. Classes that successfully train (black
and samples from the GAN (blue line is equality). The horizontal lines) tend to have decreasing mean MS-SSIM scores. Classes
red line marks the maximum MS-SSIM value (for training data) for which the generator ‘collapses’ (red line) will have increasing
across all ImageNet classes. Each point is an individual class. The mean MS-SSIM scores.
mean score across the training data and the samples was 0.05 and
0.18 respectively. The mean standard deviation of scores across
the training data and the samples was 0.06 and 0.08 respectively.
Scores below the red line (84.7% of classes) arise from classes et al., 2016), we compute the Inception score3 for 50000
where GAN training largely succeeded. samples from an AC-GAN with resolution (32 × 32), split
into 10 groups at random. We also compute the Inception
score for 25000 extra samples, split into 5 groups at ran-
and MS-SSIM scores across all classes. Inception accuracy dom. We select the best model based on the first score and
and MS-SSIM are anti-correlated (r2 = −0.16). In fact, report the second score. Performing a grid search across
74% of the classes with low diversity (MS-SSIM ≥ 0.25) 27 hyperparameter configurations, we are able to achieve a
contain Inception accuracies ≤ 1%. Conversely, 78% of score of 8.25 ± 0.07 compared to state of the art 8.09 ±
classes with high diversity (MS-SSIM < 0.25) have In- 0.07 (Salimans et al., 2016). Moreover, we accomplish this
ception accuracies that exceed 1%. In comparison, the without employing any of the new techniques introduced
Inception-v3 model achieves 78.8% accuracy on average in that work (i.e. virtual batch normalization, minibatch
across all 1000 classes (Szegedy et al., 2015). A fraction discrimination, and label smoothing).
of the classes AC-GAN samples reach this level of accu-
This provides additional evidence that AC-GANs are effec-
racy. This indicates opportunity for future image synthesis
tive even without the benefit of class splitting. See Figure 7
for a qualitative comparison of samples from an AC-GAN
These results suggest that GANs that drop modes are most and samples from the model in (Salimans et al., 2016).
likely to produce low quality images. This stands in con-
trast to a popular hypothesis about GANs, which is that 4.5. Searching for Signatures of Overfitting
they achieve high sample quality at the expense of variabil-
ity. We hope that these findings can help structure further One possibility that must be investigated is that the AC-
investigation into the reasons for differing sample quality GAN has overfit on the training data. As a first check that
between GANs and other image synthesis models. the network does not memorize the training data, we iden-
tify the nearest neighbors of image samples in the training
4.4. Comparison to Previous Results data measured by L1 distance in pixel space (Figure 8).
The nearest neighbors from the training data do not resem-
Previous quantitative results for image synthesis mod- ble the corresponding samples. This provides evidence that
els trained on ImageNet are reported in terms of log- the AC-GAN is not merely memorizing the training data.
likelihood (van den Oord et al., 2016a;b). Log-likelihood is 3
a coarse and potentially inaccurate measure of sample qual- The Inception score is given by
exp (Ex [DKL (p(y|x) || p(y))]) where x is a particular im-
ity (Theis et al., 2015). Instead we compare with previous age, p(y|x) is the conditional output distribution over the classes
state-of-the-art results on CIFAR-10 using a lower spatial in a pre-trained Inception network (Szegedy et al., 2014) given x,
resolution (32 × 32). Following the procedure in (Salimans and p(y) is the marginal distribution over the classes.
Conditional Image Synthesis with Auxiliary Classifier GANs
4.6. Measuring the Effect of Class Splits on Image tivity to class count that is well-supported experimentally.
Sample Quality. We can only note that, since the failure case that occurs
when the class count is increased is ‘generator collapse’, it
Class conditional image synthesis affords the opportunity
seems plausible that general methods for addressing ‘gen-
to divide up a dataset based on image label. In our final
erator collapse’ could also address this sensitivity.
model we divide 1000 ImageNet classes across 100 AC-
GAN models. In this section we describe experiments that
4.7. Samples from all 1000 ImageNet Classes
highlight the benefit of cutting down the diversity of classes
for training an AC-GAN. We employed an ordering of the We also generate 10 samples from each of the 1000 Ima-
labels and divided it into contiguous groups of 10. This geNet classes, hosted here. As far as we are aware, no other
ordering can be seen in the following section, where we image synthesis work has included a similar analysis.
display samples from all 1000 classes. Two aspects of the
split merit discussion: the number of classes per split and
5. Discussion
the intra-split diversity. We find that training a fixed model
on more classes harms the model’s ability to produce com- This work introduced the AC-GAN architecture and
pelling samples (Figure 10). Performance on larger splits demonstrated that AC-GANs can generate globally coher-
can be improved by giving the model more parameters. ent ImageNet samples. We provided a new quantitative
However, using a small split is not sufficient to achieve metric for image discriminability as a function of spatial
good performance. We were unable to train a GAN (Good- resolution. Using this metric we demonstrated that our
fellow et al., 2014) to converge reliably even for a split size samples are more discriminable than those from a model
of 1. This raises the question of whether it is easier to train that generates lower resolution images and performs a
a model on a diverse set of classes than on a similar set of naive resize operation. We also analyzed the diversity of
classes: We were unable to find conclusive evidence that our samples with respect to the training data and provided
the selection of classes in a split significantly affects sam- some evidence that the image samples from the majority of
ple quality. classes are comparable in diversity to ImageNet data.
Several directions exist for building upon this work. Much
work needs to be done to improve the visual discriminabil-
ity of the 128 × 128 resolution model. Although some
synthesized image classes exhibit high Inception accura-
cies, the average Inception accuracy of the model (10.1%±
2.0%) is still far below real training data at 81%. One im-
mediate opportunity for addressing this is to augment the
discriminator with a pre-trained model to perform addi-
tional supervised tasks (e.g. image segmentation, (Ron-
neberger et al., 2015)).
Improving the reliability of GAN training is an ongoing
research topic. Only 84.7% of the ImageNet classes ex-
hibited diversity comparable to real training data. Training
stability was vastly aided by dividing up 1000 ImageNet
classes across 100 AC-GAN models. Building a single
model that could generate samples from all 1000 classes
Figure 10. Mean pairwise MS-SSIM values for 10 ImageNet would be an important step forward.
classes plotted against the number of ImageNet classes used dur- Image synthesis models provide a unique opportunity for
ing training. We fix everything except the number of classes performing semi-supervised learning: these models build
trained on, using values from 10 to 100. We only report the MS-
a rich prior over natural image statistics that can be lever-
SSIM values for the first 10 classes to keep the scores comparable.
MS-SSIM quickly goes above 0.25 (the red line) as the class count
aged by classifiers to improve predictions on datasets for
increases. These scores were computed using 9 random restarts which few labels exist. The AC-GAN model can perform
per class count, using the same number of training steps for each semi-supervised learning by ignoring the component of the
model. Since we have observed that generators do not recover loss arising from class labels when a label is unavailable
from the collapse phase, the use of a fixed number of training for a given training image. Interestingly, prior work sug-
steps seems justified in this case. gests that achieving good sample quality might be inde-
pendent of success in semi-supervised learning (Salimans
We don’t have a hypothesis about what causes this sensi- et al., 2016).
Conditional Image Synthesis with Auxiliary Classifier GANs
Conditional Image Synthesis with Auxiliary Classifier GANs
A. Hyperparameters
We summarize hyperparameters used for the ImageNet model in Table 1 and for the CIFAR-10 model in Table 2.
Table 1. ImageNet hyperparameters. A Soft-Sigmoid refers to an operation over K +1 output units where we apply a Softmax activation
to K of the units and a Sigmoid activation to the remaining unit. We also use activation noise in the discriminator as suggested in
(Salimans et al., 2016).
Conditional Image Synthesis with Auxiliary Classifier GANs
Table 2. CIFAR-10 hyperparameters. When a list is given for a hyperparameter it means that we performed a grid search using the
values in the list. For each set of hyperparameters, a single AC-GAN was trained on the whole CIFAR-10 dataset. For each AC-GAN
that was trained, we split up the samples into groups so that we could give some sense of the variance of the Inception Score. To the best
of our knowledge, this is identical to the analysis performed in (Salimans et al., 2016).