0% found this document useful (0 votes)
67 views7 pages

Sketch Image Translation

This document summarizes research on using generative adversarial networks (GANs) to perform sketch-to-image translation. The researchers experiment with a discriminator that outputs a 2N-dimensional vector, with the first N values corresponding to real image classes and the second N values corresponding to generated image classes. They compare the results of training GANs with this "2N loss" versus a "penalty loss" that further leverages the extra information from the 2N outputs. Preliminary results found the 2N loss and penalty loss can generate images of similar or better quality than the standard GAN loss.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views7 pages

Sketch Image Translation

This document summarizes research on using generative adversarial networks (GANs) to perform sketch-to-image translation. The researchers experiment with a discriminator that outputs a 2N-dimensional vector, with the first N values corresponding to real image classes and the second N values corresponding to generated image classes. They compare the results of training GANs with this "2N loss" versus a "penalty loss" that further leverages the extra information from the 2N outputs. Preliminary results found the 2N loss and penalty loss can generate images of similar or better quality than the standard GAN loss.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Sketch to Image Translation using GANs

Lisa Fan, Jason Krone, Sam Woolf


Tufts University, MA, United States
{Lisa.Fan, Jason.Krone, Samuel.Woolf}@tufts.edu

Abstract this additional information.


In previous works, discriminators have been supple-
In this work we explore the effect of using a discrim- mented to produce output vectors of length N+1 instead of
inator with 2N output classes (real and fake scores for the traditional realness output. This N+1 vector gives in-
each target class) as well as different types of loss func- formation on class, as well as realness. Our paper presents
tions on the quality of images generated using an image- an additional improvement that furthers this idea. We ex-
to-image cGAN (Conditional Generative Adversarial Neu- plore the concept of a discriminator outputting a vector of
ral Network). Specifically, we experiment with two different length 2N, where the first N entries correspond to the im-
loss functions, the first of which is a fairly standard cross age classes of real images, and the second N entries refer
entropy loss (we call this the 2N loss) and the second at- to the image classes of generated images. As we show, this
tempts to take advantage of extra information provided by additional information allows for the discriminator and gen-
our 2N classification scheme (we call this the penalty loss). erator to be more efficiently updated, which in turn leads to
We find GANs trained using our 2N loss and penalty loss better results.
produce images that are of similar if not better quality than
the standard GAN loss. 2. Background & Related Work
As we developed a novel strategy for generating im-
1. Introduction ages using cGANS, we were heavily influenced by exist-
ing work. Specifically, the concept and loss function for
Image generation is a valuable tool in the computer vi- a multi-class GAN discriminator is based off of ideas pre-
sion field as well as the world outside it. For example, sented in “Improved Techniques for Training GANs” [6].
image generation can be used in unsupervised contexts to Our core network architecture is based upon a network
generate training images for sparse categories. Outside of presented in “Image-to-Image Translation with Conditional
the field, image generation has many artistic use cases. Ex- Adversarial Networks” [2].
isting work, such as iGAN [9], has shown the success of In the original GAN model proposed by Ian Goodfellow,
using Generative Adversarial Networks (GANs) to create the GAN discriminator has a single probabilistic output, and
artwork. In this paper we further explore the use of GANs attempts to decipher whether an input image is real or gener-
to create artwork by applying existing image to image trans- ated. Then, a simple loss based off of this sole value is used
lation techniques to generate photos from sketches. to update both the generator and discriminator [1]. As first
The traditional implementations of GANs utilize dis- proposed in “Improved Techniques” [6], the discriminator
criminators that solely output a measurement of the can be restructured so that it can give us more information
real/fake quality, or realness, of an input image. These met- than just a simple probability. The authors suggest creating
rics seem to work in most contexts. However, in a few a discriminator that has output of N+1 values, where N is
contexts, the input images have the potential to be coupled the number of classes in the training data set. In this output,
with additional information, including aspects such as im- the first N values correspond to the classes, and the N+1th
age class or category. This paper explores the augmentation value corresponds to any generated image. By creating a
of traditional GAN discriminators in order to produce out- discriminator that outputs class information instead of the
puts that can utilize this extra information. In theory, when probabilistic realness, one is able to calculate a better in-
a discriminator can make inferences into the class of an im- formed loss, and thus better refine both the generator and
age, one can implement a nuanced loss function that incor- discriminator. In this paper, we take the concept one step
porates this extra information. Then, both the discriminator further, attempting to create a more expressive discrimina-
and generator can be more effectively updated based off of tor by increasing the number of categories in the output of

1
the discriminator. Our discriminator now has an output of corresponds to the realness of different patches of the input
2N classes: N classes for real images, and N classes for gen- image. This out-of-the-box implementation was used as a
erated images. Our hypothesis is that by utilizing this ex- baseline network to compare against our own models.
tra information, our network will calculate a more nuanced We modify the “Image-to-Image” [2] discriminator de-
loss, and then be able to more efficiently improve both the scribed above by adding a fully-connected layer to the end
generator and the discriminator, leading to a more effective of the network, which outputs a 2N-dimensional vector of
GAN. logits. By including N fake classes in our output, rather
The “Image-to-Image” paper [2] presents a novel way to than a single fake class as described in “Improved Tech-
train a conditional GAN as a solution for image-to-image niques” [6], we increase the discriminator’s power to learn
translation. Specifically, the network uses a generator to lower level features that differentiate between fake images
first encode an image to a high-level representation, and of different objects. In contrast, having only a single class
subsequently decode the representation into a generated im- that represents fake images of all object categories forces
age. By training the cGAN on input-output image pairs, one the discriminator to look for high level features shared by
can train the generator to create images that are, in theory, all generated images that indicate an image is fake. Our
indistinguishable from the given output population. Our discriminator has the following architecture:
novel approach relies heavily on the U-Net architecture of
the generator and the convolutional layer architecture of the C128-C256-C512-C1-FC125
discriminator proposed in this paper. The 2N-dimensional output vector has the form:

3. Approach output = [l1R , . . . , lN R , l1F , . . . , lN F ]

3.1. Architecture where R denotes a real object class, F denotes a fake ob-
ject class, and N is the number of classes in our dataset. In
We use a pre-existing GAN implementation provided this formulation, a class represents a real or fake photo of a
by the authors of “Image-to-Image” [2] as the basis for particular type of object.
our model. The generator has two components: an en- These logits can be turned into class probabilities using
coder component, which takes the given sketch s and down- a softmax:
samples it to create a lower dimensional representation
φ(s), and a decoder layer, which takes a vector containing exp(lj )
pmodel (y = j|x) = P2N
φ(s) and produces an image. The generator contains skip i=1 exp(li )
connections between the ith layer of the decoder and layer
8 - i of the encoder. The architecture for the generator is as We use these class probabilities to calculate our 2N loss and
follows: penalty loss, which we describe in the following sections.
3.2. 2N Cross Entropy Loss
• Encoder:
C64-C128-C256-C512-C512-C512-C512-C512 We will first discuss the 2N cross entropy loss function,
which is the simpler of the two loss functions used in our ex-
• Decoder: periments. This 2N cross entropy loss function is inspired
CD512-CD512-CD512-C512-C512-C256-C128-C64 by the supervised component of the N+1 loss function out-
lined in the introduction and proposed in “Improved Tech-
where C stands for convolution and CD stands for a de-
niques” [6]. The discriminator loss LD contains two terms.
convolution. All of the convolutions use 4x4 spatial filters
The first term is a cross entropy loss for a real image x and
applied with stride 2. Convolutions in the encoder down-
sketch s pair taken from our training data distribution pdata
sample by a factor of 2 and in the decoder convolutions up-
with ground truth class y and target class y. The second
sample by a factor of 2. Leaky ReLU activation functions
term is a cross entropy loss for the image G(s) generated
with a leak of 0.2 are used between layers in the encoder
from a sketch s with ground truth class y and target class y.
and standard ReLU activations are used between layers in
The loss LD is described by the following equation:
the decoder.
All of the convolutions in the discriminator use 4x4 spa- LD = −(Ex,s,y∼pdata (x,s,y) [log pmodel (y|x, s, y ≤ N )]
tial filters with a stride of 2 except for the final layer, which + Es,y∼pdata (s,y) [log pmodel (y|G(s), s, N < y ≤ 2N )])
uses a stride of 1. Leaky ReLU activations with a leak of (1)
0.2 are used in between the convolutional layers. And both
the generator and discriminator are trained using the Adam Similarly, the generator loss LG contains two terms. The
update rule [3] with a learning rate of 0.0002 and momen- first term is a cross entropy loss for the image G(s) gen-
tum of 0.5. The discriminator produces a 30x30 output that erated from a sketch s with ground truth class y and target

2
class y − N . The target class is y − N in this case because
the generator wants the image G(s) to be classified as a real
image of the object depicted in sketch s and y − N is the
index of that class in the output vector. The second term
in this loss is the L1 distance between the generated image
G(S) and the ground truth image x weighted by a hyper
parameter λ. This L1 term encourages the generator to pro-
duce images that are close to the ground truth photo. The
loss LG is given by the equation:
LG = −Es,y∼pdata (s,y) [log pmodel (y − N |G(s), s, N < y ≤ 2N )]
+ λLL1 (G) Figure 1: Examples of sketch-photo pairs. The bottom row
(2) displays examples of photos cropped using the segmenta-
tion mask.
3.3. Penalty Loss
The 2N cross entropy loss makes use of our 2N-
dimensional output; however, it does not take into account 4. Experiment
much of the additional information provided by the 2N rep-
4.1. Dataset
resentation. For instance, it doesn’t differentiate between a
misclassification of the object category from a misclassi- We used the Sketchy Database 1 , a large-scale collection
fication of realness. The penalty loss aims to make use of of sketch-photo pairs created by Georgia Tech to perform
this additional information by weighting the cross entropy image retrieval using deep learning. This database contains
terms used in the 2N losses by constant penalty values, 12,500 images from a subset of 125 categories from Ima-
which vary depending on the type of misclassification. For genet. The creators asked participants on Amazon Mechan-
a class prediction ŷ with target class y our penalty function ical Turk to sketch the target object in the images, so that
pen(y, ŷ) is as follows: each image ended up with about 5 hand-drawn sketches for
 a total of 75,471 sketches in the final dataset. We eliminated
a; obj(y) = obj(ŷ), is-fake(y) 6= is-fake(ŷ)
 10,918 sketches that the creators had marked as ambigu-
pen(y, ŷ) = b; obj(y) 6= obj(ŷ), is-fake(y) = is-fake(ŷ) ous, erroneous, having an incorrect pose, or including envi-


c; obj(y) 6= obj(ŷ), is-fake(y) 6= is-fake(ŷ) ronment details. Our final training size was 43,020 sketch-
photo pairs.
where obj() returns the type of object represented by
the given class, is-fake() determines if the given class 4.2. Image Segmentation
represents a fake image, and a, b, c are hyper parameters
During preliminary testing of our cGAN sketch-to-photo
that can be chosen in cross validation. Using this penalty
network, we noticed a consistent issue with our output im-
function we define our discriminator loss LD to be:
ages. As our image output population is comprised en-
LD = −(Ex,s,y∼pdata (x,s,y) [log pmodel (y|x, s, y ≤ N )] tirely of photographs, the images often have cluttered back-
× pen(y, ŷ) grounds. We surmised that often, our generator is learning
+ Es,y∼pdata (s,y) [log pmodel (y|G(s), s, N < y ≤ 2N )] to emulate the background instead of focusing on the re-
quested object. In the class of airplane, this background
× pen(y, ŷ))
emulation is not a problem, as here, most photo back-
(3) grounds are blue and uniform. The background becomes
Our generator also weights the cross entropy term by the a greater issue in classes such as eyeglasses, where the im-
output of the penalty function and is given by the equation age is cluttered with faces, hair, and other distracting ele-
below. Note that we pass y − N into the penalty function ments. We hypothesized that by cropping our image set to
as the target class for the generated image G(s) because the only include the key object, we will see a much higher qual-
generator wants the image to be classified as a real image of ity in the generated images.
the object depicted in sketch. In order to create a segmentation mask for our dataset,
LG = we adapted the findings proposed in “Fully Convolutional
Networks for Semantic Segmentation” [4], using models
− Es,y∼pdata (s,y) [log pmodel (y − N |G(s), s, N < y ≤ 2N )] created in “Deep Residual Learning for Instrument Segmen-
× pen(y − N, ŷ) + λLL1 (G) tation in Robotic Surgery” [5]. This allowed us to utilize
(4)
1 http://sketchy.eye.gatech.edu/

3
Figure 2: Images generated during training of penalty loss model. Successful generations on the left, unsuccessful generations
on the right.

a model trained on the PASCAL VOC Image Segmenta- on real photos.


tion Dataset. This model used 25 classes for segmentation, Next, we used the Inception score method introduced in
which limited the quantity of images we were able to effec- [6], which applies a pre-trained Inception Network [7] to
tively crop. By implementing the segmentation mask model each generated image and computes the conditional class
on the Sketchy Database, we were able to produce a dataset distribution. The distribution is expected to have low en-
with 15 object classes and over 9,000 coupled sketches and tropy for a single image, since we expect realistic images
cropped images (see Figure 1). Due to time constraints, we to be classified confidently by the network. The method
only trained the baseline model with this segmented dataset. also expects the marginal distribution across all generated
images to be high, since the generated images ought to be
4.3. Class Conditional Generator varied from one another. These distributions are compared
In addition to our two proposed loss functions, we con- using KL-divergence, so that a higher Inception score in-
ducted an experiment in which we gave the target class as dicates more realistic images. Previous work has found
a conditional to the generator. We found that the crude that the score for real images range from around 11.0 to
sketches in our dataset often shared key features across 26.0, while scores reported for generated images range from
classes. For example, sketches with a striped pattern often around 8.0 to 9.0 [6, 8].
generated a “zebra-like” image with white and black stripes Finally, while the Inception score method is suitable
regardless of the rest of the sketch. By explicitly giving for quantifying the realness of images generated by non-
class information to the generator, we hoped to produce im- conditional GANs, it does not evaluate how competent a
ages that were more closely related to the class the sketch conditional generator is in generating class conditional im-
was based on. ages. To do so, we apply a pre-trained Inception Network to
We appended a one-hot encoding of the class to the vec- our generated images, and see whether the network is able
tor produced at the end of encoding the input image. This to predict the image’s conditional class. We calculate ac-
modified vector was then decoded by the generator as in the curacy from the Top 1 and Top 5 classes predicted by the
above architecture to produce an image. Due to time con- Inception Network.
straints, we only trained the class conditional generator with
the baseline loss functions.
4.5. Results
Subjectively observing our generated images, we saw
4.4. Evaluation Methods
that our models generated a variety of images. See Figure 2
The state of the art for evaluating Generational Adver- for some examples of good and bad generated images. We
sarial Networks is still being developed. We explore three found that categories that had little background noise (like
different quantitative evaluation techniques to evaluate our airplane), showed the target object in a consistent shape
models. or pose (like mushroom), or had consistent features such
First, our implementation of this 2N output discriminator as color or texture (like strawberry) often generated bet-
has the handy feature that it can double as a classifier. This ter images than noisy, inconsistent categories (like musical
is in contrast to a typical GAN discriminator that only out- instruments and animals). We also found that training on
puts a realness classification. Utilizing this idea, we evalu- the segmented images successfully generated many images
ated the accuracy of our trained discriminator as a classifier with similar shape to the target object. To see examples of

4
Model Accuracy Model Top 1 Top 5
2N Loss (50k steps) 26.98% Ground Truth Photos 71.90% 79.04%
Penalty Loss (50k steps) 29.09% Baseline Model 0.83% 2.36%
Penalty Loss (134k steps) 10.26% Class Conditional Generator 1.05% 3.13%
2N Loss Model 0.48% 1.90%
Penalty Model 0.85% 2.44%
Table 1: Accuracy when classifying validation photos using Segmented Photos 40.58% 60.51%
the standalone discriminator. Trained on Segmented Photos 1.99% 4.42%

Model Mean Std Dev


Ground Truth Photos 74.81 1.40 Table 3: Top 1 and Top 5 Accuracies for classifying gener-
Baseline Model 5.26 0.16 ated images using Inception network.
Class Conditional Generator 4.25 0.07
2N Loss Model 6.11 0.11
higher accuracies for the class conditional generator model
Penalty Model 6.20 0.10
due to the generator explicitly receiving class information.
Segmented Photos 13.33 0.90
We also see higher accuracies for the model trained on seg-
Trained on Segmented Photos 5.96 0.25
mented photos, since that model excels in generating im-
ages that have shapes similar to the target object.
Table 2: Inception scores for various models.
5. Conclusion
The results in this paper suggest that using a 2N class
images generated by all of the models, see Figure 3
discriminator for cGANs has great promise, as these net-
To evaluate the discriminator as a standalone classifier,
works show results that are competitive with previously pro-
we classified 10,809 validation photos with the trained dis-
posed methods. More work needs to be done to fully under-
criminators using the 2N loss and the penalty loss. The
stand the potential of this approach. Additionally, this pa-
2N loss discriminator was run for 50,000 iterations, and we
per demonstrates the possibility of generating photographic
used two penalty loss discriminators that trained for 50,000
images from an input of hand drawn sketches. We feel
iterations and 134,000 iterations respectively. See Table 1
that both our understanding of the model and the quality
for results. We were unable to compare these results to the
of the generated photos would benefit from three clear next
baseline model because its loss function does not produce
steps: implementing a conditional version of the N+1 class
an output with class information. The results show that the
discriminator proposed in “Improved Techniques for Train-
accuracy decreases dramatically after training the penalty
ing GANs” to use as a baseline, training our models for
loss model for longer. We believe this is due to the real
longer (about 200 epochs), and learning penalty values for
photos being misclassified as fake. Since we wish to test
the penalty loss via cross validation. We hope that this work
the discriminator as a classifier only on real photos, future
sparks interest in both using GANs to augment sketches and
work will classify images based on the first N elements of
experimenting with a 2N class discriminator.
the output, which are the elements representing real class
scores.
References
We computed the Inception score on 10,809 validation
photos, the images generated from those sketch-photo pairs, [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
1,560 segmented validation photos, and the images gener- Farley, S. Ozair, A. Courville, and Y. Bengio. Generative ad-
ated from those sketch-segmented photo pairs using the five versarial nets. In Advances in neural information processing
models explained above, each trained for 50,000 iterations. systems, pages 2672–2680, 2014.
See Table 2 for results. While the Inception scores for our [2] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-
image translation with conditional adversarial networks. arXiv
models are lower than scores previously reported by other
preprint arXiv:1611.07004, 2016.
papers due to the fewer training iterations, we see that the
[3] D. Kingma and J. Ba. Adam: A method for stochastic opti-
models using 2N loss and penalty loss slightly outperform mization. arXiv preprint arXiv:1412.6980, 2014.
the baseline model. [4] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-
Results from classifying our generated images using an tional networks for semantic segmentation. arXiv preprint
Inception network can be see in Table 3. While the low arXiv:1605.06211v1, 2016.
accuracy shows that there is still much room for improve- [5] D. Pakhomov, V. Premachandran, M. Allan, M. Azizian, and
ment, since Imagenet has 1000 categories, our models are N. Navab. Deep residual learning for instrument segmentation
still being classified at a rate better than random. We see in robotic surgery. arXiv preprint arXiv:1703.08580, 2017.

5
[6] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
ford, and X. Chen. Improved techniques for training gans.
arXiv preprint arXiv:1606.03498, 2016.
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2818–2826, 2016.
[8] D. Warde-Farley and Y. Bengio. Improving generative adver-
sarial networks with denoising feature matching. In Proceed-
ings of the International Conference on Learning Representa-
tions (ICLR), 2017.
[9] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
Generative visual manipulation on the natural image mani-
fold. In European Conference on Computer Vision, pages
597–613. Springer, 2016.

6
Figure 3: Examples of inputs and outputs of the various models. Each row corresponds to a sketch. The columns, from left
to right, correspond to: 1. Input Sketches; 2. Target Photos; 3. Segmented Target Photos; 4. Baseline Model; 5. Class
Conditional Generator; 6. 2N Loss Model; 7. Penalty Loss Model; 8. Trained on Segmented Photos.

You might also like