CycleGAN - Learning To Translate Images (Without Paired Training Data) - by Sarah Wolf - Towards Data Science
CycleGAN - Learning To Translate Images (Without Paired Training Data) - by Sarah Wolf - Towards Data Science
Open in app
Image-to-image translation is the task of transforming an image from one domain (e.g.,
images of zebras), to another (e.g., images of horses). Ideally, other features of the
image — anything not directly related to either domain, such as the background —
should stay recognizably the same. As we might imagine, a good image-to-image
translation system could have an almost unlimited number of applications. Changing art
styles, going from sketch to photo, or changing the season of the landscape in a photo
are just a few examples.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 1/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
Open in app
Examples of paired and unpaired data. *Image taken from the paper.
While there has been a great deal of research into this task, most of it has utilized
supervised training, where we have access to (x, y) pairs of corresponding images from
the two domains we want to learn to translate between. CycleGAN was introduced in the
now well-known 2017 paper out of Berkeley, Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks. It was interesting because it did not
require paired training data — while an x and y set of images are still required, they do
not need to directly correspond to each other. In other words, if you wanted to translate
between sketches and photos, you still need to train on a bunch of sketches and a bunch
of photos, but the sketches would not need to be of the exact photos in your dataset.
Since paired data is harder to find in most domains, and not even possible in some, the
unsupervised training capabilities of CycleGAN are quite useful.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 2/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
We call one generator G, and have it convert images from the X domain to the Y domain.
The other generator is called F, and converts images from Y to X.
Both G and F are generators that take an image from one domain and translate it to another. G maps from X to
Y, whereas F goes in the opposite direction, mapping Y to X.
Each generator has a corresponding discriminator, which attempts to tell apart its
synthesized images from real ones.
One discriminator provides adversarial training for G, and the other does the same for F.
If you are familiar with GANs, the adversarial loss should come as no surprise. Both
generators are attempting to “fool” their corresponding discriminator into being less
able to distinguish their generated images from the real versions. We use the least
squares loss (found by Mao et al to be more effective than the typical log likelihood loss)
to capture this.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 3/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
However, the adversarial loss alone is not sufficient to produce good images, as it leaves
Open in app
the model under-constrained. It enforces that the generated output be of the appropriate
domain, but does not enforce that the input and output are recognizably the same. For
example, a generator that output an image y that was an excellent example of that
domain, but looked nothing like x, would do well by the standard of the adversarial loss,
despite not giving us what we really want.
The cycle consistency loss addresses this issue. It relies on the expectation that if you
convert an image to the other domain and back again, by successively feeding it through
both generators, you should get back something similar to what you put in. It enforces
that F(G(x)) ≈ x and G(F(y)) ≈ y.
We can create the full objective function by putting these loss terms together, and
weighting the cycle consistency loss by a hyperparameter λ. We suggest setting λ = 10.
Generator Architecture
Each CycleGAN generator has three sections: an encoder, a transformer, and a decoder.
The input image is fed directly into the encoder, which shrinks the representation size
while increasing the number of channels. The encoder is composed of three convolution
layers. The resulting activation is then passed to the transformer, a series of six residual
blocks. It is then expanded again by the decoder, which uses two transpose convolutions
to enlarge the representation size, and one output layer to produce the final image in
RGB.
You can see the details in the figure below. Please note that each layer is followed by an
instance normalization and a ReLU layer, but these have been omitted for simplicity.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 4/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
Open in app
An architecture for a CycleGAN generator. As you can see above, the representation size shrinks in the
encoder phase, stays constant in the transformer phase, and expands again in the decoder phase. The
representation size that each layer outputs is listed below it, in terms of the input image size, k. On each layer
is listed the number of filters, the size of those filters, and the stride. Each layer is followed by an instance
normalization and ReLU activation.
Since the generators’ architecture is fully convolutional, they can handle arbitrarily large
input once trained.
Discriminator Architecture
The discriminators are PatchGANs, fully convolutional neural networks that look at a
“patch” of the input image, and output the probability of the patch being “real”. This is
both more computationally efficient than trying to look at the entire input image, and is
also more effective — it allows the discriminator to focus on more surface-level features,
like texture, which is often the sort of thing being changed in an image translation task.
If you’ve read about other image-to-image translation systems, you may already be
familiar with PatchGAN. By the time of the CycleGAN paper, a version of PatchGAN had
already been successfully used in paired image-to-image translation by Isola et al in
Image-to-Image Translation with Conditional Adversarial Nets.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 5/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
Open in app
An example architecture for a PatchGAN discriminator. PatchGAN is a fully convolutional network, that takes
in an image, and produces a matrix of probabilities, each referring to the probability of the corresponding
“patch” of the image being “real” (as opposed to generated). The representation size that each layer outputs is
listed below it, in terms of the input image size, k. On each layer is listed the number of filters, the size of those
filters, and the stride.
As you can see in the example architecture above, the PatchGAN halves the
representation size and doubles the number of channels until the desired output size is
reached. In this case, it was most effective to have the PatchGAN evaluate 70x70 sized
patches of the input.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 6/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
The training approach was fairly typical for an image-to-image translation task. The
Open in app
Adam optimizer, a common variant of gradient descent, was used to make training more
stable and efficient. The learning rate was set to 0.0002 for the first half of training, and
then linearly reduced to zero over the remaining iterations. The batch size was set to 1,
which is why we refer to instance normalization, rather than batch normalization, in the
architecture diagrams above.
CycleGAN can be used for collection style transfer, where the entire works of an artist are used to train the
model. *Image taken from paper.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 7/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
Open in app
A very unimpressive attempt at a cat-to-dog image translation. Don’t try to use a CycleGAN for this. *Image
taken from paper.
Translations on the training data often look substantially better than those done on test
data.
Conclusion
Thanks for reading! I hope this was a useful overview. If you would like to see more
implementation details, there are some great public implementations out there you can
refer to. Please leave a comment if you have questions, corrections, or suggestions for
improving this post.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 8/9
5/17/2021 CycleGAN: Learning to Translate Images (Without Paired Training Data) | by Sarah Wolf | Towards Data Science
https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d 9/9