Understanding and Visualizing Generative
Understanding and Visualizing Generative
1. Introduction
data, especially for the output data, whose format is similar or same with the
input data. By giving training data in pairs, the program will figure out the
most suitable parameters in the network, so that the discriminator (D) has
smallest possibility to distinguish the generated data (G) with the original data.
Then, Wang et al. (2017) built a refined network called PIX2PIXHD (Fig-
ure 1), for generating and evaluating 2D image data. Input image will be trans-
lated into three 2D matrices, based on its width, height, and RGB channels.
Then the matric will go through 5 groups of convolution layer, which contains
one convolution layer, one batch normalization layer, and one ReLU layer,
then 9 groups of residual network layer, which follows as two sets of Reflec-
tionPad2d-Conv2d-InstanceNorm2d-ReLU layer, and finally 5 groups of de-
convolution layer, which contains one deconvolution layer, one batch normal-
ization layer, and one ReLU or Tanh layer.
Figure 2. Apartment floor plan drawing (left); labelled image (middle); labelling rule (right).
After training with 100 image pairs, we gave new floor plan drawings to
the network, and told the program to generate predicted labelled images (Fig-
ure 3). As figure 3 shows, the network generated a highly similar labelled im-
age, which means it performed nicely in recognizing architectural drawings.
Figure 3. Apartment floor plan drawing (left); generated labelled image (middle); original
labelled image (right).
2. Working principles
In order to reveal how PIX2PIXHD network learns image pairs, this chapter
will analyse all three parts of this network, and give explanation of why they
work well in processing image data.
layer and ReLU layer do not build connections between each pixel, convolu-
tion layer acts as the main calculation rule, extracting and mixing features of
an image.
As figure 4 shows, convolution kernel is a 3 × 3 (or larger) matrix. When
we input a 5 × 5 matrix, the kernel will find its corresponding position, multi-
ply and sum-up 9 numbers, and finally output a new 3 × 3 matrix. General
speaking, convolution kernel is a feature extractor, turning a matrix into a
smaller but refined new matrix. A convolution layer usually contains hundreds
of kernels, in order to make sure all features being contained in the layer. So
the numbers in the kernel are the parameters, which we should tell the program
to figure out by machine learning.
to the original size, while using the features to generate similar data as the
second image in the image pairs. Considering the length of this article, the
reversed operation of matrix will not be elaborated.
Later comes the ResNet layer, which did not change the image size and
number of channels, but further shifted the combination of features. Last, as
figure 7 shows, deconvolution layer enlarged the image and decreased the
number of channels into the same status of the original image.
4. Conclusion
Based on Generative Adversarial Networks, PIX2PIXHD is a powerful ma-
chine learning tool to recognize and generate architectural drawings. The fea-
tures in the drawings become more concise as the network goes deeper, and
clearer as the training epoch increases. It might be inspiring comparing to the
learning process of our human beings, noted that we learned from concrete
entities to abstract concepts, and from fuzzy cognition to accurate judgement.
So in the future, Generative Adversarial Networks not only can be used for
generating images, but also has the potential for self-designing art or architec-
tural works.
Acknowledgements
I'd like to show my gratitude to Prof. Weixin Huang from Tsinghua University, who supervised
me in this research, and Yuming Lin, Lijing Yang, Chenglin Wu, Zhijia Chen, and Xia Su for
providing labelled image data and advice.
References
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio,
Y. (2014). Generative adversarial nets. In Advances in neural information processing sys-
tems (pp. 2672-2680).
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2017). High-resolution
image synthesis and semantic manipulation with conditional gans.