Tech com science
Tech com science
i
ACKNOWLEDGEMENTS
First and foremost, I would like to praise the Almighty God for giving me the
strength to complete the thesis, which has been a protected and challenging
journey.
I want to thank my advisor, Dr. Bisrat Derbesa, for his assistance, direction,
advice, and support during the thesis process. He has encouraged me to work
on impactful and challenging research topics and has taught me how to find
good research problems.
Furthermore, I would like to thank my family, friends, all Msc computer
engineering students, Imo group friends, and colleagues who supported me
throughout the thesis by encouraging, sharing ideas, and critical feedback.
This thesis would not have been feasible without their assistance.
Finally, I give my special appreciation to Meseret Haileyesus(CEO of CCFWE)
and Yidinakach Gemeda for the companionship, support, and trust they gave
me during the most challenging times.
ii
ABSTRACT
iii
Contents
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
CONTENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List Of Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Motivation Of Papers . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem of Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Specific objective . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
iv
2.3.2 Conditional GAN . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Training GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Multi-modal Unsupervised Image-to-image Translation . . . . . . 13
2.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Overview of Networks and loss . . . . . . . . . . . . . . . . . 14
2.5 Image to Image Translation . . . . . . . . . . . . . . . . . . . . . . . 16
3 Literature Reviews 19
3.1 Image-to-image translation . . . . . . . . . . . . . . . . . . . . . . 19
4 Methodology 23
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Our Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Evaluation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
v
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vi
List of Figures
5.1 The graph shows the experiment 1 overall training loss in both
models of the generator and discriminator. . . . . . . . . . . . . . 41
5.2 Images are in the MUNIT model produced during the first half of
the training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Images are in the Our model produced during the first half of the
training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 The graph shows the experiment 2 overall training loss in both
models of the generator and discriminator. . . . . . . . . . . . . . 43
vii
5.5 The graph shows the experiment 3 overall training loss in both
models of the generator and discriminator. . . . . . . . . . . . . . 44
5.6 The graph shows the experiment 4 overall training loss in both
models of the generator and discriminator. . . . . . . . . . . . . . 46
viii
List of Tables
ix
Symbols
≈ Approximately equal to
∼ Equivalence relations
E Expected Value
∼ distributed as
| conditional or given
R Real Value
∈ Set membership
⊆ Subset
∑ Summation
θ Thetha
ϕ Phi
x
ACRONYMS
2D 2-Dimensional image . . . . . . . . . . . . . . . . . . . . . . . . 5
HD High Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
IN Instance Normalization . . . . . . . . . . . . . . . . . . . . . . . 31
xi
MCGE Monte Carlo gradient estimator . . . . . . . . . . . . . . . . . . 8
SN Spectral Normalization . . . . . . . . . . . . . . . . . . . . . . . 25
xii
Chapter 1
Introduction
1
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
1.3 Objectives
1.3.1 General objective
The main objective of this research is to resolve the training instability
that occurs when the Multimodal unsupervised image-to-image translation
1.4 Methodology
The first task of this study was to start a literature review. Throughout
the investigation, several types of literature review areas of image translation
based on the Generative Adversarial Generating Networks (GANs) way to dis-
cover gaps in previous work and develop the problem statement. When an
issue becomes a postulate, a technique to remedy it is developed and put into
effect. Following the implementation of the suggested technique, edge-to-shoe
image datasets are gathered and correctly trained using the proposed system.
Finally, we proposed a method of training loss compared to the original MUNIT.
1.5 Contributions
Here, we present generative model frameworks based on deep learning for
learning image-to-image translation for unpaired-domain with low image res-
olution quality. We are introducing spectral normalization for learning image-
to-image translation for unpaired-domain. The main contributions of spectral
normalization are as follows:
Theoretical Background of
Research
In the current chapter, we’ll present and review core techniques that makeup
to understand the backbone of our thesis. We begin with an overview of Gen-
erative Modeling and the classification of Generative Models. We will quickly
show two of the most extensively utilized strategies in deep generative model
domains for image-to-image translation challenges. They are variational auto-
encoders and generative adversarial networks. Furthermore, we review a MU-
NIT. In conclusion, we discuss the theory of image-to-image translation and
the basic terms applied throughout the thesis paper has given to understand.
6
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
limit of the data log-likelihood, which is the GAN approaches to finding the
Nash equilibrium between two networks of generator and discriminator [21].
The DKL refers to the positive KL divergence. In addition, θ and ϕ are pa-
rameters. We could set a variational lower bound on the log-likelihood [21].
log pθ ( xi ) ≥ L( xi , θ, ϕ) (2.4)
L( xi , θ, ϕ) = Ez∼qϕ (z| xi ) [log pθ ( xi |z)] − DKL [qϕ (z| xi )|| pθ ( xi |z)] (2.5)
VAEs [33] provide better training stability than GANs [34] and better sam-
pling than autoregressive models [35]. Nevertheless, some practical and theo-
retical issues with VAEs are still unaddressed. The fundamental disadvantage
of variational approaches is the tendency to provide an unacceptable trade-off
in sample quality and reconstruction quality due to an insufficient approxi-
mate posterior distribution or an excessively simplified posterior distribution.
The research investigations in [36, 37] improved the variational posterior to
solve the ambiguity of produced samples. Tomczak et al. [38] introduced it as
a novel before learning the greater strength of hidden representations. Fur-
thermore, [39] asserted that the inherent over-regularization caused by the
KL divergence element in the VAE goal frequently results in an error between
L( xi , θ, ϕ) and the genuine probability.
this game is to achieve a Nash equilibrium between both participants [21]. The
remaining sections will go over unconditional GANs (2.3.1), conditional GANs
(2.3.2), and how to train GANs (2.3.3).
min max L( D, G ) = Ex∼ pdata ( x) [log D ( x)] + Ez∼ pz (z) [log(1 − D ( G (z)))] (2.6)
G D
min max L( D, G ) = Ex∼ pdata ( x) [log D ( x|y)] + Ez∼ pz (z) [log(1 − D ( G (z|y)))]
G D
(2.7)
(a) The structure of unconditional GANs (b) The structure of conditional GANs
2.4.2 Model
Assuming two domain collections of unpaired pictures in X1 and X2 , our
objective is to build two mappings Mx1 → x2 : X1 7→ X2 and Mx2 → x1 : X2 7→ X1 ,
which will translate the image across the two domains [50]. It has an Encoder
Ei and a Decoder Gi for each of the domains Xi where (i = 1, 2). The translation
should be natural and clear and with a specific focus on preserving essential
key attribute contents. It makes no strong assumptions about the two domains
other than the existence of some shared key content in both set domains. It
x2
Lrecon = E x2 ∼ p( x2 ) [|| G2 ( E2c ( x2 ), E2s ( x1 )) − x2 ||1 ] (2.9)
s1
Lrecon = Ec2 ∼ p(c2 ) ,s1 ∼ p(s1 ) [|| E2c ( G2 (c1 , s2 )) − s1 ||1 ] (2.11)
c2
Lrecon = Ec2 ∼ p(c2 ) ,s1 ∼ p(s1 ) [|| E2c ( G2 (c2 , s1 )) − c2 ||1 ] (2.12)
s2
Lrecon = Ec1 ∼ p(c1 ) ,s2 ∼ p(s2 ) [|| E2c ( G2 (c1 , s2 )) − s2 ||1 ] (2.13)
x2
LGAN ( G2 , D2 ) = Ec1 ∼ p(c1 ) ,s2 ∼ p(s2 ) [log(1 − D2 ( G2 (c1 , s2 )))]+
(2.14)
E x2 ∼ p( x2 ) [log D2 ( x2 )]
x1
LGAN ( G1 , D1 ) = Ec2 ∼ p(c2 ) ,s1 ∼ p(s1 ) [log(1 − D1 ( G1 (c2 , s1 )))]+
(2.15)
E x1 ∼ p( x1 ) [log D1 ( x1 )]
Figure 2.7: Uni-modal and Multi-modal image translation are two examples. [15]
Literature Reviews
This chapter covers the most closely related to our thesis. We started re-
viewing image translator techniques constructed using generative adversarial
networks using paired domain-trained data called Supervised image transla-
tion, and those trained with the unpaired domain are known as Unsupervised
Image Translation. The supervised translation techniques divide into two ways
Directional and Bidirectional Translations. Finally, we will cover Unsupervised
image translation approaches.
19
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
criminators to learn the mapping between the two separate sets of domains
of pictures. The model used two objective function terms: adversarial and
cyclic consistency losses. Cycle-GAN approaches have been successful at im-
age translation with simple style appearance changes, such as a horse to a
zebra or a zebra to a horse, but commonly fail when the translation wants for
geometric deformation.
An auto-encoder has an Encoder and a Decoder. The decoder turns the
input pictures into a latent representation.The generator receives this com-
pressed vector to produce high-quality images. The Unsupervised Image-to-
Image Translation Network (UNIT) [19] have created using an auto-encoder. It’s
an assumption shared latent space concept that images in different domains
can map to the same latent code in a latent space. Its framework combines
Coupled generative adversarial networks Coupled generative adversarial net-
works (COGAN) and VAEs. Two GANs have trained to convert pictures by en-
coding that into fully shared latent space and decoding their latent codes as
pictures in the domain of interest. However, there are two main drawbacks
to this framework. The translated outputs image result is a lack of diversity
or unimodal and additional problems caused by the existing training unstable
between the generator and discriminator neural network.
The Multimodal unsupervised image-to-image translation (MUNIT) [15] ap-
proach is one of the disentangled representation methods to address the UNIT
diversity problem. According to the MUNIT framework, the Latent space of
images is disentanglement into two sections: content and style code. MUNIT
assumption states that pictures from various domains have a similar content
space but not an identical style space. It has built two auto-encoders, and the
latent code from each auto-encoder has split into the form of content and style
code.
MUNIT [15] suffers from content and photo-realism loss during bidirec-
tional reconstruction. The Matting Laplacian [53] technique generates an al-
pha matte that extracts foreground pictures with white, and the background
is black, or vice versa, depending on the demands. We compute the local affine
transform factor using this matting Laplacian matrix. Moniz et al. [16] added
Methodology
4.1 Introduction
One of the most essential and frequently faced challenges with computer
graphics and vision is image translation. With so much interest in images for
neural networks in the graphics world today, it’s reasonable to wonder if artifi-
cial intelligence can learn image translation, especially in unsupervised image
translation. A computer can learn to translate an edge to a shoe if the trans-
lator has only seen a collection of image edges and shoes without pairings
between the two domains of an image set.
The interesting unpaired domain translation problem has attracted the in-
terest of a variety of computer vision and graphics researchers in recent years,
including theDomain Transfer Network (DTN) [54], MUNIT [15], Cycle GAN [1],
UNIT [19], and others [16]. However, most success in unpaired image trans-
lation has come from changing or transferring style image elements with res-
olution quality 256x256 and above the size of a pixel rather than images with
low-resolution quality. The Cycle GAN-based horse-to-zebra translating pho-
tographs is a symbolic example, with the network only able to make slight style
adjustments to the horse/zebra visuals.
For image translation, we suggest a deep generative model network. Our
goal the training will become more stable, and for the image-generating net-
work will learn meaningful information about what the translation image looks
like from the unpaired domain, which contains images of low-resolution qual-
ity. The network has been trained on two sets of images, for example, the edge
and the shoe, each represented by a collection image of datasets. There is no
image pairing between the two domains to facilitate image translation, nor is
there any pair corresponding relation between any images. Once trained, the
23
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
network takes an image from the set of a source domain and turns it into the
target.
There is no image-corresponding relationship between the source and desti-
nation domains for the translation (i.e., unsupervised image translation). One
of the challenges to alleviate is how to correct the network training stability
of the images, relating them to facilitate their translation. For that purpose,
we perform picture translating in a multi-modal Unsupervised image-to-image
Translation (MUNIT), a common latent space shared between two domains.
Figure 2.5 shows the latent space code obtained by an auto-encoder trained
before image translation.
More crucially, a suitable image translation from edge to shoe should trans-
late a given image edge to a shoe that is obviously from that specific input edge.
As a result, some of the similar edge attributes should be kept during an edge-
to-shoe translation, while others can be modified. This feature is the main
challenge: it is uncertain which characteristics will be preserved/altered; it
should depend upon the provided image domains, and our network must learn
it without supervision. To that purpose, on MUNIT network is applied to use
spectral normalization to address this challenge.
• The MUNIT framework is training with low image resolution quality, such
as 128x128. The output images are blurred, and the holistic structure
So, the slope of the secant line is f ( x1 ), f ( x2 ). The slope of a function has
an absolute value that is always less than K, which is the Lipschitz constant.
Its Lipschitz function has limited first derivatives.
Let’s look at Figure 4.2 to illustrate Lipschitz continuity geometrically. Sup-
pose F ( x) = sin( x), F is Lipschitz continuous, then we can draw a cone cen-
tered at every point on its graph such that the graph lies outside of this cone.
Figure 4.2: There exists a dual cone (yellow) whose origin may be shifted across
the graph so that the entire graph continuously remains within the double
cone boundary according to the Lipschitz continuous of sin function.
(a) (b)
Figure 4.3: The Lipschitz continuity ensures that ReLu and LeakyRelu func-
tion always remains entirely outside the yellow cone.
Spectral norm
Mathematically define as :
|| Ax||
|| A|| = max
x ̸ =0 || x||
|| A|| is represented spectrum norm. It calculates how far away the matrix
A may stretch given a vector x dimension.
Let us consider the Singular Value Decomposition (SVD) of a matrix as the
factorization of that matrix into three pieces of matrices. So that every mxn
rectangular matrix factors into U, V , and S. The factor U is an orthogonal
matrix, the factor S in the middle is a diagonal matrix, and the factor V T on
the right is also an orthogonal matrix.
x x12 · · · x1n
11
x21 x22 · · · x2n
Amxn =
.. .. .. ..
. . . .
xm1 xm2 · · · xmn
T
Amxn = Umxm Smxn Vnxn
δ1 0
u11 · · · um1 v · · · v1n
11
.. . δr
. . .
U= ,V = ,S =
...
um1 · · · umm vn1 · · · vnn
0 ··· 0
The above equation depicts a discriminator neural network with many func-
tions. Take one layer from the neural network and, for each layer L, it outputs
y = Ax followed by an activation function ”a” that determines whether the
layer is ReLU or Leaky-ReLU.
Let us write down the Lipschitz norm of a certain g function that can be
determined using its derivative, which is formally denoted by:
The activation function ReLU and the Leaky ReLU Lipschitz constant are al-
ways one. As a result, such activation functions exist in the discriminator in
Equation 4.3’s deep neural network f.
∥ f ∥ Lip = 1
If we normalize each W l using Equation 4.5 and the fact that σ (WSN
′ (W )) = 1
u k = A k u0
u1 = Au0 = λu0
u2 = Au1 = λ2 u0
u k = λ k u0
vk = Avk
vk = Ak v0
= c1 Ak u1 + c2 Ak u2 + ..... + cn Ak un
= c1 λ1k u1 + c2 λ2k u2 + ..... + cn λkn un
Let b0 be the initial guess (with uniform probability), then c1 ̸= 0 with proba-
Applying the following loop, starting with b, b will eventually converge to several
of the dominating eigenvectors. To summarize, we can begin with a random
vector b. We may estimate the dominating eigenvector by multiplying it by A.
Abk
b k +1 =
|| Abk ||
The exact same thing may be implemented with ũ and ṽ repeatedly calculated
in the following order ( in which A can be represented now as W).
4.3.1 Auto-Encoder
The Auto-Encoder architecture is designed by the encoder and decoder, as
shown in more detail in Figure 4.4. Now we can discuss each detailed part of
the encoder and decoder. We can talk about each specific component in depth.
(A) Encoder
An Encoder’s purpose in each domain is to encode images to a latent code.
The latent code consists of both content code and style code. Separate
sections are required to generate such codes (i.e., content and style code),
such as naming the content encoder, which produces the content code,
and the style encoder, which generates the style code.
• Content Encoder
The content encoder is composed of two main sections: Down-sampling,
which consists of several strides of convolutional layers to reduce the
dimensionality of the input data so that it can be processed more
quickly of the data (image). In the other section, Multiple residual
blocks further processed it. Instance Normalization (IN) comes after
all convolutional layers.
• Style Encoder
The style encoder consists of downsampling that aids in multiple
strides convolutional layers that reduce the input, followed by a Global
Average Pooling (GAP) and Fully Connected Layers (FC Layers). MUNIT
and our the approach did not use IN layers in the style encoder be-
cause IN eliminates the original feature mean and variance, which
contains crucial style information.
(B) Decoder
In MUNIT architecture, Auto-Encoder contains another component called
Decoder, which we will discuss in detail now. The decoder’s primary ob-
jective is to reconstruct the input picture from its content and style code.
During the image construction, a collection of residual blocks analyses
the content code and eventually creates the reconstructed image using
several up-sampling and convolutional layers. Inspired by past studies
that use affine transformation parameters in normalization layers to in-
dicate styles, we utilize residual blocks with Adaptive Instance Normaliza-
tion (AdaIN) layers with parameters that would be produced automatically
by an MLP from the style code.
4.3.2 Discriminator
Discriminators can be used as binary classifiers to distinguish between
real and generate fake images. The goal is to determine if the image is true or
restored.
The Least Squares GAN (LSGAN) model was created to overcome the vanish-
ing gradient issue triggered by the initial GAN model’s minimax and nonsatu-
rated loss. It ap the least squares [21]. The objective of the LSGAN function is
to minimize the difference between the Genuine sample distribution and the
produced sample distribution [21]. There are two advantages of using LSGAN
instead of the original GAN: primarily LSGAN can create excellent quality sam-
ples; next, LSGAN enables the learning process to produce a sufficient amount
of variety.
We use multi-scale discriminators with three scales. The discriminator
training with a multi-scale discriminator helps the network of generators to
create a more globally consistent repairing image and finer details, and the
overall image-repairing effect seems more natural. In subsection 2.4.3, the
two discriminators, D1 and D2 , are mentioned, which helps the network im-
proves image reconstruction. As a result, according to the sequence stated,
the MUNIT discriminator architecture consists of the following layers:
N.B: - The MUNIT model working principle is GAN, and GAN is a form of
two primary neural networks: a generator that seeks to produce a picture
from latent space code, and a discriminator is a classifier that distinguishes
between actual and generated images, as we saw in greater depth in section 2.3
of chapter 2. In the case of the Auto-Encoder, the encoder generates latent
space of content and style code, whereas the decoder creates an image from
the encoder to build latent space code. In other words, the decoder functions
as a GAN generator.
The latent reconstruction loss contains two losses namely; Style loss as
described in Equation 2.11 and 2.13 and Content loss whereas we stated
in Equation 2.10 and 2.12 .
x1
LGAN ( G2 ) = Ec2 ∼ p(c2 ) ,s1 ∼ p(s1 ) [log(1 − D1 ( G1 (c2 , s1 )))] (4.8)
Therefore, the total of the three losses, namely image reconstruction loss,
latent reconstruction loss, and translate images loss, is known as gener-
ator loss.
(B) Discriminator
The discriminator loss is the sum of two discriminators, As described in
subsection 2.4.3.
x2
LGAN ( D2 ) = Ec1 ∼ p(c1 ) ,s2 ∼ p(s2 ) [log(1 − D2 ( G2 (c1 , s2 )))]+
(4.9)
E x2 ∼ p( x2 ) [log D2 ( x2 )]
x1
LGAN ( D1 ) = Ec2 ∼ p(c2 ) ,s1 ∼ p(s1 ) [log(1 − D1 ( G1 (c2 , s1 )))]+
(4.10)
E x1 ∼ p( x1 ) [log D1 ( x1 )]
37
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
5.3 Hyper-parameters
In all four experiments, the input pictures are 128x128 image resolution
quality. We utilize the Adam optimizer with parameters β 1 = 0.5, β 2 = 0.999,
and an initial learning rate of 0.0001 15. For experiment 10 epoch, the learn-
ing rate is reduced by half. We use a batch size of 1 in all experiments content,
and style reconstruction weights are set at λ x = 10, λc = 1, λs = 1, respectively
15. The generator network uses activation function is ReLU and the discrim-
inator uses Leaky ReLU with slope 0.2. 15. We chose the style code that has
a size of 8 across 128x128 pixel images.
our model generator loss with MUNIT generator loss and our model discrim-
inator loss with MUNIT discriminator loss utilizing Relative change. Relative
change defines the change as a percentage of the value of the losses in the
earlier method, i.e.
MUNIT training loss − Our Model training loss
Relative change = x100%
MUNIT training loss
After calculating the relative change of each interval sample of training loss
of two neural networks (i.e., generator and discriminator), we calculated the
mean. The mean refers to the number you obtain when you sum up a given
set of interval samples of training loss value and then divide this sum by the
total number of interval samples of training loss in the experiment.
Figure 5.1: The graph shows the overall training loss for two models (i.e.,
MUNIT and our proposed method) of the generator and discriminator. Its plot-
ted using the results of experiment 1.
Figure 5.1 graph depicts the x-axis representing the interval sampling of
each model train and the y-axis each model’s training loss value at interval
sampling. For every 1000 interval sampling, both models ( i.e., MUNIT and our
proposed method) generate an image with a generator and discriminator train-
ing loss. The following three graphs (i.e., Figure 5.4 from Experiment 5.4.2,
Figure 5.5 from Experiment 5.4.3, and Figure 5.6 from Experiment 5.4.4 ) are
represented by the same x-axis and y-axis as Figure 5.1. Therefore, it demon-
strates the two models (i.e., MUNIT and our proposed) relationship between
generator and discriminator training loss for each interval sample of 1000.
The training loss impact of the two models can be seen in two segments
on the graph in Figure 5.1 above. For both models, the initial learning rate
hyper-parameter was mentioned in section 5.3, and we used it for up to two
epochs (from 0 to 10000 interval samples). Our suggested technique has a
lesser training loss than the MUNIT model and produces more photo realistic
photos than the MUNIT model .
(a)
Figure 5.2: The images in Figures 5.2(a), above show that images are the MUNIT
model produced during the first half of the training. We have used Experiment
1 to demonstrate. We did not include the additional experiments 2, 3, and 4
because they are similar.
(a)
Figure 5.3: The images in Figures 5.3(a), above show that images are our model
produced during the first half of the training.We have used Experiment 1 to
demonstrate. We did not include the additional experiments 2, 3, and 4 be-
cause they are similar.
In the second section of the graph, which ranges from 2 to 5 Epochs (10000
to 25000 interval samples), the initial learning for both models are reduced by
half. In this case, our models indicate a little decreased training loss. Addi-
tionally, as seen in Figure 5.1, the graph results of our model and the MUNIT
overlap.
Figure 5.4: The graph shows the overall training loss for two models (i.e.,
MUNIT and our proposed method) of the generator and discriminator. Its plot-
ted using the results of experiment 2.
The training loss impact of the two models can be seen in two segments
on the graph in Figure 5.4 above. For both models, the initial learning rate
hyper-parameter was mentioned in section 5.3, and we used it for up to five
epochs (from 0 to 50,000 interval samples). Our suggested technique has a
lesser training loss than the MUNIT model and produces more photo realistic
photos than the MUNIT model. In the second section of the graph, which
ranges from 5 to 10 Epochs (25,000 to 50,000 interval samples), the initial
learning for both models are reduced by half. In this case, our models indicate
a little decreased training loss. Additionally, as seen in Figure 5.4, the graph
results of our model and the MUNIT overlap.
epochs in this experiment. For every 1000 interval sample, we collected 100
samples from 100,000 batches (i.e., Model training 20,000 images with batch
size 1, for in a single epoch is one single pass of all the data through the
network, it will take 20,000 batches to make up a full epoch. Thus, there are
100,000 batches across all five epochs). When we compared training loss in
our proposed method to MUNIT models, we achieved the overall training loss
generator loss was decreased by 4.745 % on average, and the discriminator
was reduced by 2.787 % on average. However, our model requires more time
complexity than the MUNIT model. When putting it, as a result, it shows an
increase of 12 %.
The training loss impact of the two models can be seen in two segments
on the graph in Figure 5.5 above. For both models, the initial learning rate
hyper-parameter was mentioned in section 5.3, and we used it for up to two
epochs (from 0 to 50,000 interval samples). Our suggested technique has a
lesser training loss than the MUNIT model and produces more photo realistic
photos than the MUNIT model. In the second section of the graph, which
ranges from 2 to 5 Epochs (50,000 to 100,000 interval samples), the initial
learning for both models are reduced by half. In this case, our models indicate
a little decreased training loss. Additionally, as seen in Figure 5.5, the graph
results of our model and the MUNIT overlap.
Figure 5.5: The graph shows the overall training loss for two models (i.e.,
MUNIT and our proposed method) of the generator and discriminator. Its plot-
ted using the results of experiment 3.
Figure 5.6: The graph shows the overall training loss for two models (i.e.,
MUNIT and our proposed method) of the generator and discriminator. Its plot-
ted using the results of experiment 4.
5.5 Discussion
In this section, we discuss all experiments to address two research ques-
tions raised in chapter 1, at subsection 1.2.1. We obtained the following an-
swers to those questions based on the experimental results:
Q1: What does the effect of Spectral normalization apply on Multi-modal Un-
images.
6.1 Conclusions
This paper’s work suggested a proposed method for doing unsupervised
image-to-image translation for low picture resolution while avoiding the train-
ing instability reported in the Multi-modal unsupervised Image-Image Trans-
lation (MUNIT) method. During the MUNIT architecture, the generator’s loss
is slowly reducing, which means the generator starts to find a way to fool
the discriminator even though the generation is still immature. In addition,
the discriminator prevents the generator network from learning new informa-
tion. Our proposed method applies to lower GPU memory systems compared
to the original MUNIT model proposed. This was achieved by using spectral
normalization. The suggested approach has reduced training loss compared
to the MUNIT approach on lower-resolution images. Therefore, our proposed
method has lower training instability observed for the MUNIT method at lower-
resolution images.
Finally, we are interested in investigating image-to-image translation with
low-resolution quality for gaining meaningful information and reducing the
generator loss of information during the translation task. Giving an artist such
49
Training Stability of Multi-modal Unsupervised Image-to-Image Translation for
Low Image Resolution Quality
51
[16] M. Oza, H. Vaghela, and S. Bagul, “Semi-supervised image-to-image
translation,” in 2019 International Conference of Artificial Intelligence and
Information Technology (ICAIIT), pp. 16–20, 2019.
[17] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri, F. Cole, and
K. Murphy, “Xgan: Unsupervised image-to-image translation for many-to-
many mappings,” in Domain Adaptation for Visual Understanding, pp. 33–
49, Springer, 2020.
[18] T. Lindvall, Lectures on the coupling method. Courier Corporation, 2002.
[19] M. Liu, T. M. Breuel, and J. Kautz, “Unsupervised image-to-image trans-
lation networks,” CoRR, vol. abs/1703.00848, 2017.
[20] D. Foster, Generative deep learning: teaching machines to paint, write,
compose, and play. O’Reilly Media, 2019.
[21] Y. Pang, J. Lin, T. Qin, and Z. Chen, “Image-to-image translation: Methods
and applications,” IEEE Transactions on Multimedia, vol. 24, pp. 3859–
3881, 2021.
[22] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the
IEEE international conference on computer vision, pp. 1395–1403, 2015.
[23] A. Oussidi and A. Elhassouny, “Deep generative models: Survey,” in
2018 International Conference on Intelligent Systems and Computer Vision
(ISCV), pp. 1–8, IEEE, 2018.
[24] H.-M. Chu, C.-K. Yeh, and Y.-C. F. Wang, “Deep generative models for
weakly-supervised multi-label classification,” in Proceedings of the Euro-
pean Conference on Computer Vision (ECCV), September 2018.
[25] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson,
and M. N. Do, “Semantic image inpainting with deep generative models,”
in Proceedings of the IEEE conference on computer vision and pattern recog-
nition, pp. 5485–5493, 2017.
[26] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generative models for
distribution-preserving lossy compression,” Advances in neural informa-
tion processing systems, vol. 31, 2018.
[27] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative
adversarial networks: introduction and outlook,” IEEE/CAA Journal of
Automatica Sinica, vol. 4, no. 4, pp. 588–598, 2017.
[28] L. Jiang, H. Zhang, and Z. Cai, “A novel bayes model: Hidden naive bayes,”
IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10,
pp. 1361–1371, 2009.
[29] L. Rabiner, “A tutorial on hidden markov models and selected applications
in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–
286, 1989.
[30] A. Alotaibi, “Deep generative adversarial networks for image-to-image
translation: A review,” Symmetry, vol. 12, no. 10, p. 1705, 2020.
[31] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep boltzmann
machines,” in Proceedings of the thirteenth international conference on ar-
tificial intelligence and statistics, pp. 693–700, JMLR Workshop and Con-
ference Proceedings, 2010.
52
[32] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947,
2009.
[33] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Ad-
vances in neural information processing systems, vol. 27, 2014.
[35] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked
autoencoder for distribution estimation,” in International conference on
machine learning, pp. 881–889, PMLR, 2015.
[36] D. Rezende and S. Mohamed, “Variational inference with normalizing
flows,” in International conference on machine learning, pp. 1530–1538,
PMLR, 2015.
[37] E. Nalisnick, L. Hertel, and P. Smyth, “Approximate inference for deep
latent gaussian mixtures,” in NIPS Workshop on Bayesian Deep Learning,
vol. 2, p. 131, 2016.
[38] J. Tomczak and M. Welling, “Vae with a vampprior,” in International Confer-
ence on Artificial Intelligence and Statistics, pp. 1214–1223, PMLR, 2018.
[39] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-
encoders,” arXiv preprint arXiv:1711.01558, 2017.
[40] S. K. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, classifiaction,”
1992.
[41] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-
bard, and L. D. Jackel, “Backpropagation applied to handwritten zip code
recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
[42] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” arXiv
preprint arXiv:1511.06434, 2015.
[43] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv
preprint arXiv:1411.1784, 2014.
[44] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-
sarial networks,” in International conference on machine learning, pp. 214–
223, PMLR, 2017.
[45] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Improved training of wasserstein gans,” Advances in neural information
processing systems, vol. 30, 2017.
[46] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least
squares generative adversarial networks,” in Proceedings of the IEEE in-
ternational conference on computer vision, pp. 2794–2802, 2017.
[47] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial
network,” arXiv preprint arXiv:1609.03126, 2016.
[48] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium
generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
53
[49] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element
missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.
[50] K. Yin, Z. Chen, H. Huang, D. Cohen-Or, and H. Zhang, “Logan: Un-
paired shape transform in latent overcomplete space,” ACM Transactions
on Graphics (TOG), vol. 38, no. 6, pp. 1–13, 2019.
[51] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
E. Shechtman, “Toward multimodal image-to-image translation,” Ad-
vances in neural information processing systems, vol. 30, 2017.
[52] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse
image-to-image translation via disentangled representations,” in Proceed-
ings of the European conference on computer vision (ECCV), pp. 35–51,
2018.
[53] A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natu-
ral image matting,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 30, no. 2, pp. 228–242, 2008.
[54] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image
generation,” arXiv preprint arXiv:1611.02200, 2016.
[55] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization
for generative adversarial networks,” arXiv preprint arXiv:1802.05957,
2018.
54