Generative Adversarial Networks For Data
Generative Adversarial Networks For Data
Modeling
Purdue University
Purdue University 1
Preamble
When you create a probabilistic model for your data, you acquire the power to
generate new samples of the data from the model. Depending on how good a job
you did of modeling the data, the new samples you generate from the model may
look deceptively similar to those in your data without being exactly the same as
any one of them.
It may also happen that you are really NOT interested in fitting a parametric
model to your data, but you are interested in generating new samples from the
data nevertheless. In such cases, it is possible you could get away with just
constructing a multi-dimensional histogram from the data and using a generator
of some sort that would spit out new samples according to that histogram.
Purdue University 2
Preamble
Regardless of whether you have an analytic model for the data or just a
good-quality histogram, generating new samples is not easy. It has been the
subject of much research by probability theorists and statisticians the last several
decades. The best techniques fall under the label Markov-Chain Monte-Carlo
(MCMC) sampling and the most commonly used algorithm for MCMC sampling
is the Metropolis-Hastings algorithm.
The basic intuition in these algorithms is based on conducting a random walk
through the space in which the model is defined and subjecting each successive
randomly generated sample to an acceptance test that is based on the model
probability distribution. As you generate a candidate for the next sample at your
current point on the walk, you subject the acceptance of the candidate to the
ratio of the probabilities at the candidate point and the current point. In this
manner, you bias the acceptance of a candidate sample in such a way that you
end up with more samples in those portions of the model space where the
probabilities are relatively high. The generation of the new samples is according
to what is known as a proposal distribution. Since the acceptance of each sample
is predicated on just the previous sample that was already accepted, we obviously
have a Markov Chain. Hence the name MCMC for such algorithms.
Purdue University 3
Preamble (contd.)
The following link is to a Perl module I created several years ago for helping
generate positive and negative training samples for a machine learning algorithm
using the Metropolis-Hastings algorithm for sample selection:
https://metacpan.org/pod/Algorithm::RandomPointGenerator
The machine learning program in this case was for classifying land-cover data
obtained from wide-area satellite imagery as described in
https://engineering.purdue.edu/RVL/Publications/CVIU_2016_Chang_
Comandur_Park_Kak.pdf
https://arxiv.org/pdf/1406.2661.pdf
If pdata represents the probability distribution that describes the training data and
pg represents the probability distribution that the Generator network has learned
so far, the goal of deep learning for probabilistic data modeling would be estimate
the best values for the parameters θd and θg so that some measure of the
distance between distributions pdata and pg is minimized.
What’s interesting is that the deep learning framework that was actually
implemented by Goodfellow et al. did not directly minimize a distance between
pdata and pg . Nevertheless, the authors were able to argue that if the
Discriminator was trained to an optimum level, it was guaranteed to yield a
solution for pg that would be a minimum Jenson-Shannon divergence
approximation to pdata .
Purdue University 6
Preamble (contd.)
For the reason stated at the bottom of the previous slide, I’ll start this lecture
with a brief survey of the more popular distances and divergences between two
given distributions.
For any such distance to be useful in a deep learning context, you would want to
treat it as a loss for the backpropagation needed for updating the parameters θd
and θg that I defined previously. That places an important constraint on what
kinds of distances can actually be used a deep learning algorithm: the distance
must be differentiable so that we can calculate the gradients of the loss with
respect to the network parameters.
Over the last couple of years, the Wasserstein distance has emerged as a strong
candidate for such a differentiable distance function. And that has led to a
Generative Adversarial Network named WassersteinGAN that was presented by
Arjovsky, Chintala, and Bottou in the following 2017 publication:
https://arxiv.org/pdf/1701.07875.pdf
Purdue University 7
Preamble (contd.)
As I mentioned on the previous slide, I’ll start this lecture with a review of the
distance functions for probability distributions. That will get us ready to talk
about my implementation of DCGAN and WassersteinGAN in version 2.0.3 of the
DLStudio module:
https://engineering.purdue.edu/kak/distDLS/
If you are already familiar with the module and for whatever reason you just need
to “pip install” the latest version of the code, here is a link to its PyPi repository:
https://pypi.org/project/DLStudio/
The DCGAN that I mentioned above was first presented by Radford, Metz, and
Chintala in the following 2016 publication:
https://arxiv.org/pdf/1511.06434.pdf
Outline
Outline
What that says is that we check every subset A of the domain R n and
find the total difference between the probability mass over that subset
for both the f and g densities. The largest value for this difference is
the TV distance between the two.
Purdue University 14
Total Variation (TV) Distance
1
= L1 (P, Q) (5)
2
Outline
Kullback-Liebler Divergence
Popularly known as KL-Divergence.
In this case, let’s start directly with the discrete case of a random
variable X as stated in the first two bullets on Slide 14. The
KL-Divergence between a true distribution P and its approximating
distribution Q is given by
N
X P(xi )
dKL (P, Q) = P(xi ) log (6)
i=1
Q(xi )
P(xi )
dKL (P, Q) is obviously the expectation of the ratios log Q(xi)
with
respect to the P distribution. For the ratios to be defined you must
have Q(xi ) > 0 when P(xi ) > 0. Q(xi ) is allowed to be zero when P(xi )
is zero since x log x → 0 as x → 0+.
KL-Divergence (contd.)
Since, in general, log x can return negative and positive values as x
increases from 0 to +∞, and since a negative value for KL-divergence
makes no sense, how can we be sure that the value of dKL (P, Q) is
always non-negative?
To see that the formula for dKL (P, Q) always returns a non-negative
value, we first subject that formula to the following rewrites:
N
X P(xi )
dKL (P, Q) = P(xi ) log
i=1
Q(xi )
N
X Q(xi )
= − P(xi ) log
i=1
P(xi )
N
X P(xi ) + Q(xi ) − P(xi )
= − P(xi ) log
i=1
P(xi )
N
" #
X Q(xi ) − P(xi )
= − P(xi ) log 1 +
i=1
P(xi )
N
X
= − P(xi ) log(1 + a) (7)
i=1
Purdue University 18
Kullback-Leibler Divergence
KL-Divergence (contd.)
In the last equation on the previous slide, a = Q(xP(x
i )−P(xi )
i)
. The factor a
is lower bounded by -1, which happens when P(xi ) takes on the largest
possible value of 1 and Q(xi ) takes on the smallest possible value of 0.
N
X Q(xi ) − P(xi )
dKL (P, Q) ≥ − P(xi )
i=1
P(xi )
N
X
= − [Q(xi ) − P(xi )]
i=1
= 0 (8)
KL-Divergence (contd.)
KL-Divergence CANNOT be a metric distance — not the least
because what it calculates is asymmetric with respect to its two
arguments.
Given its limitations — requiring Q(x) > 0 when P(x) > 0 and not
being a metric distance — students frequently want to know as to
why KL-Divergence is as “famous” as it is in the estimation-theoretic
literature. One reason for that is its interpretation as relative entropy:
dKL (P, Q) = HP (Q) − H(P) (9)
Purdue University 20
Kullback-Leibler Divergence
KL-Divergence (contd.)
Perhaps the most important role that KL-Divergence plays in the
ongoing discussion related to Adversarial Networks is that it is a
stepping stone to learning the Jensen-Shannon divergence (and the
closely related Jensen-Shannon distance) that is presented starting
with the next slide.
import scipy.stats
scipy.stats.entropy(P,Q)
Outline
Both the divergence dJS (P, Q) and the distance distJS (P, Q) are
symmetric with respect to the arguments P and Q . Additionally, they
do away with the “Q(x) > 0 when P(x) > 0” requirement of
KL-Divergence.
The value of dJS (P, Q) is always a real number in the closed interval
[0, 1]. When value is 0, the two distributions P and Q are identical.
And when the value is 1, the two distributions as as different as they
can be.
Purdue University 24
Jensen-Shannon Divergence and Distance
Outline
Purdue University 27
Earth Mover’s Distance
Given two N -bin histograms f and g for the two images, you would
not be too far off the mark if the first idea that pops up in your head
would be to carry out a bin-by-bin comparison using a distance like:
N
!1
X r r
dLr (f , g ) = |gi − hi | (14)
i=1
EMD is based on associating a cost with moving pixels from one bin
to another in a hypothetical attempt that tries to make the two
histograms as similar looking as possible, constructing an overall cost
with
Purdue all such pixel transfers, and then minimizing the overall cost. 30
University
Earth Mover’s Distance
And you also have a cost estimate cij that is the cost of transporting
a unit of the resource from the i th provider to the j th consumer.
fij ≥ 0 i = 1, . . . , M, j = 1, . . . , N (17)
N
X
fij ≤ hi i = 1, . . . , M (18)
j=1
M
X
fij ≤ gj j = 1, . . . , N (19)
i=1
M X
N
( M N
)
X X X
fij = min gi , hj (20)
i=1 j=1 i=1 j=1
Purdue University 33
Earth Mover’s Distance
With that as an intro to EMD, the issue that should come up next
would be whether it is possible to create a loss function directly from
EMD for adversarial learning. I’ll address this question later when I
get into the differiantiability of the different distance functions.
Purdue University 35
Wasserstein Distance
Outline
Wasserstein Distance
Using dW (P, Q) to denote the Wasserstein distance between the
distributions P and Q, here is its definition:
" #
dW (P, Q) = inf E(X ,Y ) ∼ γ kx − y k (22)
γ(X ,Y )∈Γ(P,Q)
The infimum required on the right side of Eq. (22) says that from the
set Γ(P, Q) of all joint distributions defined in the second bullet on
the previous slide, we need to zero in on the joint distribution γ(X , Y )
that minimizes the mean value of the normed difference kx − y k
with the sample pair (x, y ) drawn from the joint distribution.
https://cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf
Purdue University 41
Wasserstein Distance
Purdue University 42
A Random Experiment for Studying Differentiability
Outline
Outline
Therefore, the value of the Expectation operator in Eq. (22) will also
be equal to θ. In other words, for the random experiment under
consideration:
dW (P, Q) = θ (25)
The last bullet on the previous implies that x must span both the
lines X and Y for this integration. However, the sets X and Y are
disjoint except when the Generator parameter θ equals zero.
= log 2 (28)
The Total Variation (TV) distance for the continuous case was
defined in Eq. (1).
So we can write:
dTV (P, Q) = 0 θ=0
= 1 θ 6= 0 (30)
Outline
Purdue University 54
PurdueShapes5GAN Dataset for Adversarial Learning
Purdue University 56
PurdueShapes5GAN Dataset for Adversarial Learning
Shown in the next slide are enlarged views of two of the images on
the previous slide. In addition to the sharp shape boundaries, you can
also small holes inside the shapes.
The holes that you see inside the shapes were caused by intentionally
suppressing bilinear interpolation as the shapes were randomly
reoriented.
So the challenge for the data modeler would be its ability to not only
reproduce the shapes while preserving the sharp edges, but also to
incorporate the tiny holes inside the shapes, and do so with the
probabilities that reflect the training data.
Purdue University 57
PurdueShapes5GAN Dataset for Adversarial Learning
Purdue University 58
PurdueShapes5GAN Dataset for Adversarial Learning
Outline
The reason I need to take you back to this paper is because the basic
training logic in DCGAN is the same as that proposed in the above
cited
Purdue publication by Goodfellow et al.
University 61
DCGAN Implementation in DLStudio
Purdue University 64
DCGAN Implementation in DLStudio
For the training required for the Generator, only the second term
inside the square brackets in Eq. (33) matters. We proceed as follows:
We note that the logarithm is a monotonically increasing function and
also because the output D(G (z)) in the second term will always be
between 0 and 1.
Therefore, the needed minimization translates into maximizing D(G(z))
with respect to a target value of 1.
With 1 as the target, we again find the nn.BCELoss associated with
D(G (z)). We call backwards() on this loss — making sure that we
have turned off requires grad() on the Discriminator parameters as
we are updating the Generator parameters.
A subsequent call to the step() for the optimizer would update the
weights in the Generator network.
Purdue University 66
DCGAN Implementation in DLStudio
Purdue University 67
DCGAN Implementation in DLStudio
Purdue University 69
DCGAN Implementation in DLStudio
Figure: At the end of 30 epochs of training, shown at left is a batch of real images and, at right, the images produced by
the Generator from noise vectors
Purdue University 70
DCGAN Implementation in DLStudio
The following animated GIF shows how the Generator’s output evolves
over 30 epochs using the same set of noise vectors.
https://engineering.purdue.edu/DeepLearn/pdf-kak/DG1_generation_animation.gif
Purdue University 71
Making Small Changes to DCGAN Architecture
Outline
class GeneratorDG2(nn.Module):
"""
The Generator for DG2 is exactly the same as for the DG1. So please the comment block for that
Generator.
"""
def __init__(self):
super(AdversarialNetworks.DataModeling.GeneratorDG2, self).__init__()
self.latent_to_image = nn.ConvTranspose2d( 100, 512, kernel_size=4, stride=1, padding=0, bias=False)
self.upsampler2 = nn.ConvTranspose2d( 512, 256, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler3 = nn.ConvTranspose2d (256, 128, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler4 = nn.ConvTranspose2d (128, 64, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler5 = nn.ConvTranspose2d( 64, 3, kernel_size=4, stride=2, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(512)
self.bn2 = nn.BatchNorm2d(256)
self.bn3 = nn.BatchNorm2d(128)
self.bn4 = nn.BatchNorm2d(64)
self.tanh = nn.Tanh()
def forward(self, x):
x = self.latent_to_image(x)
x = torch.nn.functional.relu(self.bn1(x))
x = self.upsampler2(x)
x = torch.nn.functional.relu(self.bn2(x))
x = self.upsampler3(x)
x = torch.nn.functional.relu(self.bn3(x))
x = self.upsampler4(x)
x = torch.nn.functional.relu(self.bn4(x))
x = self.upsampler5(x)
Purdue University
x = self.tanh(x)
return x
74
Making Small Changes to DCGAN Architecture
Purdue University 75
Making Small Changes to DCGAN Architecture
Figure: At the end of 30 epochs of training, shown at left is a batch of real images and, at right, the images produced by
the Generator from noise vectors
Purdue University 76
Making Small Changes to DCGAN Architecture
The following animated GIF shows how the Generator’s output evolves
over 30 epochs using the same set of noise vectors for the case of a
DCGAN with relatively minor alterations.
https://engineering.purdue.edu/DeepLearn/pdf-kak/DG2_generation_animation.gif
Purdue University 77
Wasserstein GAN Implementation in DLStudio
Outline
Purdue University 80
Wasserstein GAN Implementation in DLStudio
The calculation of the Wasserstein distance using Eq. (34) also calls
for significant averaging of the output of the Critic in order the
maximization to yield the desired distance. This can be taken care of
my having the Critic go through multiple iterations of the update of
its parameters for each iteration for the Generator.
Purdue University 82
Wasserstein GAN Implementation in DLStudio
class GeneratorCG1(nn.Module):
"""
The Generator code remains the same as for the DCGAN shown earlier.
"""
def __init__(self):
super(AdversarialNetworks.DataModeling.GeneratorCG1, self).__init__()
self.latent_to_image = nn.ConvTranspose2d( 100, 512, kernel_size=4, stride=1, padding=0, bias=False)
self.upsampler2 = nn.ConvTranspose2d( 512, 256, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler3 = nn.ConvTranspose2d (256, 128, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler4 = nn.ConvTranspose2d (128, 64, kernel_size=4, stride=2, padding=1, bias=False)
self.upsampler5 = nn.ConvTranspose2d( 64, 3, kernel_size=4, stride=2, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(512)
self.bn2 = nn.BatchNorm2d(256)
self.bn3 = nn.BatchNorm2d(128)
self.bn4 = nn.BatchNorm2d(64)
self.tanh = nn.Tanh()
def forward(self, x):
x = self.latent_to_image(x)
x = torch.nn.functional.relu(self.bn1(x))
x = self.upsampler2(x)
x = torch.nn.functional.relu(self.bn2(x))
x = self.upsampler3(x)
x = torch.nn.functional.relu(self.bn3(x))
x = self.upsampler4(x)
x = torch.nn.functional.relu(self.bn4(x))
x = self.upsampler5(x)
x = self.tanh(x)
Purdue University
return x
######################################## CG1 Definition END ############################################
83
Wasserstein GAN Implementation in DLStudio
Purdue University 84
Wasserstein GAN Implementation in DLStudio
Figure: At the end of 500 epochs of training, shown at left is a batch of real images and, at right, the images produced by
the Generator from noise vectors
Purdue University 85
Wasserstein GAN Implementation in DLStudio
The following animated GIF shows how the Generator’s output evolves
over 30 epochs using the same set of noise vectors for the case of a
DCGAN with relatively minor alterations.
https://engineering.purdue.edu/DeepLearn/pdf-kak/WGAN_generation_animation.gif
Purdue University 86