GAN Project Report
GAN Project Report
Abstract
In this research work the possibility of adapting image based anomaly detection
into text based anomaly detection is explored. Two main approaches are being
proposed, namely anomaly detection as a task of classification and unsupervised
anomaly detection using text patches. Both approached explore the use of genera-
tive adversarial networks to perform anomaly detection and results presented show
that such can be fruitful.
1 Introduction
Anomaly detection is the task of identifying out-of-ordinary, unusual or unexpected data points.
Text based anomaly detection could be interpreted as the identification of text containing malicious
intent, like offensive language, hate speech, cyber-bullying, sexual predatory behavior, or even text
containing suicidal or depressive thoughts or behavior. Such textual data is more difficult to obtain,
as it is mostly present on online chat-rooms, forums and social networking platforms. Nowadays
more and more social media data is filled with offensive language, hate speech, cyber bullying, which
endangers the cyber-safety of both children and adults online. Automated anomaly detection in this
context would lighten the workload on website and chat-room moderators whose job it is to maintain
a safe communication and interaction in a cyber-society. In the context of depressive or suicidal
behavior detection an automated anomaly detection system could make a huge difference in people’s
life, as proper help and a chance for reaching out to those in need could be possible with an early
detection of users at risk, as Jamil et al.’s work suggests [1].
The biggest challenges with such data is the lack of labelled textual data (normal and anomalous
text labels), the lack of negative examples, as the data sets are very unbalanced (usually only less
than 10% of the sample size is anomalous data) and the messy, unstructured, unfiltered nature of the
textual data makes it difficult to analyze.
By using generative models like GAN we are able to learn the distribution of the normal data well.
The question is then how to use this knowledge to identify anomalies. In [2] the authors make use of
the fact that anomalous image is an image that was not learned by the generative model. To detect
whether a given query image is anomalous or not, all one has to do is attempt to generate a similar
image and compute it’s distance to the query image. If this distance is “small”, then it can be reasoned
that it is not anomalous. However, if our generative model is unable to produce an image that is close
to the query image, then it is anomalous.
We extend this notion of similarity between generated and query images to the space of text. In the
process we face two issues: calculating the distance between two text samples and backprogagating
gradient from discriminative model to the generative model. We solve the former by making use
of word-embeddings and calculating the distance between them. The latter issue arises because if
we generate sequences of words, our discriminator is only able to evaluate complete sentences. We
solve this issue by grouping words into text-patches. This enables our generative model to produce
sequences of words, which can then be fed into the discriminator directly.
2 Related Work
In 2014 Goodfellow et. al. [3] proposed the idea of Generative adversarial nets, which consisted
of two models: the generative and discriminative. The generative model can generate inputs to the
discriminative model. The discriminative model estimates the probability that the input came from the
real data rather than the generative model. The generative model’s goal is to maximize the probability
of discriminative model making a mistake. See Figure 1.
Collectively the two models play the following min-max game:
min max V (D, G) = Ex∼pd (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))]
G D
where pd is the distribution from which real data is drawn, while pz (z) is distribution produced by
the generative model by sampling z from some (usually uniformly distributed) noise.
In [4], authors tackle the problem of generating sequences of discrete tokens with GANs. Since the
discriminative model can only judge complete sequences while generative model can produce tokens
that form partial sequences, it is not obvious how to reconcile the intermediate sequence scores with
the score for the complete sequence. The proposed SeqGAN approach models the data generator
as a stochastic policy in reinforcement learning (RL): “The RL reward signal comes from the GAN
discriminator judged on a complete sequence, and is passed back to the intermediate state-action
steps using Monte Carlo search.”
Basing their research on SeqGAN, in [5] authors proposed FakeGAN, an augmentation to the GAN
model which is effective at detecting deceptive reviews. The problem that the paper tackles is
differentiating truthful from deceptive movie reviews. Deceptive reviews here mean reviews generated
by bots or paid agents. Truthful reviews are those posted by real people. Note that truthful or deceptive
reviews say nothing about the sentiment (positive or negative) of the review itself. To solve this
problem, instead of a single discriminator, authors use two discriminator models D and D0 . There is
still just one generative model G. The D model must be able to differentiate between truthful and
deceptive reviews, while D0 must differentiate between data produced by the generative model G
and samples from the deceptive reviews distribution. Unlike most GAN proposals, which aim to
improve the generative model, this paper tries to improve the discriminator(s) instead. Similar to
SeqGAN, authors use RL approach, where the discriminator uses Monte Carlo search to produce a
reward signal that is then passed on as a gradient update to the generative model, with the generator
itself modeled as a stochastic policy, with the following action value function:
where Gα is the generator, s is the sequence of produced tokens so far, a action is the next token.
Authors expand this to a function of t so that it is able to complete a sentence using Monte Carlo
search:
2
1 2 N
{S1:L , S1:L , . . . , S1:L } = M CG0γ (S1:t , N )
i
where St+1:L is “sampled via roll-out policy G0γ based on the current state S1:t−1
i
”.
Authors state that the FakeGAN discriminator convergences in practice, and avoids mode collapse,
yet they do not provide a formal proof. The model training proceeds by some g steps for generator
training followed by some d steps for discriminator updates.
Objective function for the generator is:
X
J(α) = Gα (S1 |S0 )AGα ,D,D0 (a = S1 , s = S0 )
S1 ∈X
Given the gradient δα J(α) of the above, the generator’s parameters are updated as follows:
α ← α + λδα J(α)
where XD are the deceptive reviews and XT are the truthful reviews. For architecture, authors used
RNNs for the generator, using LSTM with softmax output layer. For discriminator a CNN is used.
FakeGAN was able to achieve accuracy of 89.1%.
In [6], authors propose to make use of Earth-Mover (EM) distance or Wasserstein-1 as the measure
between real and model distributions. For some two distributions Pr and Pg it is defined as:
It is shown that the EM distance, if used as a loss metric, can be used to improve stability and get rid
of mode collapse. Since it is hard to calculate EM directly in GAN, authors propose an approximation
called Wasserstein-GAN (WGAN) which they successfully use to train their model. Furthermore, the
paper is the first to propose a loss measure that correlates well with the visual quality of the generated
samples. Unlike other measures, for example, JS, for which the loss goes up and down regardless of
the sample quality, WGAN continually improves both. This is due to the fact that EM distance is
continuous and differentiable, which means that WGAN can continue learning, while other models
get vanishing gradients.
In [7] authors tackle the task of image inpainting. This forms the foundation for the AnoGAN
algorithm described below. Unlike the previous approaches that use local or non-local information to
recover the image, the authors instead use a GAN:
Given a trained generative model, we search for the closest encoding of the cor-
rupted image in the latent image manifold using ou context and prior losses. This
encoding is then passed through the generative model to infer the missing content.
3
The back-propagation is used to find the encoding in the latent space (noise space) that is closest
to the requested image. Both the generative G and discriminative D models are trained on normal
(uncorrupted) image data. The generator can then take a given point z in the latent space pz and
produce a point G(z) that is similar to the sample from the data distribution pdata . The authors aim
to find the optimal ẑ that is closest to the presented corrupted image. Once found, G(ẑ) is computed
to produce an image, which is then blended with the corrupted image to fill in the missing information.
The authors use two losses, the first one being the contextual loss:
where W is an importance weighting term that assigns greater value to those uncorrupted pixels
which have more corrupted pixels around them. The y is the corrupted image, and M is some mask
that defines where the information is to be filled-in. The second type of loss is prior loss:
where λ regulates the contribution of this loss. The prior loss penalizes the unrealistic images that are
produced by the generator.
Given the losses, the authors now continuously map the corrupted image y to the z in the latent space,
until they arrive at ẑ which is the closest. The weights are updated using the back-propagation of the
total loss as follows:
The authors get good results on image in-painting, doing better than state-of-the art models. Curiously
enough, the model sometimes has trouble finding ẑ which results in an incorrect image being
generated. This difference is exactly what is used by the AnoGAN model below to compute the
anomaly score.
In [2], authors proposed AnoGAN, an anomaly detection approach for medical images, that extends
the approach in [7] by interpreting the distance of generated images from the query image as a
measure of anomaly. The algorithm proceeds in two phases: training and anomaly detection. Before
continuining, authors split up the images into c × c image patches. This allows them to reduce
the dimentionality of the model. Subsequently, only random samples of these patches are used for
both training and anomaly detection. For training, the usual GAN training procedure is applied to
normal data only. Given a randomly sampled image patch x first train the discriminative model on
x and G(z) for some z ∈ Z where Z is some uniformly distributed noise. For anomaly detection,
the task is to determine whether a given query image patch is anomalous or not. The procedure
is to pick a random z ∈ Z and propagate it to two outputs: G(z) and f (G(z)), where f (.) is the
output of some intermediate layer in discriminative model that encodes a feature. Then two losses are
calculated. First, the residual loss, which is a measure of dissimilarity between the query image x
and the generated image G(z):
X
LR (zγ ) = |x − G(zγ )|
Second, the discrimination loss, which measures the dissimilarity in features extracted by the
discriminator:
X
LD (zγ ) = |f (x) − f (G(zγ ))|
Here λ is the relative weight that each type of loss has. Then, the coefficients of z are backpropagated,
while everything else remains fixed. This effectively produces a new z 0 = kz, and the process is
repeated again for Γ iterations.
4
Finally, an anomaly score is computed as follows:
A(x) = (1 − λ)R(x) + λD(x)
where R(x) = LR (zΓ ) and D(x) = LD (zΓ ) are the two losses evaluated at the last iteration of the
anomaly detection procedure. Low anomaly score is interpreted as data being non-anomalous, while
a large anomaly score means that the data is anomalous. Authors provide no way to normalize the
scale of anomaly scores.
3 Method
The origin of our work comes from Schlegl et al.’s [2] work on image based anomaly detection
(AnoGAN). The authors proposed to use healthy anatomy image patches (normal data) to train a
generative adversarial model, then use anomaly scores to detect anomalous image patches. Our
proposed work is an adaptation of this approach towards text based anomaly detection.
The use of generative adversarial networks in text based anomaly detection is in an early phase of
development, as to our knowledge FakeGAN [5] was the first approach. While image based anomaly
detection was proven to be successful in Schlegl et al.’s work [2], our main hypothesis is that text
based anomaly detection is possible using the same general approach, with additional adaptation
towards textual data input.
Throughout the process of coming up with potential approaches our initiative was to start out with
two different approaches, later evaluate which one has more potential, then continue experimenting
with the more successful approach. We describe these two possible anomaly detection adaptations
from image to text data in section 3.2 and 3.3.
The backbone of this approach is to formulate the anomaly detection task as a two-class classification
problem of discriminating between normal and anomalous data. During the training process the
generative adversarial model is only allowed to see one class, and it will learn to treat that class
as the “real” data. Anything that the generator starts generating will be classified as “fake” by the
discriminator, until the generator does not get good enough at learning the distribution of this one
class, the “real” data. Once the generator learns the distribution of the “real” data, the discriminator
had to get good at learning to accurately classify “real” data. After the training process ended the
discriminator can be used to classify between “real” and “not-real” data, which could be thought of
as normal and not-normal, thus anomalous data.
Our approach aims to learn what normal data looks like by feeding only normal data into the GAN
model, same as Schlegl et al.’s work [2], which is our main motivation for this approach. Training
on the class with larger sample size could provide a better chance for the generator to learn the
distribution of normal data, while the discriminator should learn to recognize normal data. Our
hypothesis is that the discriminator will learn what normal data looks like, and will be able to classify
it, while when presented with anomalous data, it will recognize it as not normal data, thus classifying
it as anomalous data.
However, there is one big assumption that we are making. The discriminator learns to distinguish
between real and generated data, and we are trying to classify normal and anomalous data. We made
the assumption that classifying real and generated data would behave the same way as classifying
normal and anomalous data. If this assumption fails, this can be corrected with a similar approach to
FakeGAN [5] in future work.
The final system will use only the discriminator to classify both normal and anomalous unseen text
sequences. We chose to train only on normal data, as that is the class that is always available, with or
without annotated labels, and that is the approach used by Schlegl et al.’s AnoGAN [2]. If one would
train on anomalous data, it would certainly be more difficult for the generative models to learn the
real distribution of the data, as even by joining multiple anomalous data sources, there is barely over
1000 samples to train on, which does not tend to be enough for training generative models.
5
3.3 AnoGAN based approach using Text patches
We propose to replace image patches found in [2] with text-patches and adapt AnoGAN to work with
discrete data for anomaly detection. We train our GAN architecture using non-anomalous data only.
Given a randomly sampled query text-patch from a given text we use the generator to generate the
closest possible match, and use word-vector distance to compute the anomaly score. If any text-patch
is anomalous then we consider the whole text anomalous.
The main problem with applying GAN models to text generation is that they are not well suited to
generating sequences of discrete tokens. The reason for this is that the generative model outputs
discrete outputs one at a time, while discriminative model judges only complete sequences. This
makes it difficult to pass gradient update from discriminative model to the generative model. Unlike
other GAN approaches [4] [5][6][8] which use RL techniques to predict the discriminator output,
we eliminate the issue entirely. By making the generative model produce a fixed sequence of
outputs, we allow the model to learn the predictive capability. Our final system consists of two parts:
generative and discriminative models. The generative model produces a set of word-embeddings,
called text-patches (see Section 4.2.1) which are then passed to discriminative model. This way our
discriminative model is able to immediately act on every output produced by the generator. Our
hypothesis is that using text-patches will allow our generative model to learn the distribution inherent
in the text data, and our anomaly detection procedure will be able to make use of this to distinguish
between depressive and non-depressive samples.
While the original AnoGAN paper made use of convolutional layers quite extensively for image
anomaly detection, we chose to implement our system with LSTM due to the sequential nature of
textual data.
4 Experiments
4.1 Anomaly detection as text classification
Our implementation of training the generative adversarial model is based on Shibuya’s deep learning
applications [9], which is based on a Udacity tutorial on how to train generative adversarial models.
FastText pre-trained word embeddings [10] were used to represent text in vector form. We used
the model containing two million word vectors trained on Common Crawl. This model uses 300
dimensional word embeddings, it is publicly available and it can be downloaded from FastText’s
official website.
Text pre-processing was done in two text cleaning stages. In the first stage, all excessive white
spaces, all numeric values, HTML tags, punctuation and hyperlinks were stripped, then text got
converted into lower case. In the second stage we used a tweet processing library specially designed
for cleaning tweet data by removing mentions, reserved words, emoticons and hash-tag signs. After
the text cleaning stage there is a quick pass through all texts to find the longest sequence of text. In
our data set of tweets this is a pre-processed tweet of 32 words. Using this number we feed all text
tokens into a sequence padding algorithm and apply post-padding to all text sequences shorter than
the longest text sequence. This step creates text sequences of equal length, which can be fed into the
word embedding model and create a matrix of n (number of observations) X 32 (maximum length for
sequence of text) X 300 (word embedding dimension).
6
Table 1: Performance measures on classifying using the discriminator when trained on normal data
Experiments consisted of trying out different model architectures for both the generator and discrimi-
nator, keeping the hyper parameters constant and seeing which model architecture produces better F1
scores on distinguishing between normal and anomalous data. We started with the discriminator. We
tried out different combinations of dense and RNN layers, only dense layers and only RNN layers.
We observed that using RNNs in the discriminator just for the sole purpose of classifying would cause
over-fitting, as it seemed like the RNN would memorize entire sequences of data. This manifested in
the form of after the training process the discriminator would classify anything that is has not seen as
anomalous and anything close to what it has seen as normal. For an unseen test set all observations
were classified as anomalous. We made the decision to drop RNNs from the discriminator, and have
a really simple MLP architecture using just a dense layer, a max pooling layer, and an output layer of
1 using the sigmoid activation function to predict a label.
For the generator we experimented with a combination of dense layers and RNN layers like an
LSTM, SimpleRNN and GRU layer. We have build 5 different models, labeled 1 to 5 in Table 1. All
five models have the same discriminator, the one described above, but each model has a different
generator. Throughout experimentation we noticed that the generator has to be a “stronger” model,
while the discriminator should be a simple model that could classify between 2 classes. Model 1 has
a time distributed dense layer with 300 hidden units with a Leaky ReLU using the parameter 0.01,
followed by an LSTM layer of 300 hidden units. Model 2 the same architecture as model 1, except
that it uses a SimpleRNN layer instead of the LSTM layer. Model 3 is similar to model 2, but it has a
time distributed dense layer with only 100 hidden units. Model 4 has a time distributed dense layer
with 100 hidden units using the same Leaky ReLU activation function, then followed by a Gated
Recurrent Unit (GRU) layer using 300 hidden units. Model 5 is similar to model 4, expect that it
used 300 hidden units instead of only 100 for the time distributed dense layer. All RNNs had to have
300 hidden units to match the architecture of 300 dimensional word embeddings, while the actual
input into the generator is a sequence of word embeddings, 32 (maximum length for sequence of text,
padded) X 300. We trained out generative adversarial networks for 5000 epochs. We used 0.0001 as
learning rate for both the generator and discriminator, as this was the default value set by previous
implementations. We experimented with different latent space sizes for our generator, 32 and 100,
but noticed that 100 worked better. We used training batch size and 64, and evaluation batch size as
16, as these were default values from previous implementation.
We evaluated these models based on overall F1 score, but most importantly we tried to make sure that
the overall model works properly. Model 1 and model 4 have a very high precision value, over 90%.
This is not happening because the model is really good at predicting anomalous data, but because
the model almost always predicts anomalous, thus it has high precision, but its recall value shows
how it does not do well overall. In our opinion model 1 and model 4 should not be declared the best
models, even though their F1 score beats the score obtained by Schlegl et al.’s discriminator model in
their AnoGAN [2]. However, our Model 5 using a GRU layer is close to matching the performance
of the AnoGAN discriminator with an F1 score of 0.6242, recall being the same as the AnoGAN
discriminator’s, while we are losing out by 2 percent on precision, thus we would like to declare
model 5 our best model, based on good F1 score, but not having the fault of model 1 and 4, when the
model always predicts one class. This model’s architecture can be seen in Figure 2.
Our main experiments for this section came from augmenting our data set with more depressive data
collected from a mental health forum called Time To Change. The nature of the data is perfectly
7
Figure 2: GAN Model 5 - GRU based model architecture from Table 1
suited for anomaly detection, as the textual data contains the personal stories of people struggling
with depression. In order to make sure that the models are learning exactly what we intend them to
learn we wanted the normal class be the total opposite of depressive stories. For the normal data
we chose the positive reviews from the IMDB movie review data set, thus maximizing the semantic
separation between the normal and anomalous classes.
4.2.2 Datasets
We use two datasets in our training and testing. The IMDB dataset [11], which contains both positive
and negative movie reviews, and depressive data collected from Time To Change mental health forum.
8
We choose to use only the positive reviews from IMBD dataset, and produced a train-test split. Data
was then cleaned from any html tags. The depressive data was similarly cleaned. After cleaning
the data it was converted to word vectors using the FastText model trained on Common Crawl and
Wikipedia [12].
To prepare our data for training, we simply scan over the sentences, and produce patches of K words
each. Each patch has a dimension of K × M where M is the embedding dimension size. In the case
of FastText M = 300. For a sentence i having length of Wi words, we can extract Pi patches, where:
jW k
i
Pi =
K
If there are extra words remaining in the sentence we drop them. The number of patches in our data
sets is usually around 50-150 per sentence. We proceed by creating as many patches as will fit into
each sentenceP and then collect all the patches together into a single sequence of size T × K × M
where T = i Pi , or the total number of patches in all the samples. This, then, forms the input to
our training procedure.
For anomaly detection we require our data be sets of patches randomly sampled from each sentence.
The reason we don’t use all the data for anomaly detection is mainly due to performance. It takes a
long time to compute anomaly score (see below). To prepare data for anomaly detection phase we
start by converting our data set of sentences into a data set of text-patches.
For a given sentence i, we produce Pi patches and then randomly sample B of them. If there are
samples which do not contain sufficient number of patches, e.g. Pi < B, we skip them.
At the end of the data preparation step, our anomaly detection procedure receives a vector of size
N × B × K × M . Here, again, the N is the number of samples, B is the number of patches per
sample, K is the size of each patch in word vectors and M is the size of the embedding of each
word-vector (with M = 300 for FastText).
Recall that we split the IMDB dataset into training and testing, while we keep the depressive dataset
as is. We used the training portion of the IMDB dataset as normal data for training the GAN. We
used the testing portion as normal data for testing the GAN, while the depressive data was used as
anomalous data for testing the GAN. For easy reference, let us denote the data that is normal by a +
and anomalous by a −, then we can describe our data as follows:
• T+ - Training portion of the IMDB dataset used for for training the GAN.
• A+ - Test portion of the IMDB dataset used for anomaly detection.
• D− - The depressive dataset.
Note that the + and − have nothing to do with positive or negative connotation in the data, they
simply indicate that the data was either normal or anomalous from the point of view of discriminative
model of the GAN. We next proceed to training our model and detecting anomalies.
4.2.3 Training
For training we make use of T+ data only, which is the training split of the IMDB dataset. Let G
be the generative model, let D be the discriminative. Then, given N samples of size Pi × K × M ,
where i = 1 . . . N , we proceed to train our GAN as follows:
9
Table 2: Different configurations of anomaly detection experiment.
1. Form an anomaly detector A from our previously trained D and G models (see Figure 5)
2. Sample some noise z ∈ Z from a uniform distribution
3. Produce dx = f (x) by passing our query through the discriminator but stopping at the
middle layer f (see Figure 4, left)
4. Compute loss A(x) − A(dx )
5. Backpropagate the gradients to our anomaly detector’s trainable dense layer and repeat.
10
Figure 4: Architecture of the Generative G and Discriminative D Models. The Discriminative Model
contains a middle layer that is used during the anomaly detection phase.
Figure 5: Architecture of the Anomaly Detector. Here model_17 is the generative model G and
model_16 is the discriminative model D.
11
Table 3: Performance results of anomaly detection using text patches
to create an anomaly score threshold value based on the normal text sample’s text patch anomaly
score values. This anomaly score threshold value consists of mean of the normal text patch anomaly
scores, plus 2 standard deviations of the normal text patch anomaly scores. Any anomaly score that is
above this threshold can be considered an anomalous text patch, thus the text sample that contains
this text patch is classified anomalous.
5 Conclusion
This research work investigated the possibility of adapting current research on image based anomaly
detection into text based anomaly detection. Two main approaches have been proposed, namely
anomaly detection as a task of classification and unsupervised anomaly detection using text patches,
while both approaches using generative adversarial networks. The best model for the first approach
achieves an F1 score of 0.6242, while the second approach outperforms the first with an F1 score
of 0.6592. Both approaches could use further improvement, as they both fall short on matching the
state of the art performance of AnoGAN with an F1 score of 0.7980, but nonetheless both newly
proposed approaches were proven to work. Future work could involve expanding the AnoGAN
architecture to make use of larger text patches and deeper model architecture. Perhaps using Capsule
Networks might be able to better capture the normal data distribution. Additionally, it would be
highly beneficial to try to use WGAN-style loss function, which correlates well with the quality of
the generated samples.
12
References
[1] Zunaira Jamil, Diana Inkpen, Prasadith Buddhitha, and Kenton White. Monitoring tweets for
depression to detect at-risk users. In Proceedings of the Fourth Workshop on Computational
Linguistics and Clinical Psychology—From Linguistic Signal to Clinical Reality, pages 32–40,
2017.
[2] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg
Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker
discovery. In International Conference on Information Processing in Medical Imaging, pages
146–157. Springer, 2017.
[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[4] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial
nets with policy gradient. In AAAI, pages 2852–2858, 2017.
[5] Hojjat Aghakhani, Aravind Machiry, Shirin Nilizadeh, Christopher Kruegel, and Giovanni
Vigna. Detecting deceptive reviews using generative adversarial networks. 2018 IEEE Security
and Privacy Workshops (SPW), May 2018.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.
[7] Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson,
and Minh N Do. Semantic image inpainting with deep generative models. In CVPR, volume 2,
page 4, 2017.
[8] Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, Yizhe
Zhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Adversarial text generation via feature-
mover’s distance. In Advances in Neural Information Processing Systems, pages 4667–4678,
2018.
[9] Naoki Shibuya. Using gan for generating hand-written digit images, 2017.
[10] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin.
Advances in pre-training distributed word representations. In Proceedings of the International
Conference on Language Resources and Evaluation (LREC 2018), 2018.
[11] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies, pages
142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
[12] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for
efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
[13] Lucas Deecke, Robert Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft. Anomaly
detection with generative adversarial networks, 2018.
13