Transformer 2017
Transformer 2017
249
Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), pages 249–255
Hong Kong, China, November 4, 2019. c 2019 Association for Computational Linguistics
www.aclweb.org/anthology/D19-56%2d
S: What are the dumbest questions ever asked S: Three dimensional rendering of a kitchen area
on Quora? with various appliances.
G: what is the stupidest question on quora? G: a series of photographs of a kitchen
R: What is the most stupid question asked on R: A series of photographs of a tiny model
Quora? kitchen
S: How can I lose fat without doing any aero- S: a young boy in a soccer uniform kicking a ball
bic physical activity G: a young boy kicking a soccer ball
G: how can i lose weight without exercise? R: A young boy kicking a soccer ball on a green
R: How can I lose weight in a month without field.
doing exercise? S: The dog is wearing a Santa Claus hat.
S: How did Donald Trump won the 2016 USA G: a dog poses with santa hat
presidential election? R: A dog poses while wearing a santa hat.
G: how did donald trump win the 2016 presi- S: the people are sampling wine at a wine tasting.
dential G: a group of people wine tasting.
R: How did Donald Trump become presi- R: Group of people tasting wine next to some
dent? barrels.
Table 1: Examples of our generated paraphrases on Table 2: Examples of our generated paraphrases on
the QUORA sampled test set, where S, G, R repre- the MSCOCO sampled test set, where S, G, R repre-
sents Source, Generated and Reference sentences re- sents Source, Generated and Reference sentences re-
spectively. spectively.
sentence Y = (y1 , ..., ym ) | ∃ym 6∈ S with m former. The GRU - RNN encoder (Chung et al.,
words that conveys similar semantics as S, where 2014; Cho et al., 2014) produces fixed-state vector
preferably, m < n but not necessarily. representation of the transformed input sequence
using the following equations:
3 Method
In this section, we present our framework for para- z = σ(xt U z + st−1 W z ) (1)
phrase generation. It follows the popular encode-
decode paradigm, but with two stacked layers of r = σ(xt U r + st−1 W r ) (2)
encoders. The first encoding layer is a trans-
former encoder, while the second encoding layer
h = tanh(xt U h + (st−1 r)W h ) (3)
is a GRU - RNN encoder. The paraphrase of a given
sentence is generated by a GRU - RNN decoder.
st = (1 − z) h+z st−1 (4)
3.1 Stacked Encoders
where r and z are the reset and update gates re-
3.1.1 Encoder – TRANSFORMER
spectively, W and U are the network’s parameters,
We use the transformer-encoder as sort of a pre- st is the hidden state vector at timestep t, xt is the
training module of our input sentence. The goal input vector and represents the Hadamard prod-
is to learn richer representation of the input vector uct.
that better handles long-term dependencies as well
as captures syntactic and semantic properties be- 3.2 Decoder – GRU - RNN
fore obtaining a fixed-state representation for de-
coding into the desired output sentence. The trans- The fixed-state vector representation produced by
former contains 6 stacked identical layers mainly the GRU - RNN encoder is used as initial state for
driven by self-attention implemented by Vaswani the decoder. At each time step, the decoder re-
et al. (2017, 2018). ceives the previously generated word, yt−1 and
hidden state st−1 at time step t−1 . The output
3.1.2 Encoder – GRU - RNN word, yt at each time step, is a softmax probability
Our architecture uses a single layer uni-directional of the vector in equation 3 over the set of vocabu-
GRU - RNN whose input is the output of the trans- lary words, V .
250
50 K
M ODEL BLEU METEOR R-L EACS GMS
VAE - SVG - EQ (Gupta et al., 2018) 17.4 22.2 - - -
RbM- SL (Li et al., 2018) 35.81 28.12 - - -
T RANS (ours) 35.56 33.89 27.53 79.72 62.91
S EQ (ours) 34.88 32.10 29.91 78.66 61.45
T RAN S EQ (ours) 37.06 33.73 30.89 80.81 63.63
T RAN S EQ + beam (size=6) (ours) 37.12 33.68 30.72 81.03 63.50
100 K
M ODEL BLEU METEOR R-L EACS GMS
VAE - SVG - EQ (Gupta et al., 2018) 22.90 25.50 - - -
RbM- SL (Li et al., 2018) 43.54 32.84 - - -
T RANS (ours) 37.46 36.04 29.73 80.61 64.81
S EQ (ours) 36.98 34.71 32.06 79.65 63.49
T RAN S EQ (ours) 38.75 35.84 33.23 81.50 65.52
T RAN S EQ + beam (size=6) (ours) 38.77 35.86 33.07 81.64 65.42
150 K
M ODEL BLEU METEOR R-L EACS GMS
VAE - SVG - EQ (Gupta et al., 2018) 38.30 33.60 - - -
T RANS (ours) 39.00 38.68 32.05 81.90 65.27
S EQ (ours) 38.50 36.89 34.35 80.95 64.13
T RAN S EQ (ours) 40.36 38.49 35.84 82.84 65.99
T RAN S EQ + beam (size=6) (ours) 39.82 38.48 35.40 82.48 65.54
Table 3: Performance of our model against various models on the QUORA dataset with 50k,100k,150k training
examples. R-L refers to the ROUGE-L F1 score with 95% confidence interval
251
M ODEL BLEU METEOR R-L EACS GMS
Residual LSTM (Prakash et al., 2016) 37.0 27.0 - - -
VAE - SVG - EQ (Gupta et al., 2018) 41.7 31.0 - - -
T RANS (ours) 41.8 38.5 33.4 79.6 70.3
S EQ (ours) 40.7 36.9 35.8 78.9 70.0
T RAN S EQ (ours) 43.4 38.3 37.4 80.5 71.1
T RAN S EQ + beam (size=10) (ours) 44.5 40.0 38.4 81.9 71.3
Table 4: Performance of our model against various models on the MSCOCO dataset. R-L refers to the ROUGE-L
F1 score with 95% confidence interval
the tensor2tensor library (Vaswani et al., 2018)2 , 2, while the QUORA dataset contains question
but set the hidden size to 300. We set dropout to pairs, MSCOCO contains free form texts which
0.0 and 0.7 for MSCOCO and QUORA datasets re- are human annotations of images. Subjective
spectively. We used a large dropout for QUORA observation of the MSCOCO dataset reveals that
because the model tends to over-fit to the training most of its paraphrase pairs contain more novel
set. Both the GRU - RNN encoder and decoder con- words as well as syntactic manipulations than
tain 300 hidden units. the QUORA pairs making it a more interesting
We pre-process our datasets, and do not use the paraphrase generation corpora. We split the
pre-processed/tokenized versions of the datasets QUORA dataset to 50k, 100k and 150k training
from tensor2tensor library. Our target vocabulary samples and 4k testing samples in order to align
is a set of approximately 15,000 words. It con- with baseline models for comparative purposes.
tains words in our target training and test sets that
occur at least twice. Using this subset of vocabu- 4.4 Evaluation
lary words as opposed to over 320,000 vocabulary For quantitative analysis of our model, we use
words contained in gloV e improves both training popular automatic metrics such as BLEU , ROUGE ,
time and performance of the model. METEOR . Since BLEU and ROUGE both mea-
We train and evaluate our model after each sure n − gram word-overlap with difference in
epoch with a fixed learning rate of 0.0005, and brevity penalty, we report just the ROUGE - L value.
stop training when the validation loss does not We also use 2 additional recent metrics – GMS
decrease after 5 epochs. The model learns and EACS by (Sharma et al., 2017)5 that measure
to minimize the seq2seq loss implemented in the similarity between the reference and generated
tensorflow API3 with AdamOptimizer. paraphrases based on the cosine similarity of their
We use greedy-decoding during training and vali- embeddings on word and sentence levels respec-
dation and set the maximum number of iterations tively.
to 5 times the target sentence length. For test-
ing/inference we use beam-search decoding. 4.5 Result Analysis
Tables 3 and 4 report scores of our model on both
4.3 Datasets datasets. Our model pushes the benchmark on all
We evaluate our model on two standard datasets evaluation metrics compared against current pub-
for paraphrase generation – QUORA4 and lished top models evaluated on the same datasets.
MSCOCO (Lin et al., 2014) as described in Gupta Since several words could connote similar mean-
et al. (2018) and used similar settings. The ing, it is more logical to evaluate with metrics that
QUORA dataset contains over 120k examples match with embedding vectors capable of measur-
with a 80k and 40k split on the training and ing this similarity. Hence we also report GMS and
test sets respectively. As seen in Tables 1 and EACS scores as a basis of comparison for future
2
work in this direction.
https://github.com/tensorflow/
tensor2tensor
Besides quantitative values, Tables 1 and 2
3
https://www.tensorflow.org/api_docs/ show that our paraphrases are well formed, ab-
python/tf/contrib/seq2seq/sequence_loss stractive (e.g dumbest – stupidest, dog is wearing
4
https://data.quora.com/
5
First-Quora-Dataset-Release-Question-Pairs https://github.com/Maluuba/nlg-eval
252
– dog poses), capable of performing syntactic ma- ported in this paper was conducted at the Univer-
nipulations (e.g in a soccer uniform kicking a ball sity of Lethbridge and supported by Alberta Inno-
– kicking a soccer ball) and compression. Some of vates and Alberta Education.
our paraphrased sentences even have more brevity
than the reference, and still remain very meaning-
ful. References
Marianna Apidianaki, Guillaume Wisniewski, Anne
5 Related Work Cocos, and Chris Callison-Burch. 2018. Automated
paraphrase lattice creation for hyter machine trans-
Our baseline models – VAE - SVG - EQ (Gupta et al., lation evaluation. In Proceedings of the 2018 Con-
2018) and RbM- SL (Li et al., 2018) are both ference of the North American Chapter of the Asso-
deep learning models. While the former uses a ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 2 (Short Papers), pages
variational-autoencoder and is capable of generat- 480–485.
ing multiple paraphrases of a given sentence, the
later uses deep reinforcement learning. In tune, Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
with part of our approach, ie, seq2seq, there exists automatic metric for mt evaluation with improved
correlation with human judgments. In Proceedings
ample models with interesting variants – residual of the acl workshop on intrinsic and extrinsic evalu-
LSTM (Prakash et al., 2016), bi-directional GRU ation measures for machine translation and/or sum-
with attention and special decoding tweaks (Cao marization, pages 65–72.
et al., 2017), attention from the perspective of se- Chris Callison-Burch, Philipp Koehn, and Miles Os-
mantic parsing (Su and Yan, 2017). borne. 2006. Improved statistical machine transla-
MT has been greatly used to generate para- tion using paraphrases. In Proceedings of the main
phrases (Quirk et al., 2004; Zhao et al., 2008) due conference on Human Language Technology Con-
ference of the North American Chapter of the Asso-
to the availability of large corpora. While much ciation of Computational Linguistics, pages 17–24.
earlier works have explored the use of manually Association for Computational Linguistics.
drafted rules (Hassan et al., 2007; Kozlowski et al.,
2003). Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li.
2017. Joint copying and restricted generation for
Similar to our model architecture, Chen et al. paraphrase. In Thirty-First AAAI Conference on Ar-
(2018) combined transformers and RNN-based en- tificial Intelligence.
coders for MT. Zhao et al. (2018) recently used the
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin
transformer model for paraphrasing on different Johnson, Wolfgang Macherey, George Foster, Llion
datasets. We experimented using solely a trans- Jones, Mike Schuster, Noam Shazeer, Niki Parmar,
former but got better results with T RAN S EQ. To et al. 2018. The best of both worlds: Combining
the best of our knowledge, our work is the first recent advances in neural machine translation. In
Proceedings of the 56th Annual Meeting of the As-
to cross-breed the transformer and seq2seq for the sociation for Computational Linguistics (Volume 1:
task of paraphrase generation. Long Papers), pages 76–86.
253
answering. In Proceedings of the 2017 Conference Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
on Empirical Methods in Natural Language Pro- Jing Zhu. 2002. Bleu: a method for automatic eval-
cessing, pages 875–886. uation of machine translation. In Proceedings of
the 40th annual meeting on association for compu-
Sebastian Gehrmann, Yuntian Deng, and Alexander tational linguistics, pages 311–318. Association for
Rush. 2018. Bottom-up abstractive summarization. Computational Linguistics.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages Jeffrey Pennington, Richard Socher, and Christopher
4098–4109. Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the 2014 confer-
Ankush Gupta, Arvind Agarwal, Prawaan Singh, and ence on empirical methods in natural language pro-
Piyush Rai. 2018. A deep generative framework for cessing (EMNLP), pages 1532–1543.
paraphrase generation. In Thirty-Second AAAI Con-
ference on Artificial Intelligence. Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek
Datla, Ashequl Qadir, Joey Liu, and Oladimeji
Samer Hassan, Andras Csomai, Carmen Banea, Ravi Farri. 2016. Neural paraphrase generation with
Sinha, and Rada Mihalcea. 2007. Unt: Subfinder: stacked residual lstm networks. arXiv preprint
Combining knowledge sources for automatic lex- arXiv:1610.03098.
ical substitution. In Proceedings of the Fourth
International Workshop on Semantic Evaluations Chris Quirk, Chris Brockett, and William Dolan.
(SemEval-2007), pages 410–413. 2004. Monolingual machine translation for para-
phrase generation. In Proceedings of the 2004 con-
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke ference on empirical methods in natural language
Zettlemoyer. 2018. Adversarial example generation processing, pages 142–149.
with syntactically controlled paraphrase networks.
In Proceedings of the 2018 Conference of the North Shikhar Sharma, Layla El Asri, Hannes Schulz, and
American Chapter of the Association for Computa- Jeremie Zumer. 2017. Relevance of unsuper-
tional Linguistics: Human Language Technologies, vised metrics in task-oriented dialogue for evalu-
Volume 1 (Long Papers), pages 1875–1885. ating natural language generation. arXiv preprint
arXiv:1706.09799.
Raymond Kozlowski, Kathleen F McCoy, and K Vijay-
Shanker. 2003. Generation of single-sentence para-
phrases from predicate/argument structure using Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang,
lexico-grammatical resources. In Proceedings of the and Daniel Gildea. 2018. Leveraging context infor-
second international workshop on Paraphrasing- mation for natural question generation. In Proceed-
Volume 16, pages 1–8. Association for Computa- ings of the 2018 Conference of the North American
tional Linguistics. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 2
Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. (Short Papers), pages 569–574.
2018. Paraphrase generation with deep reinforce-
ment learning. In Proceedings of the 2018 Con- Yu Su and Xifeng Yan. 2017. Cross-domain seman-
ference on Empirical Methods in Natural Language tic parsing via paraphrasing. In Proceedings of the
Processing, pages 3865–3878. 2017 Conference on Empirical Methods in Natural
Language Processing, pages 1235–1246.
Chin-Yew Lin. 2004. Rouge: A package for auto-
matic evaluation of summaries. Text Summarization Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Branches Out. Sequence to sequence learning with neural net-
works. In Advances in neural information process-
Tsung-Yi Lin, Michael Maire, Serge Belongie, James ing systems, pages 3104–3112.
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. 2014. Microsoft coco: Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-
Common objects in context. In European confer- cois Chollet, Aidan Gomez, Stephan Gouws, Llion
ence on computer vision, pages 740–755. Springer. Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par-
mar, et al. 2018. Tensor2tensor for neural machine
Jonathan Mallinson, Rico Sennrich, and Mirella Lap- translation. In Proceedings of the 13th Conference
ata. 2017. Paraphrasing revisited with neural ma- of the Association for Machine Translation in the
chine translation. In Proceedings of the 15th Con- Americas (Volume 1: Research Papers), volume 1,
ference of the European Chapter of the Association pages 193–199.
for Computational Linguistics: Volume 1, Long Pa-
pers, pages 881–893. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kathleen R McKeown. 1983. Paraphrasing questions Kaiser, and Illia Polosukhin. 2017. Attention is all
using given and new information. Computational you need. In Advances in neural information pro-
Linguistics, 9(1):1–10. cessing systems, pages 5998–6008.
254
Sanqiang Zhao, Rui Meng, Daqing He, Andi Saptono,
and Bambang Parmanto. 2018. Integrating trans-
former and paraphrase rules for sentence simplifi-
cation. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Process-
ing, pages 3164–3173, Brussels, Belgium. Associ-
ation for Computational Linguistics.
Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, and
Sheng Li. 2008. Combining multiple resources to
improve smt-based paraphrasing model. In Pro-
ceedings of ACL-08: HLT, pages 1021–1029.
255