Muta GAN
Muta GAN
Johns Hopkins Applied Physics Laboratory, 11100 Johns Hopkins Rd., Laurel, MD 20723, USA
†
Present address: Craig Howser, Nyla Technology Solutions; [email protected].
Abstract
The ability to predict the evolution of a pathogen would significantly improve the ability to control, prevent, and treat disease. Machine
learning, however, is yet to be used to predict the evolutionary progeny of a virus. To address this gap, we developed a novel machine
learning framework, named MutaGAN, using generative adversarial networks with sequence-to-sequence, recurrent neural networks
generator to accurately predict genetic mutations and evolution of future biological populations. MutaGAN was trained using a gen-
eralized time-reversible phylogenetic model of protein evolution with maximum likelihood tree estimation. MutaGAN was applied to
influenza virus sequences because influenza evolves quickly and there is a large amount of publicly available data from the National
Center for Biotechnology Information’s Influenza Virus Resource. MutaGAN generated ‘child’ sequences from a given ‘parent’ protein
sequence with a median Levenshtein distance of 4.00 amino acids. Additionally, the generator was able to generate sequences that
contained at least one known mutation identified within the global influenza virus population for 72.8 per cent of parent sequences.
These results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in
evolutionary prediction for any protein population.
Keywords: generative adversarial networks; sequence generation; Influenza virus; deep learning; evolution.
genomic sequence sets (Bengio et al. 2003; Bengio, Courville, and sequence (Sutskever, Vinyals, and Le 2014). There are two parts
Vincent 2013; Mikolov, Yih, and Zweig 2013; Mikolov et al. 2013a,b; to this model: an encoder and a decoder, shown in Fig. 1. The
Levy and Goldberg 2014). Ng created a word embeddings process encoder E, with encoding dimension f, takes as input a sequence
for DNA, called dna2vec, which creates vector representations for and converts it into a vector of real numbers. This vector is then
short substrings of DNA sequences (Ng 2017). The extension of used as the initial state of the decoder, which constructs the
deep neural network architectures such as convolutional neural goal sequence. seq2seq models have shown success in translation
networks (CNNs), recurrent neural networks (RNNs), and stacked tasks (Bahdanau, Cho, and Bengio 2014; Luong et al. 2015) and
autoencoders onto biological sequence data has proven useful text summarization tasks (Nallapati et al. 2016). For this reason,
for DNA sequence classification (Rizzo et al. 2015; Zhou and we viewed the problem of modeling protein evolution from parent
Troyanskaya 2015; Quang and Xie 2016) as well as prediction of to child as a translation problem. The seq2seq model in MutaGAN
RNA binding sites (Alipanahi et al. 2015), protein–protein inter- uses a bidirectional encoder (Schuster and Paliwal 1997), simulta-
actions (Sun et al. 2017), and DNA–protein binding (Zeng et al. neously evaluating the input sequence forward and backward to
2016). Furthermore, deep learning methods have been extended produce the optimally encoded vector.
to the problem of protein-folding (Spencer, Eickholt, and Cheng
Figure 1. A seq2seq model using two bidirectional LSTM encoder and a unidirectional LSTM decoder and embedding layers.
D. S. Berman et al. 3
Figure 2. The MutaGAN framework’s architecture. The generator of the MutaGAN is a seq2seq translation deep neural network using LSTMs and
embedding layers. The encoding layer uses a bidirectional LSTM. The output of the encoder is combined with a vector of random noise from a normal
distribution N(0,1). The output of the decoder LSTM feeds into a softmax dense layer. An argmax function is then applied to select a single amino acid
at each position, rather than a probability distribution. The discriminator uses an encoder with a slightly different structure from the encoder in the
generator, but it uses the same weights. This is because an argmax function is not differentiable. Therefore, the first layer of the encoder in the
discriminator is a linear dense layer with the same output size as the term embedding layer in the generator. This allows it to take as input, the output
of the dense layer of the decoder in the generator. The weights of this dense layer are the same as those of the embedding layer, meaning that it
produces a linear combination of the embeddings from the embedding layer of encoder. The discriminator takes in two sequences and determines
whether the input sequences are a real parent–child pair or if they are not. The sequences that are not real parent–child pairs are a parent and
generated sequences and two real sequences that are not parent–child pairs.
4 Virus Evolution
a major concern has been the introduction of new, more infec- from training data focused on identifying a single, most likely
tious subtypes from animals, e.g. avian or porcine species. This clade emerging (Neher, Russell, and Shraiman 2014; Neher et al.
so-called ‘species jumping’ has caused great concern due to the 2016). This method is not capable of projecting mutations at multi-
potential for a global spread, similar to the 1918 Influenza Pan- ple locations. While understanding clade emergence is important,
demic (Webster 1999; Palese and Shaw 2007). A recent example this approach does not provide insight into the likelihoods of
highlighting this is the introduction of a new H1N1 strain in 2009 individual site mutations and, importantly, cannot be applied to
that was introduced from pigs in Mexico (World Health Organiza- intrahost evolution. To provide a clearer picture of the different
tion (WHO) 2010; Hensley and Yewdell 2009; Michaelis, Doerr, and possible evolutionary trajectories, we developed a novel machine
Cinatl 2009). learning framework that uses broad historical training data to pre-
Current vaccine sequence selection is an inexact process based dict all likely mutations in an influenza virus genome segment
on recent field surveillance data to produce the seed stock for sequence.
the next year. This approach has resulted in frequent mismatch
between vaccine antigens and circulating virus. Another delay Contributions
in vaccine production is caused by testing candidate vaccine In this paper, we present MutaGAN, a novel deep learning frame-
the weights for that in the two encoders were automatically Dataset creation
shared once they were loaded into the parent encoder. After rooting the final ML tree to isolate ‘influenza A virus A/Hong
The embedding layer in the model had a dimension of 250 and Kong/1/68(H3N2)’, Marginal ancestral sequence reconstruction
allowed for 4,500 tokens. Unknown tokens are given the value was performed with RAxML using the General Time Reversible
‘[UNK]’. We did not have a problem in generating sequences that model of nucleotide substitution with the gamma model of rate
had unknown tokens on input, as the sliding window with an over- heterogeneity. Parent–child relationships were generated using
lap provided cover. The bidirectional LSTM encoder had 128 units, the Bio.Phylo package in BioPython (Cock et al. 2009) and were
making the encoding dimension 𝑓 = 512 (both cell and hidden limited to single steps between phylogenetic tree levels such that
states for forward and backward LSTMs), and the LSTM decoder each parent had exactly two children. One parent–child pair was
had 256 units. The discriminator consisted of concatenated hid- generated for each of the 13,678 edges within the final binary
den and cell states of the LSTM encoder and modified LSTM tree. Because phylogenetic tree generation requires removal of
encoder, as described earlier. This was fed into fully connected duplicate nucleotide sequences prior to evolutionary modeling,
layers of three sets of dropout (20 per cent), batch normaliza- there was a concern of providing an information bias of evolution
tion, and a dense layer. The first two dense layers had 128 and towards the ancestral sequences (i.e. internal nodes) and away
1,807 (55.6 per cent) of sequences where the parent and child were least one known mutation. This mutation must be a known
not the same. There were 561 unique parent sequences, and 287 mutation for that parent sequence, not any random parent. We
(51.2 per cent) had only one child sequence. refer to this as the known mutation generation rate. This can be
extended to mutations that are in the correct location but incor-
Generator evaluation rect change. This is the known mutation location rate. This is
The most important metric for assessing quality of the generated calculated as follows: for the 𝑖th parent sequence with known
sequences is whether they were able to produce the mutations child sequences, 𝐶𝑖 = {𝑐0 , 𝑐1 , …} and set of known mutations ℳ𝑖 =
observed in the data. However, missing an observed mutation does {𝑚0 , 𝑚1 , …}. For a given input parent sequence used to generate
not necessarily mean that the generator did not correctly predict 𝑘 potential sequences, we can create a set of potential mutations
possible mutations and only that it did not correctly predict all ℳ′𝑖 = {𝑚′0 , 𝑚′1 , …}
observed mutations. It is possible that predicted mutations would
have likely occurred but just were not included in our subset.
We identified mutations in our sequences by measuring the
Levenshtein distance between parent and child sequences. By where 𝑁 is the total number of parent sequences.
using the Levenshtein distance, we were able to account for inser-
tions and deletions as well as mutations (Levenshtein 1966), and Amino acid mutation frequency
by using the diff-match package from Google, we were able to Frequency of amino acid mutations was calculated to evaluate
identify where changes were made between two sequences (Fraser the similarity of the mutation profiles between MutaGAN and the
et al. 2018). The diff_cleanupSemantic function in the diff-match- ground truth data (Equation 2). For each mutation within a given
patch package was used to identify where changes were made set of mutations, 𝑎(𝑝) is the amino acid of the parent, 𝑎(𝑐) is the
between parents, real children, and generated children. A list of amino acid of the child, and 𝑥𝑎(𝑝) 𝑎(𝑐) is the count of all mutations
all child mutations was created for each parent and compared to observed from one amino acid to another. The amino acid muta-
each parent’s generated children. A mutation was counted as cor- tion frequency is the value 𝑥𝑎(𝑝) 𝑎(𝑐) divided by the total number of
rect if change occurred in the generated child that was identical in recorded mutations. This provides information on the frequency
amino acid and location as observed in the parent’s real children. of mutating one amino acid to another:
A partially correct mutation was defined as an amino acid change
in the generated child that was identical in location to any muta-
tion in a real child of that parent but differed in amino acid type.
False mutations were defined as mutations that were predicted True-positive rate
but non-existent in the real children, and missed mutations were There are two ways of evaluating whether the generated
defined as mutations observed in at least one of the real children sequences contain the known mutations. The first is to consider
of the parent but not in the generated child. all the mutations for a known parent and determine whether the
In addition to the Levenshtein distance, we use four metrics generated sequences contain those mutations. This can be mea-
for evaluating the performance of the generator: known mutation sured similar to the standard true-positive rate formula, and thus,
location rate, amino acid mutation frequency, true-positive rate, we will refer to it as the standard true-positive rate,
and weighted true-positive rate.
This is useful in determining how well the variety of mutations 2015) on four GeForce GTX 1080 Ti graphical processing units.
are reflected in the generated sequences. However, it does penalize Additionally, metrics were calculated using the functions in the
sequences with multiple known mutations unless those gener- package scikit-learn version 0.24.1(Pedregosa et al. 2011), the diff-
ated sequences contain all known mutations, which is not some- match-patch package (Fraser et al. 2018), and the natural language
thing we want, as we want these mutations distributed across toolkit package (Bird, Klein, and Loper 2009).
sequences to be proportional to how they would appear in the
sequencing data. Take, for example, a parent sequence ACFKLM Model training
has two children, ACFHLM and ACFHIM. If MutaGAN generates 100 The maximum number of words included in the embedding layer
sequences, fifty of them are exact matches to Child 1 and fifty of was 4,500, which was selected by rounding up the number of
them are exact matches to Child 2, the true-positive rate would be unique 3-mers found in our dataset. Additionally, we selected an
50.0 per cent because only half of known mutations appear across embedding size of 250. The encoder portion was a bidirectional
all generated sequences. If all generated sequences were ACFHIM, LSTM with 128 nodes, resulting in a state vector of 512 being
the true-positive rate would be 100.0 per cent because all known passed to the decoder, which was a unidirectional LSTM. Gener-
mutations appear in all children. However, we want our gener- ator pretraining was performed on the training dataset and tested
Table 1. The results and metrics of the MutaGAN model and the of {−2, −1, 0, 1, 2} to perturb that location to provide some addi-
baseline model on the validation data. tional variability. For the distribution of the amino acid change,
𝑝aa , we used the distribution of amino acid changes in the set
MutGAN perfor- Baseline model
mance on the on the validation
of known mutations in the training data. Therefore, the process
Metric validation dataset dataset of generating a new sequence using the baseline method is as
follows:
Known mutation 72.8% 44.1%
generation rate
Known mutation location 77.8% 77.1% 1. 𝑛 ∼ 𝑝num
rate 2. 𝑙 ∼ 𝑝loc until there are 𝑛 unique 𝑙𝑠
Median Levenshtein 3. 𝑦𝑖̂ ∼ 𝑝𝑎𝑎 (𝑥𝑖 ) for the amino acid 𝑥𝑖 at location 𝑙𝑖 for 𝑖 = 1, … , 𝑛.
distance
Generated vs child 4.00 (𝜇 = 4.83, 1.00 (𝜇 = 1.62,
𝜎 = 4.12) 𝜎 = 1.10)1.00 As with MutaGAN, 100 sequences were generated for each par-
Generated vs parent 4.00 (𝜇 = 5.10, (𝜇 = 1.69, ent in the validation dataset. The results for the baseline model
acids. To assess if MutaGAN’s generated amino acid mutation pro- (H)→asparagine (N) (Fig. 4B). Interestingly, Arginine is the second
file more closely resembles the training set over the validation most favorable amino acid mutation from glutamine behind glu-
set or vice versa, the average difference in amino acid muta- tamate (Barnes and Gray 2003). Aspartate and serine are the most
tion frequency was calculated for the ‘Generated vs Training’ and favorable amino acid mutations from asparagine alongside histi-
‘Generated vs Validation’ delta mutation profiles (Fig. 4B). This dine, and asparagine is the second most favorable mutation from
measure of distance was calculated to be 3.2 × 10−3 and 2.0 × 10−3 , histidine behind tyrosine. Another frequently incorrect MutaGAN
respectively. Importantly, the MutaGAN generated mutation pro- mutation of note is serine (S)→proline (P). Serine, when present
file shows changes in mutation frequency for specific amino acids on a protein’s surface, often forms hydrogen bonds with the pro-
that more closely resembles the validation set when compared tein’s backbone and effectively mimics proline (Barnes and Gray
to the training set (Fig. 4A). In particular, it is observed that 2003). In accordance, the four locations that MutaGAN incorrectly
there are higher proportions of threonine (T)→lysine (K), threonine mutated the HA protein from a serine to a proline were at amino
(T)→isoleucine (I), arginine (R)→lysine (K), and glycine (G)→aspar- acid positions 143, 198, 199, and 227 within the HA1 chain, all of
tic acid (D) mutations and lower proportions of alanine (A)→valine which are located on protein’s surface.
(V), glycine (G)→arginine (R), and alanine (A)→threonine (T) within As an artifact of the phylogenetic analysis, a small but notice-
the ground truth validation data as compared to the ground truth able portion of child sequences within the training, test, and
training data. For these same amino acids, MutaGAN’s mutational validation datasets contained the ambiguous amino acid symbol
profile shows the same trends. ‘X’ at some location within its sequence. The appearance of ‘X’ in
The results in Table 1 indicate that the MutaGAN model and a child sequence created the appearance that a parent amino acid
the baseline model were able to comparably generate mutations could mutate to ambiguity. However, MutaGAN never mutated
at sites known for mutations in the validation data, with the a parent amino acid to ambiguity (Fig. 4A). This is likely due to
known mutation generation rate of models being 77.8 per cent and the fact that of all the amino acids in the training dataset, only
77.1 per cent, respectively. However, the known mutation genera- 3.35 × 10−3 per cent were ‘X’, meaning that there were too few
tion rate for MutaGAN was 72.8 per cent vs the baseline rate of examples of ‘X’ for it to learn it.
44.1 per cent. Additionally, the true positive rate (TPR) scores are The overall mutation location profile of historical H3N2
significantly higher for MutaGAN. While the median Levenshtein influenza virus HA proteins was well reproduced by Muta-
distance and the average difference in amino acid mutation fre- GAN (Fig. 5). Supplementary Figure S3 shows the same plot on
quency were lower for the baseline model, this is expected. Addi- the test data for comparison purposes. The most highly variable
tionally, this indicates that MutaGAN, while capable of capturing regions identified in the training and validation datasets (HA1
a similar profile of mutation locations, is capable of making muta- amino acid indices 120–160 and 185–228) were also the most
tions outside the historical distributions that match the validation mutated regions by MutaGAN. Regions of lesser, but still signifi-
data. cant, variability were also identified by MutaGAN in accordance
The most prominent amino acid mutations that were made with the historical H3N2 data observed in the training and vali-
by MutaGAN that were not seen frequently in either the train- dation sets of Fig. 5 such as HA1 residue regions (i.e. amino acid
ing or validation data are glutamine (Q)→arginine (R), asparagine indices) of 45–59, 259–262, and 273–278. Regions of historical con-
(N)→aspartate (D), asparagine (N)→serine (S), and histidine servation were accurately preserved by MutaGAN, most notably
10 Virus Evolution
the HA1 residue region of 11–24 and the HA2 residue region 1–16. increases the influenza virus’s likelihood of evading host anti-
Of the top ten most frequently mutating positions in the training bodies during infection. The majority of the MutaGAN-generated
dataset, MutaGAN’s only had one within its own top ten (Position mutations also occur in these same regions but across a notably
121). Between the validation set and MutaGAN-generated set, larger number of residues on the protein surface. This finding
there were no overlaps in the top ten most frequently mutated alludes to the ability of this framework to illuminate localized
positions. However, for many of the most frequently mutating function across varying regions of the overall protein structure,
amino acid locations within the training and validation sets, the but further simulations must be performed to investigate the
location was one or two positions away from a commonly mutated functional effects of the MutaGAN-generated mutations.
position in the MutaGAN-generated sequences. For instance, HA1 We can also examine whether the model is simply repeating
residues 142, 160, and 193 were in the top ten most frequently the same mutations or whether it is selecting mutations based
mutated positions in the training and validation sets. HA1 residues on the input sequence. We can determine this by comparing the
145, 159, and 192 were in the top ten most frequently mutated counts of mutations for generated sequences, shown in the bot-
positions by MutaGAN. This phenomenon is worth noting because tom graph of Fig. 5, which uses sequences from 2018 to 2019 and
of the closeness, but the biological significance is not readily Supplementary Fig. S3, which uses data similar to the training
Figure 5. Amino acid mutation profiles with respect to HA protein locations. For the training, validation, and generated child sequences, total counts
of mutations observed across the entire length of the HA protein segment are displayed in (A), indicating the signal peptide, HA1 (head), and HA2
(stalk) regions of the full HA protein. The most highly variable regions are highlighted in salmon, the third and fourth highlighted regions. Regions of
lesser, but still significant variability, are highlighted in yellow, the second, fifth and sixth highlighted regions. Particularly conserved regions are
highlighted in blue, the first and last highlighted regions. In (B), a diagram of the H3 HA structure (PDB: 4GMS) is colored by this mutation frequency,
with the positions with the fewest mutations in yellow to the positions with the most mutations in brown. Positions with zero observed mutations
across each dataset are colored gray. Residues are displayed as spheres for positions with mutation frequencies above 30 per cent of the maximum
position for each of the three datasets. These 30 per cent threshold lines are also plotted in (A).
D. S. Berman et al. 11
Discussion
Accurate mutation forecasting from protein
sequences
The MutaGAN framework presented here is the first method to
utilize a GAN to accurately reproduce and optimize full-length
proteins above 300 amino acids in length with no structural infor-
mation provided to the model beyond amino acid sequence. The
accurate reproduction of the mutation profiles of the HA protein
with specificity of both the types of amino acids changed (Fig. 4)
and locations most likely for persistence (Fig. 5) demonstrates
the potential of this method to be used as a tool for forecasting
the genetic drift or shift that occurs during the outbreak of the
influenza virus. Our findings indicate that the sequence augmen-
tation strategies deployed by MutaGAN optimize its input toward
88.4 per cent prediction of false positives highlight shortcomings populations. As a result, we see the potential for this research to
of the current trained model. We hypothesized that the model play a significant role in public health, particularly in disease mit-
could be improved solely by improving the dataset. There is a igation and prevention. With the improvements outlined earlier, if
high likelihood that the inclusion of swine and avian influenza MutaGAN was implemented to simulate how currently circulating
sequences into the phylogenetic model inhibited the MutaGAN’s pathogens could evolve over time, targeted measures of quaran-
ability to fit itself on patterns of HA protein evolution specific to tine and treatment could be more effectively deployed. MutaGAN’s
human infection. In addition to a more comprehensive curation ability to produce full-length protein sequences while simultane-
of outliers, a larger population of HA protein sequences could be ously learning the nuances of evolution lends itself well to the
utilized to provide additional diversity to the model. The small extension to other protein types, creating potential for impact
number of database records (6,840) used to generate the phy- within the domain of multiple diseases.
logenetic model was unlikely to capture the full breadth and
depth of the true evolutionary landscape of the human H3N2 Data availability
influenza virus HA protein. By leveraging a larger database of
Influenza virus HA sequences were downloaded from the NCBI
influenza virus surveillance, such as Global Initiative on Shar-
Bedford, T., Rambaut, A., and Pascual, M. (2012) ‘Canalization of Katoh, K., and Standley, D. M. (2013) ‘MAFFT Multiple Sequence
the Evolutionary Trajectory of the Human Influenza Virus’, BMC Alignment Software Version 7: Improvements in Performance and
Biology, 10: 1–12. Usability’, Molecular Biology and Evolution, 30: 772–80.
Bengio, Y. et al. (2003) ‘A Neural Probabilistic Language Model’, Journal Kawaoka, Y., Krauss, S., and Webster, R. G. (1989) ‘Avian-to-Human
of Machine Learning Research, 3: 1137–55. Transmission of the PB1 Gene of Influenza A Viruses in the 1957
Bengio, Y., Courville, A., and Vincent, P. (2013) ‘Representation Learn- and 1968 Pandemics’, Journal of Virology, 63: 4603–8.
ing: A Review and New Perspectives’, IEEE Transactions on Pattern Keneshloo, Y. et al. (2019) ‘Deep Reinforcement Learning for
Analysis and Machine Intelligence, 35: 1798–828. Sequence-to-Sequence Models’, IEEE Transactions on Neural Net-
Bepler, T., and Berger, B. (2019) ‘Learning Protein Sequence works and Learning Systems, 31: 2469–89.
Embeddings Using Information from Structure’, arXiv preprint Killoran, N. et al. (2017) ‘Generating and Designing DNA with Deep
arXiv:1902.08661. Generative Models’, arXiv preprint arXiv:1712.06148.
Bird, S., Klein, E., and Loper, E. (2009) Natural Language Processing Kingma, D. P., and Ba, J. (2014) ‘Adam: A Method for Stochastic
with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Optimization’, arXiv preprint arXiv:1412.6980.
Media, Inc. Kosik, I., and Yewdell, J. W. (2019) ‘Influenza Hemagglutinin and Neu-
Mikolov, T., Yih, W., and Zweig, G. (2013) ‘Linguistic regularities IEEE/ACM Transactions on Computational Biology and Bioinformatics,
in continuous space word representations’ in Proceedings of the 12: 103–12.
2013 Conference of the North American Chapter of the Association for Stamatakis, A. (2014) ‘RAxML Version 8: A Tool for Phylogenetic Anal-
Computational Linguistics: Human Language Technologies, pp. 746–51. ysis and Post-Analysis of Large Phylogenies’, Bioinformatics, 30:
Mirza, M., and Osindero, S. (2014) ‘Conditional Generative Adversarial 1312–3.
Nets’, arXiv preprint arXiv:1411.1784. Sun, T. et al. (2017) ‘Sequence-Based Prediction of Protein Protein
Morris, D. H. et al. (2018) ‘Predictive Modeling of Influenza Shows the Interaction Using a Deep-Learning Algorithm’, BMC Bioinformatics,
Promise of Applied Evolutionary Biology’, Trends in Microbiology, 26: 18: 1–8.
102–18. Sutskever, I., Vinyals, O., and Le, Q. V. (2014) ‘Sequence to Sequence
Mustonen, V., and Lassig, ̈ M. (2009) ‘From Fitness Landscapes to Learning with Neural Networks’, Advances in Neural Information
Seascapes: Non-Equilibrium Dynamics of Selection and Adapta- Processing Systems, 27: 3104–12.
tion’, Trends in Genetics, 25: 111–9. Tenforde, M. W. et al. (2021) ‘Effect of Antigenic Drift on Influenza
Nallapati, R. et al. (2016) ‘Abstractive Text Summarization Using Vaccine Effectiveness in the United States—2019–2020’, Clinical
Sequence-to-Sequence Rnns and Beyond’, arXiv preprint Infectious Diseases, 73: e4244–50.
Zeng, H. et al. (2016) ‘Convolutional Neural Network Architectures for Zhang, H. et al. (2019) ‘Self-attention Generative Adversarial Net-
Predicting DNA–Protein Binding’, Bioinformatics, 32: i121–7. works’, in International conference on machine learning, pp. 7354–63.
Zhang, Z. et al. (2018) ‘Bidirectional generative adversarial networks Zhou, J., and Troyanskaya, O. G. (2015) ‘Predicting Effects of Noncod-
for neural machine translation’ in Proceedings of the 22nd Conference ing Variants with Deep Learning–based Sequence Model’, Nature
on Computational Natural Language Learning, pp. 190–9. Methods, 12: 931–4.