0% found this document useful (0 votes)
29 views15 pages

Muta GAN

Uploaded by

vijay chandar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

Muta GAN

Uploaded by

vijay chandar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MutaGAN: A sequence-to-sequence GAN framework to

predict mutations of evolving protein populations


Daniel S. Berman,*,** Craig Howser,† Thomas Mehoke,‡,†† Amanda W. Ernlund, and Jared D. Evans§

Johns Hopkins Applied Physics Laboratory, 11100 Johns Hopkins Rd., Laurel, MD 20723, USA

Present address: Craig Howser, Nyla Technology Solutions; [email protected].

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024



Present address: Thomas Mehoke, NexLeaf Analytics; [email protected].
§
Present address: Jared D. Evans, PhD, Department of Pathology and Microbiology, University of Nebraska Medical Center; email: [email protected];
phone: 402-552-2846.
**
https://orcid.org/0000-0002-9176-7535
††
https://orcid.org/0000-0001-6607-8925
*Corresponding author: E-mail: [email protected]

Abstract
The ability to predict the evolution of a pathogen would significantly improve the ability to control, prevent, and treat disease. Machine
learning, however, is yet to be used to predict the evolutionary progeny of a virus. To address this gap, we developed a novel machine
learning framework, named MutaGAN, using generative adversarial networks with sequence-to-sequence, recurrent neural networks
generator to accurately predict genetic mutations and evolution of future biological populations. MutaGAN was trained using a gen-
eralized time-reversible phylogenetic model of protein evolution with maximum likelihood tree estimation. MutaGAN was applied to
influenza virus sequences because influenza evolves quickly and there is a large amount of publicly available data from the National
Center for Biotechnology Information’s Influenza Virus Resource. MutaGAN generated ‘child’ sequences from a given ‘parent’ protein
sequence with a median Levenshtein distance of 4.00 amino acids. Additionally, the generator was able to generate sequences that
contained at least one known mutation identified within the global influenza virus population for 72.8 per cent of parent sequences.
These results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in
evolutionary prediction for any protein population.

Keywords: generative adversarial networks; sequence generation; Influenza virus; deep learning; evolution.

Introduction the evolution of a biological organism would significantly improve


our ability to prevent and treat disease. The knowledge of how
Biological evolution mainly manifests itself through seemingly
an organism will evolve would allow us to develop more precise
random mutations that occur during genome replication. When
interventions and preventive measures in advance and to pre-
this change improves organismal fitness, the probability the muta-
vent outbreaks or combat invasive species. Deep learning has
tion is passed on to future generations is increased. Virus replica-
led to performance breakthroughs in a number of applications
tion is inherently error-prone, and only mutations that maintain
but is yet to contribute to predicting mutations and evolution
the ability to infect hosts and evade the host immune system
of biological populations. We viewed this problem of predict-
are inherited by subsequent generations. Because these mutations
ing mutations as analogous to some natural language process-
occur seemingly randomly in the genetic sequence that codes for
ing (NLP) tasks, like translation and text generation, for which
these proteins, it is difficult to predict which strains will emerge
deep learning has proven successful, making it a great model
and become predominant.
candidate.
Although it is not currently possible to capture all variables
that give rise to traits across a population, modeling the appear-
ance and persistence of different mutations over time can serve Deep learning and biological sequences
as a proxy for understanding environmental pressures (Frank and New methods in data science have been applied to biolog-
Slatkin 1990; Kussell and Leibler 2005; Wolf, Vazirani, and Arkin ical sequences for purposes of unsupervised characterization
2005; Mustonen and Lassig ̈ 2009). Subsequently, if an accurate and supervised classification tasks. Deep learning is a natu-
model can be created, changes that occur in future populations ral candidate for these efforts due to an exceptional ability to
can be predicted (Bedford, Rambaut, and Pascual 2012; Neher, abstract higher-order structures from high-resolution and com-
Russell, and Shraiman 2014; Neher et al. 2016). Tools to predict plex datasets. Previous work has applied NLP techniques to

© The Author(s) 2023. Published by Oxford University Press.


This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 Virus Evolution

genomic sequence sets (Bengio et al. 2003; Bengio, Courville, and sequence (Sutskever, Vinyals, and Le 2014). There are two parts
Vincent 2013; Mikolov, Yih, and Zweig 2013; Mikolov et al. 2013a,b; to this model: an encoder and a decoder, shown in Fig. 1. The
Levy and Goldberg 2014). Ng created a word embeddings process encoder E, with encoding dimension f, takes as input a sequence
for DNA, called dna2vec, which creates vector representations for and converts it into a vector of real numbers. This vector is then
short substrings of DNA sequences (Ng 2017). The extension of used as the initial state of the decoder, which constructs the
deep neural network architectures such as convolutional neural goal sequence. seq2seq models have shown success in translation
networks (CNNs), recurrent neural networks (RNNs), and stacked tasks (Bahdanau, Cho, and Bengio 2014; Luong et al. 2015) and
autoencoders onto biological sequence data has proven useful text summarization tasks (Nallapati et al. 2016). For this reason,
for DNA sequence classification (Rizzo et al. 2015; Zhou and we viewed the problem of modeling protein evolution from parent
Troyanskaya 2015; Quang and Xie 2016) as well as prediction of to child as a translation problem. The seq2seq model in MutaGAN
RNA binding sites (Alipanahi et al. 2015), protein–protein inter- uses a bidirectional encoder (Schuster and Paliwal 1997), simulta-
actions (Sun et al. 2017), and DNA–protein binding (Zeng et al. neously evaluating the input sequence forward and backward to
2016). Furthermore, deep learning methods have been extended produce the optimally encoded vector.
to the problem of protein-folding (Spencer, Eickholt, and Cheng

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


2014; Asgari and Mofrad 2015) to predict molecular characteris- GANs
tics like secondary structure (Wang et al. 2016), backbone angle A GAN consists of two neural networks, a generator G and a dis-
and solvent accessibility surface areas (Heffernan et al. 2015), and criminator D, that compete in a zero-sum game of the generator
other details about proteins (Bepler and Berger 2019). trying to fool the discriminator and the discriminator trying to dis-
In 2014, Goodfellow et al. developed a technique for training tinguish real examples from generated examples. The traditional
generative models called generative adversarial networks (GANs) methodology for training a GAN alternates between training the
(Goodfellow et al. 2020), followed by the development of condi- discriminator and freezing the weights in the discriminator and
tional GANs in the same year (Mirza and Osindero 2014). GANs training the GAN to generate sequences that the discriminator
have seen the greatest success in image generation (Reed et al. thinks are real. Typically, a GAN is trained to turn random noise
2016; Isola et al. 2017; Ledig et al. 2017; Ma et al. 2017) and have into an output matching a known distribution. However, by con-
also been used to generate text (Yu et al. 2017; Zhang et al. 2018; ditioning the output of a GAN on a partially structured input,
Keneshloo et al. 2019; Tuan and Lee 2019). GANs have also been in addition to random noise, we can implement a conditional
extended to bioengineering applications, where they were imple- GAN (Mirza and Osindero 2014). The conditional GAN used in this
mented in conjunction with CNNs to optimize DNA for microarray paper is shown in Fig. 2. In the context of this work, our partially
probe design (Killoran et al. 2017), protein sequences for discovery structured input was the parent protein sequence, which, once
of novel enzymes (Repecka et al. 2021), as well as implemented encoded, was combined with a random vector of noise. Because
with an RNN for gene sequence optimization for antimicrobial mutations are inherently stochastic, we identified a conditional
peptide production (Gupta and Zou 2018), all from random noise. GAN framework as the ideal model candidate for the use of a
Additionally, a CNN-based GAN was used to predict the most prob- seq2seq model to generate numerous mutations given a single
able folding of protein sequences given amino acid sequence and parent protein.
pairwise distances between α-carbons on the protein backbone
(Anand and Huang 2018). In all of these cases, sequence length Influenza
ranged between 50 and 300 amino acids. However, none of these Influenza virus is an important human pathogen, causing signifi-
used an RNN conditional GAN to model the natural evolution of a cant annual morbidity and economic burden globally. In the USA,
biological sequence. influenza contributes to over 30,000 deaths each year (Thomp-
son et al. 2003, 2004; O’Brien et al. 2004). Human influenza A
Sequence-to-sequence model viruses are named based on the geographic location where the
The specific deep learning architecture used to enable high- virus was isolated, the date of the isolate, and the identity of
performance encoded representations of sequences is known the two major surface proteins, hemagglutinin (HA) and neu-
as a sequence-to-sequence (seq2seq) model (Hochreiter and raminidase (NA) (WHO 1980). There are eighteen distinct antigenic
Schmidhuber 1997). A seq2seq model is a type of neural machine subtypes of HA (H1–18) and eleven distinct antigenic subtypes of
translation algorithm that uses at least two RNNs, like long short- NA (N1–11), with only H1N1 and H3N2 currently circulating in the
term memory (LSTMs) (Sutskever, Vinyals, and Le 2014), that human population (Kosik and Yewdell 2019; CDC 2022). While few
take as input a sequence with the goal of constructing a new influenza subtypes are circulating within the human population,

Figure 1. A seq2seq model using two bidirectional LSTM encoder and a unidirectional LSTM decoder and embedding layers.
D. S. Berman et al. 3

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024

Figure 2. The MutaGAN framework’s architecture. The generator of the MutaGAN is a seq2seq translation deep neural network using LSTMs and
embedding layers. The encoding layer uses a bidirectional LSTM. The output of the encoder is combined with a vector of random noise from a normal
distribution N(0,1). The output of the decoder LSTM feeds into a softmax dense layer. An argmax function is then applied to select a single amino acid
at each position, rather than a probability distribution. The discriminator uses an encoder with a slightly different structure from the encoder in the
generator, but it uses the same weights. This is because an argmax function is not differentiable. Therefore, the first layer of the encoder in the
discriminator is a linear dense layer with the same output size as the term embedding layer in the generator. This allows it to take as input, the output
of the dense layer of the decoder in the generator. The weights of this dense layer are the same as those of the embedding layer, meaning that it
produces a linear combination of the embeddings from the embedding layer of encoder. The discriminator takes in two sequences and determines
whether the input sequences are a real parent–child pair or if they are not. The sequences that are not real parent–child pairs are a parent and
generated sequences and two real sequences that are not parent–child pairs.
4 Virus Evolution

a major concern has been the introduction of new, more infec- from training data focused on identifying a single, most likely
tious subtypes from animals, e.g. avian or porcine species. This clade emerging (Neher, Russell, and Shraiman 2014; Neher et al.
so-called ‘species jumping’ has caused great concern due to the 2016). This method is not capable of projecting mutations at multi-
potential for a global spread, similar to the 1918 Influenza Pan- ple locations. While understanding clade emergence is important,
demic (Webster 1999; Palese and Shaw 2007). A recent example this approach does not provide insight into the likelihoods of
highlighting this is the introduction of a new H1N1 strain in 2009 individual site mutations and, importantly, cannot be applied to
that was introduced from pigs in Mexico (World Health Organiza- intrahost evolution. To provide a clearer picture of the different
tion (WHO) 2010; Hensley and Yewdell 2009; Michaelis, Doerr, and possible evolutionary trajectories, we developed a novel machine
Cinatl 2009). learning framework that uses broad historical training data to pre-
Current vaccine sequence selection is an inexact process based dict all likely mutations in an influenza virus genome segment
on recent field surveillance data to produce the seed stock for sequence.
the next year. This approach has resulted in frequent mismatch
between vaccine antigens and circulating virus. Another delay Contributions
in vaccine production is caused by testing candidate vaccine In this paper, we present MutaGAN, a novel deep learning frame-

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


sera against circulating strains. This antigenic characterization of work that utilizes GANs and seq2seq models to learn a general-
influenza virus through serological interrogation with antibody- ized time-reversible evolutionary model. We do this by building
containing sera is crucial for virus titration, identifying new anti- a model capable of generating mutations for a given input par-
genic variants, vaccine strain selection, and epidemiologic studies. ent sequence, replicating key aspects of the phylogenetic tree.
However, they are slow and costly and only address a small subset We then demonstrate its capability of accurately modeling the
of potential variant viruses. Influenza virus presents significant mutations observed in phylogenetic data of the H3N2 influenza
challenges due to high virus polymerase error rate and reas- A virus HA protein. This process is the first deep learning model
sortment of the segmented genome that result in a constantly that attempts to model and predict the evolution of a protein with
changing protein landscape with antigenic drift and antigenic minimal human input and no human supervision.
shift, respectively (Kawaoka, Krauss, and Webster 1989; Hensley
et al. 2009; Medina and García-Sastre 2011; Imai et al. 2012; de
Vries et al. 2013; Harding and Heaton 2018).
Material and methods
The small changes that occur from antigenic drift usually pro- MutaGAN
duce viruses that are closely related genotypically and antigeni- The core of our model was a seq2seq translation deep neural
cally, which can be illustrated on a phylogenetic tree. These close network, which formed the generator in the GAN. The seq2seq
relations and similar antigenic properties often enable immune encoder E takes as input a sequence of length N, with a dictionary
cross-protection (Tenforde et al. 2021). However, it has been shown size d, and converts it into a vector of length m, 𝐸 ∶ ℝ𝑁x𝑑 → ℝ1x𝑚 ,
that a single change in a particularly important antigenic loca- analogous to the embedding layer. The embedding layer of this
tion on the HA can also lead to the immune system unable to network was created using a biological language of 3-mers of
recognize the virus (Laver et al. 1979; Yewdell, Webster, and amino acids with a sliding window with step size of 1. For example,
Gerhard 1979; Webster and Laver 1980; DeDiego et al. 2016; Li the sequence of amino acids MKTIIALSY is transformed into MKT
et al. 2016a; Lee et al. 2019). Antigenic drift results in influenza KTI TIL ILA…. The output of the decoder LSTM is fed into a soft-
viruses existing as populations with a major genotype and multi- max dense layer (Fig. 2). To achieve our goal of a model that can
ple quasi-species (Kuroda et al. 2010; Lauring and Andino 2010). generate different sequences for a given parent, the element-wise
This mixed diversity can lead to vaccine mismatch and reduced random noise vector sampled from a standard normal distribution
protection (Tricco et al. 2013). Furthermore, antigenic drift over was combined with the output of the encoder.
time leads to accumulation of changes and results in new anti- The structure of the encoder in the discriminator is slightly dif-
genic characteristics that can evade the immune system or trans- ferent from that of the encoder in the generator, but it uses the
mit between species. Current vaccine sequence selection is an same weights. An embedding layer requires the input to be in the
inexact process based on recent field surveillance data to pro- form of a single integer, representing a discrete input. However,
duce the seed stock for the next year (Perofsky and Nelson 2020). the output of the generator at each time step is a probability vector
This approach has resulted in frequent mismatch between vac- with dimension ℝ𝑑x1 . This cannot be transformed into an integer
cine antigens and circulating virus. Further exacerbating vaccine with the argmax function because the argmax function is not dif-
failure is the process employed to evaluate immune response ferentiable, meaning that it does not allow for backpropagation to
elicited by vaccination. Specifically, vaccinated animal sera are train the generator. Therefore, a modified encoder takes as input
tested against the selected virus stock only, which limits the a vector with dimensions ℝ𝑑x1 , with the first layer of the modi-
understanding of the breadth of immune response to other virus fied encoder being a linear dense layer with an output of 𝑚 and no
variants. bias term. The weights of this dense layer are the same as those of
While deep learning has successfully been applied to genomics the embedding layer, meaning that it produces a linear combina-
and biological sequence–related tasks, to date, there is no testable tion of the embeddings from the embedding layer of the encoder
evolutionary forecasting model that predicts with high confidence for parent sequences in the discriminator and the encoder in the
which virus genotypes will emerge and circulate annually. There generator. The generated sequences were fed into this encoder as
has been work studying the mutation of viruses and viral escape softmax outputs of the generator, and the real sequences were fed
using deep learning (Neher, Russell, and Shraiman 2014; Hie et al. in as one-hot encoded sequences.
2021) or using statistical methods to model fitness (Bush et al. The architecture of the discriminator was built using code
̈
1999; Luksza and Lassig 2014, Morris et al. 2018). However, these available on https://github.com/DanBAPL/MutaGAN, with the
do not identify direct parent to child relationships of all sequences final layer of the discriminator being a sigmoid function. This
nor predict future progeny sequences of all parent sequences. includes the loading of the pretrained autoencoder weights.
Furthermore, other models developed to predict virus evolution Because the two encoders used the same bidirectional LSTM,
D. S. Berman et al. 5

the weights for that in the two encoders were automatically Dataset creation
shared once they were loaded into the parent encoder. After rooting the final ML tree to isolate ‘influenza A virus A/Hong
The embedding layer in the model had a dimension of 250 and Kong/1/68(H3N2)’, Marginal ancestral sequence reconstruction
allowed for 4,500 tokens. Unknown tokens are given the value was performed with RAxML using the General Time Reversible
‘[UNK]’. We did not have a problem in generating sequences that model of nucleotide substitution with the gamma model of rate
had unknown tokens on input, as the sliding window with an over- heterogeneity. Parent–child relationships were generated using
lap provided cover. The bidirectional LSTM encoder had 128 units, the Bio.Phylo package in BioPython (Cock et al. 2009) and were
making the encoding dimension 𝑓 = 512 (both cell and hidden limited to single steps between phylogenetic tree levels such that
states for forward and backward LSTMs), and the LSTM decoder each parent had exactly two children. One parent–child pair was
had 256 units. The discriminator consisted of concatenated hid- generated for each of the 13,678 edges within the final binary
den and cell states of the LSTM encoder and modified LSTM tree. Because phylogenetic tree generation requires removal of
encoder, as described earlier. This was fed into fully connected duplicate nucleotide sequences prior to evolutionary modeling,
layers of three sets of dropout (20 per cent), batch normaliza- there was a concern of providing an information bias of evolution
tion, and a dense layer. The first two dense layers had 128 and towards the ancestral sequences (i.e. internal nodes) and away

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


64 dimensions and used a leaky ReLU activation function with from sequences acquired through genomic surveillance (i.e. leaf
𝛼 = 0.1. The final dense layer was a linear activation function with nodes). To mitigate this bias, leaf nodes that had a nucleotide
Dimension 1. sequence matching to multiple records within the IVR database
were inserted back into the dataset as duplicate parent–child
Dataset pairs. As an example, if a leaf node’s sequence was observed
For this work, the influenza virus was chosen as an ideal test four times in IVR, there would be four identical parent–child
case for this deep learning framework because it is a signifi- pairs inserted into the dataset. Upon completion, the number of
cant human pathogen that changes rapidly, with new strains parent–child pairs was increased to 17,218 within the formatted
emerging annually, and global surveillance efforts have generated dataset. Each nucleotide sequence was translated to amino acids
large amounts of publicly available genomic data (Wohlbold and for representation learning of the HA protein by the MutaGAN
Krammer 2014). The surface proteins HA and NA of influenza virus framework.
enable virus entry into cells and are the primary immune epi- The training and test datasets were formed by splitting a list of
topes that elicit antibodies, making them of particular interest for the compiled unique parent sequences in a random 90/10 split.
vaccine development (Wohlbold and Krammer 2014). The result was 1,451 unique parent sequences in the training
dataset and 156 unique parents in the test dataset. There were a
total of 15,699 parent–child pairs in the training dataset and 1,519
Database curation
parent–child pairs in the test dataset. A total of 150 sequence pairs
Influenza virus HA sequences were downloaded from the National
(0.96 per cent) were removed from the training dataset, and 11
Center for Biotechnology Information’s (NCBI) Influenza Virus
sequence pairs (0.72 per cent) were removed from the test dataset
Resource (IVR) (Bao et al. 2008). Utilizing Bash text parsing meth-
in which the amino acid Levenshtein distance (see the Generator
ods including awk and sed, the dataset was curated to gene
evaluation section for description) was ten or greater to prevent
sequences from the influenza A type and H3N2 subtype that were
parent–child pairs that were excessively unrelated, either from
obtained from human hosts between 1968 and 2017 and valida-
sampling bias or mistaken sequence inclusion. This removal had
tion data included strains from 2018–19. Only sequences dated
the effect of isolating an outlier group identified within our phy-
between 1 January 2018 and 31 December 2019 were used for
logenetic tree that appeared as a result of a small number of
the validation dataset. Duplicate records were removed using
sequences not being removed during the pre-phylogenetic filter-
the ‘isolate_name’ and ‘isolation_date’ metadata attributes as a
ing process. Because the phylogenetic tree was created using the
unique identifier. When a duplicate identifier was encountered,
virus gene sequences and synonymous mutations do not lead to
the first record within the IVR database was kept and the remain-
amino acid mutations in proteins, the corresponding parent–child
ing records were discarded. Additionally, only isolates that had
protein pairs could be identical. Of the 1,451 unique parents in
full-length HA segments present in the dataset were kept. Of
the training dataset, 103 parents (7.10 per cent) only had child
note, during curation, twenty-two isolates from swine or avian
sequences that were identical to the parents, while 567 (39.08 per
hosts remained in this dataset (Supplementary Table S1). When
cent) had only one unique child. For a measure of parent–child
completed, the curated sequence dataset contained 6,840 unique
diversity, the training set contained 5,048 parent–child pairs where
records of H3N2 influenza virus unique sequences.
the child’s sequence differed from its parent. Matching parent–
child pairs were removed from the training dataset. In the test
Phylogenetic tree generation set, all instances of matching parents and children were removed,
For input into the seq2seq GAN framework, phylogenetic recon- leaving 433 parent–child pairs with 141 unique parent sequences.
struction was performed using the nucleic acid sequences of the The test dataset only contained pairs in which the parent and child
6,840 HA sequences. All DNA sequences were aligned using Mul- sequences were different.
tiple Alignment using Fast Fourier Transform (v.7.471) (Katoh and The validation dataset was built using the same phylogenetic
Standley 2013) and trimmed to the coding region. The final max- tree construction process as the training and test datasets, but
imum likelihood (ML) tree was made using RAxML (v.8.1.1) (Sta- using only data from 2018 and 2019. As a result, it is a temporally
matakis 2014) with a generalized time reversible model, gamma distinct dataset from the training and test dataset, separated by a
model of rate heterogeneity, and ML estimate of the alpha param- full year. This dataset had 3,260 parent–child pairs. Of these 3,260
eter. The final tree, as shown in Fig. 3, was visualized using FigTree pairs, ten (0.3 per cent) were removed for having a Levenshtein
(v.1.4.4) (Rambaut 2017) with some custom post-processing. distance greater than 10. The remaining 3,250 pairs contained
6 Virus Evolution

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


Figure 3. Topology of RAxML tree used to build parent–child pairs. The topology of the maximum likelihood tree created from 6,840 H3N2 sequences is
shown in (A). The ancestral sequences of each internal node in this tree were used to form the 13,768 parent–child pairs used to train the seq2seq
generator of the GAN framework. An outlying group, containing twenty-two sequences, was identified as coming from swine or avian hosts, and those
sequences are indicated in blue and with *. One of these twenty-two sequences was from the group of 155 parent-child pairs with a Levenshtein
distance >10 and is indicated in yellow and with **. The region surrounding that outlying group (gray box) is expanded in the inset (B) and further
expanded in the insert (C), where it can be seen that the majority of the parent–child pairs removed for high Levenshtein distance in non-human
hosts come directly off the backbone of the phylogenetic tree leading to the outlying group in blue. This trend continues back to the root of the tree.

1,807 (55.6 per cent) of sequences where the parent and child were least one known mutation. This mutation must be a known
not the same. There were 561 unique parent sequences, and 287 mutation for that parent sequence, not any random parent. We
(51.2 per cent) had only one child sequence. refer to this as the known mutation generation rate. This can be
extended to mutations that are in the correct location but incor-
Generator evaluation rect change. This is the known mutation location rate. This is
The most important metric for assessing quality of the generated calculated as follows: for the 𝑖th parent sequence with known
sequences is whether they were able to produce the mutations child sequences, 𝐶𝑖 = {𝑐0 , 𝑐1 , …} and set of known mutations ℳ𝑖 =
observed in the data. However, missing an observed mutation does {𝑚0 , 𝑚1 , …}. For a given input parent sequence used to generate
not necessarily mean that the generator did not correctly predict 𝑘 potential sequences, we can create a set of potential mutations
possible mutations and only that it did not correctly predict all ℳ′𝑖 = {𝑚′0 , 𝑚′1 , …}
observed mutations. It is possible that predicted mutations would
have likely occurred but just were not included in our subset.
We identified mutations in our sequences by measuring the
Levenshtein distance between parent and child sequences. By where 𝑁 is the total number of parent sequences.
using the Levenshtein distance, we were able to account for inser-
tions and deletions as well as mutations (Levenshtein 1966), and Amino acid mutation frequency
by using the diff-match package from Google, we were able to Frequency of amino acid mutations was calculated to evaluate
identify where changes were made between two sequences (Fraser the similarity of the mutation profiles between MutaGAN and the
et al. 2018). The diff_cleanupSemantic function in the diff-match- ground truth data (Equation 2). For each mutation within a given
patch package was used to identify where changes were made set of mutations, 𝑎(𝑝) is the amino acid of the parent, 𝑎(𝑐) is the
between parents, real children, and generated children. A list of amino acid of the child, and 𝑥𝑎(𝑝) 𝑎(𝑐) is the count of all mutations
all child mutations was created for each parent and compared to observed from one amino acid to another. The amino acid muta-
each parent’s generated children. A mutation was counted as cor- tion frequency is the value 𝑥𝑎(𝑝) 𝑎(𝑐) divided by the total number of
rect if change occurred in the generated child that was identical in recorded mutations. This provides information on the frequency
amino acid and location as observed in the parent’s real children. of mutating one amino acid to another:
A partially correct mutation was defined as an amino acid change
in the generated child that was identical in location to any muta-
tion in a real child of that parent but differed in amino acid type.
False mutations were defined as mutations that were predicted True-positive rate
but non-existent in the real children, and missed mutations were There are two ways of evaluating whether the generated
defined as mutations observed in at least one of the real children sequences contain the known mutations. The first is to consider
of the parent but not in the generated child. all the mutations for a known parent and determine whether the
In addition to the Levenshtein distance, we use four metrics generated sequences contain those mutations. This can be mea-
for evaluating the performance of the generator: known mutation sured similar to the standard true-positive rate formula, and thus,
location rate, amino acid mutation frequency, true-positive rate, we will refer to it as the standard true-positive rate,
and weighted true-positive rate.

Known mutation location rate


The first and our primary evaluation metric is the percent of
parent sequences for which MutaGAN was able to generate at
D. S. Berman et al. 7

This is useful in determining how well the variety of mutations 2015) on four GeForce GTX 1080 Ti graphical processing units.
are reflected in the generated sequences. However, it does penalize Additionally, metrics were calculated using the functions in the
sequences with multiple known mutations unless those gener- package scikit-learn version 0.24.1(Pedregosa et al. 2011), the diff-
ated sequences contain all known mutations, which is not some- match-patch package (Fraser et al. 2018), and the natural language
thing we want, as we want these mutations distributed across toolkit package (Bird, Klein, and Loper 2009).
sequences to be proportional to how they would appear in the
sequencing data. Take, for example, a parent sequence ACFKLM Model training
has two children, ACFHLM and ACFHIM. If MutaGAN generates 100 The maximum number of words included in the embedding layer
sequences, fifty of them are exact matches to Child 1 and fifty of was 4,500, which was selected by rounding up the number of
them are exact matches to Child 2, the true-positive rate would be unique 3-mers found in our dataset. Additionally, we selected an
50.0 per cent because only half of known mutations appear across embedding size of 250. The encoder portion was a bidirectional
all generated sequences. If all generated sequences were ACFHIM, LSTM with 128 nodes, resulting in a state vector of 512 being
the true-positive rate would be 100.0 per cent because all known passed to the decoder, which was a unidirectional LSTM. Gener-
mutations appear in all children. However, we want our gener- ator pretraining was performed on the training dataset and tested

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


ated sequences to be more similar to the former scenario than on the test dataset. This was performed using the Adam opti-
the latter. mizer (Kingma and Ba 2014) with a learning rate of 0.01 until the
Therefore, we will also use a variation on the standard true- model reached a stable state. It was set to train for 72 epochs, but
positive rate in which the number of times a mutation was made converged far before that.
or missed is ignored, paying attention only to whether it hap- GAN training occurred in two stages based on batch size, for
pened across a given parent. We refer to this as the sequence a total of 350 epochs. In both stages, we selected the Adam opti-
true-positive rate because it focuses on the presence of mutations mization algorithm and the learning rate for the generator was
across a given parent sequence. The sequence true-positive rate 1 × 10− 3 and the discriminator was 3 × 10−5 . The learning rate for
is calculated the same way as the standard true-positive rate is the discriminator was chosen to avoid mode collapse. The first 200
calculated, with one additional step: a unique list of all the muta- epochs of the model were trained on a batch size of 32 with the
tions is made for sequences generated for each parent, rather than discriminator for five epochs and the generator training for five
using the mutations made for each generated sequence. epochs. The last 150 epochs of the model were trained on a batch
size of 45 with the discriminator and the generator training for five
Weighted true-positive rate epochs each.
To account for different levels of similarity between any two Typically, the discriminator is only meant to help the genera-
amino acids when evaluating mutational errors, Sneath’s index tor create realistic data, but the MutaGAN discriminator has the
(Sneath 1966), a percentage representation of the number of added goal of making sure that the generated sequence is a pos-
dissimilar comparisons of amino acids along 134 categories of sible child of the parent. Therefore, we modeled our approach
activity and structure, was incorporated into a calculation for after Reed et al. (Reed et al. 2016) and created three types of
weighted accuracy. For this analysis, we removed the prediction sequence pairs to train the discriminator. The first pair type is
of the ambiguous amino acid designation (X). Additionally, we set real parents and real children, as determined from the phyloge-
the lower limit on the allowable similarity to 0.85 to prevent over netic model. The second is real parents and generated children.
rewarding when calculating weighted averages. Only eighteen dif- The third is real parents and real non-children. The purpose of
ferent amino acid pairings with Sneath’s Index have similarities the third pair is to ensure that the model learns to differentiate
≥0.85 and were included, while all other comparisons were set between related and unrelated sequences in the context of evolu-
to 0. This means there are only eighteen different types of mis- tion. Ten thousand training records of the third type were gener-
takes for which partial credit can be awarded. To calculate the ated by randomly pairing unrelated parent and child sequences
weighted accuracy, each mutation found in the set of generated with a Levenshtein distance >15. A lower bound of fifteen was
children was weighted using thresholded Sneath’s index, 𝑆, and selected because we wanted to avoid unrelated parent child pairs
averaged across the entire table of predicted mutations, 𝐴, where that could be too close, as those could be parent–child pairs that
the columns are the predicted amino acid and the rows are the were not sequenced and therefore not observed. As a result, we
expected amino acid, as calculated in Equation 3, could model sequences that were similar and real, but not directly
related.
To optimize performance of the model, our framework devi-
ated from previously published methods in a number of ways.
where ⊗ is an element-wise multiplication of two matrices with The MutaGAN seq2seq model was pretrained prior to input into
the same dimensions. the GAN using teacher forcing (Williams and Zipser 1989), so the
generator’s decoder also contained a similar embedding layer with
Experiment 4,500 words and an embedding size of 250. The loss function was
In this section, we present the experiment we designed to train the standard sparse categorical cross entropy loss function.
and test MutaGAN. The initial version of the GAN used a binary cross-entropy loss
function (Equation 4),
Setup
The phylogenetic tree reconstruction took place on a 16 processor
64 GB RAM compute node running Ubuntu. RAxML tree opti-
mization and ancestral reconstruction took roughly 14 days to
complete. The models were built, and training and testing was However, early iterations of our model using this loss function
implemented in Python version 3.8.8 using the libraries Tensorflow were characterized by mode collapse, where the generator pro-
version 2.6.0 (Abadi et al. 2016) and Keras version 2.6.0 (Chollet duces an unvarying child sequence given a single parent sequence.
8 Virus Evolution

Table 1. The results and metrics of the MutaGAN model and the of {−2, −1, 0, 1, 2} to perturb that location to provide some addi-
baseline model on the validation data. tional variability. For the distribution of the amino acid change,
𝑝aa , we used the distribution of amino acid changes in the set
MutGAN perfor- Baseline model
mance on the on the validation
of known mutations in the training data. Therefore, the process
Metric validation dataset dataset of generating a new sequence using the baseline method is as
follows:
Known mutation 72.8% 44.1%
generation rate
Known mutation location 77.8% 77.1% 1. 𝑛 ∼ 𝑝num
rate 2. 𝑙 ∼ 𝑝loc until there are 𝑛 unique 𝑙𝑠
Median Levenshtein 3. 𝑦𝑖̂ ∼ 𝑝𝑎𝑎 (𝑥𝑖 ) for the amino acid 𝑥𝑖 at location 𝑙𝑖 for 𝑖 = 1, … , 𝑛.
distance
Generated vs child 4.00 (𝜇 = 4.83, 1.00 (𝜇 = 1.62,
𝜎 = 4.12) 𝜎 = 1.10)1.00 As with MutaGAN, 100 sequences were generated for each par-
Generated vs parent 4.00 (𝜇 = 5.10, (𝜇 = 1.69, ent in the validation dataset. The results for the baseline model

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


𝜎 = 4.14) 𝜎 = 1.14) are shown in Table 1.
Average difference in
amino acid mutation
frequency
Results
Generated vs training 3.2 × 10−3 9.1 × 10−4
Generated vs validation 2.0 × 10−3 1.1 × 10−3 After the model was trained, we generated 100 child sequences
True-positive rate for each of the 562 unique parent sequences in the validation
Standard (weighted) 23.6% (60.5%) 0.1% (32.3%) set. After removing overlap between parents from the training
Sequence level (weighted) 25.3% (67.0%) 3.1% (79.2%) and validation set, there were 536 unique parents. The process
of generating the 53,600 sequences took approximately 30 min.
We then discarded 1,032 sequences with >sixty amino acid dif-
To resolve this problem, the loss function was switched from ferences from their parent sequences, a non-biological artifact
binary cross-entropy to Wasserstein loss that we are treating as noise. A change of this magnitude corre-
sponds to 10 per cent of the overall protein structure and is highly
improbable to have occurred by chance within a single evolu-
tionary step on the timescale with which the phylogenetic model
was created. There were two parent sequences we were unable
where 𝑦𝑛 is the ground truth value, either 1 or −1, and 𝑦𝑛̂ is the to generate viable children for, corresponding to a 0.36 per cent
predicted value, and the final layer of the discriminator into a failure rate. MutaGAN’s performance is summarized in Table 1.
linear activation function (Arjovsky, Chintala, and Bottou 2017). The median Levenshtein distance between parent and observed
The loss of the generator is the sum of the Wasserstein loss child amino acid sequences within the validation dataset was
and the sparse categorical cross-entropy of generating the child 1.00 (𝜎 = 1.06). The median Levenshtein distance between the
sequence. generated sequence and the closest child sequence of the input
In a variation from Reed et al. (2016), we used sequences gener- parent in the validation dataset is 4.00 (𝜇 = 4.83, 𝜎 = 4.12), com-
ated by the initial GAN as additional negative examples in training pared to 4.00 (𝜇 = 5.10, 𝜎 = 4.14) between the generated and parent
the discriminator of the final model to prevent the model from sequence (Supplementary Fig. S2). The generated sequences were
drifting too far off course as a form of experience replay, similar marginally closer to the child sequences, but still very close, to the
to an approach used in deep reinforcement learning (Lin 1992). parent sequences. This indicates that the model is augmenting its
The initial GAN created a high proportion of generated children input to account for the learned model of protein evolution.
with a Levenshtein distance >300 (Supplementary Fig. S1). Using The known mutation generation rate was 72.8 per cent, as 390
this model to generate 10,000 children from randomly selected of the 536 parent sequences had at least one observed muta-
parents, with replacement, and removing pairs where the Lev- tion augmented onto it within its generated child sequences.
enshtein distance was <15, we were left with 8,550 parent–child Of the parent sequences that MutaGAN did not correctly iden-
pairs. These sequences were used for experience replay. The dis- tify a mutation as observed in the ground truth, there were
tribution of the Levenshtein distances of the sequences for both twenty-seven(5 per cent) sequences for which MutaGAN pro-
the fake parent–child pairs and the experience–replay pairs is duced a mutation in the correct location but with the incor-
shown in Supplementary Fig. S1 in a stacked histogram, with rect amino acid. Therefore, the known mutation location rate is
the real parent and real non-child sequences in blue and the 77.8 per cent.
real parent and generated sequences from the failed model in Because amino acids range in biochemical and physical sim-
orange. ilarities, it is important to look closer at the actual mutations
As a baseline comparison model, we used a Monte Carlo sim- that are made or missed, especially because many of the extra
ulation with the training data acting as our source of the histor- mutations correspond to common biological mutations between
ical statistics to create a fixed probability model. For this Monte functionally similar amino acids. Mutation profiles by amino acid
Carlo simulation, we create three distributions from the training are provided in Fig. 4. A side-by-side comparison of the mutational
data: number of mutations, location of mutation, and amino acid profiles is made across the training, validation, and MutaGAN-
change. For the distribution of the number of mutations, 𝑝num , we generated amino acid sequences with respect to the parent input
used the number of mutations in the training data and sampled sequences. MutaGAN’s amino acid mutational profile is strik-
from that. For the distribution of the location of the mutation, 𝑝loc , ingly similar to that of both the training and validation datasets,
we sampled from the distribution of the locations of known muta- indicating that the model has learned a measure of biological
tions in the training data and then randomly sampled from a set significance in the biophysical and chemical properties of amino
D. S. Berman et al. 9

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


Figure 4. Amino acid mutation profiles with respect to amino acid types. For the training, validation, and generated child sequences, total counts for
each amino acid mutation from parent to child are displayed in (A). Amino acid ordering was determined using R’s hclust function on the training
data and kept consistent throughout both (A) and (B). Differences in amino acid mutation frequency between the training, validation, and generated
datasets were calculated and are visualized in (B) using Equation 2

acids. To assess if MutaGAN’s generated amino acid mutation pro- (H)→asparagine (N) (Fig. 4B). Interestingly, Arginine is the second
file more closely resembles the training set over the validation most favorable amino acid mutation from glutamine behind glu-
set or vice versa, the average difference in amino acid muta- tamate (Barnes and Gray 2003). Aspartate and serine are the most
tion frequency was calculated for the ‘Generated vs Training’ and favorable amino acid mutations from asparagine alongside histi-
‘Generated vs Validation’ delta mutation profiles (Fig. 4B). This dine, and asparagine is the second most favorable mutation from
measure of distance was calculated to be 3.2 × 10−3 and 2.0 × 10−3 , histidine behind tyrosine. Another frequently incorrect MutaGAN
respectively. Importantly, the MutaGAN generated mutation pro- mutation of note is serine (S)→proline (P). Serine, when present
file shows changes in mutation frequency for specific amino acids on a protein’s surface, often forms hydrogen bonds with the pro-
that more closely resembles the validation set when compared tein’s backbone and effectively mimics proline (Barnes and Gray
to the training set (Fig. 4A). In particular, it is observed that 2003). In accordance, the four locations that MutaGAN incorrectly
there are higher proportions of threonine (T)→lysine (K), threonine mutated the HA protein from a serine to a proline were at amino
(T)→isoleucine (I), arginine (R)→lysine (K), and glycine (G)→aspar- acid positions 143, 198, 199, and 227 within the HA1 chain, all of
tic acid (D) mutations and lower proportions of alanine (A)→valine which are located on protein’s surface.
(V), glycine (G)→arginine (R), and alanine (A)→threonine (T) within As an artifact of the phylogenetic analysis, a small but notice-
the ground truth validation data as compared to the ground truth able portion of child sequences within the training, test, and
training data. For these same amino acids, MutaGAN’s mutational validation datasets contained the ambiguous amino acid symbol
profile shows the same trends. ‘X’ at some location within its sequence. The appearance of ‘X’ in
The results in Table 1 indicate that the MutaGAN model and a child sequence created the appearance that a parent amino acid
the baseline model were able to comparably generate mutations could mutate to ambiguity. However, MutaGAN never mutated
at sites known for mutations in the validation data, with the a parent amino acid to ambiguity (Fig. 4A). This is likely due to
known mutation generation rate of models being 77.8 per cent and the fact that of all the amino acids in the training dataset, only
77.1 per cent, respectively. However, the known mutation genera- 3.35 × 10−3 per cent were ‘X’, meaning that there were too few
tion rate for MutaGAN was 72.8 per cent vs the baseline rate of examples of ‘X’ for it to learn it.
44.1 per cent. Additionally, the true positive rate (TPR) scores are The overall mutation location profile of historical H3N2
significantly higher for MutaGAN. While the median Levenshtein influenza virus HA proteins was well reproduced by Muta-
distance and the average difference in amino acid mutation fre- GAN (Fig. 5). Supplementary Figure S3 shows the same plot on
quency were lower for the baseline model, this is expected. Addi- the test data for comparison purposes. The most highly variable
tionally, this indicates that MutaGAN, while capable of capturing regions identified in the training and validation datasets (HA1
a similar profile of mutation locations, is capable of making muta- amino acid indices 120–160 and 185–228) were also the most
tions outside the historical distributions that match the validation mutated regions by MutaGAN. Regions of lesser, but still signifi-
data. cant, variability were also identified by MutaGAN in accordance
The most prominent amino acid mutations that were made with the historical H3N2 data observed in the training and vali-
by MutaGAN that were not seen frequently in either the train- dation sets of Fig. 5 such as HA1 residue regions (i.e. amino acid
ing or validation data are glutamine (Q)→arginine (R), asparagine indices) of 45–59, 259–262, and 273–278. Regions of historical con-
(N)→aspartate (D), asparagine (N)→serine (S), and histidine servation were accurately preserved by MutaGAN, most notably
10 Virus Evolution

the HA1 residue region of 11–24 and the HA2 residue region 1–16. increases the influenza virus’s likelihood of evading host anti-
Of the top ten most frequently mutating positions in the training bodies during infection. The majority of the MutaGAN-generated
dataset, MutaGAN’s only had one within its own top ten (Position mutations also occur in these same regions but across a notably
121). Between the validation set and MutaGAN-generated set, larger number of residues on the protein surface. This finding
there were no overlaps in the top ten most frequently mutated alludes to the ability of this framework to illuminate localized
positions. However, for many of the most frequently mutating function across varying regions of the overall protein structure,
amino acid locations within the training and validation sets, the but further simulations must be performed to investigate the
location was one or two positions away from a commonly mutated functional effects of the MutaGAN-generated mutations.
position in the MutaGAN-generated sequences. For instance, HA1 We can also examine whether the model is simply repeating
residues 142, 160, and 193 were in the top ten most frequently the same mutations or whether it is selecting mutations based
mutated positions in the training and validation sets. HA1 residues on the input sequence. We can determine this by comparing the
145, 159, and 192 were in the top ten most frequently mutated counts of mutations for generated sequences, shown in the bot-
positions by MutaGAN. This phenomenon is worth noting because tom graph of Fig. 5, which uses sequences from 2018 to 2019 and
of the closeness, but the biological significance is not readily Supplementary Fig. S3, which uses data similar to the training

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


apparent without a deeper analysis of the HA protein structure. dataset from 1965 to 2017. These figures appear substantially dif-
In looking at the structure of the HA protein more closely (Fig. 5B), ferent, especially in all seven areas of interest. This indicates that
it is clear that the concentration of the most frequently mutated the model is not simply generating the same mutations regardless
position for both the training and validation data sets occurs out- of the training data, but generating new mutations based on the
side of the protein structure, principally on the outer surface of the input.
HA1 domain toward the host-recognition regions. It is well under- While recognizing that the phylogenetic tree does not cap-
stood that the frequent mutation of amino acids at these locations ture the entire breadth of mutations that occurred during the

Figure 5. Amino acid mutation profiles with respect to HA protein locations. For the training, validation, and generated child sequences, total counts
of mutations observed across the entire length of the HA protein segment are displayed in (A), indicating the signal peptide, HA1 (head), and HA2
(stalk) regions of the full HA protein. The most highly variable regions are highlighted in salmon, the third and fourth highlighted regions. Regions of
lesser, but still significant variability, are highlighted in yellow, the second, fifth and sixth highlighted regions. Particularly conserved regions are
highlighted in blue, the first and last highlighted regions. In (B), a diagram of the H3 HA structure (PDB: 4GMS) is colored by this mutation frequency,
with the positions with the fewest mutations in yellow to the positions with the most mutations in brown. Positions with zero observed mutations
across each dataset are colored gray. Residues are displayed as spheres for positions with mutation frequencies above 30 per cent of the maximum
position for each of the three datasets. These 30 per cent threshold lines are also plotted in (A).
D. S. Berman et al. 11

Discussion
Accurate mutation forecasting from protein
sequences
The MutaGAN framework presented here is the first method to
utilize a GAN to accurately reproduce and optimize full-length
proteins above 300 amino acids in length with no structural infor-
mation provided to the model beyond amino acid sequence. The
accurate reproduction of the mutation profiles of the HA protein
with specificity of both the types of amino acids changed (Fig. 4)
and locations most likely for persistence (Fig. 5) demonstrates
the potential of this method to be used as a tool for forecasting
the genetic drift or shift that occurs during the outbreak of the
influenza virus. Our findings indicate that the sequence augmen-
tation strategies deployed by MutaGAN optimize its input toward

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


the most successful patterns observed during the evolutionary
history of a protein. Because this framework is agnostic to the type
of phylogenetic tree and protein type used to generate parent–
Figure 6. Histograms showing the distribution of correct mutations as
both total counts (A) and percentage of total recorded mutations (B). child pairs, the extension of these methods to new proteins and
organisms (e.g. the NA protein for influenza and the dengue virus)
is ripe for exploration.
entire evolution of the influenza virus, MutaGAN’s performance
The ability of MutaGAN to learn and optimize mutations for
was evaluated with respect to this tree as our closest proxy of its
persistence within a population lends itself well to protein engi-
ability to mimic the virus’s evolutionary landscape. The number
neering applications. As demonstrated in Figs 4 and 5, the unique
of observed mutations reproduced for each parent is visualized
nuances of change within a protein population are capable of
in Fig. 6 and shows that most generated children contained at least
being captured without any additional expert knowledge being
one observed mutation, while a smaller number contained more.
provided to the model beyond a list of parent–child pairs for train-
As a measure of the standard true-positive rate, each generated
ing tailored to a specific protein. The observation that MutaGAN
child’s sequence contained 23.6 per cent of observed mutations
inserts mutations that are biologically relevant, even when not
between its parent and real children (the standard true-positive
observed in the ground truth data, poses the question of whether
rate comparing the generated and real child sequences). Using
these mutations produce energetically favorable protein confor-
Sneath’s index (Sneath 1966) to get a deeper assessment of the
mations with increased fitness within the evolutionary landscape.
closeness of MutaGAN’s predictions, we find that it has a weighted
Future work could pair computational protein modeling with
true-positive rate of 60.5 per cent, calculated using Equation 3,
this framework for a deeper analysis of the MutaGAN-generated
to compare the generated and real child sequences. This large
sequences for improved forecasting of population-level mutation
increase from an unweighted standard true-positive rate of
propagation. With direct ties to the public health domain, by
23.6 per cent to 60.5 per cent indicates that a majority of the
measuring the conformational protein favorability of MutaGAN-
mutations found within the real child sequences are similar in
generated sequences and analyzing their similarity to currently
biochemical and physical properties to the amino acids MutaGAN
circulating pathogenic sequences, public health officials could
used in those locations. When we use the sequence true-positive
assess the threat of potential mutations against vaccine evasion
rate, we find that the unweighted sequence true-positive rate
and improve the design of future treatments or vaccines.
is 25.3 per cent and the weighted sequence true-positive rate is
In addition to identification of persistent mutations in a pop-
67.0 per cent.
ulation, MutaGAN has potential to predict novel mutations. Pre-
For 300 parent sequences (53.6 per cent), our model generated
vious models utilizing evolutionary information are able to iden-
the same child sequence in each of the 100 iterations, regardless
tify historical sequences that may give rise to future clades
of the noise, using the distribution N(0,1) (Supplementary Fig. S4).
(Neher, Russell, and Shraiman 2014) or may provide insight into
While this behavior might appear to be mode collapse, in the
mutations that potentially affect fitness and therefore persist in
ground truth validation data, the phylogenetic tree had only one
future populations (Obermeyer et al. 2022). These models can only
child for 397 (71 per cent) of the parent sequences as a direct result
provide insight into historical sequences. In contrast, MutaGAN
of using the amino acid rather than nucleotide sequences and
can utilize evolutionary information to make accurate predictions
masking synonymous mutations. Of these 560 parent sequences
regarding persistence of mutation as well as identify potential
whose generated children exhibited mode collapse–like behavior,
new mutations of descendant sequences. Furthermore, MutaGAN
287 (51.25 per cent) also had only one child in the ground truth
predicts future generations at the sequence level, thus provid-
phylogenetic tree data. Since these proportions of parents with
ing a mechanism to identify amino acid changes of descendant
only one child are similar in both the generated results and the
sequences that can be incorporated into fitness models to inform
validation dataset, and notably larger than the 39.1 per cent in the
forecasting of seasonal influenza variants.
training dataset, and there is a large overlap in sequences with
only one child in the generated results and validation dataset, it
appears that the mode collapse is only partially responsible. This Model reproducibility
is further supported by the standard and sequence level true- Although the mutation profiles are well-reproduced with respect
positive rates, which indicate a defaulting to one type of known to amino acid and location, MutaGAN’s performance, when eval-
mutation, rather than generating a variety of them. Future work, uated at the single-nucleotide resolution, has a significant room
such as operating directly on nucleotide sequences, could help for improvement. The 23.6 per cent and 25.3 per cent capture of
reduce the impact of mode collapse on the sequence generation. standard and sequence true-positive rates and 81.2 per cent and
12 Virus Evolution

88.4 per cent prediction of false positives highlight shortcomings populations. As a result, we see the potential for this research to
of the current trained model. We hypothesized that the model play a significant role in public health, particularly in disease mit-
could be improved solely by improving the dataset. There is a igation and prevention. With the improvements outlined earlier, if
high likelihood that the inclusion of swine and avian influenza MutaGAN was implemented to simulate how currently circulating
sequences into the phylogenetic model inhibited the MutaGAN’s pathogens could evolve over time, targeted measures of quaran-
ability to fit itself on patterns of HA protein evolution specific to tine and treatment could be more effectively deployed. MutaGAN’s
human infection. In addition to a more comprehensive curation ability to produce full-length protein sequences while simultane-
of outliers, a larger population of HA protein sequences could be ously learning the nuances of evolution lends itself well to the
utilized to provide additional diversity to the model. The small extension to other protein types, creating potential for impact
number of database records (6,840) used to generate the phy- within the domain of multiple diseases.
logenetic model was unlikely to capture the full breadth and
depth of the true evolutionary landscape of the human H3N2 Data availability
influenza virus HA protein. By leveraging a larger database of
Influenza virus HA sequences were downloaded from the NCBI
influenza virus surveillance, such as Global Initiative on Shar-

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


IVR. The parent–child pairs used for the training, test, and vali-
ing All Influenza Data’s EpiFlu (Shu and McCauley 2017), a more
dation datasets, extracted from the phylogenetic tree, are avail-
complete evolutionary model could be generated and provided
able at https://github.com/DanBAPL/MutaGAN. Additionally, we
to MutaGAN for training, testing, and validation. In addition to
have provided the parent–non-child pairs and the bad generated
the diversity of sequences provided to the phylogenetic tree, con-
sequences (and models), along with the tokenizer. Code for train-
structing this phylogenetic model using time-based Bayesian tree
ing the MutaGAN model, performing analysis, and generating
estimation methods could improve the ability of MutaGAN to
figures is available at https://github.com/DanBAPL/MutaGAN.
learn time-related aspects of evolution as well as enable a deeper
characterization of the model’s ability to forecast into the future.
This approach could also provide us the opportunity to compare Supplementary data
our model predictions to those of the experts’ predictions in a Supplementary data are available at Virus Evolution online.
given year’s influenza vaccine, as well as look into explainable
artificial intelligence techniques to understand why the model
Acknowledgements
made the mutations it did. Future studies will also explore non-
deterministic methods of ancestral sequence reconstruction uti- We are grateful to Jason Fayer, Ben Baugher, Kyle Klarup, and
lizing the simulated nucleotide probabilities per position rather Logan Hauenstein for their support, comments, corrections, and
than strictly including the most probable sequence of the internal feedback.
node for inclusion in the parent–child pair. This would allow us to
include in our training data evolutionary pathways that are not Funding
part of the ML pathway, but are still probable.
This work was supported by the Johns Hopkins University Applied
There are also plans to further improve upon the MutaGAN
Physics Laboratory (JHUAPL) Janney Program and the National
framework’s architecture. There are a number of recent advances
Institute of Allergy and Infectious Diseases Centers of Excellence
in NLP and sequence generation that can be leveraged to fur-
in Influenza Research and Surveillance [HHSN272201400007C].
ther improve this algorithm. These advancements include models
like attention (Wu et al. 2016; Vaswani et al. 2017; Zhang et
Conflict of interest: The authors declare no conflicts of interest.
al. 2019), bidirectional encoder representation from transform-
ers (Devlin et al. 2018), and reinforcement learning (Li et al.
2016b; Yu et al. 2017; Keneshloo et al. 2019; Tuan and Lee 2019). References
These models have shown significant improvement over mod- Abadi, M. et al. (2016) ‘Tensorflow: A system for large-scale machine
els relying on LSTMs alone for NLP tasks. When paired with the learning’, in 12th {USENIX} Symposium on Operating Systems Design
larger dataset provided by a larger influenza database and non- and Implementation ({OSDI} 16), pp. 265–83.
deterministic methods for ancestral sequence reconstruction, we Alipanahi, B. et al. (2015) ‘Predicting the Sequence Specificities of
believe that the fidelity of sequence reconstruction and optimiza- DNA- and RNA-Binding Proteins by Deep Learning’, Nature Biotech-
tion can be improved. Because operating directly on nucleotide nology, 33: 831–8.
data increases the length of the sequence data from amino acids Anand, N., and Huang, P. (2018) ‘Generative Modeling for Protein
by a factor of three and further inhibits the recurrent neural net- Structures’, Advances in Neural Information Processing Systems, 31:
work (RNN) identification of long-range structural relationships, 7504–15.
it was avoided in this study. However, with the implementa- Arjovsky, M., Chintala, S., and Bottou, L. (2017) ‘Wasserstein Gener-
tion of more robust encoder–decoder architecture, future research ative Adversarial Networks’, in International Conference on Machine
could evaluate the feasibility of MutaGAN to operate directly on Learning, pp. 214–23.
nucleotide sequences. Doing so would align MutaGAN with the Asgari, E., and Mofrad, M. R. K. (2015) ‘Continuous Distributed Rep-
industry standard in phylogenetic analysis and potentially enable resentation of Biological Sequences for Deep Proteomics and
improved learning of evolutionary landscapes through the added Genomics’, PLoS One, 10: e0141287.
information of synonymous mutations observed within protein Bahdanau, D., Cho, K., and Bengio, Y. (2014) ‘Neural Machine Trans-
lineages. lation by Jointly Learning to Align and Translate’, arXiv preprint
arXiv:1409.0473.
Bao, Y. et al. (2008) ‘The Influenza Virus Resource at the National Cen-
Conclusion ter for Biotechnology Information’, Journal of Virology, 82: 596–601.
Taken together, we have developed a first-of-its-kind deep learn- Barnes, M. R., and Gray, I. C. (2003) Bioinformatics for Geneticists. John
ing framework to predict genetic evolution in dynamic biological Wiley & Sons.
D. S. Berman et al. 13

Bedford, T., Rambaut, A., and Pascual, M. (2012) ‘Canalization of Katoh, K., and Standley, D. M. (2013) ‘MAFFT Multiple Sequence
the Evolutionary Trajectory of the Human Influenza Virus’, BMC Alignment Software Version 7: Improvements in Performance and
Biology, 10: 1–12. Usability’, Molecular Biology and Evolution, 30: 772–80.
Bengio, Y. et al. (2003) ‘A Neural Probabilistic Language Model’, Journal Kawaoka, Y., Krauss, S., and Webster, R. G. (1989) ‘Avian-to-Human
of Machine Learning Research, 3: 1137–55. Transmission of the PB1 Gene of Influenza A Viruses in the 1957
Bengio, Y., Courville, A., and Vincent, P. (2013) ‘Representation Learn- and 1968 Pandemics’, Journal of Virology, 63: 4603–8.
ing: A Review and New Perspectives’, IEEE Transactions on Pattern Keneshloo, Y. et al. (2019) ‘Deep Reinforcement Learning for
Analysis and Machine Intelligence, 35: 1798–828. Sequence-to-Sequence Models’, IEEE Transactions on Neural Net-
Bepler, T., and Berger, B. (2019) ‘Learning Protein Sequence works and Learning Systems, 31: 2469–89.
Embeddings Using Information from Structure’, arXiv preprint Killoran, N. et al. (2017) ‘Generating and Designing DNA with Deep
arXiv:1902.08661. Generative Models’, arXiv preprint arXiv:1712.06148.
Bird, S., Klein, E., and Loper, E. (2009) Natural Language Processing Kingma, D. P., and Ba, J. (2014) ‘Adam: A Method for Stochastic
with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Optimization’, arXiv preprint arXiv:1412.6980.
Media, Inc. Kosik, I., and Yewdell, J. W. (2019) ‘Influenza Hemagglutinin and Neu-

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


Bush, R. M. et al. (1999) ‘Predicting the Evolution of Human Influenza raminidase: Yin–Yang Proteins Coevolving to Thwart Immunity’,
A’, Science, 286: 1921–5. Viruses, 11: 346.
CDC (2022), Types of Influenza Virus. <https://www.cdc.gov/flu/about/ Kuroda, M. et al. (2010) ‘Characterization of Quasispecies of Pan-
viruses/types.htm> accessed 5 Jan 2022. demic 2009 Influenza A Virus (A/H1N1/2009) by de Novo Sequenc-
Chollet, F. et al. (2015), Keras. <https://keras.io>. ing Using A Next-generation DNA Sequencer’, PloS One, 5:
Cock, P. J. A. et al. (2009) ‘Biopython: Freely Available Python Tools e10256.
for Computational Molecular Biology and Bioinformatics’, Bioin- Kussell, E., and Leibler, S. (2005) ‘Phenotypic Diversity, Population
formatics, 25: 1422–3. Growth, and Information in Fluctuating Environments’, Science,
DeDiego, M. L. et al. (2016) ‘Directed Selection of Influenza Virus Pro- 309: 2075–8.
duces Antigenic Variants that Match Circulating Human Virus Lauring, A. S., and Andino, R. (2010) ‘Quasispecies Theory and the
Isolates and Escape from Vaccine-Mediated Immune Protection’, Behavior of RNA Viruses’, PLoS Pathogens, 6: e1001005.
Immunology, 148: 160–73. Laver, W. et al. (1979) ‘Antigenic Drift in Type A Influenza Virus:
Devlin, J. et al. (2018) ‘Bert: Pre-Training of Deep Bidirec- Sequence Differences in the Hemagglutinin of Hong Kong (H3N2)
tional Transformers for Language Understanding’, arXiv preprint Variants Selected with Monoclonal Hybridoma Antibodies’, Virol-
arXiv:1810.04805. ogy, 98: 226–37.
de Vries, R. P. et al. (2013) ‘Evolution of the Hemagglutinin Protein Ledig, C. et al. (2017) ‘Photo-realistic single image super-resolution
of the New Pandemic H1N1 Influenza Virus: Maintaining Opti- using a generative adversarial network’ in Proceedings of the IEEE
mal Receptor Binding by Compensatory Substitutions’, Journal of Conference on Computer Vision and Pattern Recognition, pp. 4681–90.
Virology, 87: 13868–77. Lee, J. M. et al. (2019) ‘Mapping Person-to-person Variation in Viral
Frank, S. A., and Slatkin, M. (1990) ‘Evolution in a Variable Environ- Mutations that Escape Polyclonal Serum Targeting Influenza
ment’, The American Naturalist, 136: 244–60. Hemagglutinin’, Elife, 8: e49324.
Fraser, N. et al. (2018), Google-diff-match-patch. <www.google-diff- Levenshtein, V. I. (1966) ‘Binary Codes Capable of Correcting Dele-
match-patchhttp://code.google.com/p/google-diff-match-patch> tions, Insertions, and Reversals’, Soviet Physics Doklady, 10:
accessed 3 Jan 2019. 707–10.
Goodfellow, I. et al. (2020) ‘Generative Adversarial Networks’, Commu- Levy, O., and Goldberg, Y. (2014) ‘Linguistic regularities in sparse
nications of the ACM, 63: 139–44. and explicit word representations’ in Proceedings of the Eighteenth
Gupta, A., and Zou, J. (2018) ‘Feedback GAN (FBGAN) for DNA: A Novel Conference on Computational Natural Language Learning, pp. 171–80.
Feedback-Loop Architecture for Optimizing Protein Functions’, Li, C. et al. (2016a) ‘Selection of Antigenically Advanced Variants of
arXiv preprint arXiv:1804.01694. Seasonal Influenza Viruses’, Nature Microbiology, 1: 1–10.
Harding, A. T., and Heaton, N. S. (2018) ‘Efforts to Improve the Li, J. et al. (2016b) ‘Deep Reinforcement Learning for Dialogue Gener-
Seasonal Influenza Vaccine’, Vaccines, 6: 19. ation’, arXiv preprint arXiv:1606.01541.
Heffernan, R. et al. (2015) ‘Improving Prediction of Secondary Struc- Lin, L.-J. (1992) ‘Self-Improving Reactive Agents Based on Reinforce-
ture, Local Backbone Angles and Solvent Accessible Surface Area ment Learning, Planning and Teaching’, Machine Learning, 8:
of Proteins by Iterative Deep Learning’, Scientific Reports, 5: 1–11. 293–321.
Hensley, S. E. et al. (2009) ‘Hemagglutinin Receptor Binding Avidity Luksza, M., and Lassig, ̈ M. (2014) ‘A Predictive Fitness Model for
Drives Influenza A Virus Antigenic Drift’, Science, 326: 734–6. Influenza’, Nature, 507: 57–61.
Hensley, S. E., and Yewdell, J. W. (2009) ‘Que Sera, Sera: Evolution of Luong, M.-T. et al. (2015) ‘Multi-task Sequence to Sequence Learning’,
the Swine H1N1 Influenza A Virus’, Expert Review of Anti-infective arXiv preprint arXiv:1511.06114.
Therapy, 7: 763–8. Ma, L. et al. (2017) ‘Pose Guided Person Image Generation’, Advances
Hie, B. et al. (2021) ‘Learning the Language of Viral Evolution and in Neural Information Processing Systems, 30.
Escape’, Science, 371: 284–8. Medina, R. A., and García-Sastre, A. (2011) ‘Influenza A Viruses: New
Hochreiter, S., and Schmidhuber, J. (1997) ‘Long Short-Term Memory’, Research Developments’, Nature Reviews. Microbiology, 9: 590–603.
Neural Computation, 9: 1735–80. Michaelis, M., Doerr, H. W., and Cinatl, J. (2009) ‘An Influenza A H1N1
Imai, M. et al. (2012) ‘Experimental Adaptation of an Influenza H5 Virus Revival—Pandemic H1N1/09 Virus’, Infection, 37: 381–9.
HA Confers Respiratory Droplet Transmission to a Reassortant H5 Mikolov, T. et al. (2013a) ‘Efficient Estimation of Word Representations
HA/H1N1 Virus in Ferrets’, Nature, 486: 420–8. in Vector Space’, arXiv preprint arXiv:1301.3781.
Isola, P. et al. (2017) ‘Image-to-image translation with conditional Mikolov, T. et al. (2013b) ‘Distributed Representations of Words and
adversarial networks’ in Proceedings of the IEEE Conference on Com- Phrases and Their Compositionality’, Advances in Neural Informa-
puter Vision and Pattern Recognition, pp. 1125–34. tion Processing Systems, 26.
14 Virus Evolution

Mikolov, T., Yih, W., and Zweig, G. (2013) ‘Linguistic regularities IEEE/ACM Transactions on Computational Biology and Bioinformatics,
in continuous space word representations’ in Proceedings of the 12: 103–12.
2013 Conference of the North American Chapter of the Association for Stamatakis, A. (2014) ‘RAxML Version 8: A Tool for Phylogenetic Anal-
Computational Linguistics: Human Language Technologies, pp. 746–51. ysis and Post-Analysis of Large Phylogenies’, Bioinformatics, 30:
Mirza, M., and Osindero, S. (2014) ‘Conditional Generative Adversarial 1312–3.
Nets’, arXiv preprint arXiv:1411.1784. Sun, T. et al. (2017) ‘Sequence-Based Prediction of Protein Protein
Morris, D. H. et al. (2018) ‘Predictive Modeling of Influenza Shows the Interaction Using a Deep-Learning Algorithm’, BMC Bioinformatics,
Promise of Applied Evolutionary Biology’, Trends in Microbiology, 26: 18: 1–8.
102–18. Sutskever, I., Vinyals, O., and Le, Q. V. (2014) ‘Sequence to Sequence
Mustonen, V., and Lassig, ̈ M. (2009) ‘From Fitness Landscapes to Learning with Neural Networks’, Advances in Neural Information
Seascapes: Non-Equilibrium Dynamics of Selection and Adapta- Processing Systems, 27: 3104–12.
tion’, Trends in Genetics, 25: 111–9. Tenforde, M. W. et al. (2021) ‘Effect of Antigenic Drift on Influenza
Nallapati, R. et al. (2016) ‘Abstractive Text Summarization Using Vaccine Effectiveness in the United States—2019–2020’, Clinical
Sequence-to-Sequence Rnns and Beyond’, arXiv preprint Infectious Diseases, 73: e4244–50.

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024


arXiv:1602.06023. Thompson, W. W. et al. (2003) ‘Mortality Associated with Influenza
Neher, R. A. et al. (2016) ‘Prediction, Dynamics, and Visualization of and Respiratory Syncytial Virus in the United States’, JAMA, 289:
Antigenic Phenotypes of Seasonal Influenza Viruses’, Proceedings 179–86.
of the National Academy of Sciences, 113: E1701–9. Thompson, W. W. et al. (2004) ‘Influenza-Associated Hospitalizations
Neher, R. A., Russell, C. A., and Shraiman, B. I. (2014) ‘Predicting in the United States’, Jama, 292: 1333–40.
Evolution from the Shape of Genealogical Trees’, Elife, 3: e03568. Tricco, A. C. et al. (2013) ‘Comparing Influenza Vaccine Efficacy
Ng, P. (2017) ‘Dna2vec: Consistent Vector Representations of Variable- against Mismatched and Matched Strains: A Systematic Review
Length K-mers’, arXiv preprint arXiv:1701.06279. and Meta-analysis’, BMC Medicine, 11: 1–19.
Obermeyer, F. et al. (2022) ‘Analysis of 6.4 Million SARS-CoV-2 Tuan, Y.-L., and Lee, H.-Y. (2019) ‘Improving Conditional Sequence
Genomes Identifies Mutations Associated with Fitness’, Science, Generative Adversarial Networks by Stepwise Evaluation’,
376: 1327–32. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27:
O’Brien, M. A. et al. (2004) ‘Incidence of Outpatient Visits and Hos- 788–98.
pitalizations Related to Influenza in Infants and Young Children’, Vaswani, A. et al. (2017) ‘Attention Is All You Need’, Advances in Neural
Pediatrics, 113: 585–93. Information Processing Systems, 30.
Palese, P., and Shaw, M. L. (2007) ‘Orthomyxoviridae: The Virus and Wang, S. et al. (2016) ‘Protein Secondary Structure Prediction Using
Their Replication’, in Fields Virology, pp. 1647–89. Williams, & Deep Convolutional Neural Fields’, Scientific Reports, 6: 1–11.
Wilkins: Lippincott. Webster, R. G. (1999) ‘1918 Spanish Influenza: The Secrets Remain
Pedregosa, F. et al. (2011) ‘Scikit-learn: Machine Learning in Python’, Elusive’, Proceedings of the National Academy of Sciences, 96: 1164–6.
Journal of Machine Learning Research, 12: 2825–30. Webster, R., and Laver, W. (1980) ‘Determination of the Number of
Perofsky, A. C., and Nelson, M. I. (2020) ‘Seasonal Influenza: The Nonoverlapping Antigenic Areas on Hong Kong (H3N2) Influenza
Challenges of Vaccine Strain Selection’, Elife, 9: e62955. Virus Hemagglutinin with Monoclonal Antibodies and the Selec-
Quang, D., and Xie, X. (2016) ‘DanQ: A Hybrid Convolutional and tion of Variants with Potential Epidemiological Significance’, Virol-
Recurrent Deep Neural Network for Quantifying the Function of ogy, 104: 139–48.
DNA Sequences’, Nucleic Acids Research, 44: e107. WHO. (1980) ‘A Revision of the System of Nomenclature for Influenza
Rambaut, A. (2017), FigTree-version 1.4. 3, a Graphical Viewer of Phylo- Viruses: A WHO Memorandum’, Bulletin of the World Health Organi-
genetic Trees Computer program distributed by the author <http://tree. zation, 58: 585–91.
bio.ed.ac.uk/software/figtree> accessed 3 Jan 2019. World Health Organization (WHO). (2010) Pandemic (H1N1)
Reed, S. et al. (2016) ‘Generative Adversarial Text to Image Synthesis’, 2009 - Update 109. <https://www.who.int/emergencies/disease-
in International conference on machine learning, pp. 1060–9. outbreak-news/item/2010_07_16-en> accessed 4 Mar 2020.
Repecka, D. et al. (2021) ‘Expanding Functional Protein Sequence Williams, R. J., and Zipser, D. (1989) ‘A Learning Algorithm for
Space Using Generative Adversarial Networks’, Nature Machine Continually Running Fully Recurrent Neural Networks’, Neural
Intelligence bioRxiv, 3: 324–33. Computation, 1: 270–80.
Rizzo, R. et al. (2016) ‘A Deep Learning Approach to DNA Sequence Wohlbold, T. J., and Krammer, F. (2014) ‘In the Shadow of Hemagglu-
Classification’, in Computational Intelligence Methods for Bioinformat- tinin: A Growing Interest in Influenza Viral Neuraminidase and Its
ics and Biostatistics: 12th International Meeting, CIBB 2015, September Role as a Vaccine Antigen’, Viruses, 6: 2465–94.
10-12, 2015, Revised Selected Papers 12, pp. 129–40. Naples, Italy: Wolf, D. M., Vazirani, V. V., and Arkin, A. P. (2005) ‘Diversity in Times
Springer International Publishing. of Adversity: Probabilistic Strategies in Microbial Survival Games’,
Schuster, M., and Paliwal, K. K. (1997) ‘Bidirectional Recurrent Neural Journal of Theoretical Biology, 234: 227–53.
Networks’, IEEE Transactions on Signal Processing, 45: 2673–81. Wu, Y. et al. (2016) ‘Google’s Neural Machine Translation System:
Shu, Y., and McCauley, J. (2017) ‘GISAID: Global Initiative on Shar- Bridging the Gap between Human and Machine Translation’, arXiv
ing All Influenza Data–from Vision to Reality’, Eurosurveillance, 22: preprint arXiv:1609.08144.
30494. Yewdell, J., Webster, R., and Gerhard, W. (1979) ‘Antigenic Variation in
Sneath, P. H. A. (1966) ‘Relations between Chemical Structure and Three Distinct Determinants of an Influenza Type A Haemagglu-
Biological Activity in Peptides’, Journal of Theoretical Biology, 12: tinin Molecule’, Nature, 279: 246–8.
157–95. Yu, L. et al. (2017) ‘Seqgan: Sequence Generative Adversarial Nets
Spencer, M., Eickholt, J., and Cheng, J. (2014) ‘A Deep Learning Network with Policy Gradient’, Proceedings of the AAAI Conference on Artificial
Approach to Ab Initio Protein Secondary Structure Prediction’, Intelligence, 31.
D. S. Berman et al. 15

Zeng, H. et al. (2016) ‘Convolutional Neural Network Architectures for Zhang, H. et al. (2019) ‘Self-attention Generative Adversarial Net-
Predicting DNA–Protein Binding’, Bioinformatics, 32: i121–7. works’, in International conference on machine learning, pp. 7354–63.
Zhang, Z. et al. (2018) ‘Bidirectional generative adversarial networks Zhou, J., and Troyanskaya, O. G. (2015) ‘Predicting Effects of Noncod-
for neural machine translation’ in Proceedings of the 22nd Conference ing Variants with Deep Learning–based Sequence Model’, Nature
on Computational Natural Language Learning, pp. 190–9. Methods, 12: 931–4.

Downloaded from https://academic.oup.com/ve/article/9/1/vead022/7110829 by guest on 17 February 2024

You might also like