0% found this document useful (0 votes)
71 views14 pages

Meta Learning With Graph Attention Networks For Low-Data Drug Discovery

Uploaded by

Amgad Abdallah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views14 pages

Meta Learning With Graph Attention Networks For Low-Data Drug Discovery

Uploaded by

Amgad Abdallah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369045337

Meta Learning With Graph Attention Networks for Low-Data Drug Discovery

Article in IEEE Transactions on Neural Networks and Learning Systems · March 2023
DOI: 10.1109/TNNLS.2023.3250324

CITATIONS READS
16 163

5 authors, including:

Guanxing Chen Ziduo Yang


Sun Yat-Sen University Sun Yat-Sen University
26 PUBLICATIONS 182 CITATIONS 20 PUBLICATIONS 346 CITATIONS

SEE PROFILE SEE PROFILE

Weihe Zhong
Sun Yat-Sen University
16 PUBLICATIONS 350 CITATIONS

SEE PROFILE

All content following this page was uploaded by Guanxing Chen on 01 July 2023.

The user has requested enhancement of the downloaded file.


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Meta Learning With Graph Attention Networks for


Low-Data Drug Discovery
Qiujie Lv , Guanxing Chen , Ziduo Yang, Weihe Zhong, and Calvin Yu-Chian Chen

Abstract— Finding candidate molecules with favorable phar- virtual screening technologies and high-throughput omics
macological activity, low toxicity, and proper pharmacokinetic technologies, researchers can integrate the relevant knowledge
properties is an important task in drug discovery. Deep neural of computational chemistry, physics, and structural biology to
networks have made impressive progress in accelerating and
improving drug discovery. However, these techniques rely on effectively screen and design molecular compounds [3], [4],
a large amount of label data to form accurate predictions [5], [6], [7]. The key issue of drug discovery is the screening
of molecular properties. At each stage of the drug discovery and optimization of candidate molecules, which must meet a
pipeline, usually, only a few biological data of candidate molecules series of criteria: the compound needs to have suitable poten-
and derivatives are available, indicating that the application tial for biological targets, and exhibit good physicochemical
of deep neural networks for low-data drug discovery is still
a formidable challenge. Here, we propose a meta learning properties; absorption, distribution, metabolism, excretion, and
architecture with graph attention network, Meta-GAT, to predict toxicity (ADMET); water solubility; and mutagenicity [8], [9].
molecular properties in low-data drug discovery. The GAT However, there are usually only a few validated leads and
captures the local effects of atomic groups at the atom level derivatives that can be used for lead optimization [10], [11].
through the triple attentional mechanism and implicitly captures
Also, due to the possible toxicity, low activity, and low
the interactions between different atomic groups at the molecular
level. GAT is used to perceive molecular chemical environment solubility, there are often only a few real biological data
and connectivity, thereby effectively reducing sample complexity. on candidate molecules and analog molecules. The accuracy
Meta-GAT further develops a meta learning strategy based on of the physical chemical properties of candidate molecules
bilevel optimization, which transfers meta knowledge from other directly affects the results of the drug development process.
attribute prediction tasks to low-data target tasks. In summary,
Therefore, researchers have paid more and more attention to
our work demonstrates how meta learning can reduce the amount
of data required to make meaningful predictions of molecules in accurately predict the physicochemical properties of candidate
low-data scenarios. Meta learning is likely to become the new molecules with low data.
learning paradigm in low-data drug discovery. The source code In the past few years, deep learning technology has been
is publicly available at: https://github.com/lol88/Meta-GAT. implemented to accelerate and improve the drug discovery
Index Terms— Drug discovery, few examples, graph attention process [12], [13], [14], [15], and some key advances have
network, meta learning, molecular property. been made in molecular property prediction [16], [17], [18],
[19], [20], side effect prediction [21], [22], [23], and virtual
I. I NTRODUCTION screening [24], [25]. In particular, the graph neural network
(GNN), which can learn the information contained in the nodes

D RUG discovery is a high-investment, long-period, and


high-risk systems engineering [1]. When molecular biol-
ogy studies have identified an effective target associated with
and edges directly from the chemical graph structure, has
aroused the strong interest of bioinformatics scientists [26],
[27], [28], [29]. The performance of deep learning algorithm
a disease, the subsequent path of drug discovery becomes depends largely on the size of the training data, and a larger
relatively clear [2]. With the help of various computer-aided sample size usually produces a more accurate model. Given
a large amount of labeled data, deep neural networks have
Manuscript received 24 November 2021; revised 21 May 2022, 16 October
2022, and 26 December 2022; accepted 24 February 2023. This work was enough ability to learn complex representations of inputs [30].
supported in part by the National Natural Science Foundation of China under However, this is obviously in contradiction with insufficient
Grant 62176272 and in part by the China Medical University Hospital under data in the initial stage of drug discovery. Due to the scarcity
Grant DMR-112-085. (Corresponding author: Calvin Yu-Chian Chen.)
Qiujie Lv, Guanxing Chen, Ziduo Yang, and Weihe Zhong are with of labeled data, achieving satisfactory results for low-data
the Artificial Intelligence Medical Research Center, School of Intelligent drug discovery remains a challenge. The paradigm of artificial
Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, intelligence for drug discovery has changed: from large-scale
Guangdong 518107, China.
Calvin Yu-Chian Chen is with the Artificial Intelligence Medical Research sample learning to small sample learning [31], [32], [33].
Center, School of Intelligent Systems Engineering, Shenzhen Campus of The human brain’s understanding of objective things
Sun Yat-sen University, Shenzhen, Guangdong 518107, China, also with does not necessarily require large sample training, and it
the Department of Medical Research, China Medical University Hospital,
Taichung 40447, Taiwan, and also with the Department of Bioinformatics can be learned in many cases based on simple analogies
and Medical Engineering, Asia University, Taichung 41354, Taiwan (e-mail: [34], [35], [36]. DeepMind explores how the brain learns
[email protected]). with few experience, that is, “meta learning” or “learning
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2023.3250324. to learn” [37]. The understanding of meta learning mode is
Digital Object Identifier 10.1109/TNNLS.2023.3250324 one of the important ways to achieve general intelligence.
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Biswas et al. [38] developed UniRep for protein engineering to discovery. The graph attention network captures the local
efficiently use resource-intensive high-fidelity assays without effects of atomic groups at the atomic level through the
sacrificing throughput, and subsequent low-N supervision then triple attentional mechanism, so that the GAT can learn the
identifies improvements to the activity of interest. Liu et al. influence of the atom group on the properties of the compound.
[39] from the Chinese Academy of Sciences have established At the molecular level, GAT treats the entire molecule as a
a complete and effective screening method for disease tar- supervirtual node that connects every atom in a molecule,
get markers based on few examples (or even one sample). implicitly capturing the interactions between different atomic
Lin et al. [40] proposed a prototypical graph contrastive learn- groups. The gated recurrent unit (GRU) hierarchical model
ing (PGCL) method for learning graph representation, which mainly focuses on abstracting or transferring limited molecular
improved the results of molecular property prediction. Yu and information into higher-level feature vectors or meta knowl-
Tran [25] proposed an XGBoost-based fitted Q iteration edge, improving the ability of the GAT to perceive chemical
algorithm with fewer training data for finding the optimal environment and connectivity in molecules, thereby efficiently
structured treatment interruption (STI) strategies for HIV reducing sample complexity. This is very important for low-
patients. They have made certain explorations and attempts data drug discovery. Meta-GAT benefits from meta knowledge
in the field of drug virtual screening and combination drug and further develops a meta learning strategy based on bilevel
prediction based on the few examples learning method [41], optimization, which transfers meta knowledge from other
[42]. The abovementioned work is a useful attempt by meta attribute prediction tasks to low-data target tasks, allowing
learning for few samples learning problems, indicating that the the model to quickly adapt to molecular attribute predictions
meta learning method has the potential to be a useful tool in with few examples. Meta-GAT achieved accurate prediction of
drug discovery and other bioinformatics research fields. few examples’s molecular new properties on multiple public
Meta learning uses meta knowledge to reduce requirement benchmark datasets. These advantages indicate that Meta-GAT
for sample complexity, thus solving the core problem of is likely to become a viable option for low-data drug discovery.
minimizing the risk of unreliable experience. However, the In addition, the Meta-GAT code and data are open source at
molecular structure is usually composed of the interaction https://github.com/lol88/Meta-GAT, so that the results can be
between atoms and complex electronic configurations. Even easily replicated.
small changes in the molecular structure may lead to com- Our contributions can be summarized as follows.
pletely opposite molecular properties. The model learns the 1) We create a chemical tool to predict multiple physiolog-
complexity of molecular structure, which requires that the ical properties of new molecules that are invisible to the
model should perfectly extract the local environmental influ- model. This tool could push the boundaries of molecular
ence of neighboring atoms on the central atom and the rich representation for low-data drug discovery.
nonlocal information contained between pairs of atoms that 2) The proposed Meta-GAT captures the local effects of
are topologically far apart. Therefore, meta learning for low- atomic groups at the atomic level through the triplet
data drug discovery is highly dependent on the structure of the attentional mechanism and can also model global effects
network and needs to be redesigned for widely varying tasks. of molecules at the molecular level.
Meta learning has made some representative attempts to 3) We propose a meta learning strategy to selectively
predict molecular properties. Altae-Tran et al. [43] introduced update parameters within each task through a bilevel
an architecture of iteratively refined long short-term memory optimization, which is particularly helpful to capture the
(IterRefLSTM) that uses IterRefLSTM to generate dually generic knowledge shared across different tasks.
evolved embeddings for one-shot learning. Adler et al. [44] 4) Meta-GAT demonstrates how meta learning can reduce
proposed cross-domain Hebbian ensemble few-shot learning the amount of data required to make meaningful predic-
(CHEF), which achieves representation fusion by an ensemble tions of molecules in low-data drug discovery.
of Hebbian learners acting on different layers of a deep neural
network. The Meta-molecular graph neural network (MGNN) II. M ETHODS
leverages a pretrained GNN and introduces additional self-
In this section, we first briefly introduce the mathematical
supervised tasks, such as bond reconstruction and atom-
formalism of Meta-GAT and then introduce the meta learning
type prediction to be jointly optimized with the molecular
strategy and graph attention network structure. Finally, the
property prediction tasks [45]. Meta-MGNN, CHEF, obtains
parameters and details of the model training are shown. Fig. 1
meta knowledge through pretraining on a large-scale molecu-
shows the overall architecture of Meta-GAT for low-data drug
lar corpus and additional self-supervised model parameters.
discovery.
IterRefLSTM trains the memory-augmented model, which
restricts the model structure and can only be used in spe-
cific domain scenarios. How to represent molecular features A. Problem Formulations
effectively and how to capture common knowledge between Consider several common drug discovery tasks T , such as
different tasks are great challenges that exist in meta learning. predicting the toxicity and side effects of new molecules, x is
In this work, we propose a meta learning architecture the compound molecule to be measured, and the label y is the
based on graph attention network, Meta-GAT, to predict binary experimental label (positive/negative) of the molecular
the biochemical properties of molecules in low-data drug properties. Suppose that all some potential laws considered

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 3

Fig. 1. Meta learning framework for few examples molecular property prediction. The blue box and the orange box represent the data flow in the training
phase and the test phase, respectively.

by the model are called hypothesis space H . h is the optimal The difference between meta learning and transfer learning is
hypothesis from x to y. The expected risk R(h) represents that transfer learning is usually fitting the distribution of one
the prediction ability of the decision model for all samples. data, while meta learning is fitting the distribution of multiple
The empirical risk R(h I ) represents the predictive ability of similar tasks. Therefore, the training samples of meta learning
the model for samples in the training set by calculating the are a series of tasks.
average value of the loss function, and I represents the number Model-agnostic meta-learning (MAML) [47] is used as a
of samples in the training set. The empirical risk R(h I ) is used base meta learning algorithm for the Meta-GAT framework.
to estimate expected risk R(h). In real-world applications, only Meta-GAT selectively updates parameters within each task
a few examples are available for a property prediction task of through a bilevel optimization and transfers meta knowledge
a new molecule, that is, I → few. According to the empirical to new tasks with few label samples, as shown in Fig. 1.
risk minimization theory, if only a few training samples can be Bilevel optimization means that one optimization contains the
provided, which makes the empirical risk R(h I ) far from the another optimization as a constraint. In inner-level optimiza-
approximation of the expected risk R(h), the obtained empir- tion, we hope to learn a general meta knowledge w from
ical risk minimizer is unreliable [46]. The learning challenge the support set of training tasks, so that the loss of different
is to obtain a reliable empirical risk minimization from a few tasks can be as small as possible. The inner level optimization
examples. This minimizer results in R(h I ) approaching the phase can be formalized, as shown in (3). In outer-level
optimal R(h), as shown in the following equation: optimization, Meta-GAT calculates the gradient relative to the
optimal parameter in the query set of each task and calculates
E[R(h I →few ) − R(h)] = 0. (1)
the minimum total loss value of all training tasks to optimize
The empirical risk minimization is closely related to sample the w parameter, thereby reducing the expected loss of the
complexity. Sample complexity refers to the number of train- training task, as shown in (2). Algorithm 1 shows the specific
ing samples required to minimize the loss of empirical risk algorithm details
R(h I ). According to VapnikâĂŞ Chervonenkis (VC), when
M
samples are insufficient, H needs less complexity, so that the X q
w ∗ = argmin Lmeta θ ∗(i) (w), Dtrain

fθ (2)
few examples provided are sufficient for compensation. We use w
i=1
meta knowledge w to reduce the complexity of learning 
s(i)

samples, thus solving the core problem of minimizing the risk θ ∗(i) (w) = argmin Ltask
fθ θ, w, Dtrain (3)
θ
of unreliable experience.
where Lmeta and Ltask refer to the outer and inner objectives,
B. Meta Learning respectively. i represents the ith training task.
Meta learning, also known as learning to learn, means Specifically, first, the train tasks Ttrain and test tasks Ttest are
learning a learning experience by systematically observing extracted from a set of multitask T for drug discovery, where
how the model performs in a wide range of learning tasks. each task has a support set D s and a query set D q . Meta-
This learning experience is called meta knowledge w. The GAT uses a large number of training tasks Ttrain to fitting
goal of meta learning is to find the w shared across different the distribution of multiple similar tasks T . Second, Meta-
tasks, so that the model can quickly generalize to new tasks GAT sequentially iterates a batch of training tasks, learns
that contain only a few examples with supervised information. task-specific parameters, and tries to minimize the loss using

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 1 Pseudocode of Meta-GAT for Low-Data Drug TABLE I


Discovery C ODED I NFORMATION FOR ATOMIC AND B OND F EATURES
Require: A set of tasks for predicting molecular properties
T;
Ensure: GAT parameters θ, step sizes α, β;
1: Randomly initialize θ;
2: while not done do
3: Sample batch of tasks Ttrain ∼ T ;
4: for all Ttrain do
5: Sample m examples Dtrain s
= {D1 , D2 , . . . , Dm } ∈
Dtrain ;
6: Evaluate ∇θ LTtrain ( f θ ) by s
ytrain =
G AT (Dtrain , θ );
s

7: Compute adapted parameters with gradient


descent:
θi′ = θ − α∇θ LTtrain ( f θ );
q
8: Sample n examples Dtrain = {D1 , D2 , . . . , Dn } ∈
s
Dtrain − Dtrain ;
q q
9: Evaluate L′Ttrain by ytrain = G AT (Dtrain , θ ′ ); performance of the model by the accuracy of θ on the query
10: end for set of the new task. In the process of learning new tasks, the
Updat θ ← θ − β∇θ Ttrain ∼ p(T ) L′Ttrain
P
11: model benefits from meta knowledge to reduce the requirement
12: end while of sample complexity, so as to realize the optimization strategy,
13: Sample batch of tasks Ttest ∼ T − Ttrain which is faster to search the parameterized θ of the hypothesis
14: for all Ttest do h ∈ H in the hypothesis space H .
15: Sample k examples Dtest s
= {D1 , D2 , . . . , Dk } ∈ Dtest Meta-GAT essentially searches for a hypothesis that is better
16: // Similar to the training phase for all tasks of predicting the properties of drug molecules.
17: Evaluate and Compute adapted parameters with gradi- Therefore, when updating parameters, it combines the loss of
ent descent all tasks on the query set to specify the gradient update. The
18: Updat θ parameter θ obtained in this way is already an approximate
q
19: Sample j examples Dtest = {D1 , D2 , . . . , D j } ∈ optimal hypothesis on the new task, and the optimal hypothesis
s
Dtest − Dtest can be reached with few inner iterations.
q q
20: ytest = G AT (Dtest , θ) Our Meta-GAT uses meta knowledge w to guide the model
21: end for to search for the parameter θ that approximates the optimal
hypothesis h in the hypothesis space H , leading to the
minimization of empirical risk. The meta knowledge w is
gradient descent. The corresponding optimal parameters θ are obtained through limited analysis of new molecules and prior
obtained from each task’s support set, as shown in (4). This knowledge analysis of many similar molecules. The meta
parameters are not assigned to θ directly, but are cached knowledge w changes the search strategy by providing a better
parameter initialization or providing a search direction. Meta-
θi′ = θ − α∇θ LTtrain ( f θ ) (4) GAT was rapidly migrated from a better hypothesis site to
X
θ ← θ − β∇θ L′Ttrain f θi′ .

(5) the new task through several inner optimizations on fewer
Ttrain ∼ p(T ) new molecular instances, and then, the percentage of correctly
assigned molecules with/without toxicity was increased.
Then, the outer-level optimization learns w, such that it
produces models f θ [see (5)]. Each task’s query set is used to
obtain a gradient value on each task-specific parameter θ. The C. Molecular Graph Representation
vector sum of the gradient values, obtained from the above
batch task query set, is used to update the parameters of the Molecules are coded into graphs with node features, edge
meta learner. The model continues iterating up to a preset features, and adjacency matrices for input into the graph
number of times, and the best meta model is selected based network. We use a total of nine atomic features and four
on the query set bond features to characterize the molecular graph’s structure
(see Table I). Atom features include hybridization, aromatic-
θ ∗ = argmin E L θ, w, Dtest s
.

(6) ity, chirality, and so on, and bond features include type,
θ Ttest ∈T
conjugation, ring, and so on. Molecular structures usually
Finally, in the testing phase, Meta-GAT, which has learned involve atomic interactions and complex electronic structures,
meta knowledge w, learns the specificity of the new test task and bond features contain rich information about molecular
through a few inner optimizations on the support set of the scaffolds and conformational isomers. The encoded molecular
new task, as shown in (6). Note that the model parameters θ graph can implicitly capture the local environment of the
exist separately or within meta knowledge w. We evaluate the molecule and the key interactions between atoms and electrons

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 5

Fig. 2. Schematic of graph attention network architecture for meta learning.

and provide insight into the edge characteristics of molecular Specifically, GAT first performs linear transformation and
bonds. nonlinear activation on the neighbor nodes’ state vectors vi ,v j
and their edge hidden states ei j to align these vectors to the
D. Graph Attention Network same dimension, and concatenate them into triplet embedding
GNN has made substantial progress in the field of chemical vectors. Then, h i j is normalized by the softmax function over
informatics. It has the extraordinary capacity to learn the all neighbor nodes to get attention weights ai j . Finally, the
intricate relationships between structures and properties [16], node hidden state and edge hidden state elementwise multi-
[48], [49], [50]. The attention mechanism has proved its plied by neighbor node representation, and the information of
outstanding performance in predicting molecular properties. neighbors (including neighbor nodes and edges) is aggregated
Molecular structure involves the spatial position of atoms according to the attention weight to obtain the context state ci
and the types of chemical bonds. Topologically adjacent of the atom i. The formula is shown below
nodes in molecules have a greater chance of interacting with
h i j = LeakyReLU W · vi , ei j , v j
 
each other. In some cases, they can also form functional (7)

groups that determine the chemical properties of the molecule.  exp h i j
ai j = softmax h i j = P (8)
In addition, pairs of atoms that are topologically far apart

j∈N (i) exp h i j
may also have significant interactions, such as intramolecular X
ai j · W · ei j , v j
 
hydrogen bonds. Our graph attention network extracts insights ci = (9)
j∈N (i)
on molecular structure and features from both local and global
perspectives, as shown in Fig. 2. GAT captures the local effects where N (i) is the set of neighbor nodes for node i. W is the
of atomic groups at the atomic level through the attentional trainable weight matrix. Then, the GRU is used as a message
mechanism and can also model global effects of molecules at transfer function to fuse messages with a farther radius to
the molecular level. generate a new context state, as shown in Fig. 2 (bottom left).
The molecule G = (v, e) can be defined as a graph com- As the time step t increases, messages of nodes and edges in
posed of a set of atoms (nodes) v and a set of bonds (edges) e. the range centered on node I and whose radius increases with
Similar to the previous study, we encode chemical information t are collected successively to generate new states h it , which
including nine atomic features and four bond features into the is computed by
molecular graph as the input of graph attention network. For
h it = GRU h it−1 , cit−1 .

(10)
the local environment within the molecule, previous graph net-
works only aggregate the neighbor nodes’ information, which In order to include more global information from the
may lead to insufficient edge (bond) information extraction. molecule, GAT aggregates the atomic level representation
Our GAT gradually aggregates the triplet embedding of target through the readout function, which treats the entire molecule
node vi , neighbor node v j , and edge ei j through the triple as a supervirtual node that connects every atom in a molecule.
attention mechanism. We use the bidirectional GRU (BiGRU) with attention to

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE II Meta-GAT is implemented using the pytorch framework


D ETAILED D ESCRIPTION OF THE B ENCHMARK DATASETS and uses the Adam optimizer [52] with a 0.001 learning rate
for gradient descent optimization. The learning rate for inner
iterations is 0.1. Generates information around atoms with a
radius of 2. The output unit of the full connection layer is 200.
Both GRU and BiGRU also have 200 hidden units. Gradient
descent is performed five times in each iteration of the training
and testing phase, α, β = 5. During training, 10 000 episodes
connect node features with historical information from two are generated K = Npos + Nneg (Npos , Nneg ∈ [1, 5, 10]) in N -
directions, so as to obtain a graph-level (molecular) represen- way K -shot. Npos and Nneg , respectively, represent the number
tation M. The update gate in the GRU recurring network cell of positive and negative examples in the support set. We use
ensures that information is effectively transmitted to distant CrossEntropyLoss as the loss function of the classification
nodes, while the reset gate helps to filter out information that is task. When training Meta-GAT on the task of molecular
not relevant to the learning task. Moreover, different attention biochemical property prediction, multiple tasks are divided
weights can focus on the implicit interactions between distant into two disjoint task sets, training task and test task. The
atoms and extract more information related to learning tasks. training/testing division method for each dataset is the same
The formula of readout function is shown below as the comparison experiment. During the prediction phase,
n←−−h ← −i−−→ h − →io a batch of support sets with size Npos + Nneg and a batch of
M = GRU att h t GRU, att h t (11)
query sets with size K = 128 are randomly sampled from a
where att uses the same attention mechanism as before. The task’s dataset. For each test task, 20 independent runs were
final molecular representation M is used by the classifier for performed based on different random seeds, and the average
molecular property prediction. value of area under the receiver operating characteristic curve
GAT learns the contextual representation of each atom (ROC-AUC) was calculated in the report of experimental
by aggregating the triple information from the atom feature, section.
the neighboring atoms feature information, and the feature In addition, we also analyze the total training time,
information of the connecting bond through the message pass- meta training time, meta testing time, number of
ing mechanism and attention mechanism. Then, the context multiply–accumulate operations (MACs), and model size
representations of atoms are gradually aggregated by BiGRU to evaluate the computational complexity of the proposed
based on the attention mechanism to generate a global state method. Meta-GAT consists of two steps, the meta training
vector for the entire molecule. The final vector representation phase and the meta testing phase. Total training time refers
is a high-quality descriptor of molecular structure information, to the cost of stabilizing the performance of Meta-GAT on
which reduces the difficulty of learning the unsupervised new tasks. Meta training time is the cost of one iteration
information in molecular graph by the meta learning model. in the meta training phase. Meta testing time refers to the
cost of Meta-GAT learning the prediction task of molecular
E. Datasets new property with few samples in the meta testing stage.
Within an iteration, both the support set and the query set
We report the experimental results of the Meta-GAT model participate in the model forward calculation and perform
on multiple public benchmark datasets. Table II shows the one or more iterations of gradient descent. The cost of
detailed information of the benchmark dataset, including cat- one iteration in meta training stage, namely, meta training
egories, tasks, and the number of molecules. All datasets are time, is 2N ∗ α times of the model’s forward calculation
available for download at the public project MoleculNet [51]. time, while meta testing time is 2 ∗ β times longer than the
model’s forward computation time. GeForce RTX 2060 is
F. Model Implementation and Evaluation Protocols used in this experiment, and N is 8, and α and β are 5.
Meta-GAT performs linear transformation and nonlinear The average forward computation time of Meta-GAT on
activation from both atomic features and neighbor features the Tox21 and Side Effect Resource (SIDER) datasets is
to unify vector lengths [see (7)]. Then, the triplet embedding 14.84 and 23.08 ms, respectively. Therefore, the meta training
vectors of atoms are aligned using a fully connected layer, and time is about 1187.2 and 1846.4 ms, the meta testing time
the attention weights are calculated using softmax [see (8)]. is about 148.4 and 230.8 ms, and the total training time is
The weighted sum of the atoms current state vectors is taken about 7.3 and 6 h, respectively. The MACs of Meta-GAT are
as the attention context vector of a single atom [see (9)], 3.17E9, and the model size is 4.8 M. The training time of
which is fed into the GRU along with the current state meta-learning-based GAT is longer than that of GAT, but the
vector [see (10)]. This process is repeated twice to generate size of obtained prediction model is the same as GAT.
a new state vector for each atom. Finally, we assume that We compare Meta-GAT with multiple baseline mod-
the molecular embedding is a virtual node embedding, so that els, including random forest (RF) [53], Graph Conv [54],
the whole molecule can be embedded as if it was a single Siamese [55], MAML [47], attention LSTM (attnLSTM) [43],
atom. Similar to the above process, we combine the context IterRefLSTM [43], Meta-MGNN [45], edge-labeling GNN
state vectors of each atom from both directions into a BiGRU (EGNN) [56], PreGNN [57], prototypical networks (PN)
[see (11)]. This process is also repeated twice to obtain a graph [58], CHEF [44], attentive fingerprint (Attentive FP) [16],
representation at the molecular level. communicative message passing neural network (CMPNN)

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 7

TABLE III
S CORES FOR C ONSISTENCY C HECKS ON THE T OX 21 DATASET U SING
K APPA AND PAIRED W ILCOXON T ESTS

Fig. 3. ROC-AUC scores of Meta-GAT and previous models in the Tox21


few examples prediction task. 1+/5− represent the number of positive and
negative examples is 1 and 5, respectively.

[59], Weave [60], continuous and data-driven descriptors


(CDDD) [49], DeepTox [61], molecule attention trans-
former (MAT) [62], molecular prediction model fine-tuning
(MolPMoFiT) [63], N-Gram [64], molecular to context vec-
tor (Mol2Context-vec) [17], and triplet message network
(TrimNet) [29]. In the reproducibility settings, Siamese,
MAML, AttnLSTM, IterRefLSTM, Meta-MGNN, EGNN,
PN, and CHEF are based on meta learning methods, using
the same training settings as Meta-GAT. RF and Graph Conv
are single-task models. DeepTox, Weave, Attentive FP, and
CMPNN are multitask models. For each assay prediction task,
randomly select Npos + Nneg samples as the support set and
Fig. 4. Performance comparison of Meta-GAT with other representative
128 samples as the query set. Repeat this process 20 times, molecular models.
and calculate the final average value as the model result. MAT,
PreGNN, CDDD, MolPMoFiT, and N-Gram, is a pretrained
GNN model that uses self-supervised learning on a large-scale hypothesis, making the Meta-GAT model easier to learn the
molecular corpus, resulting in better parameter initialization. meta knowledge of binary classification. Meta-GAT has shown
Similarly, 128 samples were randomly collected for testing and impressive performance in toxicity assay tasks with few data.
repeated 20 times to avoid the randomness of model testing. We used Kappa and paired Wilcoxon Test to conduct consis-
tency checks on the three test tasks of Tox21, and the results
are shown in Table III. Kappa analysis is used to evaluate
III. R ESULTS AND D ISCUSSION
the consistency degree between predicted results of Meta-
A. Tox21 GAT and actual measured results. The paired Wilcoxon Test
The “Toxicology in the 21st Century” (Tox21), collected in nonparametric test is used to test whether the distribution
by the 2014 Tox21 Data Challenge, is a public database of the predictions (independent samples) produced by the two
containing 12 assays measures the toxicity of biological target. models is equal. It is not limited by the data distribution, the
We treat each assay as a single task. The first nine assays test conditions are relatively loose, and it can be applied to the
were used for training, including NR-AR, NR-AR-LBD, overall unknown samples. Wilcoxon p-value < 0.05 indicated
NR-AHR, NR-Aromatase, NR-ER, NR-ER-LBD, NR-PPAR- that the distribution of Meta-GAT predicted results was differ-
Gamma, SR-ARE, and SR-ATAD5. The last three assays were ent from that of other models. The results of Kappa analysis
used for testing, including SR-HSE, SR-MMP, and SR-P53. show that SR-HSE and real measurement results are highly
Meta-GAT is compared with other 12 models, and the consistent within the allowable error range, and SR-MMP
experimental results are shown in Fig. 3. The numbers of and SR-p53 are extremely consistent. These statistical tests
positive and negative samples in the support set are both indicate that the prediction results of Meta-GAT can replace
increased from 1 to 10, and the improvement of model real measurements within the allowable error range.
performance is not obvious. However, the change in the ratio In addition, Fig. 4 (left) also shows the performance com-
of positive and negative samples has great influence on the parison of Meta-GAT with other representative molecular
model performance. Interestingly, when there are only a few models. We observe that Meta-GAT still achieves state-of-the-
examples, the balanced ratio of positive and negative samples art (SOTA) performance compared with fully supervised mod-
in the support sets may be more important than the increase in els. Self-supervised models (CDDD, N-Gram, MolPMoFiT,
the number. To some extent, the ratio of positive and negative and Mol2Context-vec) pretrain models from unlabeled large
samples in the support set represents the distribution of task. datasets and then fine-tune models on specific target datasets.
A balanced ratio of positive and negative samples may better Due to its powerful feature transfer capability, its model
guide the model to search for the parameters of the optimal outperforms multitask models that only use the target dataset.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV
S CORES FOR C ONSISTENCY C HECKS ON THE SIDER DATASET U SING
K APPA AND PAIRED W ILCOXON T ESTS

Fig. 5. ROC-AUC scores of Meta-GAT and previous models in the SIDER


few examples prediction task. 1+/5− represent the number of positive and
negative examples is 1 and 5, respectively.

The graph attention network introduces an attention mecha-


nism by assigning different weights to different nodes, and
the corresponding graph convolution operation aggregates the
weighted sum of the atoms local information together, which
can force the model to learn the most meaningful neighbors
and local environments part. Compared with other common
GCN architectures, graph attention networks (GAT, Attentive Fig. 6. ROC-AUC Scores of Meta-GAT and previous models in the MUV
few examples prediction task. 1+/5− represent the number of positive and
FP, and TrimNet) tend to achieve better performance. Overall, negative examples is 1 and 5, respectively.
Meta-GAT shows a powerful improvement over the existing
baseline model, indicating that meta learning method may be
a better solution for low-data drug discovery. different from that of other models (Wilcoxon p-value < 0.05)
for the six indications in the SIDER dataset (see Table IV).
B. SIDER PPPC, ELD, NSD, IPPC, and real measurement results are
highly consistent within the allowable error range, while RUD
The SIDER is a public database containing 1427 marketed and CD are extremely consistent.
drugs and their adverse drug reactions [65]. According to
the MedDRA classifications, drug side effects are grouped
into 27 systemic organ classes. Among them, “renal and C. MUV
urinary disorders” (RUD), “pregnancy, puerperium and perina- The maximum unbiased validation (MUV) dataset con-
tal conditions” (PPPC), “ear and labyrinth disorders” (ELD), tains 17 binary classification tasks for more than 90 000
“cardiac disorders” (CD), “nervous system disorders” (NSD), molecules and is specifically designed to be challenging for
and “injury, poisoning and procedural complications” (IPPC), standard virtual screening [51], [66]. The positives examples
six indications were used for testing, and the remaining are selected to be structurally distinct from one another. MUV
21 indications were used for training. is a best-case scenario for baseline machine learning (since
The performance comparison of Meta-GAT with other few each data point is maximally informative) and a worst case
examples methods is shown in Fig. 5. The meta learning model test for the low-data methods, since structural similarity cannot
still shows a strong improvement in this set, demonstrating be effectively exploited to predict behavior of new active
the potential of meta learning in few examples molecular molecules [43].
property prediction tasks. As is shown before, a balanced ratio The first 12 assays were used for training. The five assays,
of positive and negative samples helps to further improve MUV-832, MUV-846, MUV-852, MUV-858, and MUV-859,
performance. The graph attention mechanism introduced in were used for model test. Fig. 6 reports the overall perfor-
Meta-GAT can focus on task-related information from the mance of all methods on the MUV dataset. Experimental
neighborhood, which helps to achieve accurate iteration of results show that Meta-GAT outperforms other baseline mod-
few examples. It can be observed that the GNN based on els. In terms of average improvement, for one-shot learning,
meta learning (EGNN, Meta-MGNN, and Meta-GAT) obtains the average improvements are +0.72%. The value equals
more advanced model performance than other meta learning 0.39% for five-shot learning. Both Meta-MGNN and PreGNN
methods (Siamese, MAML, AttnLSTM, and IterRefLSTM). provide considerable performance, with an average ROC-
Fig. 4 (right) shows that Meta-GAT achieves SOTA perfor- AUC of 0.6451 and 0.6554 on the test task set, respec-
mance compared with fully supervised representative models. tively, which are slightly worse than that of Meta-GAT
In addition, the distribution of Meta-GAT predicted results was (ROC-AUC = 0.6626). Note that Meta-MGNN and PreGNN

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 9

TABLE V
S CORES FOR C ONSISTENCY C HECKS ON THE MUV DATASET U SING
K APPA AND PAIRED W ILCOXON T ESTS

Fig. 7. High agreement between the predicted results of the Meta-GAT


(Our) and GT assessed by the Bland–Altman analysis. The x-axis and y-axis
represent the average values and the bias between the GT and the predicted
values by the Meta-GAT, respectively. The blue dashed line indicates the mean
bias. The red dashed lines indicate the 95% confidence intervals of the bias.

TABLE VI
C OMPARISON OF P REDICTIVE P ERFORMANCES (MAE) ON THE QM9
DATASET Q UANTUM P ROPERTIES . N OTE T HAT FOR MAE, L OWER
VALUE I NDICATES B ETTER P ERFORMANCE

Fig. 8. Comparison of distribution differences between Tox21 and SIDER


using kernel density estimation.

require a large-scale molecular corpus and additional self-


supervised model parameters. Furthermore, we observe that and Mol2Context-vec) provided more promising predictions
IterRefLSTM and MAML baseline methods do not have stable than other meta learning models. Meta-GAT shows noticeable
performance on different tasks. In other words, they may improvement in low-data drug discovery and is a promising
perform well on Tox21 or SIDER, but perform poorly on the meta learning method.
MUV task. In contrast, the performance of Meta-GAT on all Moreover, Fig. 7 illustrates that the predicted results of the
three classification datasets is SOTA and stable. In addition, Meta-GAT are highly agreed with the ground truth (GT). Each
Wilcoxon p-value < 0.05 in Table V indicated that the subplot shows the results of Bland–Altman analysis for the
distribution of Meta-GAT predicted results is different from three test task sets in QM9. The x-axis and y-axis represent the
that of other models on the five assays of the MUV dataset. average values and the bias between the GT and the predicted
Kappa analysis results in the last row of Table V show that the values by the Meta-GAT, respectively. The blue dashed line
predicted results and real measurement results are moderately indicates the mean bias. The red dashed lines indicate the
consistent within the allowable error range. 95% confidence intervals of the bias. The results show that
the mean bias of the three test task sets is −0.0516, 0.1817,
D. QM9 and 0.1384, respectively. The percentages of the scatter points
falling within the 95% confidence interval are greater than
Due to the huge computational cost of density functional
98%. Therefore, the results show the high agreement between
theories approaches, there has been considerable interest in
the GT and the predicted results by the Meta-GAT. In other
applying machine learning models to task of molecular quan-
words, the prediction results of Meta-GAT can replace the GT
tum property prediction. QM9 is a comprehensive dataset
measured by experiment within the allowable error range.
that provides quantum mechanical properties, which include
12 calculated quantum properties for 134k stable small organic
molecules composed of up to nine heavy atoms. E. Transfer Learning to SIDER From Tox21
The three quantum properties of LUMO, G, and Cv were The experiments, thus far, have demonstrated that Meta-
used as test tasks, and the other nine quantum properties were GAT is able to learn an efficient learning process to transfer
used for training tasks. QM9 is a regression dataset, and mean meta knowledge from a range of training tasks, allowing the
absolute error (MAE) is used to evaluate the performance model to rapidly adapt to closely related molecular prop-
of regression models. As shown in Table VI, Meta-GAT erty predictions with few examples. Transferability and task-
outcompetes other models on two out of three testing tasks relatedness issues need to be carefully evaluated in real-world
in the QM9 datasets. Two pretrained-based models (N-Gram use cases for drug discovery to determine whether transfer

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE VII
ROC-AUC S CORES OF M ODELS T RAINED ON T OX 21 T ESTED ON SIDER

Fig. 9. Visualizations of molecular embeddings generated by Meta-GAT on


each internal iteration during the test phase. (a)–(f) Number of iterations 0–5
using the support set.

chemical intuition of human understanding is conducive to


learning can be used. We train the model on the Tox21 the application of meta learning in drug discovery. When
dataset and fine-tune on ten samples taken from the test task predicting new task of molecular properties, our model uses a
on the SIDER dataset, and then evaluate on the remaining few examples in the new task support set to perform several
samples. There is a large domain transfer in two datasets. The internal iterations, and then evaluates the performance of the
Tox21 measures the results of nuclear receptor assays, and query set in the new task. This raises an obvious question: can
the SIDER measures adverse effects from real patients. This learning just a few molecules build a competitive classifier?
problem becomes so challenging that even domain experts may Taking the test dataset SR-HSE in the Tox21 toxicity
not be able to accurately judge it. prediction task as an example, we explored the performance
We only use ten labeled data to transfer on one or more of the molecular embedding representation generated by the
SIDER test tasks. The evaluation results in Table VII show Meta-GAT mode. Specifically, the high-dimensional feature
that neither meta learning nor multitask models achieve gen- generated by Meta-GAT is a 200-D embedding, similar to
eralization between unrelated tasks. The performance of these the fingerprint vector representation of a molecule. We reduce
methods using knowledge transferred from Tox21 to SIDER the high-dimensional vector to 2-D embedding space by
is inferior to that of molecular models trained only on SIDER. t-distributed neighbor embedding (T-SNE) [67] to observe
Clearly, how to quantify the correlation between different tasks the distribution of molecular representations of different cate-
is important for transfer learning in drug discovery. gories. Fig. 9 shows the distribution visualization of molecular
Kernel density estimation is used to estimate unknown embeddings in the SR-HSE dataset, and Fig. 9(a)–(f) repre-
density functions in probability theory. It does not attach sents the number of iterations (0–5) using the support set. The
any assumption to the data distribution and is a method to blue dots and red dots represent the numerators of positive and
study the characteristics of the data distribution from the negative examples, respectively.
data sample itself. Fig. 8 shows the distribution differences During model training, an initialization parameter that is
of NR-AR-LBD, NR-AHR, SR-HSE, SR-MMP in Tox21, approximately optimal for multiple toxicity prediction tasks
and eye disorders and product issues in SIDER using kernel has been searched. Before using the support set iteration in a
density estimation. There was a strong correlation between new task of SR-HSE toxicity prediction [see Fig. 9(a)], the
the four Tox21 assays, so the training task NR-AR-LBD and molecular representation had some degree of separation in
NR-AHR could be transferred to the test tasks SR-HSE and 2-D mapping space. But, the model could not clearly classify
SR-MMP. Due to the large distribution difference between positive and negative samples, and the blue dots and red
Tox21 and SIDER, it may lead to negative transfer, over- dots were still mixed together. After one iteration using the
fitting problems in the case of data scarcity, thus failing to support set [see Fig. 9(b)], the mixing degree of blue dots
obtain meaningful molecular models. Identifying and address- and red dots weakened and showed aggregation phenomenon
ing these possible limitations are research directions for our to some extent. It shows that Meta-GAT iterates well in the
future work. Furthermore, kernel density estimation may be right direction under the guidance of meta knowledge through
a method for assessing transferability in the field of drug feature analysis of limited data in the new task. Continue 1–2
discovery, which can measure the distributional differences iterates, it has been clearly observed that the model can better
between source and target tasks, thus revealing task related- distinguish between blue and red dots in Fig. 9(c) and (d).
ness. We hope our work can promote low-data drug discovery The blue dots are mostly gathered in the bottom left corner
tasks. of the space, and the red dots are mostly in the top-right corner.
The model has reached the best performance on the new task.
F. Feature Visualization and Interpretation for Meta-GAT After 4–5 iterations using the same support set, the model
The interpretability of the model is crucial, and reducing may have overfitting. Therefore, we have to set an early stop
the gap between the visualization of the model and the to make the iterative process terminate early.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 11

Fig. 11 shows that the attention weights learned from the


Meta-GAT model are used to highlight each atom in nine
molecules on Tox21 datasets. Meta-GAT model may pay
more attention to atomic groups that may cause toxicity, such
as sulfonic acid or aniline. The sulfonic acid has potential
hazards, including eye burns, skin burns, digestive tract burns
if swallowed, and respiratory tract burns if inhaled. Aniline
leakage may cause combustion, and explosion hazards, and it
is very toxic to the blood and nerves, can be poisoned by the
skin or the respiratory tract absorption. These observations
suggest that Meta-GAT has indeed successfully extracted
relevant information by learning from a specific task, and
the attention weight at the atom level indeed has chemical
implications. For more intricate problems, attention weight
may also be taken as hints for discovering new knowledge.

IV. C ONCLUSION
Drug discovery is the process of discovering new molecules
properties and identifying the useful molecules as new drugs
after optimization. In the initial stage of optimization of
candidate molecules, due to low solubility or possible toxicity,
Fig. 10. Heatmap of atomic similarity matrices for six molecules. new molecules or analog molecules do not have many records
of real physicochemical properties and biological activities.
Therefore, the key problem of AI-assisted drug discovery is
few examples learning. Here, we propose a meta learning
method based on graph attention network, Meta-GAT, which
uses graph attention network to extract the interaction of
atom pairs and the edge features of bonds in molecules.
Also, the meta learning algorithm trains a well-initialized
parameter through multiple prediction tasks and, on this basis,
performs one or more steps of gradient adjustment to achieve
the purpose of quickly adapting to a new task with only
few data. Meta-GAT achieves SOTA performance on multiple
public benchmark datasets, indicating that it can adapt to new
tasks faster than other models. This algorithm is expected
to fundamentally solve the problem of few samples in drug
Fig. 11. Attention weights learned from the Meta-GAT are used to highlight discovery. We have proved that Meta-GAT can provide a pow-
each atom in nine molecules in the toxicity prediction task on Tox21 datasets.
erful impetus for low-data drug discovery. The development
of meta learning is an important direction of AI-assisted drug
discovery. It is believed that the new learning paradigm can
In addition, we conducted two visualization experiments on be applied in the field of drug discovery in the future.
the atom similarity matrix and attention weights to rationalize
Meta-GAT. We obtained the similarity coefficient between ACKNOWLEDGMENT
atom pairs by calculating the Pearson correlation coefficient
for those feature vectors and plotted the heatmap of the atomic The authors would like to thank the anonymous reviewers
similarity matrices for the six molecules, as shown in Fig. 10. for their valuable suggestions.
Taking the molecule structure of Dipyrone as an example, the
atoms in Dipyrone are clearly separated into three clusters, R EFERENCES
as follows: a benzene (atoms 0–5), an aminomethanesulfonic [1] H. Dowden and J. Munro, “Trends in clinical success rates and thera-
peutic focus,” Nature Rev. Drug Discovery, vol. 18, no. 7, pp. 495–497,
acid (atoms 6–13), and a pyrazolidone (atoms 14–20). The 2019.
first impression of the visual pattern in the heat map for the [2] L. Wang et al., “Accurate and reliable prediction of relative ligand
compound iodoantipyrine may show some degree of chaos, binding potency in prospective drug discovery by way of a modern
free-energy calculation protocol and force field,” J. Amer. Chem. Soc.,
which is caused by the disorder of the atom numbers in vol. 137, no. 7, pp. 2695–2703, Feb. 2015.
SMILES. Combining atoms 0–6, atom N13, and atom C14 [3] G. Sliwoski, S. Kothiwale, J. Meiler, and E. W. Lowe, “Computational
of iodoantipyrine, the atoms in iodoantipyrine are clearly methods in drug discovery,” Pharmacological Rev., vol. 66, no. 1,
divided into two clusters. The visual pattern of these heat pp. 334–395, 2014.
[4] Z. Yang, W. Zhong, L. Zhao, and C. Y.-C. Chen, “ML-DTI: Mutual
maps strongly agrees with our chemical intuition regarding learning mechanism for interpretable drug–target interaction prediction,”
these molecular structure. J. Phys. Chem. Lett., vol. 12, no. 17, pp. 4247–4261, 2021.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[5] J.-Q. Chen, H.-Y. Chen, W.-J. Dai, Q.-J. Lv, and C. Y.-C. Chen, “Artifi- [28] D. Duvenaud et al., “Convolutional networks on graphs for learning
cial intelligence approach to find lead compounds for treating tumors,” molecular fingerprints,” in Proc. Adv. Neural Inf. Process. Syst., Annu.
J. Phys. Chem. Lett., vol. 10, no. 15, pp. 4382–4400, Aug. 2019. Conf. Neural Inf. Process. Syst. Montreal, QC, Canada: Curran Asso-
[6] J.-Y. Li, H.-Y. Chen, W.-J. Dai, Q.-J. Lv, and C. Y.-C. Chen, “Artificial ciates, Inc., Dec. 2015, pp. 2224–2232.
intelligence approach to investigate the longevity drug,” J. Phys. Chem. [29] P. Li et al., “TrimNet: Learning molecular representation from triplet
Lett., vol. 10, no. 17, pp. 4947–4961, Sep. 2019. messages for biomedicine,” Briefings Bioinf., vol. 22, no. 4, Jul. 2021,
[7] C. Y. Lee and Y.-P.-P. Chen, “New insights into drug repurposing for Art. no. bbaa266.
COVID-19 using deep learning,” IEEE Trans. Neural Netw. Learn. Syst., [30] Q.-J. Lv et al., “A multi-task group bi-LSTM networks application on
vol. 32, no. 11, pp. 4770–4780, Nov. 2021. electrocardiogram classification,” IEEE J. Transl. Eng. Health Med.,
[8] M. J. Waring et al., “An analysis of the attrition of drug candidates from vol. 8, pp. 1–11, 2020.
four major pharmaceutical companies,” Nature Rev. Drug Discovery, [31] C. Cai et al., “Transfer learning for drug discovery,” J. Medicinal Chem.,
vol. 14, no. 7, pp. 475–486, Jul. 2015. vol. 63, no. 16, pp. 8683–8694, 2020.
[9] J. Wenzel, H. Matter, and F. Schmidt, “Predictive multitask deep neural [32] S. Guo, L. Xu, C. Feng, H. Xiong, Z. Gao, and H. Zhang, “Multi-
network models for ADME-Tox properties: Learning from large data level semantic adaptation for few-shot segmentation on cardiac image
sets,” J. Chem. Inf. Model., vol. 59, no. 3, pp. 1253–1268, Mar. 2019. sequences,” Med. Image Anal., vol. 73, Oct. 2021, Art. no. 102170.
[10] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep [33] M. Huisman, J. N. Van Rijn, and A. Plaat, “A survey of deep meta-
neural nets as a method for quantitative structure–activity relationships,” learning,” Artif. Intell. Rev., vol. 54, pp. 1–59, Aug. 2021.
J. Chem. Inf. Model., vol. 55, no. 2, pp. 263–274, 2015. [34] A. Banino et al., “Vector-based navigation using grid-like representations
in artificial agents,” Nature, vol. 557, no. 7705, pp. 429–433, May 2018.
[11] R. S. Simões, V. G. Maltarollo, P. R. Oliveira, and K. M. Honorio,
“Transfer and multi-task learning in QSAR modeling: Advances and [35] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning
challenges,” Frontiers Pharmacol., vol. 9, p. 74, Feb. 2018. in neural networks: A survey,” 2020, arXiv:2004.05439.
[12] C. Li et al., “Geometry-based molecular generation with deep con- [36] J. Vanschoren, “Meta-learning: A survey,” 2018, arXiv:1810.03548.
strained variational autoencoder,” IEEE Trans. Neural Netw. Learn. Syst., [37] J. X. Wang et al., “Prefrontal cortex as a meta-reinforcement learning
early access, 2022, doi: 10.1109/TNNLS.2022.3147790. system,” Nature Neurosci., vol. 21, no. 6, pp. 860–868, May 2018.
[13] C. Ji, Y. Zheng, R. Wang, Y. Cai, and H. Wu, “Graph polish: A [38] S. Biswas, G. Khimulya, E. C. Alley, K. M. Esvelt, and G. M. Church,
novel graph generation paradigm for molecular optimization,” IEEE “Low-N protein engineering with data-efficient deep learning,” Nature
Trans. Neural Netw. Learn. Syst., early access, Sep. 14, 2021, doi: Methods, vol. 18, no. 4, pp. 389–396, Apr. 2021.
10.1109/TNNLS.2021.3106392. [39] R. Liu, X. Yu, X. Liu, D. Xu, K. Aihara, and L. Chen, “Identifying
[14] P. Schneider et al., “Rethinking drug design in the artificial intelli- critical transitions of complex diseases based on a single sample,”
gence era,” Nature Rev. Drug Discovery, vol. 19, no. 5, pp. 353–364, Bioinformatics, vol. 30, no. 11, pp. 1579–1586, Jun. 2014.
May 2020. [40] S. Lin et al., “Prototypical graph contrastive learning,” IEEE
[15] X. Jing and J. Xu, “Fast and effective protein model refinement using Trans. Neural Netw. Learn. Syst., early access, Jul. 27, 2022, doi:
deep graph neural networks,” Nature Comput. Sci., vol. 1, no. 7, 10.1109/TNNLS.2022.3191086.
pp. 462–469, Jul. 2021. [41] Y. Sun et al., “Combining genomic and network characteristics for
extended capability in predicting synergistic drugs for cancer,” Nature
[16] Z. Xiong et al., “Pushing the boundaries of molecular representation Commun., vol. 6, no. 1, pp. 1–10, Sep. 2015.
for drug discovery with the graph attention mechanism,” J. Medicinal
[42] Q. Liu, H. Zhou, L. Liu, X. Chen, R. Zhu, and Z. Cao, “Multi-target
Chem., vol. 63, no. 16, pp. 8749–8760, Aug. 2019.
QSAR modelling in the analysis and design of HIV-HCV co-inhibitors:
[17] Q. Lv, G. Chen, L. Zhao, W. Zhong, and C. Yu-Chian Chen, An in-silico study,” BMC Bioinf., vol. 12, no. 1, pp. 1–20, Dec. 2011.
“Mol2Context-vec: Learning molecular representation from context
[43] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data
awareness for drug discovery,” Briefings Bioinf., vol. 22, no. 6,
drug discovery with one-shot learning,” ACS Central Sci., vol. 3, no. 4,
Nov. 2021, Art. no. bbab317.
pp. 283–293, 2017.
[18] L. A. Bugnon, C. Yones, D. H. Milone, and G. Stegmayer, “Deep
[44] T. Adler et al., “Cross-domain few-shot learning by representation
neural architectures for highly imbalanced data in bioinformatics,”
fusion,” 2020, arXiv:2010.06498.
IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 8, pp. 2857–2867,
[45] Z. Guo et al., “Few-shot graph learning for molecular property predic-
Aug. 2020.
tion,” in Proc. Web Conf., J. Leskovec, M. Grobelnik, M. Najork, J. Tang,
[19] J. Song et al., “Local–global memory neural network for medication and L. Zia, Eds., Ljubljana, Slovenia, Apr. 2021, pp. 2559–2567.
prediction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 4, [46] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
pp. 1723–1736, Apr. 2021. examples: A survey on few-shot learning,” ACM Comput. Surv., vol. 53,
[20] R. Huang, X. Tan, and Q. Xu, “Learning to learn variational quan- no. 3, pp. 1–34, 2020.
tum algorithm,” IEEE Trans. Neural Netw. Learn. Syst., early access, [47] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
Feb. 28, 2022, doi: 10.1109/TNNLS.2022.3151127. fast adaptation of deep networks,” in Proc. 34th Int. Conf. Mach. Learn.,
[21] Y. Yamanishi, E. Pauwels, and M. Kotera, “Drug side-effect prediction vol. 70, Sydney, NSW, Australia, Aug. 2017, pp. 1126–1135.
based on the integration of chemical and biological spaces,” J. Chem. [48] D. Jiang et al., “Could graph neural networks learn better molecular
Inf. Model., vol. 52, no. 12, pp. 3284–3292, Dec. 2012. representation for drug discovery? A comparison study of descriptor-
[22] Á. Duffy et al., “Tissue-specific genetic features inform prediction of based and graph-based models,” J. Cheminformatics, vol. 13, no. 1,
drug side effects in clinical trials,” Sci. Adv., vol. 6, no. 37, Sep. 2020, pp. 1–23, Feb. 2021.
Art. no. eabb6242. [49] R. Winter, F. Montanari, F. Noé, and D.-A. Clevert, “Learning con-
[23] G. Yu, Y. Xing, J. Wang, C. Domeniconi, and X. Zhang, “Multiview tinuous and data-driven molecular descriptors by translating equivalent
multi-instance multilabel active learning,” IEEE Trans. Neural Netw. chemical representations,” Chem. Sci., vol. 10, no. 6, pp. 1692–1701,
Learn. Syst., vol. 33, no. 9, pp. 4311–4321, Sep. 2022. Jul. 2019.
[24] A. Morro et al., “A stochastic spiking neural network for virtual [50] J. Cui, B. Yang, B. Sun, X. Hu, and J. Liu, “Scalable and parallel deep
screening,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, Bayesian optimization on attributed graphs,” IEEE Trans. Neural Netw.
pp. 1371–1375, Apr. 2018. Learn. Syst., vol. 33, no. 1, pp. 103–116, Jan. 2020.
[25] Y. Yu and H. Tran, “An XGBoost-based fitted Q iteration for [51] Z. Wu et al., “MoleculeNet: A benchmark for molecular machine
finding the optimal STI strategies for HIV patients,” IEEE Trans. learning,” Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018.
Neural Netw. Learn. Syst., early access, Jun. 2, 2022, doi: [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
10.1109/TNNLS.2022.3176204. 2014, arXiv:1412.6980.
[26] K. V. Chuang, L. M. Gunsalus, and M. J. Keiser, “Learning molecular [53] F. Fabris, A. Doherty, D. Palmer, J. P. De Magalhães, and A. A. Freitas,
representations for medicinal chemistry: Miniperspective,” J. Medicinal “A new approach for interpreting random forest models and its appli-
Chem., vol. 63, no. 16, pp. 8705–8722, Aug. 2020. cation to the biology of ageing,” Bioinformatics, vol. 34, no. 14,
[27] M. Sun, S. Zhao, C. Gilvary, O. Elemento, J. Zhou, and F. Wang, pp. 2449–2456, Jul. 2018.
“Graph convolutional networks for computational drug develop- [54] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
ment and discovery,” Briefings Bioinf., vol. 21, no. 3, pp. 919–935, convolutional networks,” in Proc. 5th Int. Conf. Learn. Represent.
May 2020. (ICLR), Toulon, France, Apr. 2017, pp. 1–14.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LV et al.: META LEARNING WITH GRAPH ATTENTION NETWORKS FOR LOW-DATA DRUG DISCOVERY 13

[55] G. Koch et al., “Siamese neural networks for one-shot image recogni- Guanxing Chen is currently pursuing the Ph.D.
tion,” in Proc. ICML Deep Learn. Workshop, vol. 2, Lille, France, 2015, degree with the Artificial Intelligence Medical
pp. 1–30. Research Center, School of Intelligent Systems Engi-
[56] J. Kim, T. Kim, S. Kim, and C. D. Yoo, “Edge-labeling graph neural neering, Shenzhen Campus of Sun Yat-sen Univer-
network for few-shot learning,” in Proc. IEEE/CVF Conf. Comput. Vis. sity, Shenzhen, Guangdong, China.
Pattern Recognit. (CVPR). Long Beach, CA, USA: Computer Vision His research interests include explainable artificial
Foundation, Jun. 2019, pp. 11–20. intelligence, drug discovery, deep learning, biosyn-
[57] W. Hu et al., “Strategies for pre-training graph neural networks,” in thesis, and vaccine design.
Proc. 8th Int. Conf. Learn. Represent. (ICLR), Addis Ababa, Ethiopia,
Apr. 2020, pp. 1–22.
[58] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-
shot learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, Ziduo Yang is currently pursuing the Ph.D. degree
pp. 1–11. with the Artificial Intelligence Medical Research
[59] Y. Song, S. Zheng, Z. Niu, Z.-H. Fu, Y. Lu, and Y. Yang, “Com- Center, School of Intelligent Systems Engineer-
municative representation learning on attributed molecular graphs,” in ing, Shenzhen Campus of Sun Yat-sen University,
Proc. 29th Int. Joint Conf. Artif. Intell., C. Bessiere, Ed., Jul. 2020, Shenzhen, Guangdong, China.
pp. 2831–2838, doi: 10.24963/IJCAI.2020/392. His main research interests include explainable
[60] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, “Molecu- graph neural network, computer vision, reinforce-
lar graph convolutions: Moving beyond fingerprints,” J. Comput.-Aided ment learning, and chemoinformatics.
Mol. Des., vol. 30, no. 8, pp. 595–608, Aug. 2016.
[61] A. Mayr, G. Klambauer, T. Unterthiner, and S. Hochreiter, “DeepTox:
Toxicity prediction using deep learning,” Frontiers Environ. Sci., vol. 3,
p. 80, Feb. 2016.
Weihe Zhong is currently pursuing the Ph.D. degree
[62] L. Maziarka, T. Danel, S. Mucha, K. Rataj, J. Tabor, and S. Jastrzebski,
with the Artificial Intelligence Medical Research
“Molecule attention transformer,” 2020, arXiv:2002.08264.
Center, School of Intelligent Systems Engineer-
[63] X. Li and D. Fourches, “Inductive transfer learning for molecular activity
ing, Shenzhen Campus of Sun Yat-sen University,
prediction: Next-gen QSAR models with MolPMoFiT,” J. Cheminfor-
Shenzhen, Guangdong, China.
matics, vol. 12, no. 1, pp. 1–15, Dec. 2020.
His main research interests include graph neural
[64] S. Liu, M. F. Demirel, and Y. Liang, “N-gram graph: Simple unsuper- network, chemoinformatics, and drug discovery.
vised representation for graphs, with applications to molecules,” in Proc.
Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 8464–8476.
[65] M. Kuhn, I. Letunic, L. J. Jensen, and P. Bork, “The SIDER database
of drugs and side effects,” Nucleic Acids Res., vol. 44, no. D1,
pp. D1075–D1079, Jan. 2016.
[66] S. G. Rohrer and K. Baumann, “Maximum unbiased validation (MUV) Calvin Yu-Chian Chen is currently the Direc-
data sets for virtual screening based on PubChem bioactivity data,” tor of the Artificial Intelligent Medical Center
J. Chem. Inf. Model., vol. 49, no. 2, pp. 169–184, Feb. 2009. and a Professor with the School of Intelligent
[67] L. Van Der Maaten and G. Hinton, “Visualizing data using t-SNE,” Systems Engineering, Shenzhen Campus of Sun
J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. Yat-sen University, Shenzhen, Guangdong, China.
He also serves as an Advisor at China Medical
University Hospital, Taichung, China, and Asia Uni-
versity, Taichung, and a Guest Professor at the
Massachusetts Institute of Technology (MIT), Cam-
bridge, MA, USA, and the University of Pittsburgh,
Pittsburgh, PA, USA. He has published more
than 300 SCI articles and with H-index more than 47. In 2020–2023, he is
Qiujie Lv is currently pursuing the Ph.D. degree the highly cited candidate (in the field of computer science and technology).
with the Artificial Intelligence Medical Research In 2021–2023, he was also selected as the world’s top 100 000 scientists.
Center, School of Intelligent Systems Engineer- In 2018–2023, he was also selected as the world’s top 2% scientists. He had
ing, Shenzhen Campus of Sun Yat-sen University, built several artificial intelligence medical systems for hospital, including
Shenzhen, Guangdong, China. various pathological image processing, MRI image processing, and big data
His research interests include graph neural net- modeling. He also built the world’s largest traditional Chinese medicine
work, drug discovery, artificial intelligence, and database (http://TCMBank.cn/). His laboratory general research interests
bioinformatics. include developing structured machine learning techniques for computer
vision tasks to investigate how to exploit the human commonsense and
incorporate them to develop the advanced artificial intelligence system.

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on March 08,2023 at 02:46:12 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like