0% found this document useful (0 votes)
123 views14 pages

Knowledge Graphs and Their Applications in Drug

The document discusses how knowledge graphs have emerged as promising tools for drug discovery due to the large amount of heterogeneous biological data available and the industry's shift toward systems biology approaches. It evaluates the utility of knowledge graphs, highlighting target identification and drug repurposing as areas showing promise. It also provides a case study on using knowledge graphs to identify potential drug candidates for COVID-19 and discusses challenges like bias that need to be addressed.

Uploaded by

nassar.dakkoune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views14 pages

Knowledge Graphs and Their Applications in Drug

The document discusses how knowledge graphs have emerged as promising tools for drug discovery due to the large amount of heterogeneous biological data available and the industry's shift toward systems biology approaches. It evaluates the utility of knowledge graphs, highlighting target identification and drug repurposing as areas showing promise. It also provides a case study on using knowledge graphs to identify potential drug candidates for COVID-19 and discusses challenges like bias that need to be addressed.

Uploaded by

nassar.dakkoune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Opinion on Drug Discovery

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/iedc20

Knowledge graphs and their applications in drug


discovery

Finlay MacLean

To cite this article: Finlay MacLean (2021): Knowledge graphs and their applications in drug
discovery, Expert Opinion on Drug Discovery, DOI: 10.1080/17460441.2021.1910673

To link to this article: https://doi.org/10.1080/17460441.2021.1910673

Published online: 12 Apr 2021.

Submit your article to this journal

Article views: 56

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=iedc20
EXPERT OPINION ON DRUG DISCOVERY
https://doi.org/10.1080/17460441.2021.1910673

REVIEW

Knowledge graphs and their applications in drug discovery


Finlay MacLean
Target Identification., BenevolentAI, United Kingdom of Great Britain and Northern Ireland

ABSTRACT ARTICLE HISTORY


Introduction: Knowledge graphs have proven to be promising systems of information storage and Received 10 January 2021
retrieval. Due to the recent explosion of heterogeneous multimodal data sources generated in the Accepted 26 March 2021
biomedical domain, and an industry shift toward a systems biology approach, knowledge graphs have KEYWORDS
emerged as attractive methods of data storage and hypothesis generation. Biomedical knowledge
Areas covered: In this review, the author summarizes the applications of knowledge graphs in drug graphs; drug repurposing;
discovery. They evaluate their utility; differentiating between academic exercises in graph theory, and drug repositioning;
useful tools to derive novel insights, highlighting target identification and drug repurposing as two heterogeneous information
areas showing particular promise. They provide a case study on COVID-19, summarizing the research networks; graph machine
that used knowledge graphs to identify repurposable drug candidates. They describe the dangers of learning; network
degree and literature bias, and discuss mitigation strategies. embeddings; knowledge
Expert opinion: Whilst knowledge graphs and graph-based machine learning have certainly shown graph embedding; network
pharmacology; network
promise, they remain relatively immature technologies. Many popular link prediction algorithms fail to medicine
address strong biases in biomedical data, and only highlight biological associations, failing to model
causal relationships in complex dynamic biological systems. These problems need to be addressed
before knowledge graphs reach their true potential in drug discovery.

1. Introduction wide association studies are now frequently used to connect


underlying genetic variation to complex phenotypic traits.
Despite an explosion of data being generated, significant scientific
Public repositories such as the Sequence Read Archive [7] and
and technological advancements, and industry initiatives on effi­
European Nucleotide Archive [8] have provided researchers with
ciency, the pharmaceutical industry has suffered a ’drug drought’,
access to large databases of DNA sequencing data. Combined
in which investment in research and development (R&D) drama­
with the use of expression quantitative trait locus studies, tech­
tically increased [1] without a marked increase in annually
nologies such as Mendelian randomization and colocalization
approved drugs [2]. Whilst the number of new chemical entities
have helped to identify causal loci and genes. In transcriptomics,
to reach the market has steadily increased over the last decade [2],
the Human Protein Atlas [9] have generated atlases of gene
this has been accompanied by a marked increase in R&D expen­
expression of tissues, the brain, diseases, blood and cells, whilst
diture [1]. Business value is derived principally from research and
the Genotype-Tissue Expression (GTEx) [10] project quantified
development (R&D) yielding a positive return on investment. The
expression over common tissues. The Library of Integrated
return on investment on R&D for the top 12 pharmaceutical
Network-Based Cellular Signatures (LINCS) project collected gene
companies fell from 10% in 2010 to 2% in 2019 [3], and the cost
expression profiles in response to genetic and chemical pertur­
to develop a drug rose almost two-fold to 2 USD billion USD [3].
bagens [11]. In oncology, projects such as The Cancer Genome
The pharmaceutical industry is increasingly looking toward new
Atlas [12] and Cancer Cell Line Encyclopedia [13] have provided
disruptive methods to reduce the failure rate, increase the speed
detailed molecular characterizations of cancers, whilst RNA-
of development, and ultimately reduce the cost of research.
interference and CRISPR-Cas9 technologies identified genetic
dependencies in cancers [14]. In epigenomics, the Human
1.1. The decades of data Epigenome Atlas [15] has cataloged genome-wide epigenetic
markers in all major tissues. Multiple databases have documen­
In the past two decades, there has been an explosion of data
ted non-coding RNAs and their regulatory targets [16,17], and
generated in the biomedical domain. Motivated by the landmark
their association with diseases [18].
The Human Genome Project [4] earmarking the beginning of the
genetic revolution, subsequent projects, such as the
1000 Genomes Project [5] and the UK Biobank [6] have provided 1.2. The era of systems biology
comprehensive analyses of the human genome. Next generation
This immense accumulation of biomedical data led to
sequencing (NGS) technologies have dramatically lowered the
a paradigm shift in the pharmaceutical industry, moving
cost of sequencing, and both genome-wide and phenotype-

CONTACT Finlay MacLean [email protected] BenevolentAI, 4-8 Maple St, Bloomsbury, London W1T 5HD, United Kingdom of Great Britain and
Northern Ireland.
© 2021 Informa UK Limited, trading as Taylor & Francis Group
2 F. MACLEAN

1.3. Knowledge graphs


Article highlights
These technological and scientific advancements have
• Knowledge graphs provide an elegant solution to the ’data undoubtedly deepened our understanding of human biology
problem’ in the pharmaceutical industry, integrating and harmo­
nizing the ever-growing number of multimodal data sources.
and disease. But why has this innovation not been reflected in
an increase in profitable R&D in pharmaceutical companies?
• Representing biological systems as knowledge graphs has Some speculate that the so-called low-hanging fruit of drug
allowed for the exploitation of graph theory and powerful graph
machine learning methodologies, well suited to the target-based discovery have been picked [19]. A tightening of drug safety
systems biology approach to drug discovery. regulations in response to the thalidomide tragedy have cer­
tainly upped the criteria in receiving approval for a drug [20].
• The most common application of knowledge graphs in the
pharmaceutical industry is in early stage drug discovery and repur­ One possible answer is simply how difficult it is to ingest,
posing, particularly in identification of pathogenic genes and drug harmonize and use the multitude of data that has been gen­
targets. erated. Despite significant initiatives to ’digitally transform’
• Biomedical knowledge graphs have yielded noteworthy repur­ Novartis, their CEO, Vas Narasimhan, has remarked on the
posing candidates for COVID-19 directly leading to clinical valida­ difficulty to clean and link their heterogeneous data [21].
tion and emergency use authorization.
This should not be read as an individual failure of Novartis,
• The predictive power in many graph machine learning techni­ rather, it reflects the difficulties faced by the entire industry.
ques comes mainly from connectivity, and not network proximity, The difficulties in data integration are further reflected in the
introducing a significant bias in link prediction tasks.
industry-academic initiatives to standardize scientific data
• This connectivity bias is further exacerbated when training on such as the FAIR (Findability, Accessibility, Interoperability, and
literature-derived knowledge graphs whose degree distribution Reusability) Data Principles [22].
diverges from that of the underlying biological system.
Mitigation strategies are needed. Knowledge graphs (KGs) have proven to be attractive
methods to store biomedical data due to their capacity to
This box summarizes key points contained in the article.
model complex data structures. Observing Figure 1, one can
clearly see the parity between the biological pathway [23] and
its approximation in a graph database. Formally a KG can be
described as a labeled multi-graph [24]. The graph consists of
entities; commonly referred to as nodes or vertices, and rela­
tionships connecting two entities; commonly referred to as
from phenotype-based discovery to a target-based
edges, facts, or links. The two nodes constituting an edge or
approach. The previous phenotypic approach paid less
relation are often respectively called the head and tail or
attention to the mechanism of action of the drug, focusing
source and target nodes. Across many diverse domains and
more on the desired phenotypic outcome. Even in Phase
industries, KGs have been used in question-answering,
I on clinical trials, this approach was unable to determine
the mechanism of action of a drug [19]. The collective
zeitgeist of the industry moved toward a systems biology-
focused approach, and with it the fundamental aim of
drug discovery changed. It now aimed to first understand
the complex biological systems within our cells and how
their dysregulation leads to disease, and finally develop
methods to selectively target these systems.
Systems biology describes the computational modeling
of molecular systems, drawing from many disciplines includ­
ing computer science and physics. One of the fundamental
principles of biological life is that the accumulation of sim­
ple, locally-acting components leads to complex structures
and systems. Systems biology follows this principle, describ­
ing a complex biological system as a network of simple
biological components (analogous to those in electronic
circuits) and simulates how the system changes in response
to certain stimuli. These systems consist of intra- and inter-
cellular interactions amongst molecules that govern biolo­
gical functions, and whose dysregulation leads to disease.
Their networks often contain multi-scale elements, ranging
from molecular components to tissues; both physical and
abstract entities, ranging from proteins to phenotypic out­
Figure 1. Representing biology as a heterogeneous information network.
comes. Networks also contain diverse types of interactions
A KG representation of the canonical insulin receptor signaling cascades using the Neo4j
between entities, such as inhibitions, activations, associa­ graph database [27]. Image was created by querying a KG to reconstruct the signaling
tions and causal interactions. pathway of Iams et al. [23].
EXPERT OPINION ON DRUG DISCOVERY 3

recommendation, and information retrieval systems. Arguably graph. In a disease-gene association network, comorbidities
the most famous of commercial question-answering systems is will share a higher number of associated genes, inferring their
Watson, developed by IBM to beat human experts at the quiz functional similarity. In contrast, two diseases that have no
show Jeopardy [25]. In terms of recommendation systems, intersection of associated genes are unlikely to exhibit func­
Pinterest have famously used a KG of user-likes-pin to recom­ tional homogeneity. Traditional approaches such as Common
mend new pins to their user-base [26]. Neighbor, Jaccard’s Index, Adamic/Adar Index and Katz com­
In the field of drug discovery, one of the earliest notable pute similarities between nodes based on local neighbor­
attempts to integrate multiple structured biomedical data­ hoods [37], however failed to utilize the global
bases was the work of Himmelstein et al., developing neighborhood and topology of the graph. In recent years,
Hetionet to prioritize drugs for repurposing [28**] and genes embedding-based GML has become the norm. Embedding
associated with disease [29]. Other KGs include OpenBioLink, strategies encode the continuous neighborhood information
principally used to benchmark link prediction models [30], and of a node, graph substructure, or entire graph into a discrete
the work of Womack et al. [31]. Whilst the integration of low-dimensional latent vector [38]. To refer back to the pre­
structured databases has proven its utility, others have derived vious exemplary disease-gene association network,
biological relationships from literature. The Global Network of a comorbidity of two diseases will encode both diseases with
Biomedical Relationships [32*] screened 24 million research embeddings (vectors) that are mathematically similar in latent
articles to create a disease-gene-chemical KG consisting of vector space, since they are proximal within the network. For
2 million thematically-labeled edges. Biomedical KGs can con­ a comprehensive survey of graph embeddings, we defer the
tain a multitude of multimodal data spanning transcriptomics, reader to these comprehensive reviews [34*] [38]
proteomics, genomics, phenomics, drug pharmacology, chem­ Embedding strategies have now developed to encode het­
istry, and ontological information. The schema of the Drug erogeneous graphs, often referred to as knowledge graph
Repurposing Knowledge Graph [33] exemplifies the heteroge­ embeddings (KGEs). Aside from representing nodes as latent
neity of data common in KGs for drug discovery. The majority vectors, a low-dimensional representation of each relationship
of large-scale biomedical KGs are based on semantic web type is also learned. A wonderful myriad of methods have now
technologies, the largest of which is Bio2RDF [34]. One of the been employed to generate KG embeddings. A review of
defining features of semantic KGs is their extensibility, as these methodologies is outside the remit of this manuscript.
demonstrated by projects, such as Chem2Bio2RDF, which com­ We refer the reader to the comprehensive review of Rossi et al.
bined Bio2RDF with a chemogenomic semantic graph [35]. [24] and repository of KGE models [39].
There seems to be no agreed-upon definition of a KG.
Some constrict its name to only literature-derived graphs.
Others even go further, using KG to refer only to graphs that 2. Applications of knowledge graphs
use semantic technologies to represent the data. In this article,
KGs have emerged as an effective method of information
we define a KG as any heterogeneous information network,
representation in drug discovery. Modeling biological systems
regardless of the technology used and the provenance of the
as graphs has facilitated the use of powerful network-based
data it represents.
algorithms; encoding the continuous global or local neighbor­
hood of nodes into discrete latent vectors. The vectors are
1.4. Graph machine learning on knowledge graphs then used in a downstream machine learning task. Supervised
downstream tasks include link prediction (pairwise prediction
Marshall Nirenberg famously stated that science progresses
between two nodes), and node classification (classification of
best using simple assays to rapidly generate large data sets
one node). Embeddings may be also used in unsupervised
[19]. Whilst large-scale and genome-wide screens are certainly
tasks such as community detection (the detection of neigh­
the gold standard of systematic drug discovery, their high
borhoods of nodes via clustering). These methods make the
costs often prohibit their use only for all but the most com­
guilt-by-association assumption; that functionally or structu­
mon (and thus profitable) of diseases. Machine learning has
rally similar biological entities are likely to have similar proper­
demonstrated its potential as a complementary approach:
ties, have high network proximity, and have a small distance
rapidly and inexpensively generating data in unexplored
between node embeddings in vector space.
areas of the biological and chemical space. In particular,
graph-based machine learning (GML; also referred to as geo­
metric machine learning) methods have shown promise in this
2.1. Drug repurposing: a link prediction task
field. By representing biological systems as KGs, it has allowed
for the exploitation of graph theory and powerful network The overwhelming majority of applications using KGs are
science algorithms; drawing new insights into this otherwise framed as link prediction tasks. Our knowledge of biology is
silo-ed data. GML has been applied to systemically screen incomplete, and the resulting information networks are spar­
compounds for new interactions, and shed light upon areas sely populated. Link prediction on incomplete networks aims
unknown of the human interactome. to systematically complete these networks, in which predicted
GML uses the topological structure of the network to clas­ edges represent biological interactions or associations that are
sify properties of nodes, predict the existence of edges, and currently unknown, unexplored or yet to be validated. Figure 2
detect communities [36*]. It assumes that functionally or struc­ illustrates the training process of a link prediction KG embed­
turally similar nodes will be more highly connected within the ding model.
4 F. MACLEAN

Figure 2. Prototypical link prediction training process.


Tensor factorization KG embedding model reproduced from Paliwal et al. [49] Step (a) shows the original graph consisting of three entities (A, B, and C), and two relations (r1 and r2). Step
(b) shows the latent vector representation (embeddings) of the nodes and relations. Step (c) shows a downstream scoring function of a triplet of source node, relation, and target node
embeddings. Step (d) shows the predictions of edge existence for all edges in the original graph.

Of all of the applications of KG-based link prediction within


the field of drug discovery, one of the most promising is drug
repurposing (DR). The aim of DR is to identify new indications
and conditions for existing drugs. Framed as a systems biology
problem, the aim is to predict the likelihood of edges within
a biological graph. With only 10% new molecules reaching the
clinic [40*], drug repurposing has proven to be a lucrative
method of drug discovery; mitigating a proportion of the risk
by focusing on drugs that already have established safety
profiles. In recent years, almost one-third of the drugs that
receive approvals are repurposed [41]. Many of the most
successful repositions have been largely serendipitous: their
unplanned side effects fortuitously provide benefit to other
patient populations with other conditions [41]. Examples of
repurposed drugs include the dihydrotestosterone inhibitor,
finasteride. Finasteride was originally developed for treatment
of prostate cancer and showed moderate efficacy [42] how­
ever after hair growth was noted on laboratory rats, the drug
was repositioned for treatment of androgenetic alopecia
[43]. Whilst finasteride serves as a poster child for how side Figure 3. Approaches to network-based drug repurposing and discovery.
effects can be beneficial, it also serves as a stark reminder that Edges of multiple types can be predicted to indicate the therapeutic viability of a repurpo­
the majority of side effects are maleficial. Systemic inhibition sable drug to treat a disease. Different edge types correspond to different approaches to
drug repurposing. On-target repurposing describes the prediction of novel therapeutic
of dihydrotestosterone, for example, has been associated with genes, which are the known on-targets of drugs. In off-target repurposing, the off-targets
permanent sexual dysfunction and cognitive impairment [44]. of a drug are predicted, one of more of which will regulate a disease. In target-agnostic
repurposing, the gene through which the drug acts is not explicitly provided.
Unlike it’s serendipitous counterpart, network-based DR has
the potential to differentiate between beneficial and maleficial
phenotypic outcomes; intelligently and systematically identify­
ing new indications for existing drugs. Multiple approaches pathophysiologies and multimodal edges [26**]. Biomedical
encompass network-based DR; on-target repurposing, off- data in these graphs is sparse. To overcome the sparsity
target repurposing and target-agnostic repurposing. Each problem, Poleksic developed a compressed sensing technique,
approach predicts a different relation between drug, gene demonstrating superior performance over the original path­
and disease in a KG (see Figure 3). way-based implementation [45]. One of the drawbacks of
pathway-based approaches is the high computational cost,
2.1.1. Target-agnostic drug repurposing limiting their use to relatively small KGs. Womack et al.
To identify repurposable drug candidates for new indications, demonstrated how node2vec, a popular random-walk method,
many methods predict drug-treats-disease edges in pharma­ was more performant and with significantly lower computa­
cological KGs. One example was the work of Himmelstein et al. tional overheads [31]. KGEs have also been applied to this
who applied a degree-normalized pathway model to highlight prediction task, including Sosa et al. whose model exploited
repurposable drugs for epilepsy. The model was applied to the confidence scores of edges in a literature-derived KG [46].
their hetionet KG, consisting of genes, diseases, tissues, In target-agnostic drug repurposing, neither the drug target
EXPERT OPINION ON DRUG DISCOVERY 5

nor disease associated genes are implicitly provided to the 2.1.3. Off-target repurposing and drug-target interaction
models and thus obfuscate the mechanism of action of the The one drug, one gene, one disease paradigm of drug discov­
drug. ery has passed. Many diseases are now understood to be
multifactorial; caused by the combination of the effects of
multiple genes. Most drugs are now estimated to bind to
2.1.2. On-target repurposing and target identification between 10 and 100 targets [48]. Polypharmacology is
In contrast to target-agnostic approaches, target-based a promising paradigm in drug discovery, assuming drugs act
drug repurposing approaches are attractive alternatives. through multiple genes associated with one or more patho­
Diseases can be described as phenotypic manifestations mechanism. Our knowledge of the pharmacogenomic space is
caused by genomic perturbations. These genomic pertur­ sparsely populated [53], mainly limited to the disease-
bations cause further dysregulation of genes and path­ associated genes of interest, and genes common in safety
ways. The aim of target-based drug discovery is to panels (essential genes and those associated with undesired
develop a compound to either directly or indirectly target phenotypes). Genome-wide screens would be of great utility
one of these disease-causing or disease-associated genes. to understand the polypharmacology of existing drugs (off-
Similarly, the aim of on-target drug repurposing is to iden­ target drug repurposing), and novel compounds (drug discov­
tify a preexisting drug to target one of these disease- ery). Due to the cost of experimental screens, many research­
causing or disease-associated genes. Both methods require ers have developed in silico methods to quickly and
one or more targets through which to act. GML methods inexpensively screen for drug-target interactions. Link predic­
have been widely applied to prioritize pathogenic genes. tions methods have been used to predict drug-binds-target
Himmelstein et al. applied the previously cited hetionet KG edges in pharmacological KGs. A random walk-based
to predict disease-causing genes for multiple sclerosis [29]. approach, DTINet, was used to identify the novel inhibitory
Even embedding strategies trained on relatively small het­ action of three approved drugs
erogeneous networks have demonstrated their superiority on cyclooxygenase proteins [54]. The pharmaceutical
over baseline approaches, such as Xu et al. who trained industry is often interested in determining chemicals with little
a multipath random walk model on a network of gene- to no available interaction data: the so-called cold-start pro­
phenotype, protein–protein interactions and phenotypic blem. Lim et al. developed a collaborative filtering approach
similarities [47]. tailored specifically to chemicals with few interactions [55]. In
KGE approaches have recently been applied to target iden­ addition to the necessary pharmacogenomic data, network-
tification. Pitalla et al. used a relation-weighted RotatE model based approaches have used genomic, chemical, pharmacolo­
to predict drug targets for Parkinson’s disease [48]. Notably, gical [54,56], side effects [57], diseases, pharmacokinetics, and
the model outperformed OpenTargets, the leading initiative for proteomics [58]. KG embeddings have also been applied,
target identification which includes genetic, pharmacologic, screening approved drugs for off-target interactions with
pathway, multi-omics data. Similarly, Paliwal et al. developed COVID-19 associated genes [33].
Rosalind, a tensor factorization-based KG embedding trained Many computational methods have been developed to
on BenevolentAI’s proprietary biomedical knowledge to predict perform in silico screens. The main advantage of network-
therapeutic targets for rheumatoid arthritis [49]. Rosalind out­ based approaches is they do not require the 3D structure of
performed OpenTargets alongside other GML approaches. Top the protein. Deep learning approaches such as DeepPurpose,
predicted genes were experimentally validated in an in vitro using only the primary amino acid sequence of the protein
assay using patient-derived cells. Five genes were determined and SMILES string of the molecule, have shown to be compe­
to be promising for further preclinical research. titive methods of DTI prediction [59]. If the 3D structure of the
The utility of network-based approaches to target identifi­ protein is known, molecular docking studies provide an unpar­
cation is well demonstrated by the above (both academia and alleled level of information of how a drug interacts with the
industry-driven) projects. The above methods are focused on binding pocket of a protein. Deep learning has also been
identifying pathogenic protein-coding genes. Whilst protein successfully applied to molecular docking, as exemplified by
drug targets remain the central focus of drug discovery, the the commercialization of Atomnet by Atomwise [60]. Until
role of non-coding RNA (ncRNA) in disease is becoming more recently, these approaches were limited to proteins with
apparent [50]. Researchers have started to exploit KGs and a known 3D structure. A deep learning method AlphaFold,
GML to predict ncRNA-disease associations. Ji et al. con­ that uses amino acid sequence as input, has recently demon­
structed a KG consisting of micro-RNA, circular-RNA, long non- strated accuracy comparable to experimental techniques such
coding RNA (lncRNA), proteins and diseases, and used a matrix as X-ray crystallography [61]. This may widen the screening
factorization embedding model to predict miRNA-disease possibilities of structure-based approaches, overshadowing
associations for three common cancers [51]. GML has also network-based approaches to DTI prediction.
been applied to predict lncRNA-disease associations. In
a similar project, Zhou et al. built a KG similar to that used
by Ji et al., and trained a higher-order preserving matrix 2.1.4. COVID-19: A case study in network-based
factorization model [52]. They validated their model by pre­ repurposing
dicting disease-related lncRNAs for three excess death rate Unlike it’s serendipitous counterpart, network-based drug repur­
cancers. posing is still waiting to see its first compound reach the market.
6 F. MACLEAN

True validation that this method is an effective tool in drug many other applications remain as academic research projects.
discovery will come only once drugs identified are approved KGs and GML techniques have been widely used in academia
for their new indication, and a systematic review of the metho­ and applied to further pharmacological and multi-omic pre­
dology has been conducted. Arguably, the most mature and diction tasks, including prediction of protein–protein interac­
substantial efforts to identify repurposable drugs have been tion [70–72], polypharmacy side effect [73], disease side effects
focused on finding a therapeutics to target the SARS-CoV-2 [74], and drug–drug interactions [75]. An exhaustive summary
coronavirus or treat the associated COVID-19 disease. of all of these is out of the remit of this review. We refer the
Network-based methods need a network on which to train, in reader to the review of Su et al. [76].
this case capturing data pertaining to SARS-CoV-2. Reese et al.
[62] integrated multiple structured databases into their biome­
dical KG. Next, they integrated datasets pertaining to COVID-19
(Zhou et al. [63], CORD-19 [64]). Ioannidis et al. [33] combined the
2.3. Node classification applications
preexisting KGs [26**], [30*] with additional databases. To predict Whilst most applications of KGs in drug discovery are framed
the likelihood that an approved drug would treat COVID-19, the as link prediction tasks, node classification has also demon­
researchers trained a TransE embedding model on the KG, and strated its utility. Node classification describes the process in
then computed the distance scores between approved drugs which a model is trained on features derived from a
and COVID-19 and similar coronaviruses, and drugs and COVID- subset of nodes with a labeled property, and subsequently
19-associated genes. Hseih et al. extended the KG of Ioannidis predicts the likelihood that unlabeled nodes possess this prop­
et al. [33], integrating a SARS-CoV-2-specific graph into the ori­ erty. To the best of our knowledge, there are few examples of
ginal graph via transfer learning [65]. To discover drugs that can the adoption of these methods by industry or public biological
functionally target SARS-CoV-2-associated host genes, protein databases, however demonstrate the diverse range in which
and drug embeddings were used to predict therapeutic KGs can be applied to drug discovery.
effectiveness.
A multitude of network medicine approaches have been
applied to identify existing drugs to palliate or treat COVID-19. 2.3.1. Protein function
The majority of these studies focus on predicting drugs that Understanding protein function is one of the earliest prerequi­
prevent viral entry (targeting viral genes), viral replicative sites in the drug discovery process. Proteins with similar
mechanisms, or suppression of the host inflammatory response sequences tend to exhibit similar functions [77]. Also, proteins
(both targeting host genes). One of the most noteworthy efforts with similar sequences tend to interact with similar proteins
was produced by Gysi et al. [66], who integrated host-host, host- within protein–protein interaction (PPI) networks. Thus, pro­
viral, and host-drug protein interaction networks, using an tein nodes with high network proximity tend to share protein
ensemble of predictive models to predict 81 potential candidates function. The seminal paper for the node2vec model [78],
to treat SARS-CoV-2. Their method successfully predicted that a semi-supervised random walk model, showed how node
the SARS-CoV-2 could manifest in brain tissue and have neuro­ embeddings and a downstream multi-label classifier could
logical comorbidities, which have since been validated [67]. be effectively used to predict protein function, using the
Other researchers have similarly used network proximity of BioGRID PPI network [79]. Whilst node2vec used only
drug targets to viral proteins in protein–protein interactomes a homogeneous network of PPIs, others have extended their
[63]. BenevolentAI used a proprietary literature-derived KG to work, including other forms of both graphical and comple­
identify baricitinib, a drug used to treat rheumatoid arthritis, to mentary information. DeepGo uses both sequence and PPI
treat patients with bilateral COVID-19 pneumonia [68]. Since networks to generate features for each protein [80]. Nariai
then, more than 12 clinical trials have been conducted, including et al. demonstrated that the integration of PPI networks,
by the drug’s proprietor, Eli Lilly. The janus kinase inhibitor has gene expression, protein motif information, gene knockout
now been granted emergency use authorization (EUA) from the phenotype data and protein localization information yielded
FDA for the treatment of hospitalized patients with COVID-19 greater performance than homogeneous PPI networks [81].
[69]. The methodology employed by BenevolentAI is yet to be Proteins do not execute all of their functions at all times,
published, and drug authorization of baricitinib is only tempor­ and in all tissues in which they are expressed [82].
ary. True validation for the use of KGs in drug repurposing will Researchers developed OhmNet, using node2vec to generate
come only after (i) the FDA approval of drugs surfaced through embeddings based on tissue-specific PPIs, demonstrating
KG and GML methods, and (ii) such methods have been sub­ improved performance over methods employing tissue-
jected to the scientific method. Nevertheless, the numerous agnostic networks.
aforementioned studies certainly suggest that the inclusion of Node classification is far from the only computational
KG-based methods into drug discovery workflows could be of method for predicting protein function. Protein function is
great benefit. directly related to its 3D structure. A deep learning method
AlphaFold, that uses amino acid sequence as input, has
recently demonstrated accuracy comparable to experimental
2.2. Additional link prediction applications techniques such as X-ray crystallography [61]. Such advances
The utility of the method is reflected in its adoption in indus­ may facilitate the overshadowing of network methods by
try. Whilst the above applications of both drug repurposing deep learning methods which learn functions from 3D
and early drug discovery have been employed in industry, topology.
EXPERT OPINION ON DRUG DISCOVERY 7

2.3.2. Essential genes dimensionality reduction, researchers are able to visualize


PPI networks underpin the majority of intracellular communi­ nodes of interest, and identify to what extent their model
cation. Many researchers have utilized topological properties has successfully encoded the biological network. The model
of these networks to identify the most important key regula­ gene2vec [88] generated embeddings based on pan-genome
tors. Hub genes; the genes most important within gene co-expression and clustered using t-SNE. Coloring nodes
a submodule, are often selected on the basis of node degree by gene expression, the clustered graph successfully identified
(also called degree centrality). It is now widely reported that tissue-specific gene clusters. Goh et al. created a bipartite
hub genes correspond to essential genes [83]. Essential genes graph of diseases and their genetic associations, and then
have more recently been associated with other topological performed functional clustering of the diseases to create the
characteristics, such as high betweenness centrality [84], human disease network. Neoplastic diseases constituted the
a measure of the centrality of a node between submodules. largest cluster, with notable clusters of comorbidities such as
If one imagined a network resembling an hourglass, it is likely diabetes and obesity, hypertension, asthma, and atherosclero­
that nodes with high degree centrality would exist within the sis [89] (see Figure 4).
top and bottom ‘glass’ bulb subnetworks, whereas genes with
high betweenness centrality are those close to the bottleneck 2.4.2. Cancer driver gene detection
between bulbs, acting as the detrimental linchpin in commu­ Another notable application of neighborhood detection has
nication between submodules. been in the detection of cancer drivers. Cantini et al. used
Based on the assumption that essential genes are topolo­ a consensus model of 5 popular community detection algo­
gically distinct from their non-essential counterparts, research­ rithms to identify communities in transcription factor and
ers demonstrated how PPI networks can be used to predict miRNA co-targeting networks, PPI and gene co-expression
the essentiality of genes [85]. Node embeddings based on PPI networks for both tumor and healthy tissue [90]. Next, they
networks were generated via a biased random walk. A binary compared communities detected in the multiple tumor and
classifier was trained on existing known essential genes, using healthy tissue networks to identify both genes and associated
node embeddings as features. To explore the biological func­ functions only prevalent in cancer tissue. These candidate
tions of the essential genes, they clustered genes by their cancer drivers included known oncogenes and potential new
node embedding, and performed GO functional enrichment oncogenic drivers.
analysis on the genes in these clusters. They found notable
correlation with known important processes such as RNA spli­
cing, ribosome biogenesis and golgi vesicle transport. 3. The inherent biases in link prediction
3.1. Degree bias and literature bias
2.3.3. Antitumor activity
Many foods are known to be rich in compounds with anti­ One of the most challenging problems in link prediction in
tumor activity [86], however the cancer suppressive or pre­ biological networks is mitigating the inherent bias caused by
ventative potential of all compounds has not been assessed the degree distribution of a graph being unrepresentative of
experimentally. Veselkov et al. computed node embeddings the underlying distribution of the network. In contrast to
for compounds by using random walks on a human PPI net­ a bias-free link prediction model, which infers the likelihood
work [87]. Compounds were represented within the protein of an edge by the proximity of nodes within the network,
embedding space, by using the known targets of the respec­ a model biased by degree infers the likelihood of an edge
tive compound as starting nodes for the random walk. Using by the connectivity of nodes. A biased model would predict
known anticancer drugs as the training set, a support vector a disease and gene are connected based solely on how many
machine classifier was used to predict which food compounds independent connections each node has in the graph, without
were candidates for cancer prevention or treatment. considering the biological rationale; how well-connected the
disease and gene are to each other. The magnitude of this
problem is relative to the disparity between degree distribu­
2.4. Neighborhood detection applications tion of the training graph and degree distribution of the true
underlying network. Literature bias is often responsible for this
Neighborhood or community detection describes the process over-representation of a handful of well-researched nodes,
of clustering the latent vectors of nodes. Following the guilt causing the degree distribution to not represent the true
by-association principle, clusters in the high dimensional latent distribution of the underlying biology.
space of the embeddings correspond to biologically relevant Most embedding-based models applied to link prediction
communities within the KGs. tasks are based on the guilt-by-association assumption; that
structurally or functionally similar nodes will be encoded near
2.4.1. Visualization as validation each other in the n-dimensional space of the embedding.
It is commonplace in graph machine learning for researchers These methods assume that the downstream classifier or
to use neighborhood detection to validate the biological rele­ score function (see Figure 2) use network proximity as the
vance of their model. The majority of embedding approaches driving force behind the model’s predictive power. In reality,
encode functionally or structurally similar nodes close to each the overwhelming majority of predictive power comes predo­
other in the embedding space. By clustering the embeddings minantly from node degree-based features, with local topol­
in high dimensional space, and mapping to 2D space via ogy attributable to a small portion of performance [91**]. In
8 F. MACLEAN

Figure 4. Visualizing biological networks.


Illustration of the human disease network constructed by Goh et al. [89]. Nodes represent Mendelian traits and disorders, and edges indicate that the two diseases share at least one genetic
association. Reproduced in part from [89] with permission of PNAS. Copyright (2007) National Academy of Sciences, U.S.A.

other words, nodes which are highly connected are statisti­ distribution (but captures the same information). Researchers
cally more likely to be connected to other nodes, and thus [26**] demonstrated how a degree-biased model achieved
have a higher prior probability of connection. To clarify, in a near perfect score (AUROC of 0.979) when tested on the
a drug-treats-disease prediction task, imatinib would be highly same graph distribution that it was trained on (a drug-treats-
predicted to treat diabetes type 1, mainly due to their con­ disease graph based on DrugBank [92]). However, when tested
nectivity, whilst the probability that marizomib treats diffuse on a separate graph (a drug-treats-disease graph based on
intrinsic pontine glioma would be low, due to the rarity of DrugCentral [93]), the model achieved a score close to random
both disease and drug. This issue is especially problematic, (AUROC of 0.541). In contrast, their model that utilized net­
when the prediction task is focused on predicting edge exis­ work proximity achieved similar scores across both graphs
tence for low-degree nodes, for example, drug repurposing for (AUROC of 0.974 and 0.855 on DrugBank and DrugCentral,
rare diseases. respectively).
Training a model that depends on degree bias is not inher­ Literature-derived networks often have vastly different
ently problematic. If one imagines a PPI graph that accurately degree distributions to that of the underlying network. They
represents the biological system, hub genes have a higher rely on the extraction of biological relationships stated in
connections, and thus are more likely to be connected to academic manuscripts. The connectivity of nodes in these
any other protein at random. A simplistic model using degree- networks is therefore largely governed by the quantity of
derived features would accurately predict the connectivity of research that is conducted in that area, and not by the under­
this network. The problem lies, however, when the degree lying biology. For example, more than three quarters of pro­
distribution diverges from that of the underlying biological tein research focuses on the 10% of proteins that were known
network. In this example, nodes that are over-represented in before the genome was mapped, even though many proteins
the graph will receive high probabilities, despite a lack of have now been linked to disease [94]. There is no discernible
biological evidence that the node is central in the network. biological difference between well-researched and under-
Simplistic models that depend on node degree fail to over­ researched genes, save their level of research interest and
come this disparity between network and graph. In contrast, tool availability. Studies have shown that therapeutic oppor­
models that use network proximity are considerably more tunities [95] and essentiality [96] of well-researched genes are
robust, and predictions are less likely to vary when the training similar to those in the unstudied ’dark’ genome. Systematic
graph distribution differs from the biological network. As we genome-wide screens effectively eliminate research bias, and
often do not know the ground truth, one can simply test provide a much more accurate proxy for true distribution of
a model on an external graph with a different biological biological networks. Comparisons between networks based on
EXPERT OPINION ON DRUG DISCOVERY 9

systematic screens and literature highlight the disparity principally used for link prediction by employing KG embed­
between literature-derived KGs and the biology they aim to ding methods. Whilst these methods are continually evolving,
represent [91**]. This example should heed as a warning to they remain relatively immature. Hereinafter, we present the
researchers or machine learning practitioners performing link author’s opinion on the current shortcomings of KGs, the areas
prediction on graphs. One must stay vigilant to this bias, as it in which they need to be improved, and evaluate their utility
is often possible to achieve seemingly satisfactory results in a drug discovery project.
based solely on the degree of the network; without consider­ KGs are fundamentally question answering tools. Questions
ing the topology of the network nor the similarity of nodes. such as does drug X treat disease Y? and does gene X regulate
[paragraph removed – see comments] disease Y? have demonstrably been answered. However this
doesn’t reflect the granularity and variety of questions asked
by researchers and scientists in the drug discovery process. We
3.2. Mitigation strategies need to work toward a universal and extensible system that
can answer questions such as given pathway X, which com­
These results highlight the need for effective mitigation stra­
pounds agonize targets assayed in only functional assays with
tegies to remove the reliance on degree-based features;
potency <1 mm? And given diseases with the shared pathogenic
instead encouraging models to learn network topology to
mechanism Y, which targets have failed clinical trials at Phase
infer the existence of edges. Whilst far from commonplace in
I or II and why? and for disease Z, which targets have ligands in
literature, researchers have now addressed the problem of
different stages of the development process with publications
degree bias, mitigating the bias at distinct points in the
and/or patents describing these compounds? When KGs can
model training process.
answer these questions, their value will increase
Multiple embedding strategies exist to encode node neigh­
immeasurably.
borhoods. Researchers have applied a degree penalty to pre­
KGs are fundamentally question answering tools. Questions
vent over-representation of high degree nodes using
such as does drug X treat disease Y? and does gene X regulate
a random walk embedding and skip-gram model [97]. Others
disease Y? Have demonstrably been answered. However, this
have used a Bayesian method, explicitly providing the prior
does not reflect the granularity and variety of questions asked
probability alongside the adjacency matrix to the model; pre­
by researchers and scientists in the drug discovery process. We
venting the encoding of node degree in the embeddings
need to work toward a universal and extensible system that
[98,99]. Many graphs have only positive edges, and negative
can answer questions such as given pathway X, which com­
edges are created by either randomly sampling an edge from
pounds agonize targets assayed in only functional assays with
all possible non-edges (node pairs without a connection), or
potency <1 mm? and given diseases with the shared pathogenic
by corrupting a node in a positive edge, replacing a node at
mechanism Y, which targets have failed clinical trials at Phase
random. Importantly, it has been shown by uniformly sam­
I or II and why? and for disease Z, which targets have ligands in
pling, positive and negative samples do not have the same
different stages of the development process with publications
degree distribution. By sampling nodes from the global
and/or patents describing these compounds? Work on this topic
degree distribution, models are forced to differentiate edges
is already underway [102]. When KGs can answer these ques­
by their network proximity, and not their network connectiv­
tions, their value will increase immeasurably.
ity, ultimately leading to improved and less biased perfor­
Pathology is fascinatingly complex. This complexity is often
mance [100,101]. Whilst the above methods try to prevent
not well-represented in KGs. Many research projects use pub­
information of node degree from being provided to the
licly accessible KGs which provide a reductive model of dis­
model, some researchers have removed reliance of degree
ease (with edges such as drug-binds-gene, gene-associates-
by doing precisely the opposite. By explicitly providing the
disease and drug-treats-disease). These graphs fail to represent
prior probability alongside network-based features to the
neither the genetic heterogeneity, nor transient nature of
model during training, the model relies on the degree-based
disease. Whilst a drug repurposing link prediction model
features and does not learn them. During testing, a uniformly
may successfully predict CFTR-associates-chronic_pancreatitis,
connected network is assumed, and a uniform prior is pro­
ivacaftor-binds-CFTR, and ivacaftor-treats-chronic_pancreatitis,
vided in place of the biased prior, yielding bias-free predic­
these generalizations do not reflect the complexity of the
tions [26**] [98]
disease, or the prerequisites of ivacaftor to be an effective
treatment. In reality, we want to be able to use a KG to
approximate the causal reasoning of a team of researchers:
4. Expert opinion
”chronic pancreatitis is caused by loss of function of the CFTR
KGs have shown great promise in drug discovery providing an gene. Mutations in CFTR cause an imbalance of calcium home­
answer to the pharmaceutical industry’s ’big data’ problem. ostasis, leading to early protease activation, fibrosis, inflamma­
KGs have opened the doors to the application of graph theory tion and abdominal pain. Ivacaftor is used to treat a subset of
to drug discovery; harnessing powerful network algorithms to cystic fibrosis patients via potentiation and correction of
systematically ’fill in’ the unknown areas of the genome and mutant CFTR, which restores the calcium homeostasis in
draw novel insights into the genes and mechanisms that endothelial cells. Patients with similar loss-of-function muta­
underpin disease. There has been significant research interest tions in the CFTR gene could be treated with Ivacaftor. Whilst
in KGs in both academia and industry, using them principally CFTR remains the main pathomechanism of chronic pancrea­
for target identification and drug repurposing. KGs are titis, other possible treatments include immunosuppressants,
10 F. MACLEAN

antifibrotics, protease inhibitors, and analgesics”. A KG that them mathematically. Whilst we are increasingly creating
can deliver this level of granularity would be a fundamental more and more data pertaining to these systems, we currently
asset in any drug discovery company. cannot sufficiently model them. KGs are undoubtedly a useful
KGs are mainly used in conjunction with KG embedding framework on which to build such approaches. To be able to
models. These models are based on reasoning-by-association develop informative computational models, we must strive
(also called guilt-by-association). This is distinctly different from toward building KGs which describe the complex dynamic
a causal model of the underlying biological mechanism. biological systems of the human body, how they are dysregu­
The most informative paths of many network embedding- lated in the disease state, and how therapeutics act upon the
based models are not describing biological paths (e.g. drug systems. Whilst the dog days of phenotypic-based drug dis­
inhibits-gene-causes-disease), but instead are describing simila­ covery have not yet passed, the dawn of target-based discov­
rities between source and target nodes (e.g. drug-resembles- ery is certainly upon us. Biologically-representative KGs will be
drug treats-disease and disease-resembles-disease-treats-drug) instrumental in the era of systems biology.
[26**] [45], Moreover, most relation inference models do not
capture directionality nor trend of the edge [103]. Whilst
Acknowledgments
researchers have developed models that produce inference
paths between source and target nodes to approximate the The author would like to express their gratitude to Delphine Rolando,
biology path [39,104], such paths are often not well-correlated Rachel Hodos and Dane Corneil. Their expertise in drug discovery, graph
with the underlying causal biological path. Perhaps we should machine learning, and knowledge graphs was instrumental in writing this
review. Lastly, we thank Daniel Miskell for his insight over the years.
strive to move away from models that simply associate biolo­
gical components, and more toward models that accurately
describe the underlying biological system. This problem seems Reviewer disclosures
endemic in the wider field of artificial intelligence. Gary
Peer reviewers on this manuscript have no relevant financial or other
Marcus and Ernest Davis echoed the problem, stating ”we relationships to disclose.
need to stop building computer systems that merely get
better and better at detecting statistical patterns . . . and start
building computer systems that from the moment of their Funding
assembly innately grasp three basic concepts: time, space This manuscript was supported by BenevolentAI.
and causality” [105]. Whilst KG embeddings remains an over­
populated area of research, with researchers competing to eek
out the smallest increase in model performance, causal net­ Declaration of interest
work reasoning remains a largely unexplored field. There have F MacLean is a full-time employee of BenevolentAI. The author has no
been a handful of notable network-based causal reasoning other relevant affiliations or financial involvement with any organization
approaches that have been successfully applied to drug dis­ or entity with a financial interest in or financial conflict with the subject
matter or materials discussed in the manuscript apart from those
covery [106–114]. We hope to see more causal models, built
disclosed.
upon biologically-representative KGs.
In areas such as target identification, link prediction meth­
ods have demonstrated their utility in academia and industry ORCID
led projects. Applications such as drug–target interaction, Finlay MacLean http://orcid.org/0000-0003-2779-179X
drug–drug interaction, and protein–ncRNA interaction remain
academic exercises in graph theory, often surpassed by
powerful deep learning approaches with features based solely References
on the physicochemical structures of the interacting entities Papers of special note have been highlighted as either of interest (•) or of
(the power of which is exemplified by AlphaFold). We believe considerable interest (••) to readers.
GML is best suited to the prediction of abstract entities such as 1. “Total global pharmaceutical RD spending 2012–2026,”. [cited 2021
diseases. Modeling physicochemical interactions should be left Jul 03]. Available from: https://www.statista.com/statistics/309466/
to structure-based approaches. Whilst it was assumed that global-r-and-d-expenditure-forpharmaceuticals
2. “2020 FDA drug approvals,”. [cited 2021 Jul 03]. Available from:
graph embedding methods inferred edge existence via net­ https://www.nature.com/articles/d41573-021-00002-0
work proximity, it has become evident the overwhelming 3. “Ten years on: measuring the return from pharmaceutical innova­
majority of their predictive power comes simply from the tion 2019,”. [cited 2021 Jul 03]. Available from: https://www2.
connectivity of the nodes, and not their local neighborhood. deloitte.com/us/en/pages/life-sciences-andhealth-care/articles/mea
This issue becomes especially problematic when using litera­ suring-return-from-pharmaceutical-innovation.html
4. Collins FS, Morgan M, Patrinos A. The human genome project:
ture-derived KGs, where link prediction models strive to lessons from large-scale biology. Science. 2003;300(5617):286–290.
approximate the biologically incorrect degree distribution of 5. 1000 G. P. Consortiumet al.. A map of human genome variation
a literature-derived network and not that of the underlying from population-scale sequencing. Nature. 2010;467(7319):1061.
biological system. Mitigation strategies, more appropriate eva­ 6. Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associa­
luation metrics and less biased graphs are desperately needed tions in UK biobank. Nat Genet. 2018;50(11):1593–1599.
7. Leinonen R, Sugawara H, Shumway M, et al. The sequence read
to correct this problem. Network medicine is based on the archive. Nucleic Acids Res. 2010;39(suppl 1):D19–D21.
assumption that we can accurately model the biological sys­ 8. Leinonen R, Akhtar R, Birney E, et al. The european nucleotide
tems that govern disease; applying graph theory to describe archive. Nucleic Acids Res. 2010;39(suppl 1):D28–D31.
EXPERT OPINION ON DRUG DISCOVERY 11

9. Ponten F, Jirstrom K, Uhlen M. The human protein atlas—a tool for 29. Himmelstein DS, Baranzini SE. Heterogeneous network edge pre­
pathology. J Pathol. 2008;216(4):387–393. diction: a data integration approach to prioritize disease-associated
10. GTEx Consortium. The genotype-tissue expression (gtex) pilot ana­ genes. PLoS Comput Biol. 2015;11(7):e1004259.
lysis: multitissue gene regulation in humans. Science. 2015;348 30. Breit A, Ott S, Agibetov A, et al. OpenBioLink: a benchmarking
(6235):648–660. framework for large-scale biomedical link prediction. arXiv
11. Stathias V, Turner J, Koleti A, et al. Lincs data portal 2.0: next Preprint arXiv:1912 04616. 2019.
generation access point for perturbation-response signatures. 31. Womack F, McClelland J, Koslicki D. Leveraging distributed biome­
Nucleic Acids Res. 2020;48(D1):D431–D439. dical knowledge sources to discover novel uses for known drugs.
12. Tomczak K, Czerwinska P, Wiznerowicz M. The cancer genome atlas bioRxiv. 2019;765305.
(tcga): an immeasurable source of knowledge. Contemp Oncol. 32. Percha B, Altman, RB. A global network of biomedical relationships
2015;19(1A):A68. derived from text. Bioinformatics. 2018;34(15):2614–2624.
13. Ghandi M, Huang FW, Jane-Valbuena J, et al. Next-generation • Article of interest - Despite being not without their shortcom­
characterization of the cancer cell line encyclopedia. Nature. ings, literature-derived knowledge graphs are popular meth­
2019;569(7757):503–508. ods of rapidly generating biological knowledge graphs. This
14. Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a cancer paper provides an effective method of knowledge graph gen­
dependency map. Cell. 2017;170(3):564–576. eration from publicly available data sources. Their derived
15. Chadwick LH. The NIH roadmap epigenomics program data biological relationships are pleasingly complex, compared to
resource. Epigenomics. 2012;4(3):317–324. other efforts. Their evaluation of literature-derived relation­
16. Kozomara A, Griffiths-Jones S. MiRBase: integrating microrna anno­ ships against structured databases highlights the disparity
tation and deep-sequencing data. Nucleic Acids Res. 2010;39(suppl between structured and unstructured data sources, and the
1):D152–D157. need for effective edge harmonization methods.
17. Volders P-J, Helsens K, Wang X, et al. Lncipedia: a database for 33. Ioannidis VN, Song X, Manchanda S, et al. Drkg-drug repurposing
annotated human lncrna transcript sequences and structures. knowledge graph for COVID-19. arXiv. 2020.
Nucleic Acids Res. 2013;41(D1):D246–D251. 34. Belleau F, Nolin M-A, Tourigny N, et al. Bio2rdf: towards a mashup
18. Cui T, Zhang L, Huang Y, et al. Mndr v2. 0: an updated resource of to build bioinformatics knowledge systems. J Biomed Inform.
ncrna–disease associations in mammals. Nucleic Acids Res. 2018;46 2008;41(5):706–716.
(D1):D371–D374. 35. Chen B, Dong X, Jiao D, et al. Chem2bio2rdf: a semantic framework
19. Earm K, Earm YE. Integrative approach in the era of failing drug for linking and data mining chemogenomic and systems chemical
discovery and development. Integr Med Res. 2014;3(4):211–216. biology data. BMC Bioinformatics. 2010;11(1):255.
20. Rago L, Santoso B. “Drug regulation: history, present and future,” ¨. 36. Yue, X, Wang, Z, Huang, J, et al. Graph embedding on biomedical
Drug Benefit Risks. 2008;2:65–77. networks: methods, applications and evaluations. Bioinformatics.
21. “Novartis CEO who wanted to bring tech into pharma now explains 2020;36(4):1241–1251.
why it’s so hard,”. [cited 2020 Sep 30]. Available from: https://www. • Article of interest - For any researcher wishing to learn the
forbes.com/sites/davidshaywitz/2019/01/16/novartis-ceo-who- fundamentals of graph embeddings and their applications,
wanted-to-bring-tech-into-pharma-now-explains-why-its-so-hard, this review is a must-read.
accessed: 2020-september-30. 37. Gao F, Musial K, Cooper C, et al. Link prediction methods and their
22. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The fair guiding accuracy for different social networks and network metrics. Sci
principles for scientific data management and stewardship. Sci Programm. 2015;2015:1–13.
Data. 2016;3(1):1–9. 38. Cai H, Zheng VW, Chang K-C-C. A comprehensive survey of graph
23. Iams WT, Lovly CM. Molecular pathways: clinical applications embedding: problems, techniques, and applications. IEEE Trans
and future direction of insulin-like growth factor-1 receptor Knowledge Data Eng. 2018;30(9):1616–1637.
pathway blockade. Clin Cancer Res. 2015;21(19): 39. Xia X, “Knowledge Graph Embedding Methodologies,”. [cited 2020
4270–4277. Jul 03]. Available from: https://github.com/xinguoxia/
24. Rossi A, Firmani D, Matinata A, et al. Knowledge graph embedding KGE#methodologies
for link prediction: a comparative analysis. arXiv Preprint arXiv:2002 40. Hodos, RA, Kidd, BA, Khader, S, et al. Computational approaches to
00819. 2020. drug repurposing and pharmacology. Wiley Interdiscip Rev Syst
25. Zou X. A survey on application of knowledge graph. JPhCS. Biol Med. 2016;8(3):186.
2020;1487(1):012016. • Article of interest - This review highlights drug repurposing as
26. Gao Y, Li Y-F, Lin Y, et al. Deep learning on knowledge graph for a promising application of knowledge graphs. Knowledge
recommender system: a survey. arXiv Preprint arXiv:2004 00387. graphs and associated graph machine learning approaches
2020. only constitute a few of the many computational approaches
27. “Neo4j graph database. [cited 2021 Sep 12]. Available from: https:// that have been used for drug repurposing. This manuscript
neo4j.com provides a comprehensive summary of most other approaches.
28. Himmelstein, DS, Lizee, A, Hessler, C, et al. Systematic integration of 41. Talevi A, Bellera CL. Challenges and opportunities with drug repur­
biomedical knowledge prioritizes drugs for repurposing. Elife. posing: finding strategies to find alternative uses of therapeutics.
2017;6:e26726. Expert Opin Drug Discov. 2020;15(4):397–401.
•• Article of high interest - This seminal paper represents one of 42. Wang L, Lei Y, Gao Y, et al. Association of finasteride with prostate
the earliest attempts to train a link prediction model on cancer: a systematic review and meta-analysis. Medicine
a biomedical knowledge graph, to answer biological ques­ (Baltimore). 2020;99(15):e19486.
tions (in this case drug repurposing). This research area has 43. Jain P, Jain SK, Jain M. Harnessing drug repurposing for exploration
matured significantly since this manuscript. The knowledge of new diseases: an insight to strategies and case studies. Curr Mol
graph they developed is rather small using today’s stan­ Med. 2020;20. DOI:10.2174/1566524020666200619125404
dards, and research interest has moved away from pathway- 44. Ganzer CA, Jacobs AR, Iqbal F. Persistent sexual, emotional, and
based models to embedding-based models, in part due to cognitive impairment post-finasteride: a survey of men reporting
their scalability. However, all researchers and practitioners symptoms. Am J Men’s Health. 2015;9(3):222–228.
working in this space would benefit from understanding the 45. Poleksic A. Overcoming sparseness of biomedical networks to
provenance of their work. The authors also highlight the identify drug repositioning candidates. bioRxiv. 2020.
prior probability of connection problem in their manuscript. 46. Sosa DN, Derry A, Guo M, et al. A literature-based knowledge graph
This, however, is covered in more detail in their more recent embedding method for identifying drug repurposing opportunities
work. in rare diseases. bioRxiv. 2019;727925.
12 F. MACLEAN

47. Xu B, Liu Y, Yu S, et al. A network embedding model for pathogenic 70. Kuchaiev O, Rasajski M, Higham DJ, et al. Geometric de-noising of
genes prediction by multi-path random walking on heterogeneous protein-protein interaction networks. PLoS Comput Biol. 2009;5(8):
network. BMC Med Genomics. 2019;12(10):188. e1000454.
48. Gaudelet T, Day B, Jamasb AR, et al. Utilising graph machine 71. Xiao Z, Deng Y. Graph embedding-based novel protein interaction
learning within drug discovery and development. arXiv Preprint prediction via higher-order graph convolutional network. PloS One.
arXiv:2012 05716. 2020. 2020;15(9):e0238915.
49. Paliwal S, De Giorgio A, Neil D, et al. Preclinical validation of 72. Yang F, Fan K, Song D, et al. Graph-based prediction of
therapeutic targets predicted by tensor factorization on heteroge­ protein-protein interactions with attributed signed graph
neous graphs. Sci Rep. 2020;10(1):1–19. embedding. BMC Bioinformatics. 2020;21(1):1–16.
50. Amaral PP, Dinger ME, Mattick JS. Non-coding rnas in homeostasis, 73. Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side
disease and stress responses: an evolutionary perspective. Brief effects with graph convolutional networks. Bioinformatics.
Funct Genomics. 2013;12(3):254–278. 2018;34(13):i457–i466.
51. Ji B-Y, You Z-H, Cheng L, et al. Predicting mirna-disease association 74. Lim H, Poleksic A, Xie L. Exploring landscape of drug-target-
from heterogeneous information network with grarep embedding pathway-side effect associations. AMIA Summits Translat Sci
model. Sci Rep. 2020;10(1):1–12. Proceed. 2018:132–141.
52. Zhou J-R, You Z-H, Cheng L, et al. Prediction of lncrna–disease 75. Zhang W, Chen Y, Liu F, et al. Predicting potential drug-drug
associations via an embedding learning hope in heterogeneous interactions by integrating chemical, biological, phenotypic and
information networks. Mol Ther Nucleic Acids. 2020;23:277-285. network data. BMC Bioinformatics. 2017;18(1):18.
53. Zheng Y, Peng H, Zhang X, et al. Old drug repositioning and new 76. Su C, Tong J, Zhu Y, et al. Network embedding in biomedical data
drug discovery through similarity learning from drug-target joint science. Brief Bioinform. 2020;21(1):182–197.
feature spaces. BMC Bioinformatics. 2019;20(23):605. 77. Sangar V, Blankenberg DJ, Altman N, et al. Quantitative
54. Luo Y, Zhao X, Zhou J, et al. A network integration approach for sequence-function relationships in proteins based on gene
drug-target interaction prediction and computational drug reposi­ ontology. BMC Bioinformatics. 2007;8(1):294.
tioning from heterogeneous information. Nat Commun. 2017;8 78. Grover A, Leskovec J, “node2vec: scalable feature learning for
(1):1–13. networks,” in Proceedings of the 22nd ACM SIGKDD international
55. Lim H, Gray P, Xie L, et al. Improved genome-scale multi-target conference on Knowledge discovery and data mining, 2016, USA. pp.
virtual screening via a novel collaborative filtering approach to 855–864.
cold-start problem. Sci Rep. 2016;6(1):1–11. 79. Stark C, Breitkreutz B-J, Reguly T, et al. Biogrid: a general repository
56. Ba-Alawi W, Soufan O, Essack M, et al. Daspfind: new efficient for interaction datasets. Nucleic Acids Res. 2006;34(suppl 1):D535–
method to predict drug–target interactions. J Cheminform. 2016;8 D539.
(1):15. 80. Kulmanov M, Khan MA, Hoehndorf R. Deepgo: predicting protein
57. Mizutani S, Pauwels E, Stoven V, et al. Relating drug–protein inter­ functions from sequence and interactions using a deep ontology
action network with drug side effects. Bioinformatics. 2012;28(18): aware classifier. Bioinformatics. 2018;34(4):660–668.
i522–i528. 81. Nariai N, Kolaczyk ED, Kasif S. Probabilistic protein function predic­
58. Wan F, Hong L, Xiao A, et al. Neodti: neural integration of neighbor tion from heterogeneous genome-wide data. Plos One. 2007;2(3):
information from a heterogeneous network for discovering new e337.
drug–target interactions. Bioinformatics. 2019;35(1):104–111. 82. Makrodimitris S, Van Ham RC, Reinders MJ. Automatic gene func­
59. Huang K, Fu T, Xiao C, et al. Deeppurpose: a deep learning tion prediction in the 2020’s. Genes (Basel). 2020;11(11):1264.
based drug repurposing toolkit. arXiv Preprint arXiv:2004 83. Goymer P. Why do we need hubs? Nat Rev Genet. 2008;9(9):651.
08919. 2020. 84. Chen S-J, Liao D-L, Chen C-H, et al. Construction and analysis of
60. Wallach I, Dzamba M, Heifets A. Atomnet: a deep convolutional protein-protein interaction network of heroin use disorder. Sci Rep.
neural network for bioactivity prediction in structure-based drug 2019;9(1):1–9.
discovery. arXiv Preprint arXiv:1510 02855. 2015. 85. Dai W, Chang Q, Peng W, et al. Network embedding the protein–
61. Senior AW, Evans R, Jumper J, et al. Improved protein structure protein interaction network for human essential genes identifica­
prediction using potentials from deep learning. Nature. 2020;577 tion. Genes (Basel). 2020;11(2):153.
(7792):706–710. 86. Lefranc F, Tabanca N, Kiss R. Assessing the anticancer effects
62. Reese JT, Unni DR, Callahan TJ, et al. KG-COVID-19: a framework to associated with food products and/or nutraceuticals using in vitro
produce customized knowledge graphs for covid-19 response. and in vivo preclinical development-related pharmacological tests.
Patterns. 2020;2(1):100155. In: Seminars in cancer biology. Vol. 46. Elsevier; 2017. p. 14–32.
63. Zhou Y, Hou Y, Shen J, et al. Network-based drug repurposing for 87. Veselkov K, Gonzalez G, Aljifri S, et al. Hyperfoods: machine intel­
novel coronavirus 2019-ncov/sars-cov 2. Cell Discov. 2020;6 ligent mapping of cancer-beating molecules in foods. Sci Rep.
(1):1–18. 2019;9(1):1–12.
64. Wang LL, Lo K, Chandrasekhar Y, et al. Cord-19: the covid-19 open 88. Du J, Jia P, Dai Y, et al. Gene2vec: distributed representation of
research dataset. ArXiv. 2020. genes based on co-expression. BMC Genomics. 2019;20(1):7–15.
65. Hsieh K, Wang Y, Chen L, et al. Drug repurposing for covid-19 using 89. Goh K-I, Cusick ME, Valle D, et al., “The human disease network,”
graph neural network with genetic, mechanistic, and epidemiolo­ Proceedings of the National Academy of Sciences, vol. 104, no. 21,
gical validation. arXiv Preprint arXiv:2009 10931. 2020. pp. 8685–8690, 2007, USA.
66. Gysi DM, Valle ID, Zitnik M, et al. Network medicine framework for 90. Cantini L, Medico E, Fortunato S, et al. Detection of gene commu­
identifying drug repurposing opportunities for covid-19. arXiv nities in multi-networks reveals cancer drivers. Sci Rep. 2015;5
Preprint arXiv:2004 07229. 2020. (1):17386.
67. Gasmi A, Tippairote T, Mujawdiya PK, et al. Neurological involve­ 91. Zietz M, Himmelstein DS, Kloster K, et al. The probability of edge
ments of sars-cov2 infection. Mol Neurobiol. 202 existence due to node degree: a baseline for network-based
68. Stebbing J, Phelan A, Griffin I, et al. Covid-19: combining antiviral predictions. Manubot, Tech Rep. 2020.
and anti-inflammatory treatments. Lancet Infect Dis. 2020;20 •• Article of high interest - This paper provides the most compre­
(4):400–402. hensive analysis of the problems arising from i) the degree
69. “Baricitinib receives emergency use authorization from the FDA for imbalance in graphs with long-tailed distributions, and ii) the
the treatment of hospitalized patients with COVID-19,”. [cited 2021 disparity between literature-derived biological networks and
Jan 02]. Available from: https://investor.lilly.com/news-releases those derived from systematic screens. Of particular interest is
/news-release-details/baricitinib-receives-emergency-use- the authors’ closed form approximation of the prior probabil­
authorization-fda-treatment ity of connection. This allows researchers and industry
EXPERT OPINION ON DRUG DISCOVERY 13

professionals to differentiate between predictions based on 103. Lee B, Zhang S, Poleksic A, et al. Heterogeneous multi-layered
network connectivity and proximity at almost no computa­ network model for omics data integration and analysis. Front
tional cost. Genet. 2020;10:1381.
92. Wishart DS, Knox C, Guo AC, et al. DrugBank: a knowledgebase for 104. Lin XV, Socher R, Xiong C. Multi-hop knowledge graph reasoning
drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36 with reward shaping. arXiv Preprint arXiv:1808 10568. 2018.
(suppl1):D901–D906. 105. Bishop JM. Artificial intelligence is stupid and causal reasoning
93. Avram S, Bologa CG, Holmes J, et al. DrugCentral 2021 supports won’t fix it. arXiv Preprint arXiv:2008 07371. 2020.
drug discovery and repositioning. Nucleic Acids Res. 2021;49(D1): 106. Liu A, Trairatphisan P, Gjerga E, et al. From expression footprints to
D1160–D1169. causal pathways: contextualizing large signaling networks with
94. Edwards AM, Isserlin R, Bader GD, et al. Too many roads not taken. carnival. NPJ Syst Biol Appl. 2019;5(1):1–10.
Nature. 2011;470(7333):163–165. 107. Rivas-Barragan D, Mubeen S, Guim-Bernat F, et al. Drug2ways:
95. Oprea TI, Bologa CG, Brunak S, et al. Unexplored therapeutic reasoning over causal paths in biological networks for drug dis­
opportunities in the human genome. Nat Rev Drug Discov. covery. bioRxiv. 2020.
2018;17(5):317. 108. Vidal M, Cusick ME, Barabasi A-L. Interactome networks and human
96. Hutchison CA, Chuang R-Y, Noskov VN, et al. Design and synthesis disease. Cell. 2011;144(6):986–998.
of a minimal bacterial genome. Science. 2016;351(6280):6280. 109. Broido AD, Clauset A. Scale-free networks are rare. Nat Commun.
97. Feng R, Yang Y, Hu W, et al. Representation learning for scale-free 2019;10(1):1–10.
networks. arXiv Preprint arXiv:1711 10755. 2017. 110. Dorogovtsev S, Mendes J, Samukhin A. Generic scale of the” scale-
98. Kang B, Lijffijt J, Bie TD. Conditional network embeddings. arXiv free” growing networks. arXiv Preprint Cond-mat/0011115. 2000.
Preprint arXiv:1805 07544. 2018. 111. Rohani N, Eslahchi C. Drug-drug interaction predicting by neural
99. Buyl M, De Bie T. Debayes: a bayesian method for debiasing net­ network using integrated similarity. Sci Rep. 2019;9(1):1–11.
work embeddings. arXiv Preprint arXiv:2002 11442. 2020. 112. Wouters OJ, McKee M, Luyten J. Estimated research and develop­
100. Lerer A, Wu L, Shen J, et al. Pytorch-biggraph: a large-scale graph ment investment needed to bring a new medicine to market,
embedding system. arXiv Preprint arXiv:1903 12287. 2019. 2009–2018. Jama. 2020;323(9):844–853.
101. Zheng D, Song X, Ma C, et al. Dgl-ke: training knowledge graph 113. Mohs RC, Greig NH. Drug discovery and development: role of basic
embeddings at scale. arXiv Preprint arXiv:2004 08532. 2020. biological research. Alzheimers Dementia. 2017;3(4):651–657.
102. Hamilton WL, Bajaj P, Zitnik M, et al. Embedding logical queries on 114. Xue S, Lu J, Zhang G. Cross-domain network representations.
knowledge graphs. arXiv Preprint arXiv:1806 01445. 2018. Pattern Recogn. 2019;94:135–148.

You might also like