The Little Book of Big Changes in AI-Powered Drug Discovery: Ebook
The Little Book of Big Changes in AI-Powered Drug Discovery: Ebook
The Little Book of Big Changes in AI-Powered Drug Discovery: Ebook
AI Advancements Appendix
AI and the Drug Discovery Pipeline Resources
Research
Commercial Cases
Insilico Medicine
Celeris Therapeutics
Cyclica
Academic Cases
DeepCE
DOCKSTRING
EGGNs
Artificial Intelligence
Has Changed The Game
AI has launched itself from the pages of science At this point if you aren’t using AI for drug discovery and repurposing, you’re
setting yourself up to be left behind. As the pace of innovation continues to
fiction and disrupted our industry. surge, hundreds of startups are centering their entire organization around AI
and seeing astounding results. Pharma companies aren’t being shy either, with
numerous big names quickly shifting to integrate AI into their practices.
Startups using AI1 applying AI as a means to reduce costs, increase speed, and ensure your time
and efforts are in pursuit of successful drug candidates and treatments.
There are currently more than 100 drugs in the As AI increasingly becomes standard practice, access to high-quality data
is proving to be a true differentiator. You’ve likely heard, if not already said
AI in drug discovery pipeline, and numerous “garbage in, garbage out,” when it comes to drug data. As the move to AI
drugs have been making their way to clinical continues to advance it will only become more vital to have reliable, clean, and
robust datasets. Without this, you’ll find yourself spending more and more time
trials in a matter of years — instead of decades. getting datasets ready before you can even begin your research.
Scientific journals published each year5 that enables anyone to use it to its fullest potential.
AI-guided approaches.
COMMERCIAL
WHAT THEY’RE UP TO
Insilico has made it their mission to accelerate drug discovery and drug
development by continuously inventing and deploying new artificial intelligence
technologies. Nearly a decade old, they now have several oncology candidates in
their pipeline, and are pursuing the development of both drugs and biomarkers
in areas ranging from fibrosis, infectious diseases, immunology, and the process
of aging.
One of the most noteworthy advancements Insilico has developed involved Insilico’s DDR1 research was able to save significant development time, completing
applying a generative pipeline to complete hit discovery, optimization, synthesis, the process in 46 days which was 15-fold faster than traditional approaches.
and validation on candidates against discoidin domain receptor 1 (DDR1), a
kinase target implicated in fibrosis and other diseases. Insilico was also able to identify the genes most important
for age prediction11, achieving Pearson correlation of 0.91
This approach uses a two-step algorithm. The first step involves learning a
for the actual age values of the muscle tissue samples.
mapping of the chemical space; the second step explores this mapping using
their proprietary deep reinforcement learning platform GENTRL (General Tensorial
Reinforcement Learning) to learn DDR1 and common kinase inhibitors. GENTRL
utilized three distinct Kohonen-based self-organizing maps (SOMs) as reward
functions for the reinforcement learning step: the trending SOM (scores compound
novelty based on patent disclosure dates), the general kinase SOM (distinguishes
kinase inhibitors), and the specific kinase SOM (isolates DDR1 inhibitors).
This approach identified four active compounds10, two active in cellular assays,
and one lead candidate that demonstrated favorable pharmacokinetics in mice.
Insilico has also advanced our understanding of aging. Applying several supervised
machine learning approaches, including neural networks, Insilico built a panel of
tissue-specific biomarkers of aging.
WHAT THEY’RE UP TO
Celeris Therapeutics focuses on undruggable pathogenic proteins that cause Applying these approaches and models, Celeris was able to identify novel West
serious conditions such as Alzheimer’s and Parkinson’s disease. Their current Nile Virus NS2B/NS3 protease inhibitors13. The viral NS2B/NS3 protease is critical
pipeline includes programs in neurology and oncology. Celeris is also using graph to the viral replication process. Using these deep learning approaches, Celeris was
neural networks to predict the properties of molecules. able to identify novel, unexplored drug candidates that demonstrate an inhibition
score statistically neighboring experimentally confirmed inhibitors, presenting new
HOW THEY’RE USING AI
candidates for treating West Nile Virus.
In their Xanthos Match Maker platform, Celeris encodes molecular structures
WHY WE’RE SO IMPRESSED
in a graph along with features such as the number of hydrogens, valence, and
aromaticity and then applies deep neural networks where information about Drugs are currently limited in their ability to treat diseases caused by
molecules and proteins are processed into an increasingly high-level form. pathogenic proteins.
To make a molecular graph more performant, Celeris uses additional ML By establishing reliable, AI-backed methods to leverage
techniques to improve molecular fragment linking (linking two fragments binding
the body’s natural cell-based mechanisms to degrade
in nearby subpockets together has become an important technique in fragment-
these proteins, they were able to identify novel potential
based drug discovery to optimize the binding potency of fragment hits).
treatment options.
In late 2021, Celeris published work12 describing the use of Variational
Autoencoders (VAEs) to augment existing data with a bond-angle-torsion
coordinate system, trained on the ZINC dataset, that demonstrated an
improvement of 9.3 percent (79 to 88.3) over the previous model (DeLinker).
WHAT THEY’RE UP TO
DeepCE In a collaborative project between Ohio State University, City University of New York, and Cornell University, researchers Thai-Hoang
Pham, Yue Qiu, Jucheng Zeng, Lei Xie, and Ping Zhang developed DeepCE, a mechanism-driven neural network-based method.
WHAT IT DOES
DeepCE expands on phenotype-based compound screening by modelling To validate the model’s effectiveness, the authors utilized DrugBank as a source for
chemical substructure-gene and gene-gene associations, predicting the clinically relevant drug-target and disease relationships. The results indicated that
differential gene expression profile perturbed by de novo chemicals. Essentially, integrating gene expression profiles generated with DeepCE can solve problems
DeepCE uses deep learning to predict how drugs will influence the amounts of related to unreliable data in the standard (L1000) dataset, leading to better
RNA, and therefore the amounts of various proteins, produced by a cell, which in performances on downstream prediction tasks. This specific application of DeepCE
turn provides insights into how the drug may modulate the disease. represents the first work of phenotype-based drug repurposing for COVID-19.
HOW THEY’RE USING AI Going one step further, Deep CE was applied to the full DrugBank dataset,
combined with gene expression data from patients with SARS-CoV-2, to predict
DeepCE uses a neural network-based model for gene expression profile
highly relevant drug candidates.
prediction consisting of several components. A graph convolutional network
is used to learn a vector representation for each chemical compound from WHY WE’RE SO IMPRESSED
its graph structure. A feed-forward neural network is used to learn vector
representations for cell line and chemical dose size. DeepCE offers improved performance compared to
existing methods and has the advantage of providing data
These vector representations are then put into the interaction component (two
augmentation, which makes it possible to tackle areas with
multihead attention modules, concatenated into a normalization layer followed
by feed-forward layer and another normalization layer) to learn high-level feature
minimal or unreliable data.
associations, including chemical substructure-gene and gene-gene feature
associations. Finally, the prediction component (two-layer feed-forward neural
network with a rectified linear unit activation function) takes the interaction
component’s outputs as inputs to simultaneously predict the gene expression
values for all L100017 genes.
WHAT IT DOES
One challenge in drug discovery is being able to utilize the full spectrum of The code is an open-source Python package, and the dataset is the first to
knowledge available. Often, approaches that would be beneficial require the include docking poses. It is also the most extensive dataset that offers a
researchers to have a deep level of understanding of the underlying biology. One complete matrix of docking scores for all ligand-target-pairs. This feature enables
example of this is molecular docking. It requires extensive domain knowledge experiments in transfer learning and multi-objective optimization.
to set up experiments and train machine learning correctly. DOCKSTRING was
WHY WE’RE SO IMPRESSED
created to help address this challenge.
DOCKSTRING, like other recent tools, is aiming to lower the barrier to entry for
HOW THEY’RE USING AI
drug discovery startups.
As machine learning methods for drug discovery continue to be developed,
benchmarks are required to compare performance against experimental data, It goes beyond structure-based modelling and brings more
giving an indication of what performance can be expected in the real world. complex techniques for predicting binding affinity into
more ligand design pipelines.
While other benchmarking methods exist, DOCKSTRING offers standardized and
accessible benchmarking capabilities based on molecular docking. The three-
component DOCKSTRING bundle includes code, datasets, and benchmarking
tasks which allow ML practitioners without biological expertise to obtain
meaningful docking scores.
Satorras, Hoogeboom, and Welling introduced the EGNN architecture for graphs Graph Neural Networks (GNNs) can accelerate the drug
that is translation, rotation, reflection, and permutation equivariant.
discovery process by providing an ability to analyze
Trained and tested against the QM919 20 dataset (a standard in ML for chemical molecules and their properties at a previously unattainable
property prediction tasks), Equivariant Graph Neural Networks (EGNNs) produces level, and EGNNs in particular represent a step forward in
highly competitive results in all property prediction tasks while remaining
terms of simplicity and efficiency.
simple, not requiring the use of higher-order molecular representations,
molecular angles, or spherical harmonics.
The EGNN-based model can predict all features from the QM9 dataset including
equilibrium geometries, frontier orbital eigenvalues, dipole moments, harmonic
frequencies, polarizabilities, and thermochemical energetics corresponding to
atomization energies, enthalpies, and entropies at ambient temperature.
If there’s one thing we know for sure, AI is at its Does it have quality coverage?
best, and your research will be too, when you Coverage means knowing that our data sufficiently captures all relevant
medical information.
have the highest quality drug data that spans a
vast range and depth of detail. Is it consistent?
All data must be input in a consistent manner. At DrugBank we have
In order to ensure that our data is of the highest quality we uphold strict
strict curation specifications that all of our data must meet before it is
criteria. Before we are satisfied with the quality of our datasets we ask ourselves
incorporated in our datasets. By standardizing this multi-step peer review
a series of questions.
process we ensure consistency and accuracy.
With the help of artificial intelligence, our team of medical and scientific
Adverse Effects Metabolism
experts gather, author, verify, and organize all of the latest, most relevant
biomedical information into one machine-learning ready platform. This platform
Indications Targets
is accessible through data downloads or software integrations and is constantly
updated to include the latest findings.
Protein Relationships Chemical Structure
We’re working to augment human intelligence so that
the world’s medical information can be used to its fullest Drug Categories Pharmacology
potential and ensure that everyone has access to the best
possible medical outcomes.
Our datasets are ideal for all kinds of machine learning, drug discovery Contact us to learn more about DrugBank, our
applications. As a result we get to work alongside many leading researchers comprehensive drug database, & potential applications.
and institutions.
drugbank.com/datasets
Resources
PAPERS GENERATIVE ALGORITHMS ADME AND TOXICITY PREDICTION
Bengio et al, 2021. Flow Network based Generative Models for Non- Siramshetty et al, 2021. Validating ADME QSAR Models Using
SURVEY PAPERS Iterative Diverse Candidate Generation. [Code] Marketed Drugs.
Berenger and Tsuda, 2021. Molecular generation by Fast Assembly of Göller et al, 2020. Bayer’s in silico ADMET platform: a journey of
Walters and Barzilay, 2021. Critical assessment of AI in drug discovery. (Deep)SMILES fragments. [Code] machine learning over the past two decades.
Coley, 2020. Defining and Exploring Chemical Spaces. Gao et al, 2021. Amortized Tree Generation for Bottom-up Synthesis Ryu et al, 2020. DeepHIT: a deep learning framework for prediction of
Chuang et al, 2020. Learning Molecular Representations for Planning and Synthesizable Molecular Design. [Code] hERG-induced cardiotoxicity. [Code]
Medicinal Chemistry. Takeuchi et al, 2021. R-group replacement database for
Walters and Barzilay, 2020. Applications of Deep Learning in Molecule medicinal chemistry. SYNTHETIC ACCESSABILITY AND
Generation and Molecular Property Prediction. Imrie et al, 2020. Deep Generative Models for 3D Linker Design. [Code] RETROSYNTHETIC PLANNING
Cai et al, 2020. Transfer Learning for Drug Discovery. Jin et al, 2020. Hierarchical Generation of Molecular Graphs using Fortunato et al, 2020. Data augmentation and pretraining for
Structural Motifs. [Code] template-based retrosynthetic prediction in computer-aided
synthesis planning.
REPRESENTATION AND TRANSFER LEARNING Polishchuk, 2020. CReM: chemically reasonable mutations framework
for structure generation. [Code] Koch et al, 2020.Reinforcement Learning for Bioretrosynthesis.
Ahmad et al, 2021. ChemBERTa-2: Towards Chemical Foundation
Somnath et al, 2020. Learning Graph Models for
Models. [Code]
Retrosynthesis Prediction.
Satorras et al, 2021. E(n) Equivariant Graph Neural Networks. [Code] HIT FINDING AND POTENCY PREDICITON
Townshend et al, 2021. ATOM3D: Tasks On Molecules in Bender et al, 2021. A practical guide to large-scale docking. VISUALIZATION AND INTERPRETABILITY
Three Dimensions. García-Ortegón et al, 2021. DOCKSTRING: easy molecular docking Humer et al, 2021. ChemInformatics Model Explorer (CIME):
Chuang and Keiser, 2020. Attention-Based Learning on yields better benchmarks for ligand design. [Code] [Data] Exploratory analysis of chemical model explanations. [Code]
Molecular Ensembles. Graff et al, 2021. Accelerating high-throughput virtual screening Matveieva and Polishchuk, 2021. Benchmarks for interpretation of
Li and Fourches, 2020. Inductive transfer learning for molecular through molecular pool-based active learning. [Code] QSAR models. [Code]
activity prediction: Next-Gen QSAR Models with MolPMoFiT. [Code] Gentile et al, 2020. Deep Docking: A Deep Learning Platform for
Augmentation of Structure Based Drug Discovery. [Code]