From Big Data To Artificial Intelligence Chemoinformatics Meets New Challenges

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Tetko 

and Engkvist J Cheminform (2020) 12:74


https://doi.org/10.1186/s13321-020-00475-y Journal of Cheminformatics

EDITORIAL Open Access

From Big Data to Artificial Intelligence:


chemoinformatics meets new challenges
Igor V. Tetko1,2* and Ola Engkvist3

Abstract 
The increasing volume of biomedical data in chemistry and life sciences requires development of new methods and
approaches for their analysis. Artificial Intelligence and machine learning, especially neural networks, are increas‑
ingly used in the chemical industry, in particular with respect to Big Data. This editorial highlights the main results
presented during the special session of the International Conference on Neural Networks organized by “Big Data in
Chemistry” project and draws perspectives on the future progress of the field.

The analysis and exploitation of Big Data was the cor- from HTS data provides better performance but, impor-
nerstone of the “Big Data in Chemistry” (BIGCHEM), tantly, also superior scaffold hopping capability. Analo-
and of this special issue, which was prepared follow- gously QSAR-derived affinity fingerprints (QAFFP) [5, 6]
ing the International Conference on Neural Networks outperformed classical Morgan fingerprints for scaffold
(ICANN2019). In total 17 articles, including 15 contribu- hopping. While Morgan fingerprints due to their robust-
tions co-authored by BIGCHEM PhD students and part- ness and performance for small molecules (see review
ners, were published in this issue. Its thematic covered of David et  al. [7]) are frequently used as a gold stand-
many different aspects of the use of Big Data in medicinal ard in, e.g., virtual screening and target predictions, they
chemistry [1, 2] that were actively pursued and advanced might not be optimal for larger molecules, such as pep-
during the project. The articles in the issue can be catego- tides. MinHashed Atom-Pair fingerprints with a diam-
rized into two main groups. eter of up to four bonds (MAP4) [8] were introduced as
The first group deals with machine learning methods to a universal fingerprint providing good results for various
improve analysis of large datasets such as those of high- targets. HTS data are frequently imbalanced with only
throughput screening (HTS) campaigns. The comparison few active compounds: COVER (conformational over-
of structure-based and protein–ligand interaction fin- sampling as data augmentation for molecules) generates
gerprints (IFPs) and for the prediction of ligand binding multiple conformations of molecules, in order to provide
modes for protein kinases were studied by Rodríguez- an efficient data balancing mechanism for the underrep-
Pérez et  al. [3]. The authors showed that including tar- resented class [9]. All these methodological studies are
get-relevant information via IPFs improved predictions important to have better models for Big Data.
of the modes by about 10% compared to the use of tra- The second group of articles deals with novel
ditional atom environment fingerprints. Laufkötter et al. machine-learning algorithms such as the use of gen-
[4] demonstrated that augmenting chemical structure erative models (GMs) for molecular de novo design
descriptors with bio-activity based fingerprints derived in drug discovery. BIGCHEM was one of the origina-
tors in this area of research with its pioneering works
on applying Recurrent Neural Networks (RNN) with
*Correspondence: [email protected]
1
Helmholtz Zentrum München‑German Research Center reinforcement learning and variational autoencoders
for Environmental Health (GmbH), Institute of Structural Biology, for molecular designs [10, 11] as reviewed elsewhere
Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
[12]. LatentGAN represents one of the most advanced
Full list of author information is available at the end of the article

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat​iveco​
mmons​.org/licen​ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat​iveco​mmons​.org/publi​cdoma​in/
zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Tetko and Engkvist J Cheminform (2020) 12:74 Page 2 of 3

developments of GMs by combining an autoencoder approaches are part of the emerging area of AI, which is
and a generative adversarial neural network [13]. The going to drive the future of chemoinformatics.
overfitting can reduce the diversity of autoencoder-gen- AI is fast becoming a ubiquitous part of modern life,
erated structures. Generative Examination Networks and is also increasingly employed in the pharmaceutical
(GEN) use randomized SMILES and early stopping industry to automate key steps in drug design. Compared
[14] to prevent this [15]. The effect of the randomized to Big Data challenges, “how to best analyse the Big Data”
SMILES to improve the quality of GMs is also con- [1, 2], the future progress in this field is linked to the need
firmed with extensive benchmarking based on GDB-13 for explainable “chemistry aware” methods. Such method
[16]. Another method to increase the diversity of gen- should allow the elucidation of the molecular basis of
erated structures was proposed by Blaschke et  al. [17] compound activity, or to directly suggest new com-
who use memory-assisted reinforcement learning for pounds with improved properties, or to optimise routes
this purpose. The methods developed in these studies for synthesis supported with chemical knowledge. These
are general ones and can be used to enhance other GMs and other topics, such as interpretable deep learning, use
such as scaffold decoration [18] or Mol-CycleGAN [19]. of knowledge elicitation from human experts, machine
New types of deep learning algorithms based on Mes- learning and molecular dynamics, language and quan-
sage Passing Neural Networks [20] and Transformer tum chemistry based retro-synthesis prediction, scalable
Convolutional Neural Networks (CNN) [21] were also multi-objective synthesis route optimization, methods for
introduced. scaffold hopping, uncertainty estimation of AI methods,
There is a significant difference between both groups etc., will be investigated within the “Advanced machine
of articles. The methods used in the first group mainly learning for Innovative Drug Discovery” (AIDD, http://
explore traditional machine learning methods, such as ai-dd.eu). This project will employ 16 fellows starting
Random Forest, Support Vector Machines, etc. that are January 2021, who will get training and full support in
based on traditional molecular representations as a vec- theoretical and practical skills from their supervisors and
tor of descriptors [7]. Those studies could be performed via various network activities. While not being a direct
using traditional toolkits with no or little programming continuation of BIGCHEM, AIDD will definitely contrib-
effort. Contrary to that, the generative models were ute to the further development of the successful methods
based on novel machine deep learning architectures such originated from the previous network.
as CNN, RNN, Long Short Term Memory, Transform- The advance in this field critically depends on the
ers, etc. These methods are more innovative and most of availability of open source software, which is important
them were introduced, developed and/or implemented for sustainable progress and sharing results. A distinct
by the authors. All of the studies from this second group feature of BIGCHEM was the voluntary decision of its
thus required significant programming skills and exper- several partners to release the source code for its meth-
tise in modern toolkits such as TensorFlow, Keras, odological developments, which dramatically boosted the
Pytorch, etc., which are becoming a pre-requisite to get respective research areas. For example the publicly avail-
a position in the industry in addition, of course, to excel- able source code [29] from the REINVENT [10] article
lent knowledge in the basis disciplines. Prospective PhD was forked 75 times since its publication, which definitely
students, who plan to build their careers in the field of contributed to a rigorous validation, and a wide accept-
chemoinformatics and Artificial Intelligence (AI), should ance of the published results by the scientific community.
not overlook these requirements. The same principle will be widened in the AIDD, where
One of the major differences of these new methods is all partners have agreed to release the source codes of
their ability to infer statistical dependencies directly from their individual projects to improve their dissemination.
chemical structures, which can be represented as text, In summary, this special issue comprises a carefully
e.g. SMILES [21, 22], chemical graphs [20], or 3D images selected collection of articles in Big Data, most of which
[23]. The benchmarking studies [20, 22] show that such were contributed by BIGCHEM partners or reported
methods can achieve similar or better performances during the ICANN19 (http://e-nns.org/icann​2019). Con-
compared to traditional methods for classical tasks sidering the great success of the project, which contrib-
such as QSAR[24] but at the same time allow for intui- uted about 70 publications that were cited nearly 1000
tive interpretation of models [21]. Moreover, they can be times in 2020 alone (https​://schol​ar.googl​e.com/citat​
used to address very different tasks, such as the afore- ions?user=eLncF​6MAAA​AJ) as well as of the ICANN19,
mentioned generation of molecules with desired proper- which was attended by all-time record of 500 participants
ties or/and the prediction of single step (retro) synthesis and resulted in five volumes of proceedings [30], we
[25, 26], or even complete retro-synthesis [27, 28] that believe that this special issue will be of a great interest to
could not be achieved with traditional methods. All these the readers of the journal.
Tetko and Engkvist J Cheminform (2020) 12:74 Page 3 of 3

Acknowledgements 13. Prykhodko O, Johansson SV, Kotsias P-C, Arús-Pous J, Bjerrum EJ, Engkvist
This study was partially funded by the European Union’s Horizon 2020 O, Chen H (2019) A de novo molecular generation method using latent
research and innovation program under the Marie Skłodowska-Curie Innova‑ vector based generative adversarial network. J Cheminform 11(1):74
tive Training Network European Industrial Doctorate grant agreement No. 14. Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Compar‑
676434, “Big Data in Chemistry” (http://bigch​em.eu). The article reflects only ison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833
the authors’ view and neither the European Commission nor the Research 15. van Deursen R, Ertl P, Tetko IV, Godin G (2020) GEN: highly efficient SMILES
Executive Agency are responsible for any use that may be made of the explorer using autodidactic generative examination networks. J Chemin‑
information it contains. The authors thank BIGCHEM partners for their fruitful form 12(1):22
collaboration during the project. IVT is CEO and founder of BIGCHEM GmbH, 16. Arús-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond
which licenses the On-line Database and Chemical Modelling Environment J-L, Chen H, Engkvist O (2019) Randomized SMILES strings improve the
(http://ochem​.eu), that was used in several studies reported in this editorial. quality of molecular generative models. J Cheminform 11(1):71
OE declares that he has no actual or potential conflicts of interests. 17. Blaschke T, Engkvist O, Bajorath J, Chen H (2020) Memory-assisted rein‑
forcement learning for diverse molecular de novo design. J Cheminform
Author details 12(1):68
1
 Helmholtz Zentrum München‑German Research Center for Environmental 18. Arús-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond J-L, Chen H,
Health (GmbH), Institute of Structural Biology, Ingolstädter Landstraße 1, Engkvist O (2020) SMILES-based deep generative scaffold decorator for
85764 Neuherberg, Germany. 2 BIGCHEM GmbH, Valerystr. 49, 85716 Unter‑ de-novo drug design. J Cheminform 12(1):38
schleißheim, Germany. 3 Molecular AI, Discovery Sciences, R&D, AstraZeneca, 19. Maziarka Ł, Pocha A, Kaczmarczyk J, Rataj K, Danel T, Warchoł M (2020)
Gothenburg, Sweden. Mol-CycleGAN: a generative model for molecular optimization. J Chemin‑
form 12(1):2
Received: 18 November 2020 Accepted: 18 November 2020 20. Withnall M, Lindelöf E, Engkvist O, Chen H (2020) Building attention and
edge message passing neural networks for bioactivity and physical–
chemical property prediction. J Cheminform 12(1):1
21. Karpov P, Godin G, Tetko IV (2020) Transformer-CNN: Swiss knife for QSAR
modeling and interpretation. J Cheminform 12(1):17
References 22. Tetko IV, Karpov P, Bruno E, Kimber TB, Godin G: Augmentation is what
1. Tetko IV, Engkvist O, Koch U, Reymond JL, Chen H (2016) BIGCHEM: chal‑ you need! Artificial neural networks and machine learning—ICANN
lenges and opportunities for Big Data analysis in chemistry. Mol Inform 2019: Workshop and Special Sessions: 17th–19th September 2019 2019;
35(11–12):615–621 Münich. Springer International Publishing. pp. 831–835.
2. Tetko IV, Engkvist O, Chen H (2016) Does “Big Data” exist in medicinal 23. Iqbal J, Vogt M, Bajorath J (2020) Activity landscape image analysis using
chemistry, and if so, how can it be harnessed? Future Med Chem convolutional neural networks. J Cheminform 12(1):34
8(15):1801–1806 24. Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V,
3. Rodríguez-Pérez R, Miljković F, Bajorath J (2020) Assessing the information Oprea TI, Baskin II, Varnek A, Roitberg A et al (2020) Correction: QSAR
content of structural and protein–ligand interaction representations for without borders. Chem Soc Rev 49(11):3716
the classification of kinase inhibitor binding modes via machine learning 25. Tetko IV, Karpov P, Van Deursen R, Godin G (2020) State-of-the-art aug‑
and active learning. J Cheminform 12(1):36 mented NLP transformer models for direct and single-step retrosynthesis.
4. Laufkötter O, Sturm N, Bajorath J, Chen H, Engkvist O (2019) Combining Nat Comm 11(1):1–11
structural and bioactivity-based fingerprints improves prediction perfor‑ 26. Karpov P, Godin G, Tetko IV: A Transformer Model for Retrosynthesis. In:
mance and scaffold hopping capability. J Cheminform 11(1):54 Artificial Neural Networks and Machine Learning—ICANN 2019: Workshop
5. Cortés-Ciriano I, Škuta C, Bender A, Svozil D (2020) QSAR-derived affinity and Special Sessions: 17th–19th September 2019 2019; Münich. Springer
fingerprints (part 2): modeling performance for potency prediction. J International Publishing. pp. 817–830.
Cheminform 12(1):41 27. Thakkar A, Bjerrum EJ, Engkvist O, Reymond J-L: Neural network guided
6. Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, tree-search policies for synthesis planning. Artificial neural networks and
Bender A, Svozil D (2020) QSAR-derived affinity fingerprints (part 1): fin‑ machine learning—ICANN 2019: workshop and special sessions: 17th–
gerprint construction and modeling performance for similarity searching, 19th September 2019 2019; Münich. Springer International Publishing:
bioactivity classification and scaffold hopping. J Cheminform 12(1):39 721-724.
7. David L, Thakkar A, Mercado R, Engkvist O (2020) Molecular representa‑ 28. Genheden S, Thakkar A, Chadimová V, Reymond J-L, Engkvist O, Bjerrum E
tions in AI-driven drug discovery: a review and practical guide. J Chemin‑ (2020) AiZynthFinder: a fast, robust and flexible open-source software for
form 12(1):56 retrosynthetic planning. J Cheminform 12(1):70
8. Capecchi A, Probst D, Reymond JL (2020) One molecular fingerprint to 29. REINVENT [https​://githu​b.com/Marcu​sOliv​ecron​a/REINV​ENT]
rule them all: drugs, biomolecules, and the metabolome. J Cheminform 30. Tetko IV, Theis F, Karpov P, Kůrková V (2019) Artificial Neural Networks
12(1):43 and Machine Learning—ICANN 2019: 28th International Conference on
9. Hemmerich J, Asilar E, Ecker GF (2020) COVER: conformational oversam‑ Artificial Neural Networks. Lecture Notes in Computer Science (including
pling as data augmentation for molecules. J Cheminform 12(1):18 subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
10. Olivecrona M, Blaschke T, Engkvist O, Chen H (2017) Molecular de-novo Bioinformatics). volumes 11727–11731 LNCS
design through deep reinforcement learning. J Cheminform 9(1):48
11. Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H (2018) Applica‑
tion of generative autoencoder in De Novo molecular design. Mol Inform
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
37(1–2):1700123
lished maps and institutional affiliations.
12. Engkvist O, Arús-Pous J, Bjerrum EJ, Chen H: Chapter 13 Molecular De
Novo Design Through Deep Generative Models. Artificial Intelligence in
Drug Discovery. The Royal Society of Chemistry; 2021. pp. 272–300.

You might also like