Improving The Prediction of Drug-Target Interactions Using Machine - Documentation
Improving The Prediction of Drug-Target Interactions Using Machine - Documentation
Submitted by
Supervised by
Cairo, Egypt
Jun 2024
Table of Contents
Table of Contents
Table of Contents
Abstract ........................................................................................................................................................................... 3
List of Figures .................................................................................................................................................................. 4
List of Tables.................................................................................................................................................................... 5
List of Abbreviations ....................................................................................................................................................... 6
CHAPTER 1: Business Case for Developing a Predictive Drug Discovery ML Model ....................................................... 7
1.1 Introduction: ....................................................................................................................................................... 8
1.2. Problem Statement ................................................................................................................................................ 10
1.3. Scope: ..................................................................................................................................................................... 13
3.3 Feature Selection .................................................................................................................................................... 16
3.4 Model Training and Validation ............................................................................................................................ 17
1.4. Tentative Project Time-line: ................................................................................................................................... 19
1.5. Conclusion: ............................................................................................................................................................. 19
CHAPTER 2: Literature Review ...................................................................................................................................... 20
2.1. Introduction ........................................................................................................................................................... 21
1.1 Background ......................................................................................................................................................... 21
2.2. Fundamentals of Drug-Target Interactions ........................................................................................................ 21
2.3. Machine Learning Techniques in Predicting Drug-Target Interactions.............................................................. 22
2.4. Current Advances in Machine Learning for Drug-Target Interaction Prediction ............................................... 23
2.5. Challenges and Limitations ................................................................................................................................ 24
2.6. Future Directions................................................................................................................................................ 25
2.7. Conclusion .......................................................................................................................................................... 25
CHAPTER 3: System Analysis and design....................................................................................................................... 27
3.1. System Architecture ............................................................................................................................................... 28
3.2. System requirements specification: ....................................................................................................................... 30
2.1 Functional requirements: ........................................................................................................................................ 30
2.1.1. Requirement specifications: ............................................................................................................................... 30
2.2 User stories: ........................................................................................................................................................... 31
1 | Page
2.3 work backlog: .......................................................................................................................................................... 31
2.3.2 Project duration ................................................................................................................................................... 32
2.4. Use-Case Diagram: ................................................................................................................................................. 32
2.5 Actor Description: ................................................................................................................................................... 33
CHAPTER 4: Conclusion and Future work ..................................................................................................................... 34
4.1 Introduction: ........................................................................................................................................................... 35
4.3 Code and Results: .................................................................................................................................................... 36
3.1 collecting bioactivity data from ChEMBL ............................................................................................................ 36
................................................................................................................................................................................... 37
3.1 Downloading this 2 files padel.zip and padel.sh ................................................................................................. 38
3.3 Training Model .................................................................................................................................................... 39
3.4 Testing Model...................................................................................................................................................... 41
References..................................................................................................................................................................... 43
2 | Page
Abstract
Predicting drug-target interactions (DTIs) is crucial for drug discovery, providing insights
into the efficacy and safety of potential therapeutic compounds. Recent advancements in
machine learning (ML) have significantly enhanced the accuracy and efficiency of DTI
predictions by analyzing complex biological data and identifying patterns that traditional
methods may overlook. This project employs a structured methodology comprising several
critical steps. Initially, we focus on data collection, assembling datasets enriched with
specific characteristics such as physicochemical properties essential for absorption,
specificity, and low toxicity. These datasets are then processed to generate mathematical
descriptors from molecular sequences, transforming them into matrices suitable for ML
algorithm processing. Following this, we perform feature selection to identify the most
relevant subset of variables, minimizing redundancy while preserving biologically
significant information. The refined dataset is subsequently used to train and validate ML
models, employing optimal algorithms and robust validation techniques such as cross-
validation. This systematic approach ensures the development of predictive models
capable of enhancing drug discovery by accurately forecasting drug interactions and
potential adverse effects.
3 | Page
List of Figures
Figure 1.1: Drug discovery stages from target identification and validation ………................................9
Figure 1.2 : Stages in the discovery of new drugs in the context of precision medicine................11
Figure 1. 3: Machine learning methodology commonly used for drug discovery................................12
Figure 3.1: Prediction of Drug-Target Interactions Using Machine Learning system components…….29
4 | Page
List of Tables
5 | Page
List of Abbreviations
6 | Page
CHAPTER 1: Business Case for
Developing a Predictive Drug Discovery
ML Model
7 | Page
1.1 Introduction:
The process of discovering new drugs is fraught with challenges that impede the
development of effective and safe therapeutic compounds. Over recent decades, the
pharmaceutical industry has faced increasing difficulties, including escalating research and
development (R&D) costs and declining approval rates for new drugs. The number of new
drugs approved by the US Food and Drug Administration (FDA) per billion US dollars
(inflation-adjusted) spent on R&D has halved approximately every nine years. This trend,
known as Eroom's Law, reflects a significant decrease in R&D productivity. The pattern of
decline is consistent over various ten-year periods, even when accounting for different
assumptions about the delay between R&D spending and drug approval.
Furthermore, many potential drugs fail during clinical trials due to unforeseen toxicity or
lack of efficacy. Between Phase II (including Phase I/II) and submission, there were 218
reported failures, with 174 of these having stated reasons for failure. These setbacks not
only represent substantial financial losses but also delay the availability of potentially life-
saving treatments.
The complexity of human biology and the vastness of chemical space add to the
challenges. Human biological systems are intricate and unpredictable, making it difficult to
foresee how a drug will interact with these systems. Additionally, the limited chemical
space, encompassing all possible small molecules, poses a significant hurdle in identifying
compounds with optimal therapeutic properties.
8 | Page
Figure 1:Drug discovery stages from target identification and validation until filling through
FDA (Food and Drug Administration); IND (Investigational New Drug); NDA (New Drug
Application) [4]
9 | Page
1.2. Problem Statement
The drug discovery process faces numerous challenges that significantly impact the
efficiency and success rate of developing new therapeutic compounds. Key issues include
escalating costs, declining approval rates, and unforeseen failures during clinical trials.
These challenges are compounded by the complexities of human biology and the
limitations of chemical space exploration.
Key Challenges
10 | Page
4. Limited Chemical Space:
○ The vastness of chemical space, which encompasses all possible small
molecules, makes it difficult to identify compounds with optimal therapeutic
properties. Navigating this space efficiently is crucial for discovering effective
drugs.
Figure 1.2 Stages in the discovery of new drugs in the context of precision medicine.
11 | Page
Figure 1.3 Machine Learning methodology commonly used for drug discovery.
12 | Page
1.3. Scope:
In the ML methodology applied in drug discovery, the following steps are differentiated:
1. Data Collection:
○ Gathering datasets with specific characteristics, including physicochemical
properties that aid in absorption, specificity, and low toxicity.
2. Mathematical Descriptors:
○ Generating mathematical descriptors from molecular sequences to convert
them into matrices suitable for ML algorithm processing.
3. Feature Selection:
○ Identifying the best subset of variables to reduce redundant data and retain
biologically relevant information.
4. Model Training and Validation:
○ Training ML models with the optimal subset of variables, selecting
appropriate algorithms, and validating the models using techniques such as
cross-validation.
The goal is to gather a comprehensive and high-quality dataset that includes information
about various compounds and their characteristics. Here are some more details:
Types of Data
The data collected typically includes information about the physicochemical properties of
various compounds, such as molecular weight, polarity, and charge. It must also include
characteristics that allow the compounds to be easily produced and handled in the
laboratory, excluding large proteins or extremely complex molecules. These properties
influence how a compound interacts with biological systems and can be important
predictors of its potential as a drug.
13 | Page
Format of Data
A line notation for describing chemical structures using short ASCII strings.
SMILES identifies the nodes and edges of a molecular graph and is widely
used despite being developed in the late 1980s by Daylight Chemical
Information Systems.
Sources of Data
● Scientific Literature: Research papers and articles that provide relevant compound
information.
● Existing Databases: Public repositories such as DrugBank, PubChem, ChEMBL, and
ZINC store a large amount of useful data for drug discovery.
● Laboratory Experiments: Data generated through experimental research.
Data Quality
High-quality data is crucial, meaning it must be accurate, reliable, and relevant to the
problem being studied. Data cleaning and preprocessing steps are often necessary to
ensure data quality.
Data Quantity
The quantity of data is also important as machine learning models often require large
amounts of data to train effectively. Therefore, researchers aim to collect as much data as
possible.
14 | Page
3.2 Mathematical Descriptors
Sequencing Technologies
New sequencing technologies have greatly advanced the generation of sequence data for
DNA, RNA, proteins, small molecules, and more. These sequences serve as the starting
point in drug discovery.
Data Conversion
To make predictions, these sequences need to be converted into matrices that can be
processed by machine learning (ML) algorithms.
Labeling
In drug discovery, supervised learning models are commonly used. The labeling defined by
the researchers is essential and crucial in the experimental process, providing the
necessary targets for training the models.
Data Processing
Once mathematical descriptors are generated, a dataset is obtained which the ML model
can process. This dataset is typically divided into two subsets: a larger one for training the
model and a smaller one for testing it.
15 | Page
3.3 Feature Selection
Principal Component Analysis (PCA) is one of the oldest and most extensively used
approaches to reduce the dimensionality of large datasets. It transforms a large set of
variables into a smaller one that retains most of the information in the original set. PCA
works by finding orthogonal vectors in a dataset that account for the greatest amount of
variation, where each orthogonal vector is a linear combination of all the features in the
original dataset.
Feature Selection (FS) techniques obtain a subset of features from the original set
without modifying the content of the variables. This provides a justification that is
understandable at a biological level, which is why a large majority of researchers use
these techniques in their experimental designs.
16 | Page
3.4 Model Training and Validation
The choice of algorithms and their parameters is critical to ensure they are suitable for the
problem at hand, as well as for the quantity and type of data available. Careful selection
helps optimize model performance and accuracy.
Multiple experimental runs are conducted using the training data. The original dataset is
split into two subsets: the training set and the validation set. This division allows for the
assessment of the model's performance on unseen data.
Cross-Validation (CV)
Parameter Tuning
The ultimate goal of the cross-validation (CV) process is to identify the optimal
combination of parameters for each algorithm. Fine-tuning these parameters is crucial to
maximize model performance.
17 | Page
Performance Measurement
Once the parameters are selected, the performance of each model is evaluated. The best
model is the one that achieves the highest performance metrics while maintaining the
lowest overall cost.
Final Validation
The final validation involves using the test set, which was initially set aside from the
original dataset, to validate the best model obtained from the CV process. This step
ensures that the model's performance on the test set is consistent with the CV results.
If the final validation results are statistically significant, it indicates that a robust predictive
model has been developed. This process is essential in machine learning to confirm the
model’s reliability and robustness. It helps prevent over fitting, a situation where the
model performs exceptionally well on training data but poorly on new, unseen data.
18 | Page
1.4. Tentative Project Time-line:
1.5. Conclusion:
In conclusion, the Prediction of Drug-Target Interactions Using Machine Learning
project aims to develop an accurate model for drug-target interaction based on a
dataset extracted from the ChEMBL database. The project objectives are specific,
measurable, achievable, relevant, and time-bound, ensuring that the project is
well-defined and manageable. The project timeline is also tentative, allowing for
flexibility in case of unexpected challenges. The success of this project has the
potential to contribute to the development of Prediction of Drug-Target
Interactions
19 | Page
CHAPTER 2: Literature Review
20 | Page
2.1. Introduction
1.1 Background
Drug discovery involves identifying potential therapeutic compounds and their interactions
with biological targets. Predicting drug-target interactions (DTIs) is crucial for
understanding drug efficacy and safety. Bioinformatics employs computational tools and
data analysis to understand biological data, playing a pivotal role in drug discovery by
identifying potential drug targets and elucidating their mechanisms of action.
Machine learning (ML), a branch of artificial intelligence (AI), has transformed various
fields, including drug discovery. ML algorithms can analyze extensive datasets, uncovering
patterns and relationships that might not be evident through traditional methods. This
capability is particularly advantageous in predicting DTIs, where complex biological data
can be efficiently processed to identify potential interactions.
This literature review aims to synthesize current research on ML methods for predicting
DTIs, highlighting key advancements, challenges, and future directions. By examining
various ML approaches and their applications, this review seeks to provide a
comprehensive understanding of the field's current state and its potential impact on drug
discovery.
21 | Page
DTIs occur when drug molecules bind to biological targets such as proteins, enzymes, or
receptors, influencing their function. Understanding these interactions is critical for
developing effective therapeutics. The binding process can involve various types of
chemical bonds and molecular forces, which determine the specificity and strength of the
interaction.
Traditional methods for identifying DTIs include high-throughput screening, in vitro and in
vivo assays, and computational docking. High-throughput screening involves testing large
libraries of compounds against biological targets to identify potential interactions. In vitro
and in vivo assays provide detailed insights into the biological activity of compounds.
Computational docking predicts how small molecules interact with target proteins based
on their three-dimensional structures.
ML algorithms are broadly classified into supervised, unsupervised, and deep learning
techniques:
Supervised Learning: Algorithms like support vector machines (SVMs), random forests, and
neural networks are trained on labeled datasets to predict DTIs.
Deep Learning: Advanced models like convolutional neural networks (CNNs) and recurrent
neural networks (RNNs) can automatically extract features from raw data, improving the
accuracy of DTI predictions.
22 | Page
3.2 Data Sources and Feature Engineering
ML models for DTI prediction utilize diverse data sources, including chemical properties,
genomic data, and proteomic data. Effective feature engineering techniques are crucial for
transforming raw data into meaningful inputs for ML models. Common approaches include
molecular descriptors, fingerprints, and embeddings.
Training ML models involves splitting data into training, validation, and test sets. Cross-
validation techniques ensure the models generalize well to unseen data. Performance
metrics such as accuracy, precision, recall, F1 score, and ROC-AUC are used to evaluate
model effectiveness.
Recent advancements have led to the development of sophisticated ML models for DTI
prediction. For example, Patel et al. (2020) discuss various ML methods that have shown
promise in drug discovery, highlighting their ability to predict DTIs accurately. Carracedo-
Reboredo et al. (2021) provide an overview of trends in ML approaches, emphasizing the
integration of different data types to enhance prediction accuracy. Additionally, Zhang et
al. (2019) review recent advances in ML-based DTI prediction, showcasing progress in
algorithm development and application.
23 | Page
Monteiro et al. (2020) present an end-to-end deep learning approach for DTI prediction,
demonstrating the power of deep learning in capturing complex biological relationships.
Class imbalance is a significant challenge in DTI datasets, where negative samples often
outnumber positive samples. Techniques such as synthetic minority over-sampling
(SMOTE) and cost-sensitive learning help address this issue, improving model performance
on imbalanced datasets (Gupta et al., 2021). El-Behery et al. (2021) develop an efficient
ML model for predicting DTIs, specifically addressing data imbalance in the context of
COVID-19 drug discovery.
Data quality and availability are critical for training reliable ML models. Issues such as
noise, biases, and incomplete data can hinder model performance. Standardizing data
collection and improving data sharing practices can help address these challenges (Dara et
al., 2022).
24 | Page
5.3 Generalization and Robustness
Ensuring ML models generalize well across different datasets is crucial for their practical
application. Techniques such as domain adaptation and transfer learning can improve
model robustness, enabling them to perform well on diverse datasets (Patel et al., 2020).
Emerging techniques such as transfer learning and meta-learning hold promise for DTI
prediction. These approaches leverage knowledge from related tasks to improve model
performance on new tasks, addressing issues related to limited data availability (Bagherian
et al., 2021).
2.7. Conclusion
25 | Page
ML has significantly advanced the field of DTI prediction, offering powerful tools for drug
discovery. Key findings from the reviewed literature highlight the effectiveness of various
ML approaches and their integration with biological data.
The application of ML in DTI prediction holds substantial potential for accelerating drug
discovery, reducing costs, and improving the success rate of new therapeutics.
26 | Page
CHAPTER 3: System Analysis and design
27 | Page
3.1. System Architecture
The system architecture of the Improving the Prediction of Drug-Target Interactions Using
Machine Learning data collection, Mathematical Descriptors, Feature Selection, and Model
Training and Validation These components are described in detail below
● Feature Selection: Identifying the best subset of variables to reduce redundant data
and retain biologically relevant information.
● Model Training and Validation: Training ML models with the optimal subset of
variables, selecting appropriate algorithms, and validating the model using
techniques like cross-validation.
28 | Page
Figure 3.1 Improving the Prediction of Drug-Target Interactions Using Machine Learning
system components
29 | Page
3.2. System requirements specification:
Due to project scope, in this section we will discuss requirements analysis through
functional and non-functional requirements.
REQ-03 5 The system shall clean and prepare our data for ML
30 | Page
2.2 User stories:
Table 2.2.2: User stories table
Work Iteration
User story Estimated work duration
items no
ST-01
01 Collect dataset of target Iteration 1 4pt (2 day)
protein
ST-02
02 Preprocess the collected Iteration 2 6pt (3 day)
data
ST-04
04 implement machine Iteration 4 18pt (9 day)
learning models
31 | Page
ST-05
05 Implement cross Iteration 5 4pt (2 day)
validation
ST-06
06 documenting the Iteration 6 18pt (9 day)
project
Collect Data
Select Features
Predict Drug-Target
Interactions
32 | Page
Figure 2.4.5: Use-Case diagram
2.5 Actor Description:
Table 2.5.4: Actor description table
Train Model
Validate Model
33 | Page
CHAPTER 4: Conclusion and Future
work
34 | Page
4.1 Introduction:
We used several models to reach the maximum accuracy of these valuable dataset , the
purpose about the stating the results that we achieved here to conclude our work and
state a checkpoint to the next milestone , and it worth to say that this topic is very
important to use in drug discovery field.
- The data set used in this project is a real life collected dataset from ChEMBL which is
a manually curated database of bioactive molecules with drug-like properties. It
brings together chemical, bioactivity and genomic data to aid the translation of
genomic information into effective new drugs.
35 | Page
4.3 Code and Results:
3.1 collecting bioactivity data from ChEMBL
36 | Page
The above code will extract a .CSV file with the bioactivity data of the target protein
37 | Page
3.1 Downloading this 2 files padel.zip and padel.sh
Sample of Date
38 | Page
3.3 Training Model
39 | Page
Running above will result in the following
40 | Page
3.4 Testing Model
41 | Page
42 | Page
References
Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., & Ahsan, M. J. (2022). Machine
learning in drug discovery: a review. Artificial Intelligence Review, 55(3), 1947-1999.
Patel, L., Shukla, T., Huang, X., Ussery, D. W., & Wang, S. (2020). Machine learning
methods in drug discovery. Molecules, 25(22), 5277.
Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., & Kumar, P. (2021). Artificial
intelligence to deep learning: machine intelligence approach for drug discovery. Molecular
diversity, 25, 1315-1360.
Réda, C., Kaufmann, E., & Delahaye-Duriez, A. (2020). Machine learning applications in
drug development. Computational and structural biotechnology journal, 18, 241-252.
Bagherian, M., Sabeti, E., Wang, K., Sartor, M. A., Nikolovska-Coleska, Z., & Najarian, K.
(2021). Machine learning approaches and databases for prediction of drug–target
interaction: a survey paper. Briefings in bioinformatics, 22(1), 247-269.
Zhang, W., Lin, W., Zhang, D., Wang, S., Shi, J., & Niu, Y. (2019). Recent advances in the
machine learning-based drug-target interaction prediction. Current drug metabolism,
20(3), 194-202.
Monteiro, N. R., Ribeiro, B., & Arrais, J. P. (2020). Drug-target interaction prediction: end-
to-end deep learning approach. IEEE/ACM transactions on computational biology and
bioinformatics, 18(6), 2364-2374.
43 | Page
El-Behery, H., Attia, A. F., El-Fishawy, N., & Torkey, H. (2021). Efficient machine learning
model for predicting drug-target interactions with case study for Covid-19. Computational
Biology and Chemistry, 93, 107536.
Zheng, Y., & Wu, Z. (2021). A machine learning-based biological drug–target interaction
prediction method for a tripartite heterogeneous network. ACS omega, 6(4), 3037-3045.
Hughes, J., Rees, S., Kalindjian, S. and Philpott, K. (2011), Principles of early drug discovery.
British Journal of Pharmacology, 162: 1239-1249. https://doi.org/10.1111/j.1476-
5381.2010.01127.x
Scannell, J., Blanckley, A., Boldon, H. et al. Diagnosing the decline in pharmaceutical R&D
efficiency. Nat Rev Drug Discov 11, 191–200 (2012).
https://doi.org/10.1038/nrd3681
Harrison, R. Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15, 817–818
(2016). https://doi.org/10.1038/nrd.2016.184
44 | Page