0% found this document useful (0 votes)
8 views45 pages

Improving The Prediction of Drug-Target Interactions Using Machine - Documentation

Masters project documentation

Uploaded by

fatma.omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

Improving The Prediction of Drug-Target Interactions Using Machine - Documentation

Masters project documentation

Uploaded by

fatma.omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Cairo University

Faculty of Graduate Studies for Statistical Research


Department of Information Systems & Technology

Improving the Prediction of Drug-Target Interactions Using Machine Learning


A Project Presented for Fulfillment
For Master Project in Computer Science

Submitted by

1. Fatma Omar Abdelmohsen Mohamed


2. Bassam Tarek Farouk
3. Rada Kamel Saleh

Supervised by

Dr. Tarek ElGahzaly

Cairo, Egypt
Jun 2024
Table of Contents

Table of Contents

Table of Contents
Abstract ........................................................................................................................................................................... 3
List of Figures .................................................................................................................................................................. 4
List of Tables.................................................................................................................................................................... 5
List of Abbreviations ....................................................................................................................................................... 6
CHAPTER 1: Business Case for Developing a Predictive Drug Discovery ML Model ....................................................... 7
1.1 Introduction: ....................................................................................................................................................... 8
1.2. Problem Statement ................................................................................................................................................ 10
1.3. Scope: ..................................................................................................................................................................... 13
3.3 Feature Selection .................................................................................................................................................... 16
3.4 Model Training and Validation ............................................................................................................................ 17
1.4. Tentative Project Time-line: ................................................................................................................................... 19
1.5. Conclusion: ............................................................................................................................................................. 19
CHAPTER 2: Literature Review ...................................................................................................................................... 20
2.1. Introduction ........................................................................................................................................................... 21
1.1 Background ......................................................................................................................................................... 21
2.2. Fundamentals of Drug-Target Interactions ........................................................................................................ 21
2.3. Machine Learning Techniques in Predicting Drug-Target Interactions.............................................................. 22
2.4. Current Advances in Machine Learning for Drug-Target Interaction Prediction ............................................... 23
2.5. Challenges and Limitations ................................................................................................................................ 24
2.6. Future Directions................................................................................................................................................ 25
2.7. Conclusion .......................................................................................................................................................... 25
CHAPTER 3: System Analysis and design....................................................................................................................... 27
3.1. System Architecture ............................................................................................................................................... 28
3.2. System requirements specification: ....................................................................................................................... 30
2.1 Functional requirements: ........................................................................................................................................ 30
2.1.1. Requirement specifications: ............................................................................................................................... 30
2.2 User stories: ........................................................................................................................................................... 31

1 | Page
2.3 work backlog: .......................................................................................................................................................... 31
2.3.2 Project duration ................................................................................................................................................... 32
2.4. Use-Case Diagram: ................................................................................................................................................. 32
2.5 Actor Description: ................................................................................................................................................... 33
CHAPTER 4: Conclusion and Future work ..................................................................................................................... 34
4.1 Introduction: ........................................................................................................................................................... 35
4.3 Code and Results: .................................................................................................................................................... 36
3.1 collecting bioactivity data from ChEMBL ............................................................................................................ 36
................................................................................................................................................................................... 37
3.1 Downloading this 2 files padel.zip and padel.sh ................................................................................................. 38
3.3 Training Model .................................................................................................................................................... 39
3.4 Testing Model...................................................................................................................................................... 41
References..................................................................................................................................................................... 43

2 | Page
Abstract

Predicting drug-target interactions (DTIs) is crucial for drug discovery, providing insights
into the efficacy and safety of potential therapeutic compounds. Recent advancements in
machine learning (ML) have significantly enhanced the accuracy and efficiency of DTI
predictions by analyzing complex biological data and identifying patterns that traditional
methods may overlook. This project employs a structured methodology comprising several
critical steps. Initially, we focus on data collection, assembling datasets enriched with
specific characteristics such as physicochemical properties essential for absorption,
specificity, and low toxicity. These datasets are then processed to generate mathematical
descriptors from molecular sequences, transforming them into matrices suitable for ML
algorithm processing. Following this, we perform feature selection to identify the most
relevant subset of variables, minimizing redundancy while preserving biologically
significant information. The refined dataset is subsequently used to train and validate ML
models, employing optimal algorithms and robust validation techniques such as cross-
validation. This systematic approach ensures the development of predictive models
capable of enhancing drug discovery by accurately forecasting drug interactions and
potential adverse effects.

3 | Page
List of Figures

Figure 1.1: Drug discovery stages from target identification and validation ………................................9

Figure 1.2 : Stages in the discovery of new drugs in the context of precision medicine................11
Figure 1. 3: Machine learning methodology commonly used for drug discovery................................12

Figure 1. 3.2: Generation of Mathematical Descriptors....................................................................15

Figure 1.3.4: cross validation..............................................................................................................19

Figure 3.1: Prediction of Drug-Target Interactions Using Machine Learning system components…….29

Figure 2.4.5: Use-Case diagram……………………………………………………………………………………………………………………………………………………………………………….32

4 | Page
List of Tables

Table 2.1.1: Requirement specifications table …………………………………………………....33

Table 2.2.2: User stories table …………………………………………………………….............. ...34

Table 2.3.3: Work backlog table ……………………………………………………………… …………25

Table 2.3.3: Case diagram table ……………………………………………………………… ………..…31

5 | Page
List of Abbreviations

ML: Machine Learning

Pd: panda's library

np: numpy library

CSV: Comma-Separated Values

6 | Page
CHAPTER 1: Business Case for
Developing a Predictive Drug Discovery
ML Model

7 | Page
1.1 Introduction:

The process of discovering new drugs is fraught with challenges that impede the
development of effective and safe therapeutic compounds. Over recent decades, the
pharmaceutical industry has faced increasing difficulties, including escalating research and
development (R&D) costs and declining approval rates for new drugs. The number of new
drugs approved by the US Food and Drug Administration (FDA) per billion US dollars
(inflation-adjusted) spent on R&D has halved approximately every nine years. This trend,
known as Eroom's Law, reflects a significant decrease in R&D productivity. The pattern of
decline is consistent over various ten-year periods, even when accounting for different
assumptions about the delay between R&D spending and drug approval.

Furthermore, many potential drugs fail during clinical trials due to unforeseen toxicity or
lack of efficacy. Between Phase II (including Phase I/II) and submission, there were 218
reported failures, with 174 of these having stated reasons for failure. These setbacks not
only represent substantial financial losses but also delay the availability of potentially life-
saving treatments.

The complexity of human biology and the vastness of chemical space add to the
challenges. Human biological systems are intricate and unpredictable, making it difficult to
foresee how a drug will interact with these systems. Additionally, the limited chemical
space, encompassing all possible small molecules, poses a significant hurdle in identifying
compounds with optimal therapeutic properties.

To address these challenges, we propose developing a predictive drug discovery model


using advanced machine learning (ML) techniques. This model aims to enhance the
efficiency and accuracy of predicting drug-target interactions (DTIs), thereby improving the
drug discovery process.

8 | Page
Figure 1:Drug discovery stages from target identification and validation until filling through
FDA (Food and Drug Administration); IND (Investigational New Drug); NDA (New Drug
Application) [4]

9 | Page
1.2. Problem Statement
The drug discovery process faces numerous challenges that significantly impact the
efficiency and success rate of developing new therapeutic compounds. Key issues include
escalating costs, declining approval rates, and unforeseen failures during clinical trials.
These challenges are compounded by the complexities of human biology and the
limitations of chemical space exploration.

Key Challenges

1. Escalating Costs and Declining Approval Rates:


○ The number of new drugs approved by the US Food and Drug Administration
(FDA) per billion US dollars (inflation-adjusted) spent on research and
development (R&D) has halved approximately every nine years. This trend,
often referred to as Eroom's Law (the opposite of Moore's Law), indicates a
significant decrease in R&D productivity.
○ The rate of decline in the approval of new drugs per billion US dollars spent
has been consistent over various ten-year periods. This pattern remains
robust even when considering different assumptions about the average delay
between R&D spending and drug approval.

2. High Failure Rates in Clinical Trials:


○ Many potential drugs fail during clinical trials due to unforeseen toxicity or
lack of efficacy. Between Phase II (including Phase I/II) and submission, there
were 218 reported failures. Of these, 174 had stated reasons for failure,
which were used in the subsequent analysis. These failures represent a
substantial loss of investment and time.
○ Addressing the root causes of these failures is critical for improving the
success rate of drug development.

3. Complex Biological Systems:


○ Human biology is highly complex, and understanding the intricate interactions
between drugs and biological systems poses a significant challenge. This
complexity often leads to unpredictable outcomes in drug efficacy and safety.

10 | Page
4. Limited Chemical Space:
○ The vastness of chemical space, which encompasses all possible small
molecules, makes it difficult to identify compounds with optimal therapeutic
properties. Navigating this space efficiently is crucial for discovering effective
drugs.

Figure 1.2 Stages in the discovery of new drugs in the context of precision medicine.

11 | Page
Figure 1.3 Machine Learning methodology commonly used for drug discovery.

12 | Page
1.3. Scope:

In the ML methodology applied in drug discovery, the following steps are differentiated:

1. Data Collection:
○ Gathering datasets with specific characteristics, including physicochemical
properties that aid in absorption, specificity, and low toxicity.
2. Mathematical Descriptors:
○ Generating mathematical descriptors from molecular sequences to convert
them into matrices suitable for ML algorithm processing.
3. Feature Selection:
○ Identifying the best subset of variables to reduce redundant data and retain
biologically relevant information.
4. Model Training and Validation:
○ Training ML models with the optimal subset of variables, selecting
appropriate algorithms, and validating the models using techniques such as
cross-validation.

3.1 Data Collection

The goal is to gather a comprehensive and high-quality dataset that includes information
about various compounds and their characteristics. Here are some more details:

Types of Data

The data collected typically includes information about the physicochemical properties of
various compounds, such as molecular weight, polarity, and charge. It must also include
characteristics that allow the compounds to be easily produced and handled in the
laboratory, excluding large proteins or extremely complex molecules. These properties
influence how a compound interacts with biological systems and can be important
predictors of its potential as a drug.

13 | Page
Format of Data

● Main Compounds: The primary focus is on small molecules and peptides.


● Representation Formats:
○ SMILES (Simplified Molecular-Input Line-Entry System):

A line notation for describing chemical structures using short ASCII strings.
SMILES identifies the nodes and edges of a molecular graph and is widely
used despite being developed in the late 1980s by Daylight Chemical
Information Systems.

○ FASTA Format: A text-based format for representing nucleotide or amino acid


(protein) sequences using single-letter codes.

Sources of Data

The data can come from a variety of sources, including:

● Scientific Literature: Research papers and articles that provide relevant compound
information.
● Existing Databases: Public repositories such as DrugBank, PubChem, ChEMBL, and
ZINC store a large amount of useful data for drug discovery.
● Laboratory Experiments: Data generated through experimental research.

Data Quality

High-quality data is crucial, meaning it must be accurate, reliable, and relevant to the
problem being studied. Data cleaning and preprocessing steps are often necessary to
ensure data quality.

Data Quantity

The quantity of data is also important as machine learning models often require large
amounts of data to train effectively. Therefore, researchers aim to collect as much data as
possible.

14 | Page
3.2 Mathematical Descriptors

Sequencing Technologies

New sequencing technologies have greatly advanced the generation of sequence data for
DNA, RNA, proteins, small molecules, and more. These sequences serve as the starting
point in drug discovery.

Data Conversion

To make predictions, these sequences need to be converted into matrices that can be
processed by machine learning (ML) algorithms.

Labeling

In drug discovery, supervised learning models are commonly used. The labeling defined by
the researchers is essential and crucial in the experimental process, providing the
necessary targets for training the models.

Data Processing

Once mathematical descriptors are generated, a dataset is obtained which the ML model
can process. This dataset is typically divided into two subsets: a larger one for training the
model and a smaller one for testing it.

Figure 1.3.2 Generation of Mathematical Descriptors

15 | Page
3.3 Feature Selection

During the generation of mathematical descriptors, a large number of numerical


variables are presented. The main objective in this step is to reduce as much as possible
the useless or redundant variables. There are several techniques used for this purpose:

PCA: Principal Component Analysis

Principal Component Analysis (PCA) is one of the oldest and most extensively used
approaches to reduce the dimensionality of large datasets. It transforms a large set of
variables into a smaller one that retains most of the information in the original set. PCA
works by finding orthogonal vectors in a dataset that account for the greatest amount of
variation, where each orthogonal vector is a linear combination of all the features in the
original dataset.

FS: Feature Selection

Feature Selection (FS) techniques obtain a subset of features from the original set
without modifying the content of the variables. This provides a justification that is
understandable at a biological level, which is why a large majority of researchers use
these techniques in their experimental designs.

16 | Page
3.4 Model Training and Validation

Selection of Algorithms and Parameters

The choice of algorithms and their parameters is critical to ensure they are suitable for the
problem at hand, as well as for the quantity and type of data available. Careful selection
helps optimize model performance and accuracy.

Training and Validation

Multiple experimental runs are conducted using the training data. The original dataset is
split into two subsets: the training set and the validation set. This division allows for the
assessment of the model's performance on unseen data.

Cross-Validation (CV)

Cross-validation (CV) is a technique employed to evaluate the generalization ability of the


model during the training phase. It assesses the model’s performance and estimates its
effectiveness with unknown data. In the basic approach, known as k-fold CV, the training
set is divided into k smaller subsets. The model is trained on k-1 of these subsets and
validated on the remaining subset. This process is repeated k times, with each of the k
subsets used exactly once as the validation data. By evaluating the model on multiple
validation sets, CV provides a more realistic estimate of the model’s generalization
performance, measuring its ability to perform well on new, unseen data.

Parameter Tuning

The ultimate goal of the cross-validation (CV) process is to identify the optimal
combination of parameters for each algorithm. Fine-tuning these parameters is crucial to
maximize model performance.

17 | Page
Performance Measurement

Once the parameters are selected, the performance of each model is evaluated. The best
model is the one that achieves the highest performance metrics while maintaining the
lowest overall cost.

Final Validation

The final validation involves using the test set, which was initially set aside from the
original dataset, to validate the best model obtained from the CV process. This step
ensures that the model's performance on the test set is consistent with the CV results.

Figure 1.3.4 Cross-Validation

If the final validation results are statistically significant, it indicates that a robust predictive
model has been developed. This process is essential in machine learning to confirm the
model’s reliability and robustness. It helps prevent over fitting, a situation where the
model performs exceptionally well on training data but poorly on new, unseen data.

18 | Page
1.4. Tentative Project Time-line:

1. Week 1-2: Data collection and preprocessing

● Collect data from ChEMBL database


● Clean and preprocess the data

2. Week 3-4: Model selection and development

● Select appropriate Machine learning models for classification


● Develop and train models on the preprocessed data
● Fine-tune models using hyper parameter tuning technique

3. Week 5-6: Evaluation and testing

● Evaluate the performance of the models using appropriate metrics


● Compare the results with other approaches and datasets
● Test the models on unseen data to assess generalization performance

4. Week 7-8: Documentation and reporting

● Write a comprehensive report of the project including the methodology,


results, and analysis
● Create visualizations and presentations to support the report
● Review the report and prepare it for submission

1.5. Conclusion:
In conclusion, the Prediction of Drug-Target Interactions Using Machine Learning
project aims to develop an accurate model for drug-target interaction based on a
dataset extracted from the ChEMBL database. The project objectives are specific,
measurable, achievable, relevant, and time-bound, ensuring that the project is
well-defined and manageable. The project timeline is also tentative, allowing for
flexibility in case of unexpected challenges. The success of this project has the
potential to contribute to the development of Prediction of Drug-Target
Interactions

19 | Page
CHAPTER 2: Literature Review

20 | Page
2.1. Introduction

1.1 Background

Drug discovery involves identifying potential therapeutic compounds and their interactions
with biological targets. Predicting drug-target interactions (DTIs) is crucial for
understanding drug efficacy and safety. Bioinformatics employs computational tools and
data analysis to understand biological data, playing a pivotal role in drug discovery by
identifying potential drug targets and elucidating their mechanisms of action.

1.2 Significance of Machine Learning in Drug Discovery

Machine learning (ML), a branch of artificial intelligence (AI), has transformed various
fields, including drug discovery. ML algorithms can analyze extensive datasets, uncovering
patterns and relationships that might not be evident through traditional methods. This
capability is particularly advantageous in predicting DTIs, where complex biological data
can be efficiently processed to identify potential interactions.

1.3 Objectives of the Literature Review

This literature review aims to synthesize current research on ML methods for predicting
DTIs, highlighting key advancements, challenges, and future directions. By examining
various ML approaches and their applications, this review seeks to provide a
comprehensive understanding of the field's current state and its potential impact on drug
discovery.

2.2. Fundamentals of Drug-Target Interactions

2.1 Biological Basis of Drug-Target Interactions

21 | Page
DTIs occur when drug molecules bind to biological targets such as proteins, enzymes, or
receptors, influencing their function. Understanding these interactions is critical for
developing effective therapeutics. The binding process can involve various types of
chemical bonds and molecular forces, which determine the specificity and strength of the
interaction.

2.2 Experimental Methods for Identifying Drug-Target Interactions

Traditional methods for identifying DTIs include high-throughput screening, in vitro and in
vivo assays, and computational docking. High-throughput screening involves testing large
libraries of compounds against biological targets to identify potential interactions. In vitro
and in vivo assays provide detailed insights into the biological activity of compounds.
Computational docking predicts how small molecules interact with target proteins based
on their three-dimensional structures.

2.3. Machine Learning Techniques in Predicting Drug-Target Interactions

3.1 Overview of Machine Learning Algorithms

ML algorithms are broadly classified into supervised, unsupervised, and deep learning
techniques:

Supervised Learning: Algorithms like support vector machines (SVMs), random forests, and
neural networks are trained on labeled datasets to predict DTIs.

Unsupervised Learning: Techniques such as clustering and dimensionality reduction help


uncover hidden patterns in unlabeled data, which can be useful for understanding
complex biological systems.

Deep Learning: Advanced models like convolutional neural networks (CNNs) and recurrent
neural networks (RNNs) can automatically extract features from raw data, improving the
accuracy of DTI predictions.

22 | Page
3.2 Data Sources and Feature Engineering

ML models for DTI prediction utilize diverse data sources, including chemical properties,
genomic data, and proteomic data. Effective feature engineering techniques are crucial for
transforming raw data into meaningful inputs for ML models. Common approaches include
molecular descriptors, fingerprints, and embeddings.

3.3 Model Training and Evaluation

Training ML models involves splitting data into training, validation, and test sets. Cross-
validation techniques ensure the models generalize well to unseen data. Performance
metrics such as accuracy, precision, recall, F1 score, and ROC-AUC are used to evaluate
model effectiveness.

2.4. Current Advances in Machine Learning for Drug-Target Interaction


Prediction

4.1 Predictive Models

Recent advancements have led to the development of sophisticated ML models for DTI
prediction. For example, Patel et al. (2020) discuss various ML methods that have shown
promise in drug discovery, highlighting their ability to predict DTIs accurately. Carracedo-
Reboredo et al. (2021) provide an overview of trends in ML approaches, emphasizing the
integration of different data types to enhance prediction accuracy. Additionally, Zhang et
al. (2019) review recent advances in ML-based DTI prediction, showcasing progress in
algorithm development and application.

4.2 Integrative Approaches

Integrative approaches combine multi-omics data, enabling a comprehensive


understanding of biological systems. Network-based methods and graph neural networks
(GNNs) model the interactions within biological networks, improving DTI predictions by
considering the broader context of molecular interactions (Bagherian et al., 2021).

23 | Page
Monteiro et al. (2020) present an end-to-end deep learning approach for DTI prediction,
demonstrating the power of deep learning in capturing complex biological relationships.

4.3 Handling Imbalanced Data

Class imbalance is a significant challenge in DTI datasets, where negative samples often
outnumber positive samples. Techniques such as synthetic minority over-sampling
(SMOTE) and cost-sensitive learning help address this issue, improving model performance
on imbalanced datasets (Gupta et al., 2021). El-Behery et al. (2021) develop an efficient
ML model for predicting DTIs, specifically addressing data imbalance in the context of
COVID-19 drug discovery.

2.5. Challenges and Limitations

5.1 Interpretability and Explainability

Interpretable models are essential for understanding the biological significance of


predicted DTIs. Efforts to enhance model transparency include feature importance analysis
and the development of interpretable algorithms (Réda et al., 2020). Zheng and Wu (2021)
propose a ML-based DTI prediction method for a tripartite heterogeneous network,
focusing on improving model interpretability.

5.2 Data Quality and Availability

Data quality and availability are critical for training reliable ML models. Issues such as
noise, biases, and incomplete data can hinder model performance. Standardizing data
collection and improving data sharing practices can help address these challenges (Dara et
al., 2022).

24 | Page
5.3 Generalization and Robustness

Ensuring ML models generalize well across different datasets is crucial for their practical
application. Techniques such as domain adaptation and transfer learning can improve
model robustness, enabling them to perform well on diverse datasets (Patel et al., 2020).

2.6. Future Directions

6.1 Emerging Machine Learning Techniques

Emerging techniques such as transfer learning and meta-learning hold promise for DTI
prediction. These approaches leverage knowledge from related tasks to improve model
performance on new tasks, addressing issues related to limited data availability (Bagherian
et al., 2021).

6.2 Integration with Other Computational Methods

Integrating ML with other computational methods, such as molecular dynamics


simulations and quantum computing, can enhance DTI predictions. These synergies enable
more accurate modeling of molecular interactions and their dynamics (Gupta et al., 2021).

6.3 Personalized Medicine

ML has the potential to revolutionize personalized medicine by predicting DTIs based on


individual genetic profiles. This approach can lead to tailored treatment plans, improving
therapeutic outcomes for patients (Carracedo-Reboredo et al., 2021).

2.7. Conclusion

7.1 Summary of Key Findings

25 | Page
ML has significantly advanced the field of DTI prediction, offering powerful tools for drug
discovery. Key findings from the reviewed literature highlight the effectiveness of various
ML approaches and their integration with biological data.

7.2 Implications for Drug Discovery

The application of ML in DTI prediction holds substantial potential for accelerating drug
discovery, reducing costs, and improving the success rate of new therapeutics.

7.3 Final Thoughts

Continued innovation in ML methods and interdisciplinary collaboration will further


enhance our ability to predict DTIs, ultimately leading to more effective and personalized
treatments.

26 | Page
CHAPTER 3: System Analysis and design

27 | Page
3.1. System Architecture

The system architecture of the Improving the Prediction of Drug-Target Interactions Using
Machine Learning data collection, Mathematical Descriptors, Feature Selection, and Model
Training and Validation These components are described in detail below

● Data Collection: Gathering datasets with specific characteristics, including physical-


chemical properties that aid in absorption, specificity, and low toxicity.

● Mathematical Descriptors: Generating mathematical F descriptors from molecular


sequences to convert them into matrices for ML algorithm processing.

● Feature Selection: Identifying the best subset of variables to reduce redundant data
and retain biologically relevant information.

● Model Training and Validation: Training ML models with the optimal subset of
variables, selecting appropriate algorithms, and validating the model using
techniques like cross-validation.

28 | Page
Figure 3.1 Improving the Prediction of Drug-Target Interactions Using Machine Learning
system components

29 | Page
3.2. System requirements specification:

Due to project scope, in this section we will discuss requirements analysis through
functional and non-functional requirements.

2.1 Functional requirements:

2.1.1. Requirement specifications:


Table 2.1.1: Requirement specifications table

Identifier Priority Requirement

The system shall be able to extract data from ChEMBL


REQ-01 5
Database.

The system shall be able to define target protein as has been


REQ-02 5
stated.

REQ-03 5 The system shall clean and prepare our data for ML

REQ-04 5 The system shall retrieve only bioactivity data for.

REQ-05 5 The system shall Filter using standard type "IC50".

The system shall install the following library


REQ-06 5
chembl_webresource_client, scikit-learn, numpy and panda

The system shall train and evaluate the performance of each


REQ-07 5
machine learning algorithm using Across Validation

30 | Page
2.2 User stories:
Table 2.2.2: User stories table

Identifier User story Size

As a developer, I need to collect target protein data from


ST-01 10pt
ChEMBL database

ST-02 As an ML engineer, I need to preprocess collected data. 6pt

As an ML engineer, I need to analyze and visualize


ST-03 2pt
collected data

As a developer, I need to implement machine learning


ST-04 18pt
models for drug discovery prediction

As a developer, I need to evaluate the performance of


ST-05 4pt
implemented models

As an SWE engineer, I need to document the project and


ST-06 18pt
present the findings.

2.3 work backlog:


Table 2.3.3: Work backlog table

Work Iteration
User story Estimated work duration
items no
ST-01
01 Collect dataset of target Iteration 1 4pt (2 day)
protein

ST-02
02 Preprocess the collected Iteration 2 6pt (3 day)
data

ST-04
04 implement machine Iteration 4 18pt (9 day)
learning models

31 | Page
ST-05
05 Implement cross Iteration 5 4pt (2 day)
validation

ST-06
06 documenting the Iteration 6 18pt (9 day)
project

2.3.2 Project duration


Work duration = path size / travel velocity

Path size = total work size so, work size = 50 / 2 = 25 days

2.4. Use-Case Diagram:

Collect Data

Generate Mathematical Research


Descriptors Scientist

Select Features

Validate Model Train Model


ML
Engineer

Predict Drug-Target
Interactions

32 | Page
Figure 2.4.5: Use-Case diagram
2.5 Actor Description:
Table 2.5.4: Actor description table

Actor Actor’s Goal


Collect Data
Research Scientist
Select Features

ML Engineer Generate Mathematical Descriptors

Train Model

Validate Model

33 | Page
CHAPTER 4: Conclusion and Future
work

34 | Page
4.1 Introduction:
We used several models to reach the maximum accuracy of these valuable dataset , the
purpose about the stating the results that we achieved here to conclude our work and
state a checkpoint to the next milestone , and it worth to say that this topic is very
important to use in drug discovery field.

- The data set used in this project is a real life collected dataset from ChEMBL which is
a manually curated database of bioactive molecules with drug-like properties. It
brings together chemical, bioactivity and genomic data to aid the translation of
genomic information into effective new drugs.

- Generating mathematical descriptors from molecular sequences to convert them


into matrices for ML algorithm processing.

- Generating mathematical descriptors from molecular sequences to convert them


into matrices for ML algorithm processing.

- Training ML models with the optimal subset of variables, selecting appropriate


algorithms, and validating the model using techniques like cross-validation.

35 | Page
4.3 Code and Results:
3.1 collecting bioactivity data from ChEMBL

36 | Page
The above code will extract a .CSV file with the bioactivity data of the target protein

37 | Page
3.1 Downloading this 2 files padel.zip and padel.sh

Sample of Date

38 | Page
3.3 Training Model

39 | Page
Running above will result in the following

40 | Page
3.4 Testing Model

41 | Page
42 | Page
References

Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., & Ahsan, M. J. (2022). Machine
learning in drug discovery: a review. Artificial Intelligence Review, 55(3), 1947-1999.

Patel, L., Shukla, T., Huang, X., Ussery, D. W., & Wang, S. (2020). Machine learning
methods in drug discovery. Molecules, 25(22), 5277.

Carracedo-Reboredo, P., Liñares-Blanco, J., Rodríguez-Fernández, N., Cedrón, F., Novoa, F.


J., Carballal, A., ... & Fernandez-Lozano, C. (2021). A review on machine learning
approaches and trends in drug discovery. Computational and structural biotechnology
journal, 19, 4538-4558.

Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., & Kumar, P. (2021). Artificial
intelligence to deep learning: machine intelligence approach for drug discovery. Molecular
diversity, 25, 1315-1360.

Réda, C., Kaufmann, E., & Delahaye-Duriez, A. (2020). Machine learning applications in
drug development. Computational and structural biotechnology journal, 18, 241-252.

Bagherian, M., Sabeti, E., Wang, K., Sartor, M. A., Nikolovska-Coleska, Z., & Najarian, K.
(2021). Machine learning approaches and databases for prediction of drug–target
interaction: a survey paper. Briefings in bioinformatics, 22(1), 247-269.

Zhang, W., Lin, W., Zhang, D., Wang, S., Shi, J., & Niu, Y. (2019). Recent advances in the
machine learning-based drug-target interaction prediction. Current drug metabolism,
20(3), 194-202.

Monteiro, N. R., Ribeiro, B., & Arrais, J. P. (2020). Drug-target interaction prediction: end-
to-end deep learning approach. IEEE/ACM transactions on computational biology and
bioinformatics, 18(6), 2364-2374.

43 | Page
El-Behery, H., Attia, A. F., El-Fishawy, N., & Torkey, H. (2021). Efficient machine learning
model for predicting drug-target interactions with case study for Covid-19. Computational
Biology and Chemistry, 93, 107536.

Zheng, Y., & Wu, Z. (2021). A machine learning-based biological drug–target interaction
prediction method for a tripartite heterogeneous network. ACS omega, 6(4), 3037-3045.

Hughes, J., Rees, S., Kalindjian, S. and Philpott, K. (2011), Principles of early drug discovery.
British Journal of Pharmacology, 162: 1239-1249. https://doi.org/10.1111/j.1476-
5381.2010.01127.x

Scannell, J., Blanckley, A., Boldon, H. et al. Diagnosing the decline in pharmaceutical R&D
efficiency. Nat Rev Drug Discov 11, 191–200 (2012).
https://doi.org/10.1038/nrd3681

Harrison, R. Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15, 817–818
(2016). https://doi.org/10.1038/nrd.2016.184

44 | Page

You might also like