0% found this document useful (0 votes)

9 views45 pages

Improving The Prediction of Drug-Target Interactions Using Machine - Documentation

Masters project documentation

Uploaded by

fatma.omar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views45 pages

Improving The Prediction of Drug-Target Interactions Using Machine - Documentation

Masters project documentation

Uploaded by

fatma.omar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Cairo University

Faculty of Graduate Studies for Statistical Research

Department of Information Systems & Technology

Improving the Prediction of Drug-Target Interactions Using Machine Learning

A Project Presented for Fulfillment
For Master Project in Computer Science

Submitted by

1. Fatma Omar Abdelmohsen Mohamed

2. Bassam Tarek Farouk
3. Rada Kamel Saleh

Supervised by

Dr. Tarek ElGahzaly

Cairo, Egypt
Jun 2024
Table of Contents

Table of Contents

Table of Contents
Abstract ........................................................................................................................................................................... 3
List of Figures .................................................................................................................................................................. 4
List of Tables.................................................................................................................................................................... 5
List of Abbreviations ....................................................................................................................................................... 6
CHAPTER 1: Business Case for Developing a Predictive Drug Discovery ML Model ....................................................... 7
1.1 Introduction: ....................................................................................................................................................... 8
1.2. Problem Statement ................................................................................................................................................ 10
1.3. Scope: ..................................................................................................................................................................... 13
3.3 Feature Selection .................................................................................................................................................... 16
3.4 Model Training and Validation ............................................................................................................................ 17
1.4. Tentative Project Time-line: ................................................................................................................................... 19
1.5. Conclusion: ............................................................................................................................................................. 19
CHAPTER 2: Literature Review ...................................................................................................................................... 20
2.1. Introduction ........................................................................................................................................................... 21
1.1 Background ......................................................................................................................................................... 21
2.2. Fundamentals of Drug-Target Interactions ........................................................................................................ 21
2.3. Machine Learning Techniques in Predicting Drug-Target Interactions.............................................................. 22
2.4. Current Advances in Machine Learning for Drug-Target Interaction Prediction ............................................... 23
2.5. Challenges and Limitations ................................................................................................................................ 24
2.6. Future Directions................................................................................................................................................ 25
2.7. Conclusion .......................................................................................................................................................... 25
CHAPTER 3: System Analysis and design....................................................................................................................... 27
3.1. System Architecture ............................................................................................................................................... 28
3.2. System requirements specification: ....................................................................................................................... 30
2.1 Functional requirements: ........................................................................................................................................ 30
2.1.1. Requirement specifications: ............................................................................................................................... 30
2.2 User stories: ........................................................................................................................................................... 31

1 | Page
2.3 work backlog: .......................................................................................................................................................... 31
2.3.2 Project duration ................................................................................................................................................... 32
2.4. Use-Case Diagram: ................................................................................................................................................. 32
2.5 Actor Description: ................................................................................................................................................... 33
CHAPTER 4: Conclusion and Future work ..................................................................................................................... 34
4.1 Introduction: ........................................................................................................................................................... 35
4.3 Code and Results: .................................................................................................................................................... 36
3.1 collecting bioactivity data from ChEMBL ............................................................................................................ 36
................................................................................................................................................................................... 37
3.1 Downloading this 2 files padel.zip and padel.sh ................................................................................................. 38
3.3 Training Model .................................................................................................................................................... 39
3.4 Testing Model...................................................................................................................................................... 41
References..................................................................................................................................................................... 43

2 | Page
Abstract

Predicting drug-target interactions (DTIs) is crucial for drug discovery, providing insights
into the efficacy and safety of potential therapeutic compounds. Recent advancements in
machine learning (ML) have significantly enhanced the accuracy and efficiency of DTI
predictions by analyzing complex biological data and identifying patterns that traditional
methods may overlook. This project employs a structured methodology comprising several
critical steps. Initially, we focus on data collection, assembling datasets enriched with
specific characteristics such as physicochemical properties essential for absorption,
specificity, and low toxicity. These datasets are then processed to generate mathematical
descriptors from molecular sequences, transforming them into matrices suitable for ML
algorithm processing. Following this, we perform feature selection to identify the most
relevant subset of variables, minimizing redundancy while preserving biologically
significant information. The refined dataset is subsequently used to train and validate ML
models, employing optimal algorithms and robust validation techniques such as cross-
validation. This systematic approach ensures the development of predictive models
capable of enhancing drug discovery by accurately forecasting drug interactions and
potential adverse effects.

3 | Page
List of Figures

Figure 1.1: Drug discovery stages from target identification and validation ………................................9

Figure 1.2 : Stages in the discovery of new drugs in the context of precision medicine................11
Figure 1. 3: Machine learning methodology commonly used for drug discovery................................12

Figure 1. 3.2: Generation of Mathematical Descriptors....................................................................15

Figure 1.3.4: cross validation..............................................................................................................19

Figure 3.1: Prediction of Drug-Target Interactions Using Machine Learning system components…….29

Figure 2.4.5: Use-Case diagram……………………………………………………………………………………………………………………………………………………………………………….32

4 | Page
List of Tables

Table 2.1.1: Requirement specifications table …………………………………………………....33

Table 2.2.2: User stories table …………………………………………………………….............. ...34

Table 2.3.3: Work backlog table ……………………………………………………………… …………25

Table 2.3.3: Case diagram table ……………………………………………………………… ………..…31

5 | Page
List of Abbreviations

ML: Machine Learning

Pd: panda's library

np: numpy library

CSV: Comma-Separated Values

6 | Page
CHAPTER 1: Business Case for
Developing a Predictive Drug Discovery
ML Model

7 | Page
1.1 Introduction:

The process of discovering new drugs is fraught with challenges that impede the
development of effective and safe therapeutic compounds. Over recent decades, the
pharmaceutical industry has faced increasing difficulties, including escalating research and
development (R&D) costs and declining approval rates for new drugs. The number of new
drugs approved by the US Food and Drug Administration (FDA) per billion US dollars
(inflation-adjusted) spent on R&D has halved approximately every nine years. This trend,
known as Eroom's Law, reflects a significant decrease in R&D productivity. The pattern of
decline is consistent over various ten-year periods, even when accounting for different
assumptions about the delay between R&D spending and drug approval.

Furthermore, many potential drugs fail during clinical trials due to unforeseen toxicity or
lack of efficacy. Between Phase II (including Phase I/II) and submission, there were 218
reported failures, with 174 of these having stated reasons for failure. These setbacks not
only represent substantial financial losses but also delay the availability of potentially life-
saving treatments.

The complexity of human biology and the vastness of chemical space add to the
challenges. Human biological systems are intricate and unpredictable, making it difficult to
foresee how a drug will interact with these systems. Additionally, the limited chemical
space, encompassing all possible small molecules, poses a significant hurdle in identifying
compounds with optimal therapeutic properties.

To address these challenges, we propose developing a predictive drug discovery model

using advanced machine learning (ML) techniques. This model aims to enhance the
efficiency and accuracy of predicting drug-target interactions (DTIs), thereby improving the
drug discovery process.

8 | Page
Figure 1:Drug discovery stages from target identification and validation until filling through
FDA (Food and Drug Administration); IND (Investigational New Drug); NDA (New Drug
Application) [4]

9 | Page
1.2. Problem Statement
The drug discovery process faces numerous challenges that significantly impact the
efficiency and success rate of developing new therapeutic compounds. Key issues include
escalating costs, declining approval rates, and unforeseen failures during clinical trials.
These challenges are compounded by the complexities of human biology and the
limitations of chemical space exploration.

Key Challenges

1. Escalating Costs and Declining Approval Rates:

○ The number of new drugs approved by the US Food and Drug Administration
(FDA) per billion US dollars (inflation-adjusted) spent on research and
development (R&D) has halved approximately every nine years. This trend,
often referred to as Eroom's Law (the opposite of Moore's Law), indicates a
significant decrease in R&D productivity.
○ The rate of decline in the approval of new drugs per billion US dollars spent
has been consistent over various ten-year periods. This pattern remains
robust even when considering different assumptions about the average delay
between R&D spending and drug approval.

2. High Failure Rates in Clinical Trials:

○ Many potential drugs fail during clinical trials due to unforeseen toxicity or
lack of efficacy. Between Phase II (including Phase I/II) and submission, there
were 218 reported failures. Of these, 174 had stated reasons for failure,
which were used in the subsequent analysis. These failures represent a
substantial loss of investment and time.
○ Addressing the root causes of these failures is critical for improving the
success rate of drug development.

3. Complex Biological Systems:

○ Human biology is highly complex, and understanding the intricate interactions
between drugs and biological systems poses a significant challenge. This
complexity often leads to unpredictable outcomes in drug efficacy and safety.

10 | Page
4. Limited Chemical Space:
○ The vastness of chemical space, which encompasses all possible small
molecules, makes it difficult to identify compounds with optimal therapeutic
properties. Navigating this space efficiently is crucial for discovering effective
drugs.

Figure 1.2 Stages in the discovery of new drugs in the context of precision medicine.

11 | Page
Figure 1.3 Machine Learning methodology commonly used for drug discovery.

12 | Page
1.3. Scope:

In the ML methodology applied in drug discovery, the following steps are differentiated:

1. Data Collection:
○ Gathering datasets with specific characteristics, including physicochemical
properties that aid in absorption, specificity, and low toxicity.
2. Mathematical Descriptors:
○ Generating mathematical descriptors from molecular sequences to convert
them into matrices suitable for ML algorithm processing.
3. Feature Selection:
○ Identifying the best subset of variables to reduce redundant data and retain
biologically relevant information.
4. Model Training and Validation:
○ Training ML models with the optimal subset of variables, selecting
appropriate algorithms, and validating the models using techniques such as
cross-validation.

3.1 Data Collection

The goal is to gather a comprehensive and high-quality dataset that includes information
about various compounds and their characteristics. Here are some more details:

Types of Data

The data collected typically includes information about the physicochemical properties of
various compounds, such as molecular weight, polarity, and charge. It must also include
characteristics that allow the compounds to be easily produced and handled in the
laboratory, excluding large proteins or extremely complex molecules. These properties
influence how a compound interacts with biological systems and can be important
predictors of its potential as a drug.

13 | Page
Format of Data

● Main Compounds: The primary focus is on small molecules and peptides.

● Representation Formats:
○ SMILES (Simplified Molecular-Input Line-Entry System):

A line notation for describing chemical structures using short ASCII strings.
SMILES identifies the nodes and edges of a molecular graph and is widely
used despite being developed in the late 1980s by Daylight Chemical
Information Systems.

○ FASTA Format: A text-based format for representing nucleotide or amino acid

(protein) sequences using single-letter codes.

Sources of Data

The data can come from a variety of sources, including:

● Scientific Literature: Research papers and articles that provide relevant compound
information.
● Existing Databases: Public repositories such as DrugBank, PubChem, ChEMBL, and
ZINC store a large amount of useful data for drug discovery.
● Laboratory Experiments: Data generated through experimental research.

Data Quality

High-quality data is crucial, meaning it must be accurate, reliable, and relevant to the
problem being studied. Data cleaning and preprocessing steps are often necessary to
ensure data quality.

Data Quantity

The quantity of data is also important as machine learning models often require large
amounts of data to train effectively. Therefore, researchers aim to collect as much data as
possible.

14 | Page
3.2 Mathematical Descriptors

Sequencing Technologies

New sequencing technologies have greatly advanced the generation of sequence data for
DNA, RNA, proteins, small molecules, and more. These sequences serve as the starting
point in drug discovery.

Data Conversion

To make predictions, these sequences need to be converted into matrices that can be
processed by machine learning (ML) algorithms.

Labeling

In drug discovery, supervised learning models are commonly used. The labeling defined by
the researchers is essential and crucial in the experimental process, providing the
necessary targets for training the models.

Data Processing

Once mathematical descriptors are generated, a dataset is obtained which the ML model
can process. This dataset is typically divided into two subsets: a larger one for training the
model and a smaller one for testing it.

Figure 1.3.2 Generation of Mathematical Descriptors

15 | Page
3.3 Feature Selection

During the generation of mathematical descriptors, a large number of numerical

variables are presented. The main objective in this step is to reduce as much as possible
the useless or redundant variables. There are several techniques used for this purpose:

PCA: Principal Component Analysis

Principal Component Analysis (PCA) is one of the oldest and most extensively used
approaches to reduce the dimensionality of large datasets. It transforms a large set of
variables into a smaller one that retains most of the information in the original set. PCA
works by finding orthogonal vectors in a dataset that account for the greatest amount of
variation, where each orthogonal vector is a linear combination of all the features in the
original dataset.

FS: Feature Selection

Feature Selection (FS) techniques obtain a subset of features from the original set
without modifying the content of the variables. This provides a justification that is
understandable at a biological level, which is why a large majority of researchers use
these techniques in their experimental designs.

16 | Page
3.4 Model Training and Validation

Selection of Algorithms and Parameters

The choice of algorithms and their parameters is critical to ensure they are suitable for the
problem at hand, as well as for the quantity and type of data available. Careful selection
helps optimize model performance and accuracy.

Training and Validation

Multiple experimental runs are conducted using the training data. The original dataset is
split into two subsets: the training set and the validation set. This division allows for the
assessment of the model's performance on unseen data.

Cross-Validation (CV)

Cross-validation (CV) is a technique employed to evaluate the generalization ability of the

model during the training phase. It assesses the model’s performance and estimates its
effectiveness with unknown data. In the basic approach, known as k-fold CV, the training
set is divided into k smaller subsets. The model is trained on k-1 of these subsets and
validated on the remaining subset. This process is repeated k times, with each of the k
subsets used exactly once as the validation data. By evaluating the model on multiple
validation sets, CV provides a more realistic estimate of the model’s generalization
performance, measuring its ability to perform well on new, unseen data.

Parameter Tuning

The ultimate goal of the cross-validation (CV) process is to identify the optimal
combination of parameters for each algorithm. Fine-tuning these parameters is crucial to
maximize model performance.

17 | Page
Performance Measurement

Once the parameters are selected, the performance of each model is evaluated. The best
model is the one that achieves the highest performance metrics while maintaining the
lowest overall cost.

Final Validation

The final validation involves using the test set, which was initially set aside from the
original dataset, to validate the best model obtained from the CV process. This step
ensures that the model's performance on the test set is consistent with the CV results.

Figure 1.3.4 Cross-Validation

If the final validation results are statistically significant, it indicates that a robust predictive
model has been developed. This process is essential in machine learning to confirm the
model’s reliability and robustness. It helps prevent over fitting, a situation where the
model performs exceptionally well on training data but poorly on new, unseen data.

18 | Page
1.4. Tentative Project Time-line:

1. Week 1-2: Data collection and preprocessing

● Collect data from ChEMBL database

● Clean and preprocess the data

2. Week 3-4: Model selection and development

● Select appropriate Machine learning models for classification

● Develop and train models on the preprocessed data
● Fine-tune models using hyper parameter tuning technique

3. Week 5-6: Evaluation and testing

● Evaluate the performance of the models using appropriate metrics

● Compare the results with other approaches and datasets
● Test the models on unseen data to assess generalization performance

4. Week 7-8: Documentation and reporting

● Write a comprehensive report of the project including the methodology,

results, and analysis
● Create visualizations and presentations to support the report
● Review the report and prepare it for submission

1.5. Conclusion:
In conclusion, the Prediction of Drug-Target Interactions Using Machine Learning
project aims to develop an accurate model for drug-target interaction based on a
dataset extracted from the ChEMBL database. The project objectives are specific,
measurable, achievable, relevant, and time-bound, ensuring that the project is
well-defined and manageable. The project timeline is also tentative, allowing for
flexibility in case of unexpected challenges. The success of this project has the
potential to contribute to the development of Prediction of Drug-Target
Interactions

19 | Page
CHAPTER 2: Literature Review

20 | Page
2.1. Introduction

1.1 Background

Drug discovery involves identifying potential therapeutic compounds and their interactions
with biological targets. Predicting drug-target interactions (DTIs) is crucial for
understanding drug efficacy and safety. Bioinformatics employs computational tools and
data analysis to understand biological data, playing a pivotal role in drug discovery by
identifying potential drug targets and elucidating their mechanisms of action.

1.2 Significance of Machine Learning in Drug Discovery

Machine learning (ML), a branch of artificial intelligence (AI), has transformed various
fields, including drug discovery. ML algorithms can analyze extensive datasets, uncovering
patterns and relationships that might not be evident through traditional methods. This
capability is particularly advantageous in predicting DTIs, where complex biological data
can be efficiently processed to identify potential interactions.

1.3 Objectives of the Literature Review

This literature review aims to synthesize current research on ML methods for predicting
DTIs, highlighting key advancements, challenges, and future directions. By examining
various ML approaches and their applications, this review seeks to provide a
comprehensive understanding of the field's current state and its potential impact on drug
discovery.

2.2. Fundamentals of Drug-Target Interactions

2.1 Biological Basis of Drug-Target Interactions

21 | Page
DTIs occur when drug molecules bind to biological targets such as proteins, enzymes, or
receptors, influencing their function. Understanding these interactions is critical for
developing effective therapeutics. The binding process can involve various types of
chemical bonds and molecular forces, which determine the specificity and strength of the
interaction.

2.2 Experimental Methods for Identifying Drug-Target Interactions

Traditional methods for identifying DTIs include high-throughput screening, in vitro and in
vivo assays, and computational docking. High-throughput screening involves testing large
libraries of compounds against biological targets to identify potential interactions. In vitro
and in vivo assays provide detailed insights into the biological activity of compounds.
Computational docking predicts how small molecules interact with target proteins based
on their three-dimensional structures.

2.3. Machine Learning Techniques in Predicting Drug-Target Interactions

3.1 Overview of Machine Learning Algorithms

ML algorithms are broadly classified into supervised, unsupervised, and deep learning
techniques:

Supervised Learning: Algorithms like support vector machines (SVMs), random forests, and
neural networks are trained on labeled datasets to predict DTIs.

Unsupervised Learning: Techniques such as clustering and dimensionality reduction help

uncover hidden patterns in unlabeled data, which can be useful for understanding
complex biological systems.

Deep Learning: Advanced models like convolutional neural networks (CNNs) and recurrent
neural networks (RNNs) can automatically extract features from raw data, improving the
accuracy of DTI predictions.

22 | Page
3.2 Data Sources and Feature Engineering

ML models for DTI prediction utilize diverse data sources, including chemical properties,
genomic data, and proteomic data. Effective feature engineering techniques are crucial for
transforming raw data into meaningful inputs for ML models. Common approaches include
molecular descriptors, fingerprints, and embeddings.

3.3 Model Training and Evaluation

Training ML models involves splitting data into training, validation, and test sets. Cross-
validation techniques ensure the models generalize well to unseen data. Performance
metrics such as accuracy, precision, recall, F1 score, and ROC-AUC are used to evaluate
model effectiveness.

2.4. Current Advances in Machine Learning for Drug-Target Interaction

Prediction

4.1 Predictive Models

Recent advancements have led to the development of sophisticated ML models for DTI
prediction. For example, Patel et al. (2020) discuss various ML methods that have shown
promise in drug discovery, highlighting their ability to predict DTIs accurately. Carracedo-
Reboredo et al. (2021) provide an overview of trends in ML approaches, emphasizing the
integration of different data types to enhance prediction accuracy. Additionally, Zhang et
al. (2019) review recent advances in ML-based DTI prediction, showcasing progress in
algorithm development and application.

4.2 Integrative Approaches

Integrative approaches combine multi-omics data, enabling a comprehensive

understanding of biological systems. Network-based methods and graph neural networks
(GNNs) model the interactions within biological networks, improving DTI predictions by
considering the broader context of molecular interactions (Bagherian et al., 2021).

23 | Page
Monteiro et al. (2020) present an end-to-end deep learning approach for DTI prediction,
demonstrating the power of deep learning in capturing complex biological relationships.

4.3 Handling Imbalanced Data

Class imbalance is a significant challenge in DTI datasets, where negative samples often
outnumber positive samples. Techniques such as synthetic minority over-sampling
(SMOTE) and cost-sensitive learning help address this issue, improving model performance
on imbalanced datasets (Gupta et al., 2021). El-Behery et al. (2021) develop an efficient
ML model for predicting DTIs, specifically addressing data imbalance in the context of
COVID-19 drug discovery.

2.5. Challenges and Limitations

5.1 Interpretability and Explainability

Interpretable models are essential for understanding the biological significance of

predicted DTIs. Efforts to enhance model transparency include feature importance analysis
and the development of interpretable algorithms (Réda et al., 2020). Zheng and Wu (2021)
propose a ML-based DTI prediction method for a tripartite heterogeneous network,
focusing on improving model interpretability.

5.2 Data Quality and Availability

Data quality and availability are critical for training reliable ML models. Issues such as
noise, biases, and incomplete data can hinder model performance. Standardizing data
collection and improving data sharing practices can help address these challenges (Dara et
al., 2022).

24 | Page
5.3 Generalization and Robustness

Ensuring ML models generalize well across different datasets is crucial for their practical
application. Techniques such as domain adaptation and transfer learning can improve
model robustness, enabling them to perform well on diverse datasets (Patel et al., 2020).

2.6. Future Directions

6.1 Emerging Machine Learning Techniques

Emerging techniques such as transfer learning and meta-learning hold promise for DTI
prediction. These approaches leverage knowledge from related tasks to improve model
performance on new tasks, addressing issues related to limited data availability (Bagherian
et al., 2021).

6.2 Integration with Other Computational Methods

Integrating ML with other computational methods, such as molecular dynamics

simulations and quantum computing, can enhance DTI predictions. These synergies enable
more accurate modeling of molecular interactions and their dynamics (Gupta et al., 2021).

6.3 Personalized Medicine

ML has the potential to revolutionize personalized medicine by predicting DTIs based on

individual genetic profiles. This approach can lead to tailored treatment plans, improving
therapeutic outcomes for patients (Carracedo-Reboredo et al., 2021).

2.7. Conclusion

7.1 Summary of Key Findings

25 | Page
ML has significantly advanced the field of DTI prediction, offering powerful tools for drug
discovery. Key findings from the reviewed literature highlight the effectiveness of various
ML approaches and their integration with biological data.

7.2 Implications for Drug Discovery

The application of ML in DTI prediction holds substantial potential for accelerating drug
discovery, reducing costs, and improving the success rate of new therapeutics.

7.3 Final Thoughts

Continued innovation in ML methods and interdisciplinary collaboration will further

enhance our ability to predict DTIs, ultimately leading to more effective and personalized
treatments.

26 | Page
CHAPTER 3: System Analysis and design

27 | Page
3.1. System Architecture

The system architecture of the Improving the Prediction of Drug-Target Interactions Using
Machine Learning data collection, Mathematical Descriptors, Feature Selection, and Model
Training and Validation These components are described in detail below

● Data Collection: Gathering datasets with specific characteristics, including physical-

chemical properties that aid in absorption, specificity, and low toxicity.

● Mathematical Descriptors: Generating mathematical F descriptors from molecular

sequences to convert them into matrices for ML algorithm processing.

● Feature Selection: Identifying the best subset of variables to reduce redundant data
and retain biologically relevant information.

● Model Training and Validation: Training ML models with the optimal subset of
variables, selecting appropriate algorithms, and validating the model using
techniques like cross-validation.

28 | Page
Figure 3.1 Improving the Prediction of Drug-Target Interactions Using Machine Learning
system components

29 | Page
3.2. System requirements specification:

Due to project scope, in this section we will discuss requirements analysis through
functional and non-functional requirements.

2.1 Functional requirements:

2.1.1. Requirement specifications:

Table 2.1.1: Requirement specifications table

Identifier Priority Requirement

The system shall be able to extract data from ChEMBL

REQ-01 5
Database.

The system shall be able to define target protein as has been

REQ-02 5
stated.

REQ-03 5 The system shall clean and prepare our data for ML

REQ-04 5 The system shall retrieve only bioactivity data for.

REQ-05 5 The system shall Filter using standard type "IC50".

The system shall install the following library

REQ-06 5
chembl_webresource_client, scikit-learn, numpy and panda

The system shall train and evaluate the performance of each

REQ-07 5
machine learning algorithm using Across Validation

30 | Page
2.2 User stories:
Table 2.2.2: User stories table

Identifier User story Size

As a developer, I need to collect target protein data from

ST-01 10pt
ChEMBL database

ST-02 As an ML engineer, I need to preprocess collected data. 6pt

As an ML engineer, I need to analyze and visualize

ST-03 2pt
collected data

As a developer, I need to implement machine learning

ST-04 18pt
models for drug discovery prediction

As a developer, I need to evaluate the performance of

ST-05 4pt
implemented models

As an SWE engineer, I need to document the project and

ST-06 18pt
present the findings.

2.3 work backlog:

Table 2.3.3: Work backlog table

Work Iteration
User story Estimated work duration
items no
ST-01
01 Collect dataset of target Iteration 1 4pt (2 day)
protein

ST-02
02 Preprocess the collected Iteration 2 6pt (3 day)
data

ST-04
04 implement machine Iteration 4 18pt (9 day)
learning models

31 | Page
ST-05
05 Implement cross Iteration 5 4pt (2 day)
validation

ST-06
06 documenting the Iteration 6 18pt (9 day)
project

2.3.2 Project duration

Work duration = path size / travel velocity

Path size = total work size so, work size = 50 / 2 = 25 days

2.4. Use-Case Diagram:

Collect Data

Generate Mathematical Research

Descriptors Scientist

Select Features

Validate Model Train Model

ML
Engineer

Predict Drug-Target
Interactions

32 | Page
Figure 2.4.5: Use-Case diagram
2.5 Actor Description:
Table 2.5.4: Actor description table

Actor Actor’s Goal

Collect Data
Research Scientist
Select Features

ML Engineer Generate Mathematical Descriptors

Train Model

Validate Model

33 | Page
CHAPTER 4: Conclusion and Future
work

34 | Page
4.1 Introduction:
We used several models to reach the maximum accuracy of these valuable dataset , the
purpose about the stating the results that we achieved here to conclude our work and
state a checkpoint to the next milestone , and it worth to say that this topic is very
important to use in drug discovery field.

- The data set used in this project is a real life collected dataset from ChEMBL which is
a manually curated database of bioactive molecules with drug-like properties. It
brings together chemical, bioactivity and genomic data to aid the translation of
genomic information into effective new drugs.

- Generating mathematical descriptors from molecular sequences to convert them

into matrices for ML algorithm processing.

- Generating mathematical descriptors from molecular sequences to convert them

into matrices for ML algorithm processing.

- Training ML models with the optimal subset of variables, selecting appropriate

algorithms, and validating the model using techniques like cross-validation.

35 | Page
4.3 Code and Results:
3.1 collecting bioactivity data from ChEMBL

36 | Page
The above code will extract a .CSV file with the bioactivity data of the target protein

37 | Page
3.1 Downloading this 2 files padel.zip and padel.sh

Sample of Date

38 | Page
3.3 Training Model

39 | Page
Running above will result in the following

40 | Page
3.4 Testing Model

41 | Page
42 | Page
References

Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., & Ahsan, M. J. (2022). Machine
learning in drug discovery: a review. Artificial Intelligence Review, 55(3), 1947-1999.

Patel, L., Shukla, T., Huang, X., Ussery, D. W., & Wang, S. (2020). Machine learning
methods in drug discovery. Molecules, 25(22), 5277.

Carracedo-Reboredo, P., Liñares-Blanco, J., Rodríguez-Fernández, N., Cedrón, F., Novoa, F.

J., Carballal, A., ... & Fernandez-Lozano, C. (2021). A review on machine learning
approaches and trends in drug discovery. Computational and structural biotechnology
journal, 19, 4538-4558.

Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., & Kumar, P. (2021). Artificial
intelligence to deep learning: machine intelligence approach for drug discovery. Molecular
diversity, 25, 1315-1360.

Réda, C., Kaufmann, E., & Delahaye-Duriez, A. (2020). Machine learning applications in
drug development. Computational and structural biotechnology journal, 18, 241-252.

Bagherian, M., Sabeti, E., Wang, K., Sartor, M. A., Nikolovska-Coleska, Z., & Najarian, K.
(2021). Machine learning approaches and databases for prediction of drug–target
interaction: a survey paper. Briefings in bioinformatics, 22(1), 247-269.

Zhang, W., Lin, W., Zhang, D., Wang, S., Shi, J., & Niu, Y. (2019). Recent advances in the
machine learning-based drug-target interaction prediction. Current drug metabolism,
20(3), 194-202.

Monteiro, N. R., Ribeiro, B., & Arrais, J. P. (2020). Drug-target interaction prediction: end-
to-end deep learning approach. IEEE/ACM transactions on computational biology and
bioinformatics, 18(6), 2364-2374.

43 | Page
El-Behery, H., Attia, A. F., El-Fishawy, N., & Torkey, H. (2021). Efficient machine learning
model for predicting drug-target interactions with case study for Covid-19. Computational
Biology and Chemistry, 93, 107536.

Zheng, Y., & Wu, Z. (2021). A machine learning-based biological drug–target interaction
prediction method for a tripartite heterogeneous network. ACS omega, 6(4), 3037-3045.

Hughes, J., Rees, S., Kalindjian, S. and Philpott, K. (2011), Principles of early drug discovery.
British Journal of Pharmacology, 162: 1239-1249. https://doi.org/10.1111/j.1476-
5381.2010.01127.x

Scannell, J., Blanckley, A., Boldon, H. et al. Diagnosing the decline in pharmaceutical R&D
efficiency. Nat Rev Drug Discov 11, 191–200 (2012).
https://doi.org/10.1038/nrd3681

Harrison, R. Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15, 817–818
(2016). https://doi.org/10.1038/nrd.2016.184

44 | Page

CHY Brochure A4 72pg V11
No ratings yet
CHY Brochure A4 72pg V11
72 pages
Seminar
No ratings yet
Seminar
14 pages
Pawann Project Docs
No ratings yet
Pawann Project Docs
91 pages
Prediction Machines Applied Machine Learning For Therapeutic
No ratings yet
Prediction Machines Applied Machine Learning For Therapeutic
17 pages
Report
No ratings yet
Report
69 pages
Comprehensive Evaluation of Deep and Graph Learning On Drug-Drug Interactions Prediction
No ratings yet
Comprehensive Evaluation of Deep and Graph Learning On Drug-Drug Interactions Prediction
22 pages
Applications of Machine Learning in Drug Discovery and Development
No ratings yet
Applications of Machine Learning in Drug Discovery and Development
15 pages
MIN: Multi-Channel Interaction Network For Drug-Target Interaction With Protein Distillation
No ratings yet
MIN: Multi-Channel Interaction Network For Drug-Target Interaction With Protein Distillation
11 pages
The Big Picture B2 Intermediate
No ratings yet
The Big Picture B2 Intermediate
170 pages
Drug-Drug Interactions Prediction Based On Deep Learning and Knowledge Graph
No ratings yet
Drug-Drug Interactions Prediction Based On Deep Learning and Knowledge Graph
27 pages
Drug-Target Interaction Prediction by Integrating Heterogeneous Information With Mutual Attention Network
No ratings yet
Drug-Target Interaction Prediction by Integrating Heterogeneous Information With Mutual Attention Network
16 pages
Drug Target Interaction Prediction Using Machine Learning Techniques
No ratings yet
Drug Target Interaction Prediction Using Machine Learning Techniques
15 pages
Masterarbeit / Master'S Thesis
No ratings yet
Masterarbeit / Master'S Thesis
58 pages
DTI400 Presentation Template 4thsem
No ratings yet
DTI400 Presentation Template 4thsem
12 pages
Fphar 12 814858
No ratings yet
Fphar 12 814858
10 pages
Drug-Target Interaction Prediction Using
No ratings yet
Drug-Target Interaction Prediction Using
16 pages
(2023) DeepDrug
No ratings yet
(2023) DeepDrug
15 pages
Btae 147
No ratings yet
Btae 147
8 pages
1 2021 ML
No ratings yet
1 2021 ML
53 pages
Predictive Health Care-Enhancin Diagnosis and Treatment With Maching Learning
No ratings yet
Predictive Health Care-Enhancin Diagnosis and Treatment With Maching Learning
49 pages
Application of Machine Learning For Drug-Target Interaction Prediction
No ratings yet
Application of Machine Learning For Drug-Target Interaction Prediction
8 pages
Comprehensive Survey of Recent Drug Discovery Usin
No ratings yet
Comprehensive Survey of Recent Drug Discovery Usin
37 pages
ENGGG
No ratings yet
ENGGG
36 pages
Machine Learning in Drug Discovery A Cri
No ratings yet
Machine Learning in Drug Discovery A Cri
11 pages
Artificial Intelligence To Deep Learning: Machine Intelligence Approach For Drug Discovery
No ratings yet
Artificial Intelligence To Deep Learning: Machine Intelligence Approach For Drug Discovery
46 pages
9-Ai Mi Drug Discov Dev
No ratings yet
9-Ai Mi Drug Discov Dev
40 pages
Abstract
No ratings yet
Abstract
8 pages
Article IN Press: Fundamental Research
No ratings yet
Article IN Press: Fundamental Research
15 pages
Link Prediction Drug Disease
No ratings yet
Link Prediction Drug Disease
21 pages
Machine Learning in Drug Discovery - Bridging Data
No ratings yet
Machine Learning in Drug Discovery - Bridging Data
12 pages
Published PDF
No ratings yet
Published PDF
18 pages
League of Nations
No ratings yet
League of Nations
6 pages
The Big Arm Guide
100% (6)
The Big Arm Guide
21 pages
FMCA-DTI A Fragment-Oriented Method Based On A
No ratings yet
FMCA-DTI A Fragment-Oriented Method Based On A
10 pages
REFERENCE PAPER 2 - Machine Learning-Based Prediction of Drug-Drug
No ratings yet
REFERENCE PAPER 2 - Machine Learning-Based Prediction of Drug-Drug
9 pages
HFHDJSJWDJNDNDKWM
No ratings yet
HFHDJSJWDJNDNDKWM
81 pages
Early Drug Discovery
No ratings yet
Early Drug Discovery
16 pages
Molecular Simulations and ML in Drug Discovery
No ratings yet
Molecular Simulations and ML in Drug Discovery
11 pages
Drugs Recommended System2
No ratings yet
Drugs Recommended System2
30 pages
Drug 4
No ratings yet
Drug 4
15 pages
Medication Recommendation System-1
No ratings yet
Medication Recommendation System-1
24 pages
Validation Strategies For Target Prediction Methods
No ratings yet
Validation Strategies For Target Prediction Methods
12 pages
Artificial Intelligence and Machine Learning Approaches For Drug Design: Challenges and Opportunities For The Pharmaceutical Industries
No ratings yet
Artificial Intelligence and Machine Learning Approaches For Drug Design: Challenges and Opportunities For The Pharmaceutical Industries
21 pages
Sony rcp-1530 1st-Edition Rev.1 MM
No ratings yet
Sony rcp-1530 1st-Edition Rev.1 MM
172 pages
Surveys (Tunneling)
No ratings yet
Surveys (Tunneling)
66 pages
Ijms 22 05118
No ratings yet
Ijms 22 05118
15 pages
BL - Awb
No ratings yet
BL - Awb
1 page
Study 3
No ratings yet
Study 3
4 pages
Introducing The Smart Drug Prediction Consultancy System
No ratings yet
Introducing The Smart Drug Prediction Consultancy System
9 pages
Prediction of Drug-Target Interactions and Drug Repositioning Via Network-Based Inference
No ratings yet
Prediction of Drug-Target Interactions and Drug Repositioning Via Network-Based Inference
12 pages
Leveraging Artificial Intelligence For Accelerating Drug Discovery: Predictive Models and Molecular Design
No ratings yet
Leveraging Artificial Intelligence For Accelerating Drug Discovery: Predictive Models and Molecular Design
8 pages
AI-Enhanced Drug Discovery
No ratings yet
AI-Enhanced Drug Discovery
10 pages
20 Limitation
No ratings yet
20 Limitation
13 pages
Abstract
No ratings yet
Abstract
5 pages
Lab Lec 1a - Laboratory Rules and Safety Precautions
No ratings yet
Lab Lec 1a - Laboratory Rules and Safety Precautions
52 pages
AI Assisted Drug Discovery
No ratings yet
AI Assisted Drug Discovery
10 pages
Artigo 3
No ratings yet
Artigo 3
3 pages
Technology
No ratings yet
Technology
10 pages
Efficient Prediction of Drug-Drug Interaction Using Deep Learning Models
No ratings yet
Efficient Prediction of Drug-Drug Interaction Using Deep Learning Models
6 pages
Government College of Engineering and Technology Jammu
No ratings yet
Government College of Engineering and Technology Jammu
20 pages
Drug Target Interaction (DTI) and Prediction Using Machine Learning
No ratings yet
Drug Target Interaction (DTI) and Prediction Using Machine Learning
9 pages
Machine Learning in Drug Discovery: A Comprehensive Analysis of Applications, Challenges, and Future Directions
No ratings yet
Machine Learning in Drug Discovery: A Comprehensive Analysis of Applications, Challenges, and Future Directions
9 pages
Biology Project On Ai in Medicine
No ratings yet
Biology Project On Ai in Medicine
10 pages
IRJMETS Research Paper Group 7
No ratings yet
IRJMETS Research Paper Group 7
6 pages
Talmachi 2020 - The Implications of Proptech On The Real Estate Brokerage. The Case Study of Dubai, United Arab Emirates
No ratings yet
Talmachi 2020 - The Implications of Proptech On The Real Estate Brokerage. The Case Study of Dubai, United Arab Emirates
106 pages
Prameet (12a) (5728)
No ratings yet
Prameet (12a) (5728)
33 pages
Drug Design References
No ratings yet
Drug Design References
3 pages
System and Network Administration Assignment
No ratings yet
System and Network Administration Assignment
64 pages
Machine Learning in Drug Discovery and Development Part 1: A Primer
No ratings yet
Machine Learning in Drug Discovery and Development Part 1: A Primer
14 pages
Drug-Target Interaction Prediction With Graph Attention Networks
No ratings yet
Drug-Target Interaction Prediction With Graph Attention Networks
9 pages
2018 HotelMarketingGuide FINAL
No ratings yet
2018 HotelMarketingGuide FINAL
12 pages
Mathematical Modeling of A Battery Energy Storage System in Grid Forming Mode
No ratings yet
Mathematical Modeling of A Battery Energy Storage System in Grid Forming Mode
6 pages
Fluency Plus 6 - Unit 1.3 - Vocabulary
No ratings yet
Fluency Plus 6 - Unit 1.3 - Vocabulary
5 pages
Class-Teachers Program - Lalaine Bonifacio
No ratings yet
Class-Teachers Program - Lalaine Bonifacio
2 pages
GIVER Study Guide
No ratings yet
GIVER Study Guide
5 pages
The Technical Aspects When Using BENDER Communication Solutions
No ratings yet
The Technical Aspects When Using BENDER Communication Solutions
4 pages
Manual de Instalación XLED
No ratings yet
Manual de Instalación XLED
92 pages
Risk Management and System Safety
From Everand
Risk Management and System Safety
Leonam dos Santos Guimarães
5/5 (1)
Fluid Level Sensors in Oil & Gas
No ratings yet
Fluid Level Sensors in Oil & Gas
4 pages
Drug Discovery and Drug Identification Using AI
No ratings yet
Drug Discovery and Drug Identification Using AI
3 pages
ICT360 TechEd Report Vol 1
No ratings yet
ICT360 TechEd Report Vol 1
16 pages
Waqar Hussain - Bilal Rasheed Machine Learning and Drug Discovery
No ratings yet
Waqar Hussain - Bilal Rasheed Machine Learning and Drug Discovery
4 pages
Lm3622 Aplication Circuit
No ratings yet
Lm3622 Aplication Circuit
2 pages
Norm Referenced Interpretation
No ratings yet
Norm Referenced Interpretation
1 page
Poetry Mid Test
No ratings yet
Poetry Mid Test
4 pages
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
No ratings yet
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
5 pages
MSC in Chemical and Bioengineering Study Plan ETH Zurich
No ratings yet
MSC in Chemical and Bioengineering Study Plan ETH Zurich
1 page
Typical Vs Atypical Antipsychotics
No ratings yet
Typical Vs Atypical Antipsychotics
6 pages
Final Exam - Math 111-Second Term 222
No ratings yet
Final Exam - Math 111-Second Term 222
7 pages
Thomasyl CV
No ratings yet
Thomasyl CV
7 pages