Minor 1
Minor 1
Mr S.Rajesh,
Assistant Professor
CERTIFICATE
i
DECLARATION
We hereby declare that the MINI Project entitled “PAIRWISE SEQUENCE ALIGN-
MENT IN BIOLOGICAL SEQUENCES USING MACHINE LEARNING” sub-
mitted for the B.Tech Degree is our original work and the dissertation has not
formed the basis for the award of any degree, associate ship, fellowship or any
other similar titles.
ii
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our guide Mr S.Rajesh, As-
sistant Professor for her persisting encouragement, everlasting patience and keen
interest in discussion and for her numerous suggestions which we had at every
phase of this project.
iii
Abstract
iv
Table of Contents
1 INTRODUCTION 1
1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Pairwise Sequence alignment . . . . . . . . . . . . . . . . . . 2
1.1.3 Match, Mismatch, and Gap . . . . . . . . . . . . . . . . . . 2
1.1.4 Alignment Score . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Multilayer Perceptron (MLP): . . . . . . . . . . . . . . . . . 3
1.1.6 Particle Swarm Optimization (PSO) . . . . . . . . . . . . . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 LITERATURE REVIEW 5
2.1 Sequence Alignment Using Machine Learning-Based Needleman–Wunsch
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Performance-Based Analogising of Needleman Wunsch Algorithm
to Align DNA Sequences Using GPU and FPGA . . . . . . . . . . . 5
2.3 Accelerating DNA pairwise sequence alignment using FPGA and a
customized convolutional neural network . . . . . . . . . . . . . . . 6
2.4 Local Alignment of DNA Sequence Based on Deep Reinforcement
learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Simple and Efficient Pattern Matching Algorithms for Biological
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 GenieHD: Efficient DNA Pattern Matching Accelerator Using Hyper-
dimensional Computing . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Accelerating Edit-Distance Sequence Alignment on GPU Using the
Wavefront Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 SLPal: Accelerating Long Sequence Alignment on Many-Core and
Multi-Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . 9
v
3.2.1 python3.9: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . . . 13
4 SOFTWARE DESIGN 14
4.1 Software Development Lifecycle . . . . . . . . . . . . . . . . . . . . 14
4.2 UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Use-Case Diagram . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.3 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . 18
5 PROPOSED SYSTEM 19
5.1 Process Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Data Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.2 Model building and Training: . . . . . . . . . . . . . . . . . 20
5.2.3 Hyperparameter Tuning: . . . . . . . . . . . . . . . . . . . . 21
5.2.4 Model Evolution and Comparison . . . . . . . . . . . . . . . 21
5.3 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6 IMPLEMENTATION 23
6.1 Output Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Epoch Output . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.2 Accuracy-Epoch Curve . . . . . . . . . . . . . . . . . . . . . 23
6.1.3 Loss-Epoch Curve . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1.5 Classification Report . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 26
REFERENCES 28
vi
List of Figures
vii
Chapter 1
INTRODUCTION
1
1.1 Basic Concepts
• Sequence Alignment
• Match,Mismatch,Gap
• Alignment Score
• Multilayer Perceptron
2
1.1.5 Multilayer Perceptron (MLP):
MLP is an artificial neural network architecture commonly used in machine learn-
ing. It consists of multiple layers of interconnected nodes (neurons) that process
input data and generate output predictions. MLP is known for its ability to learn
complex patterns and relationships in data through a process called training. In
this project, MLP is utilized to capture patterns and relationships between nu-
cleotides in DNA sequences, enabling it to predict alignments for new sequences.
1.2 Motivation
Traditional sequence alignment algorithms over large databases cannot yield re-
sults within a reasonable time, power, and cost. Developing faster and more
efficient algorithms for sequence alignment can enable researchers to analyze large
amounts of biological data quickly and accurately.Faster and more efficient se-
quence alignment can accelerate scientific discovery and facilitate the development
of new treatments for diseases.
3
solution for pairwise sequence alignment in DNA sequences using MLP and PSO.
1.4 Objectives
• Study on existing technologies to find out the draw backs and limitations.
• To train the model using a large dataset of DNA, and optimize it for en-
hancing the performance of the model.
1.5 Scope
• The scope of the project is limited to Pairwise Sequence Alignment.
.
1.6 Applications
• disease analysis: Identify genetic variations and mutations associated with
diseases by aligning DNA sequences of affected individuals with reference
sequences.
4
Chapter 2
LITERATURE REVIEW
• The new deep learning model of this paper improves the automatic classifi-
cation performance of breast cancer.
• A new method based on CNN model for breast cancer screening and diagnosis
is proposed
Disadvantages:
5
Advantages:
• Fast Computation
Disadvantages:
• High speed: The proposed approach can achieve significant speedups com-
pared to traditional implementations of sequence alignment algorithms.
Disadvantages:
6
the subsequences that have the highest degree of similarity between two input se-
quences. The proposed approach employs a convolutional neural network (CNN)
to learn the sequence features and a DRL model to make optimal alignment de-
cisions. The DRL model learns to maximize the reward function by aligning the
two sequences to obtain the highest similarity score. The approach is evaluated on
standard benchmark datasets and compared with existing state-of-the-art method
Advantages:
• The proposed DRL-based approach achieves higher accuracy and faster pro-
cessing times compared to existing methods.
Disadvantages:
• The approach requires a large amount of training data to learn the optimal
alignment strategy
• The paper presents two novel algorithms for pattern matching in biological
sequences that are simple and efficient.
• The proposed algorithms have low memory requirements and are well-suited
for use in resource-constrained settings
Disadvantages:
• The paper does not compare the proposed algorithms to some of the more
recent, state-of-the-art pattern matching algorithms, which could limit their
utility in cutting-edge research.
7
2.6 GenieHD: Efficient DNA Pattern Matching
Accelerator Using Hyper-dimensional Com-
puting
Methodology: GenieHD, which effectively parallelizes the DNA pattern match-
ing problem, is suggested in this study. This makes use of hyperdimensional (HD)
computing, which is inspired by the brain and imitates pattern-based calculations
in human memory. HD computing is used to convert the naturally sequential
processes involved in DNA pattern matching into highly parallelizable compute
workloads. The complete genome sequence and the target DNA pattern are first
encoded to high dimensional vectors using the suggested method. Once encoded,
a simple operation on the high-dimensional vectors can reveal whether the desired
pattern is present throughout the entire sequence. The construction of an accel-
erator architecture is also suggested in this research in order to drastically lower
the amount of memory accesses while effectively parallelizing HD-based DNA pat-
tern matching. To fulfill the needs of the target system, the architecture can be
implemented on a variety of parallel computing platforms.
Advantages:
Disadvantages:
• When we calculate the inter molecular distance between the points then it
may show different value when compared with the normal measured distance.
8
on a GPU requires programming expertise and familiarity with GPU computing
frameworks such as CUDA or OpenCL. Furthermore, optimizing memory access
patterns, load balancing, and algorithmic optimizations specific to the alignment
algorithm are essential for achieving efficient performance
Advantages:
• This paper includes the problems of pattern mining and its related applica-
tions.
Disadvantages:
• All the techniques related to pattern mining were not included in this paper.
Disadvantages:
9
• Implementation Complexity: Implementing and optimizing SLPal for specific
hardware architectures and alignment algorithms can be complex.
10
Chapter 3
SOFTWARE REQUIREMENT
ANALYSIS
• Memory: 256GB
11
Development Environment: Google Colab
3.2.1 python3.9:
Python is a general-purpose, versatile, and powerful programming language. It’s
a great first language because it’s concise and easy to read. Whatever you want to
do, Python can do it. From web development to machine learning to data science,
Python is the language for you.. It is simple, yet powerful. Python is easy to write,
and simple to understand. This behavior of its makes it intuitive. Situations like
getting your code from another developer that uses third-party components mean
you need very little cognitive overhead. It is also true that code is read more often
than it is written. A great choice of libraries is one of the main reasons Python
is the most popular programming language used for AI. A library is a module
or a group of modules published by different sources like PyPi which include a
pre-written piece of code that allows users to reach some functionality or perform
different actions. Python libraries provide base level items so developers don’t
have to code them from the very beginning every time. ML requires continuous
data processing, and Python’s libraries let you access, handle and transform data.
These include Python NumPy, SciPy, scikit-learn, and many more. These are good
with all intrinsic tasks of machine learning.
12
3.3 Functional Requirements
• Input DNA Sequences: The system should allow users to input DNA se-
quences for alignment, either by manual entry or by importing from external
files.
13
Chapter 4
SOFTWARE DESIGN
4. The project design is developed, considering the architecture and key com-
ponents of the alignment model.
6. Testing is performed using dedicated testing data to assess the model’s per-
formance and effectiveness.
7. Model evaluation is conducted during the testing phase to measure the align-
ment model’s accuracy and reliability.
9. In case additional requirements arise from users, the iterative cycle repeats,
allowing for modifications and enhancements to be made to the alignment
model.
14
Figure 4.1: Iterative Model[14]
15
Figure 4.2: Use Case Diagram[15]
16
4.2.2 Sequence Diagram
A sequence diagram or system sequence diagram shows process interactions
arranged in time sequence in the field of software engineering. Figure 4.3 presents
Sequence Diagram of the Proposed System.
17
4.2.3 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise ac-
tivities and actions with support for choice, iteration and concurrency. Figure 4.4
presents Activity Diagram of the Proposed System.
18
Chapter 5
PROPOSED SYSTEM
The process flow diagram represents the overall step by step process of the done
in the project. It involves from collection of data set to the display of end result
to the user.
5.2 Methodology
The architecture is mainly divided into 4 steps:
1. Data Preprocessing.
3. Hyperparameter Tuning.
19
5.2.1 Data Preprocessing:
Data preprocessing is an essential step in preparing the data for analysis and
model training. In the context of Pairwise Sequence Alignment using Multilayer
Perceptron (MLP) and Particle Swarm Optimization (PSO), data preprocessing
involves the following steps:
Data Collection and Organization:
Gather a dataset of DNA sequences that need to be aligned. Ensure that the
dataset is properly organized, with each sequence represented as a separate data
point.
Data Encoding:
Convert the DNA sequences into a numerical representation that can be under-
stood by the MLP model.Common encoding techniques include one-hot encoding,
where each nucleotide is represented by a binary vector, or numerical encoding,
where each nucleotide is assigned a unique numeric value.
Data Split:
Split the dataset into training and test datasets.The training dataset is used to
train the MLP model, while the test dataset is used to evaluate its performance.
Data Normalization:
Normalize the numerical features of the dataset to a common scale.This ensures
that all features contribute equally during model training.By performing these data
preprocessing steps, the DNA sequences are transformed into a suitable format for
training the MLP model using PSO for pairwise sequence alignment.
20
the alignment of the sequences. It consists of an input layer, one or more hidden
layers with activation functions, and an output layer.
Particle Swarm Optimization (PSO): Particle Swarm Optimization is a
metaheuristic optimization algorithm inspired by the social behavior of bird flock-
ing or fish schooling. It involves a population of particles, where each particle
represents a potential solution to the optimization problem. The particles move
through the search space, adjusting their positions and velocities based on their
own experience and the information shared with neighboring particles.
In the project, PSO is used to optimize the weights and biases of the MLP.
The particles in the PSO algorithm represent different sets of weights and biases
for the MLP. The fitness function is defined based on the performance of the MLP
in aligning the DNA sequences. The particles adjust their positions and velocities
based on their own best position and the best position found by any particle in
the swarm. This iterative process helps to search for the optimal set of weights
and biases that minimize the alignment error.
Model Iteration:
21
• Make incremental changes to the baseline model to improve alignment accu-
racy.
• Train the modified models on the training dataset and evaluate their perfor-
mance on the validation dataset.
Performance Evaluation:
• Compare the performance of the modified models with the baseline model.
• Consider the models’ ability to correctly align DNA sequences and their
overall performance.
22
Chapter 6
IMPLEMENTATION
23
6.1.3 Loss-Epoch Curve
Figure 6.3 represents the loss epoch curve for MLP-PSO where the no of epochs
taken as x-axis and loss is taken as y-axis
24
False Positive (FP):
25
R=Recall
TP=True Positive
FN=False Negative
FP=False Positive
Support: Support is a term used in a classification report to refer to the number of
instances in each class. It is used to identify the imbalanced classes in the dataset,
which may affect the performance of the model.
26
Chapter 7
CONCLUSION AND FUTURE WORK
Integration of more advanced alignment algorithms: While MLP and PSO of-
fer effective techniques, consider exploring the integration of other alignment al-
gorithms, such as Smith-Waterman or Needleman-Wunsch. Combining multiple
algorithms can potentially improve alignment accuracy and handle complex align-
ment scenarios.
Exploration of alternative optimization algorithms: While PSO is a powerful
optimization algorithm, there are other metaheuristic algorithms available, such as
Genetic Algorithms or Ant Colony Optimization. Investigate the applicability of
these algorithms to optimize the MLP model’s weights and biases for alignment.
27
REFERENCES
[4] Y. -J. Song and D. -H. Cho, ”Local Alignment of DNA Sequence Based
on Deep Reinforcement Learning,” in IEEE Open Journal of Engineering in
Medicine and Biology, vol. 2, pp. 170-178, 2021, doi: 10.1109/OJEMB.2021.3076156.
[9] https://study.com/cimages/multimages/16/iterativesdlc.png
28