0% found this document useful (0 votes)

12 views51 pages

Batch 17

This document presents a mini project focused on improving pairwise sequence alignment of DNA using machine learning, specifically a multilayer perceptron (MLP) optimized with particle swarm optimization (PSO). It highlights the limitations of traditional algorithms and proposes a novel approach to enhance computational efficiency and accuracy in analyzing large biological databases. The project includes an outline of research questions, methodologies, and existing technologies related to sequence alignment.

Uploaded by

CS-3-5E0 dinesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views51 pages

Batch 17

Uploaded by

CS-3-5E0 dinesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

VR SIDDHARTHA ENGINEERING C O L L EG E ,

V I JAYAWA DA
Department of Computer Science and Engineering

Sequence Alignment in Biological

Sequences Using Machine Learning
20CS6554: B. Tech Mini Project – I (Final Review)
May25th, 2023
Batch No: 17

Batch Members Under the Guidance of

D.Dinesh(208W1A05E0) Mr.S.Rajesh
P.J.S.Krishna (208W1A05H7) Assistant Professor
S.Sushma(218W5A017)

1
Abstract
Sequence Alignment is a way of arranging two(Pairwise Alignment)or more
(Multiple Sequence Alignment) biological sequences(e.g DNA, RNA, or Protein
sequences) of characters to identify regions of similarity. Similarities may be a
consequence of functional or evolutional relationships between these sequences. This
paper mainly focuses on the Pairwise Alignment of DNA sequences which identifies
the similarities and differences between two DNA sequences. Existing methods for
pairwise sequence alignment, such as the Needleman-Wunsch algorithm and the
Smith-Waterman algorithm, have limitations in terms of computational efficiency and
accuracy. Sequence alignment over huge databases cannot produce findings in a fair
amount of time, power, or money. This paper proposes a novel approach for pairwise
sequence alignment in DNA using a multilayer perceptron (MLP) trained with
particle swarm optimization (PSO). The PSO algorithm is used to optimize the
parameters of the MLP for better alignment performance.
keywords: Multilayer Perceptron, Sequence Alignment, Particle Swarm
Optimization.

2
Presentation Outline
1. Aim and Motivation 13.1. UseCase Diagram
2. Research Questions 13.2. Activity Diagram
3. Title Justification 13.3. Sequence Diagram
4. Introduction 14. Functional and Non-Functional
Requirements
5. Basic Concept
15. Implementation and Analysis of Results
6. Study on Existing Technologies
15.1 Output Screenshots
7. Gap Analysis
16. Conclusion and Future Work
8. Objectives
17. Timeline Chart
9. Scope
References
10. Dataset Description
11. Methodology
11.1.Proposed Model
11.2. Modules of the Proposed
Model
11.3 Algorithms
12. SDLC Model
13. UML Diagrams 3
1. Aim and Motivation
Aim: Sequence alignment over large databases cannot yield results within a
reasonable time, power, and cost. The aim of the project is to develop a
machine learning model to accelerate the performance of sequence alignment.
Motivation:
•Biological sequence databases are growing exponentially.

•Traditional sequence alignment algorithms are time-consuming and require

significant computational resources.

•Developing faster and more efficient algorithms for sequence alignment can
enable researchers to analyze large amounts of biological data quickly and
accurately.

•Faster and more efficient sequence alignment can accelerate scientific

discovery and facilitate the development of new treatments for diseases.

4
2. Research Questions

1. Can machine learning algorithms improve the speed and accuracy of sequence alignment
over large databases compared to traditional alignment methods?

2. How can machine learning be used to optimize sequence alignment parameters for
different types of biological data (e.g., DNA vs. protein sequences)?

3. Can machine learning models be trained on a subset of a large sequence database to

improve alignment accuracy on the entire database?

4. How do different types of machine learning models (e.g., neural networks, decision
trees, support vector machines) compare in terms of their performance on sequence
alignment tasks?

5
3. Title Justification
Pairwise sequence alignment is a fundamental task in bioinformatics that
involves comparing two DNA sequences. Dynamic programming algorithms
are computationally expensive for long sequences, leading to the use of
artificial neural networks (ANNs) as an alternative.
Multilayer perceptron (MLP) can be trained using Particle Swarm
Optimization (PSO) to improve its accuracy in sequence alignment. This
approach can handle longer sequences, achieve higher accuracy, and be used
for various types of sequence alignment tasks. However, it requires a large
amount of training data and can be computationally intensive.
Overall, the use of MLP trained with PSO for sequence alignment in DNA is
a promising approach that has the potential to improve the accuracy and
efficiency of sequence alignment in bioinformatics.

6
4. Introduction
Pairwise sequence alignment in bioinformatics compares two DNA sequences to
identify regions of similarity. Dynamic programming algorithms can be
computationally expensive for long sequences, so artificial neural networks like
multilayer perceptron (MLP) are used instead. MLP can be trained with Particle
Swarm Optimization (PSO) to improve accuracy in handling longer sequences. In
this project, we explore using MLP trained with PSO for pairwise sequence
alignment in DNA and compare its performance against other methods.

7
5.Basic Concepts
Sequence Alignment:
Alignment is the process of arranging sequences in a way that maximizes
their similarity by matching corresponding positions. It involves inserting
gaps in the sequences to account for insertions, deletions, or substitutions
that may have occurred during evolution.

Pairwise Sequence alignment:

Pairwise sequence alignment in DNA involves comparing and aligning two
DNA sequences to identify regions of similarity or divergence. It is
performed using algorithms such as the Needleman-Wunsch or Smith-
Waterman algorithms, which consider nucleotide matches, mismatches, and
gaps. Pairwise alignment helps reveal evolutionary relationships, detect
genetic variations, and analyze functional elements within DNA sequences.
Match, Mismatch, and Gap:
In sequence alignment, a match occurs when the nucleotides or amino acids
at corresponding positions in the sequences are the same. A mismatch occurs
when they are different. Gaps are introduced when a position in one
sequence aligns with a gap in the other sequence, indicating insertions or
deletions.
Alignment Score:
The alignment score represents the overall quality of an alignment and is
based on the scoring system used. Higher scores indicate better alignments,
reflecting higher similarity between the sequences.
Multilayer Perceptron (MLP):
MLP is an artificial neural network architecture commonly used in machine
learning.It consists of multiple layers of interconnected nodes (neurons) that
process input data and generate output predictions. MLP is known for its
ability to learn complex patterns and relationships in data through a process
called training. In this project, MLP is utilized to capture patterns and
relationships between nucleotides in DNA sequences, enabling it to predict
alignments for new sequences.
Particle Swarm Optimization (PSO)
PSO is a population-based optimization algorithm inspired by the collective
behavior of bird flocks or fish schools. It involves a group of particles
moving through a search space, with each particle representing a potential
solution. The particles communicate with each other to update their positions
based on their individual and group experiences. PSO iteratively adjusts
alignment parameters and scoring functions based on fitness or objective
functions, aiming to optimize the alignment process. In this project, PSO is
used to improve the accuracy and efficiency of pairwise sequence alignment
by optimizing the alignment parameters.
LITERATURE SURVEY
6.Study on Existing Technologies
Title: Sequence Alignment Using Machine Learning-Based Needleman–Wunsch
Algorithm
Journal Details: IEEE Access, date of publication July 26, 2021,
Dataset: :BailBase,Homstrad
Description:
In this paper, the authors propose a new approach that combines the traditional
Needleman-Wunsch algorithm with a neural network to improve the accuracy of
sequence alignment. They first train a neural network on a large dataset of aligned
sequences to learn patterns in the data. Then, they use the trained network to guide
the alignment process by providing a score for each possible alignment position.
Advantages:
1. Improved accuracy
2. Fast Computation
Disadvantages:
1. Limited explanation of the neural network architecture
2.Lack of comparison with other machine learning-based alignment methods:

. 8
Title:Accelerating DNA pairwise sequence alignment using FPGA and a customized
convolutional neural network
Journal Details: IEEE Access,date of publication 30 March 2021,
Description:The paper "Accelerating DNA pairwise sequence alignment using FPGA
and a customized convolutional neural network" proposes a novel approach to
accelerating pairwise sequence alignment of DNA sequences using field
programmable gate arrays (FPGA) and a customized convolutional neural network
(CNN). The authors demonstrate that their approach can achieve significant speedups
compared to traditional implementations of sequence alignment algorithms, while
maintaining high accuracy. The authors also discuss the potential for their approach
to be integrated into existing bioinformatics pipelines and applications..
Advantages:
High speed: The proposed approach can achieve significant speedups compared to
traditional implementations of sequence alignment algorithms.
Disadvantages:
Hardware dependency: The approach is hardware-dependent and requires access to
FPGA devices, which can limit its accessibility and portability
Title :Performance-Based Analogising of Needleman Wunsch Algorithm to Align
DNA Sequences Using GPU and FPGA
Journal Details:IEEE Access,date of publication 05 February 2021
Dataset: :BAliBASE, SABmark, or OXBench
Description:
The paper "Performance-Based Analogising of Needleman Wunsch Algorithm to
Align DNA Sequences Using GPU and FPGA" proposes an optimized implementation
of the Needleman-Wunsch algorithm for aligning DNA sequences using both GPU
and FPGA architectures. The authors compare the performance of their optimized
implementation to existing CPU-based implementations, demonstrating significant
speedups on both GPU and FPGA platforms. The authors also discuss the potential
for further optimization and improvement in future work.
Advantages
1.The proposed approach can achieve significant speedups compared to traditional
implementations of sequence alignment algorithms
2. Fast Computation
Disadvantages:
1. Hardware dependency, Limited flexibility
Title:Local Alignment of DNA Sequence Based on Deep Reinforcement learning
Journal Details: IEEE Open Journal of Engineering in Medicine and Biology, Date
of Publication: 27 April 2021
Dataset: :CRISPR-Cas9,Chip-sq
Description: The paper proposes a deep reinforcement learning (DRL) based
approach for local alignment of DNA sequences. Local alignment aims to identify the
subsequences that have the highest degree of similarity between two input
sequences. The proposed approach employs a convolutional neural network (CNN)
to learn the sequence features and a DRL model to make optimal alignment
decisions. The DRL model learns to maximize the reward function by aligning the two
sequences to obtain the highest similarity score. The approach is evaluated on
standard benchmark datasets and compared with existing state-of-the-art method
Advantages:The proposed DRL-based approach achieves higher accuracy and faster
processing times compared to existing methods.
Disadvantages:The approach requires a large amount of training data to learn the
optimal alignment strategy.
Title:Simple and Efficient Pattern Matching Algorithms for Biological Sequences
Journal Details: IEEE Access,Date of Publication 23 January 2020

Description:The paper "Simple and Efficient Pattern Matching Algorithms for

Biological Sequences" proposes two novel algorithms for pattern matching in
biological sequences: the Boyer-Moore-Horspool (BMH) algorithm and the Reverse
Complement (RC) algorithm. The authors compare the performance of these
algorithms to existing algorithms such as the Basic Local Alignment Search Tool
(BLAST) and demonstrate that the BMH and RC algorithms can achieve significant
speedups while maintaining high accuracy.
Advantages:
1. The paper presents two novel algorithms for pattern matching in biological
sequences that are simple and efficient.
2. he proposed algorithms have low memory requirements and are well-suited for
use in resource-constrained settings.
Disadvantages:The paper does not compare the proposed algorithms to some of
the more recent, state-of-the-art pattern matching algorithms, which could limit
their utility in cutting-edge research.
Title: GenieHD: Efficient DNA Pattern Matching Accelerator Using Hyper-
dimensional Computing
Journal: IEEE Access,2020
Methodology: GenieHD, which effectively parallelizes the DNA pattern matching
problem, is suggested in this study. This makes use of hyperdimensional (HD)
computing, which is inspired by the brain and imitates pattern-based calculations in
human memory. HD computing is used to convert the naturally sequential processes
involved in DNA pattern matching into highly parallelizable compute workloads. The
complete genome sequence and the target DNA pattern are first encoded to high
dimensional vectors using the suggested method. Once encoded, a simple operation on
the high-dimensional vectors can reveal whether the desired pattern is present
throughout the entire sequence. The construction of an accelerator architecture is also
suggested in this research in order to drastically lower the amount of memory accesses
while effectively parallelizing HD-based DNA pattern matching. To fulfill the needs of
the target system, the architecture can be implemented on a variety of parallel
computing platforms.
Advantage: l.The experimental results show that GenieHD significantly accelerates the
pattern matching procedure, e.g., 44.4× speedup with 54.1× energy-efficiency
improvements when comparing to the existing design on the same FP.
Title: Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront
Algorithm

Journal: IEEE Access,date of publication 13 June 2022

Methodology: The paper explores the utilization of the wavefront algorithm on a

GPU (Graphics Processing Unit) to accelerate the process of sequence alignment,
specifically focusing on the edit-distance sequence alignment. The wavefront
algorithm is known for leveraging parallel computing capabilities to enhance the
speed of alignment algorithms. By employing this algorithm on a GPU, the authors
aim to achieve significant performance improvements.the paper provides a
conceptual overview of the technique, it highlights that implementing the
wavefront algorithm on a GPU requires programming expertise and familiarity with
GPU computing frameworks such as CUDA or OpenCL
Advantage: This paper includes the problems of pattern mining and its related
applications.
Disadvantage:All the techniques related to pattern mining were not included in
this paper.
Title:SLPal: Accelerating Long Sequence Alignment on Many-Core and Multi-Core
Architectures
Journal: IEEE Access,2020
Methodology: SLPal is designed to leverage the computational power of these
architectures to achieve faster and more scalable alignment. The paper presents
SLPal as an innovative approach for accelerating long sequence alignment on
manycore and multi-core architectures. By combining the seed-and-extend
paradigm, divide-and-conquer strategy, SIMD vectorization, and indexing
techniques, SLPal offers a highly efficient and scalable solution for handling the
alignment of long sequences. The work showcases the potential of SLPal in
addressing the computational challenges associated with the analysis of increasingly
longer sequences in genomics and bioinformatics research.
Advantage:
• Accelerated Alignment: SLPal provides significant acceleration in long sequence
alignment compared to traditional methods..
• Scalability: SLPal incorporates a divide-and-conquer strategy, allowing it to
scale efficiently with longer sequences.
Disadvantage:Hardware Dependency: SLPal is specifically designed for many-core
and multi-core architectures. While this allows for efficient parallel processing,
it also restricts its applicability to systems that possess such hardware capabilities.
5. Study on Existing Technologies
Table no. : Summary of existing implementations
S. Algorithms/
Article Title Journal Details Dataset Advantages Disadvantages
No. Models

Sequence IEEE Access, date of Needleman Wunsch BailBase, 1. higher accuracy 1. time-consuming
Alignment Using publication July 26, 2021
Horstrad
Machine
1 Learning-Based 2. highly scalable

Needleman–
Wunsch
Algorithm
Local Alignment IIEEE Open Journal of Deep Q-Network (DQN) Synthetic, 1. new and 1. significant amount of training data and
Engineering in Medicine and
algorithm promising computation power
of DNA Sequence Biology, Date of Non Synthetic
Publication: 27 April 2021 technique.
Based on Deep
Reinforcement 2.Handles variation
learning of s equences

2 .

Accelerating DNA IEEE Access,date of field programmable gate significant speedups The approach is hardware-dependent and
publication 30 March arrays (FPGA) and a compared to requires access to FPGA devices, which
pairwise
2021 customized traditional can limit its accessibility and portability
sequence implementations of
convolutional neural
alignment using network sequence alignment
FPGA and a algorithms.
3
customized

10
Table no. : Summary of existing implementations
S. Algorithms/
Article Title Journal Details Dataset Advantages Disadvantages
No. Models

Performance-Based IEEE Access,date of GPU,FPGA BAliBASE, SABmark, significant speedups Hardware dependency,
Analogising of Needleman publication 05 February 2021 or OXBench compared to traditional Limited flexibility
Wunsch Algorithm to Align implementations of
DNA Sequences Using GPU sequence alignment
and FPGA algorithms
4.

Simple and Efficient IEEE Access,Date of Boyer-Moore-Horspool Bailbase Two novel algorithms for paper does not compare
Publication 23 January (BMH) algorithm and the pattern matching in the proposed algorithms to
Pattern Matching
2020 Reverse Complement (RC) biological sequences that some of the more recent,
Algorithms for are simple and efficient state-of-the-art pattern
algorithm
Biological Sequences matching algorithms,
which could limit their
utility in cutting-edge
research
5. .

GenieHD: Efficient IEEE Access,2020

DeepLocaAlign GenieHD The experimental results DeepLocaAlign requires a
show that GenieHD large amount of training
6. DNA Pattern
significantly accelerates data to build an accurate
Matching Accelerator the pattern matching model, which can be time-
Using Hyper- procedure, consuming and resource-
dimensional intensive.The algorithm
Computing may not be suitable for
highly divergent sequence

10
Table no: Summary of existing implementations
S. Algorithms/
Article Title Journal Details Dataset Advantages Disadvantages
No. Models
7
Accelerating Edit- IEEE Access,date of the wavefront algorithm on - This paper includes the All the techniques related to
a problems of pattern pattern mining were not
Distance Sequence publication 13 June 2022 GPU mining and its related included in this paper.
Alignment on GPU
. applications.
Using the
Wavefront
Algorithm

SLPal: Accelerating IEEE Access,Date of seed-and-extend paradigm, Accelerated Alignment: Hardware Dependency: SLPal
- SLPal provides significant is specifically designed for
Long Sequence Publication 23 January SIMD vectorization
2020 acceleration in long many-core and multi-core
Alignment on sequence alignment architectures. While this
Many-Core and compared to traditional allows for efficient parallel
Multi-Core methods.. processing,It also restricts its
Architectures . applicability to systems that
possess such hardware
capabilities.
8 .
7. Gap Analysis
1.In the previous works the Sequence alignment over large databases cannot yield
results within a reasonable time, power, and cost.

2.By using Needleman-Wunch Algorithm the computational time for sequence

alignment is more.

3.In the present work by using Multilayer Perceptron and Particle Swarm
Optimization Algorithm we improve the performance of the sequence alignment

23
8. Objectives
The objectives of this project are listed as follows:

➢ Study on existing technologies to find out the draw backs and limitations.

➢ Collection of DNA Sequence alignment datasets.

➢ To develop a Machine learning model, that increases the accuracy in sequence

alignment.

➢ To train the model using a large dataset of DNA, and optimize it for enhancing the
performance of the model.

➢ To evaluate the performance of the model on a test dataset, and compare it with the
existing methods.

24
9. Scope

The scope of the project is limited to Pairwise Sequence Alignment.

25
10. Dataset Description
Name of the Dataset: DNA SEQUENCE ALIGNMENT DATASETS BASED ON
NW ALGORITHM
Link: https://www.kaggle.com/datasets/amrezzeldinrashed/dna-sequence-
alignmnet-dataset

Description: The DNA sequence alignment dataset on Kaggle provides over 8

million DNA sequence reads and a reference genome sequence for researchers in
bioinformatics and genetics. By comparing the DNA sequence reads to the reference
genome, researchers can identify genetic variations, mutations, and evolutionary
relationships. The dataset also includes metadata such as sample ID and sequencing
platform used, which can provide additional information for data analysis and
interpretation. This dataset is a valuable resource for researchers interested in
studying DNA sequence alignment, genetic diseases, population genetics, and
evolutionary biology.

S o u r c e : Kaggle
Classes: 254
Instances: 65536
Attributes:16
26
11. METHODOLOGY
11.1. Proposed Model
Figure 1 represents the Proposed system

Figure 1 :Proposed system 28

11.2. Modules of Proposed Model
Module 1: Data Preprocessing

Step 1: Prepare the dataset with DNA sequences and known alignments.

Step 2:Split dataset into training and test datasets.

Step 3:Preprocess sequences into numerical representations.

Module 2: MLP Model Training with PSO

Step 1:Initialize the MLP model with hidden layers, activation functions, and
random weights.

Step 2:Implement the PSO algorithm to optimize model weights and biases.

Step 3:Train the MLP model using PSO and update weights based on best
positions found.

29
Module 3: Pairwise Sequence Alignment

Step 1: Use a trained MLP model to align DNA sequences.

Step 2:Calculate similarity scores based on model predictions.

Step 3:Align sequences based on similarity scores.

Module 4: Performance Evaluation and Optimization

Step 1:Evaluate the performance of the methodology using metrics such as

accuracy, sensitivity, specificity, and F1-score.

Step 2:Assess performance on a test dataset of known alignments.

Step 3:Perform optimization and fine-tuning of parameters for improved

performance.
11.3 Algorithm
1. Initialize the MLP model with random weights and biases.

2. Initialize the PSO parameters, including the number of particles,

maximum iterations, inertia weight, cognitive weight, and social weight.

3. Generate a population of particles, where each particle represents a

potential solution (set of weights and biases) for the MLP model.

4. Evaluate the fitness of each particle by performing pairwise sequence

alignment using the current MLP model.

5. Identify the particle with the best fitness (alignment score) as the global
best solution.

6. Repeat the following steps until the termination condition is met

(maximum iterations or convergence):

7. Update the velocity and position of each particle using the PSO equations.

8. Clip the velocity and position values within predefined bounds.

9. Evaluate the fitness of each particle by performing pairwise sequence
alignment using the updated MLP model.

10. Update the local best solution for each particle if a better alignment
score is achieved.

11. Update the global best solution if a particle achieves a better alignment
score than the current global best.

12. Extract the best weights and biases from the global best solution.

13. Train the MLP model using the best weights and biases obtained from
PSO.

14. Perform pairwise sequence alignment using the trained MLP model on
the test sequences.

15. Return the alignment results.

12. SDLC Model
1. Initially, DNA sequence data is collected for training and testing the
alignment model.
2. After data collection, a thorough analysis is conducted to ensure the
collected data meets the required criteria.
3. Based on the analysis, a detailed plan is formulated for data preprocessing
and alignment model implementation.
4. The project design is developed, considering the architecture and key
components of the alignment model.
5. Following the design, the alignment model is implemented according to
the planned approach.
6. Testing is performed using dedicated testing data to assess the model’s
performance and effectiveness.
7. Model evaluation is conducted during the testing phase to measure the
alignment model’s accuracy and reliability.
8. Upon successful testing, the model is deployed for practical usage.
9. In case additional requirements arise from users, the iterative cycle
repeats,allowing for modifications and enhancements to be made to the
alignment model.
S D L C Model: ITERATIVE MODEL

Figure 2: Iterative Model

34
13. UML DIAGRAMS
13.1 Use Case Diagram
Figure 3 represents Usecase diagram

Figure 3:Usecase Diagram

20
13.2. Activity Diagram
Figure 4 represents ActivityDiagram

Figure 4:ActivityDiagram
19
13.3. Sequence Diagram
Figure 5 represents SequenceDiagram

Figure 5:SequenceDiagram

38
14. Functional and Non-Functional Requirements

Functional Requirements:

• Input DNA Sequences: The system should allow users to input DNA
sequences for alignment, either by manual entry or by importing from external
files.

• Pairwise Sequence Alignment: The system should perform pairwise

sequence alignment of the input DNA sequences using the Multilayer
Perceptron(MLP) and Particle Swarm Optimization (PSO) algorithm.

• Alignment Results: The system should provide the aligned sequences as

the output, displaying the aligned regions, gaps, and mismatches.
Non-Functional Requirements
• Performance: The system should be able to align DNA sequences
efficiently, providing results within a reasonable time frame, even for large-
scale datasets.
• Accuracy: The alignment results should be accurate, reflecting the true
homology and evolutionary relationships between the DNA sequences.
• Reliability: The system should be reliable, producing consistent and
accurate alignment results across multiple runs.
• Scalability: The system should be able to handle a large number of DNA
sequences for alignment, accommodating scalability as the dataset size
increases.
• Usability: The system should have a user-friendly interface, making it easy
for users to input sequences, initiate alignment, and interpret the results
without requiring specialized knowledge.
15. IMPLEMENTATION AND
ANALYSIS OF RESULTS
15.1 OUTPUT SCREENSHOTS
The Output obtained during training the Multilayer Perceptron using Particle Swarm
Optimization
Loss-Epoch Curve for MLP-PSO during Accuracy-Epoch Curve for MLP-PSO
Training during training
Confusion Matrix

Confusion matrix for MLP-PSO

A confusion matrix is a useful tool for evaluating the performance of a
model in Pairwise Sequence Alignment using Multilayer Perceptron (MLP)
and Particle Swarm Optimization (PSO). It provides insights into how the
model classifies the DNA sequences and the number of correct and
incorrect classifications. The confusion matrix consists of four cells that
represent different outcomes:
True Positive (TP):
• In the context of pairwise sequence alignment, TP refers to the number of
DNA sequences correctly classified as aligned by the model.

False Positive (FP):

• FP represents the number of DNA sequences incorrectly predicted as aligned
by the model when they are actually unaligned.
False Negative (FN):
• FN indicates the number of DNA sequences that are incorrectly predicted
as unaligned by the model, but they are actually aligned.
True Negative (TN):
• TN represents the number of DNA sequences correctly predicted as unaligned
by the model.
Classification Report
A classification report is a way to evaluate the performance of a classification
model. It provides various metrics such as precision, recall, F1-score, and support
for each class in the dataset. These metrics help assess how well the model is
performing for different classes.

P = T P/(T P + F P) where P=Precision

R=Recall
R = T P/(T P + F N) TP=True Positive
FN=False Negative
FP=False Positive
F1score = 2 ∗ (P ∗ R)/(P + R)
16.Conclusion And Future
The MLP model trained on DNA sequence pairs demonstrates its ability to
effectively learn and predict alignment patterns. The integration of PSO
further enhances the model's performance by optimizing its weights and
biases, leading to improved alignment accuracy. Evaluation of the model on
a separate dataset provides valuable insights into its effectiveness and
highlights areas for potential improvement.

In future work, it is recommended to explore the integration of more

advanced alignment algorithms, such as Smith-Waterman or Needleman-
Wunsch, in conjunction with MLP and PSO. This integration could further
enhance alignment accuracy and enable the handling of complex alignment
scenarios. Additionally, investigating alternative optimization algorithms,
such as Genetic Algorithms or Ant Colony Optimization, can provide
valuable insights into optimizing the MLP model's weights and biases for
alignment.
17. Timeline Chart

48
References
[1] A. E. E. -D. Rashed, H. M. Amer, M. El-Seddek and H. E. -D. Moustafa, ”Sequence
Alignment Using Machine Learning-Based Needleman–Wunsch Algorithm,”
in IEEE Access, vol. 9, pp. 109522-109535, 2021, doi: 10.1109/ACCESS.
2021.3100408.
[2] C. Kyal, R. Kumar and A. Zamal, ”Performance-Based Analogising of Needleman
Wunsch Algorithm to Align DNA Sequences Using GPU and FPGA,”
2020 IEEE 17th India Council International Conference (INDICON), New
Delhi, India, 2020, pp. 1-5, doi: 10.1109/INDICON49873.2020.9342078.
[3] A. E. E. -D. Rashed, H. M. Amer, M. El-Seddek and H. E. -D. Moustafa, Accelerating
DNA pairwise sequence alignment using FPGA and a customized
convolutional neural network”, in ScienceDirect,2021.
[4] Y. -J. Song and D. -H. Cho, ”Local Alignment of DNA Sequence Based
on Deep Reinforcement Learning,” in IEEE Open Journal of Engineering in
Medicine and Biology, vol. 2, pp. 170-178, 2021, doi: 10.1109/OJEMB.2021.3076156.
[5] P. Neamatollahi, M. Hadi and M. Naghibzadeh, ”Simple and Efficient Pattern
Matching Algorithms for Biological Sequences,” in IEEE Access, vol. 8,
pp. 23838-23846, 2020, doi: 10.1109/ACCESS.2020.2969038.

49
[6] A. Mishra, B. K. Tripathi and S. Singh Soam, ”A Genetic Algorithm based
Approach for the Optimization of Multiple Sequence Alignment,” 2020 International
Conference on Computational Performance Evaluation (ComPE),
Shillong, India, 2020, pp. 415-418, doi: 10.1109/ComPE49325.2020.9200060.
[7] Q. Aguado-Puig et al,”Accelerating Edit-Distance Sequence Alignment on
GPU Using theWavefront Algorithm”,in IEEE
Access,doi:10.1109/ACCESS.2022.3182714
[8] X. Xu et al., ”SLPal: Accelerating Long Sequence Alignment on Many-
Core and Multi-Core Architectures,” 2020 IEEE International Conference
on Bioinformatics and Biomedicine (BIBM), Seoul, Korea (South), 2020,
pp. 2242-2249, doi: 10.1109/BIBM49941.2020.9313429.
[9] https://study.com/cimages/multimages/16/iterativesdlc.png
51

Organizational Change Management
100% (5)
Organizational Change Management
107 pages
Batch 17
No ratings yet
Batch 17
30 pages
Batch 17 Final
No ratings yet
Batch 17 Final
38 pages
Minor 1
No ratings yet
Minor 1
36 pages
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
No ratings yet
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
9 pages
Minor
No ratings yet
Minor
37 pages
Minor 1
No ratings yet
Minor 1
38 pages
Data Mining-Mining Sequence Patterns in Biological Data
No ratings yet
Data Mining-Mining Sequence Patterns in Biological Data
6 pages
Module 3 Session.2 Practical Assignment-Lucy Nakabazzi
No ratings yet
Module 3 Session.2 Practical Assignment-Lucy Nakabazzi
4 pages
Daa Assignment 10 Aryan Project
No ratings yet
Daa Assignment 10 Aryan Project
11 pages
Sequence Alignment Methods
No ratings yet
Sequence Alignment Methods
32 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
Alignment-Free Sequence Comparison A Systematic Survey From A Machine Learning Perspective
No ratings yet
Alignment-Free Sequence Comparison A Systematic Survey From A Machine Learning Perspective
17 pages
Module 3 CSE3069 (Bioinformatics)
No ratings yet
Module 3 CSE3069 (Bioinformatics)
57 pages
8-5-19-Sequence Alignment in Gpu
No ratings yet
8-5-19-Sequence Alignment in Gpu
26 pages
Sequence Alignment Thesis
100% (2)
Sequence Alignment Thesis
6 pages
5 Sequence Alignment
No ratings yet
5 Sequence Alignment
21 pages
Genomic Sequence Data Classification Using Machine Learning Techniques
100% (1)
Genomic Sequence Data Classification Using Machine Learning Techniques
23 pages
Note 7 - Group 7 Scribbing
No ratings yet
Note 7 - Group 7 Scribbing
7 pages
Base Paper 1
No ratings yet
Base Paper 1
21 pages
L8 Msa
No ratings yet
L8 Msa
52 pages
Bibm49941 2020 9313429
No ratings yet
Bibm49941 2020 9313429
8 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
89 pages
2016 - AAlign A SIMD Framework For Pairwise Sequence Alignment On X86-Based Multi - and Many-Core Processors
No ratings yet
2016 - AAlign A SIMD Framework For Pairwise Sequence Alignment On X86-Based Multi - and Many-Core Processors
10 pages
Project Report (Parallel SM - NW) PDF
No ratings yet
Project Report (Parallel SM - NW) PDF
21 pages
Daa Assignment 9 Aryan Project
No ratings yet
Daa Assignment 9 Aryan Project
5 pages
Chapter 2 Bioinformatics
No ratings yet
Chapter 2 Bioinformatics
9 pages
Sequence Alignment
No ratings yet
Sequence Alignment
9 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
No ratings yet
CSE3068-Sequential and Spatial Data Mining: School of Computing Science and Engineering
8 pages
Unit I Algorithms
No ratings yet
Unit I Algorithms
42 pages
Lec7 - Multiple Sequence Alignment
No ratings yet
Lec7 - Multiple Sequence Alignment
22 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
Software Requirements Specification
No ratings yet
Software Requirements Specification
8 pages
Introduction To Bioinformatics Presentation
No ratings yet
Introduction To Bioinformatics Presentation
13 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Shamam, Waheeda, Haris
No ratings yet
Shamam, Waheeda, Haris
4 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
19 pages
Daa Assignment 9
No ratings yet
Daa Assignment 9
4 pages
BI Assignment 1
No ratings yet
BI Assignment 1
6 pages
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
No ratings yet
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
18 pages
Lecture 6 - Sequence Analysis
No ratings yet
Lecture 6 - Sequence Analysis
28 pages
Biological Databases
No ratings yet
Biological Databases
13 pages
Cuda Smith Watermaan Speed Up
No ratings yet
Cuda Smith Watermaan Speed Up
7 pages
Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos
No ratings yet
Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos
53 pages
MultipleSequenceAlignment 2021 PDF
No ratings yet
MultipleSequenceAlignment 2021 PDF
5 pages
Alignments Jmcinerney
No ratings yet
Alignments Jmcinerney
48 pages
Unit 3 Sequence Alignment and Phylogenetic Tree
No ratings yet
Unit 3 Sequence Alignment and Phylogenetic Tree
70 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
Multiple Sequence Alignment Thesis
100% (3)
Multiple Sequence Alignment Thesis
8 pages
Blast 2 Sequences, A New Tool For Comparing Protein and Nucleotide Sequences
No ratings yet
Blast 2 Sequences, A New Tool For Comparing Protein and Nucleotide Sequences
17 pages
Sequence Alignments: Felix Sappelt Irina Wagner
100% (1)
Sequence Alignments: Felix Sappelt Irina Wagner
34 pages
Tabby
No ratings yet
Tabby
11 pages
Da1 SSDM
No ratings yet
Da1 SSDM
16 pages
Local DNA Sequence Alignment in A Cluster of Workstations: Algorithms and Tools
No ratings yet
Local DNA Sequence Alignment in A Cluster of Workstations: Algorithms and Tools
8 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Ray Martinez - Resume 03 11 2023 - Most Recent
No ratings yet
Ray Martinez - Resume 03 11 2023 - Most Recent
3 pages
BOQs 444
No ratings yet
BOQs 444
33 pages
Grade 7 Maths Notes Part 1
No ratings yet
Grade 7 Maths Notes Part 1
6 pages
Darood
No ratings yet
Darood
22 pages
2ND Performance Task in Science
No ratings yet
2ND Performance Task in Science
6 pages
Philippine Public Administration
No ratings yet
Philippine Public Administration
15 pages
Action Plan in English
No ratings yet
Action Plan in English
4 pages
DR - AishaCv 20250422 152511 0000
No ratings yet
DR - AishaCv 20250422 152511 0000
4 pages
ENG 201 Quiz # 1
50% (2)
ENG 201 Quiz # 1
5 pages
ANTENATAL ASSESSMENT Form 10
No ratings yet
ANTENATAL ASSESSMENT Form 10
4 pages
Fisa Tehnica Pompe SPAU
No ratings yet
Fisa Tehnica Pompe SPAU
4 pages
Environmental Law and Jurisprudence
No ratings yet
Environmental Law and Jurisprudence
76 pages
Portfolio Management in Kotak Securites
0% (1)
Portfolio Management in Kotak Securites
92 pages
Bodybuilding, Drugs and Risk
No ratings yet
Bodybuilding, Drugs and Risk
230 pages
Stratus 3i Installation Guide
No ratings yet
Stratus 3i Installation Guide
8 pages
Action Research Proposal
No ratings yet
Action Research Proposal
10 pages
Big Data in Healthcare Systems and Research
No ratings yet
Big Data in Healthcare Systems and Research
4 pages
Chapter1 InteractionsandMotion
No ratings yet
Chapter1 InteractionsandMotion
44 pages
LP 4TH Grade 10 Day1
No ratings yet
LP 4TH Grade 10 Day1
3 pages
Liverpool Football Club Annual Report and Consolidated Financial Statements
No ratings yet
Liverpool Football Club Annual Report and Consolidated Financial Statements
38 pages
In Vivo and in Vitro Evaluation of Four Different Aqueous Polymeric Dispersions For Producing An Enteric Coated Tablet
No ratings yet
In Vivo and in Vitro Evaluation of Four Different Aqueous Polymeric Dispersions For Producing An Enteric Coated Tablet
6 pages
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
No ratings yet
Sapien Labs Age of First Smartphone and Mental Wellbeing Outcomes
26 pages
Group 8 Ocampo ED 203 MidTerm Exam
No ratings yet
Group 8 Ocampo ED 203 MidTerm Exam
6 pages
OOPS Lab File
No ratings yet
OOPS Lab File
60 pages
Perencanaan Tebal Perkerasan Landasan Pacu
No ratings yet
Perencanaan Tebal Perkerasan Landasan Pacu
8 pages
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
No ratings yet
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
3 pages
Yaskawa SGMGV
No ratings yet
Yaskawa SGMGV
24 pages
Number Series
No ratings yet
Number Series
16 pages
Gallup Test
No ratings yet
Gallup Test
25 pages

Batch 17

Uploaded by

Batch 17

Uploaded by

VR SIDDHARTHA ENGINEERING C O L L EG E ,

Sequence Alignment in Biological

Batch Members Under the Guidance of

•Traditional sequence alignment algorithms are time-consuming and require

•Faster and more efficient sequence alignment can accelerate scientific

3. Can machine learning models be trained on a subset of a large sequence database to

Pairwise Sequence alignment:

Description:The paper "Simple and Efficient Pattern Matching Algorithms for

Journal: IEEE Access,date of publication 13 June 2022

Methodology: The paper explores the utilization of the wavefront algorithm on a

GenieHD: Efficient IEEE Access,2020

2.By using Needleman-Wunch Algorithm the computational time for sequence

➢ Collection of DNA Sequence alignment datasets.

➢ To develop a Machine learning model, that increases the accuracy in sequence

The scope of the project is limited to Pairwise Sequence Alignment.

Description: The DNA sequence alignment dataset on Kaggle provides over 8

Figure 1 :Proposed system 28

Step 2:Split dataset into training and test datasets.

Step 3:Preprocess sequences into numerical representations.

Module 2: MLP Model Training with PSO

Step 1: Use a trained MLP model to align DNA sequences.

Step 2:Calculate similarity scores based on model predictions.

Step 3:Align sequences based on similarity scores.

Module 4: Performance Evaluation and Optimization

Step 1:Evaluate the performance of the methodology using metrics such as

Step 2:Assess performance on a test dataset of known alignments.

Step 3:Perform optimization and fine-tuning of parameters for improved

2. Initialize the PSO parameters, including the number of particles,

3. Generate a population of particles, where each particle represents a

4. Evaluate the fitness of each particle by performing pairwise sequence

6. Repeat the following steps until the termination condition is met

8. Clip the velocity and position values within predefined bounds.

15. Return the alignment results.

Figure 2: Iterative Model

Figure 3:Usecase Diagram

• Pairwise Sequence Alignment: The system should perform pairwise

• Alignment Results: The system should provide the aligned sequences as

Confusion matrix for MLP-PSO

False Positive (FP):

P = T P/(T P + F P) where P=Precision

In future work, it is recommended to explore the integration of more

You might also like