0% found this document useful (0 votes)
10 views36 pages

Minor 1

The document presents a mini project report on 'Pairwise Sequence Alignment in Biological Sequences Using Machine Learning' submitted for a Bachelor of Technology degree in Computer Science and Engineering. It discusses the limitations of existing alignment methods and proposes a novel approach using a multilayer perceptron optimized by particle swarm optimization for improved performance. The report includes sections on methodology, software requirements, implementation, and results analysis.

Uploaded by

CS-3-5E0 dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

Minor 1

The document presents a mini project report on 'Pairwise Sequence Alignment in Biological Sequences Using Machine Learning' submitted for a Bachelor of Technology degree in Computer Science and Engineering. It discusses the limitations of existing alignment methods and proposes a novel approach using a multilayer perceptron optimized by particle swarm optimization for improved performance. The report includes sections on methodology, software requirements, implementation, and results analysis.

Uploaded by

CS-3-5E0 dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

PAIRWISE SEQUENCE ALIGNMENT IN

BIOLOGICAL SEQUENCES USING


MACHINE LEARNING
MINI Project report submitted in partial fulfillment of the
Requirements for the Award of the Degree of
BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE AND ENGINEERING


Submitted by

Dinesh Darsi 208W1A05E0


P.J.S.Krishna 208W1A05H7
S.Sushma Sri 218W5A0517

Under the Guidance of

Mr S.Rajesh,
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


V.R SIDDHARTHA ENGINEERING COLLEGE
Autonomous and Approved by AICTE, NAAC A+, NBA Accredited
Affiliated to Jawaharlal Nehru Technological University, Kakinada
Vijayawada 520007
2023
VELAGAPUDI RAMAKRISHNA SIDDHARTHA
ENGINEERING COLLEGE
(Autonomous, Accredited with ‘A+’ grade by NAAC)
Department of Computer Science and Engineering

CERTIFICATE

This is to certify that the mini project report entitled “PAIRWISE


SEQUENCE ALIGNMENT IN BIOLOGICAL SEQUENCES USING
MACHINE LEARNING” being submitted by
D.DINESH 208W1A05E0
P.J.S.KRISHNA 208W1A05H7
S.SUSHMA SRI 218W5A0517
in partial fulfilment for the award of the Degree of Bachelor of Technology in
Computer Science and Engineering to the Jawaharlal Nehru Technological Uni-
versity, Kakinada, is a record of bonafide work carried out under my guidance and
supervision.

Mr S. Rajesh, M.Tech, Dr.D.Rajeswara Rao, M.Tech, Ph.D

Assistant Professor & Guide Professor & HOD,CSE

i
DECLARATION

We hereby declare that the MINI Project entitled “PAIRWISE SEQUENCE ALIGN-
MENT IN BIOLOGICAL SEQUENCES USING MACHINE LEARNING” sub-
mitted for the B.Tech Degree is our original work and the dissertation has not
formed the basis for the award of any degree, associate ship, fellowship or any
other similar titles.

Place: Vijayawada DINESH DARSI (208W1A05E0)


Date: P.J.S.KRISHNA (208W1A05H7)
S.SUSHMA SRI(218W5A0517)

ii
ACKNOWLEDGEMENT

We would like to thank Dr. A. V. Ratna Prasad, Principal of Velagapudi


Ramakrishna Siddhartha Engineering College for the facilities provided during the
course of Mini Project.

We have been bestowed with the privilege of thanking Dr. D. Rajeswara


Rao, Professor and Head of the Department for his moral and material support.

We would like to express our deep gratitude to our guide Mr S.Rajesh, As-
sistant Professor for her persisting encouragement, everlasting patience and keen
interest in discussion and for her numerous suggestions which we had at every
phase of this project.

We owe our acknowledgements to an equally long list of people who helped us


in MINI project work

Place: Vijayawada DINESH DARSI (208W1A05E0)


Date: P.J.S.KRISHNA (208W1A05H7)
S.SUSHMA SRI(218W5A0517)

iii
Abstract

Sequence Alignment is a way of arranging two(Pairwise Alignment)or more (Mul-


tiple Sequence Alignment) biological sequences(e.g DNA, RNA, or Protein se-
quences) of characters to identify regions of similarity. Similarities may be a con-
sequence of functional or evolutional relationships between these sequences. This
paper mainly focuses on the Pairwise Alignment of DNA sequences which identifies
the similarities and differences between two DNA sequences. Existing methods for
pairwise sequence alignment, such as the Needleman-Wunsch algorithm and the
Smith-Waterman algorithm, have limitations in terms of computational efficiency
and accuracy. Sequence alignment over huge databases cannot produce findings in
a fair amount of time, power, or money. This paper proposes a novel approach for
pairwise sequence alignment in DNA using a multilayer perceptron (MLP) trained
with particle swarm optimization (PSO). The PSO algorithm is used to optimize
the parameters of the MLP for better alignment performance.

Keywords: Multilayer Perceptron, Sequence Alignment, Particle Swarm Opti-


mization.

iv
Table of Contents

1 INTRODUCTION 1
1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Pairwise Sequence alignment . . . . . . . . . . . . . . . . . . 2
1.1.3 Match, Mismatch, and Gap . . . . . . . . . . . . . . . . . . 2
1.1.4 Alignment Score . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Multilayer Perceptron (MLP): . . . . . . . . . . . . . . . . . 3
1.1.6 Particle Swarm Optimization (PSO) . . . . . . . . . . . . . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 LITERATURE REVIEW 5
2.1 Sequence Alignment Using Machine Learning-Based Needleman–Wunsch
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Performance-Based Analogising of Needleman Wunsch Algorithm
to Align DNA Sequences Using GPU and FPGA . . . . . . . . . . . 5
2.3 Accelerating DNA pairwise sequence alignment using FPGA and a
customized convolutional neural network . . . . . . . . . . . . . . . 6
2.4 Local Alignment of DNA Sequence Based on Deep Reinforcement
learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Simple and Efficient Pattern Matching Algorithms for Biological
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 GenieHD: Efficient DNA Pattern Matching Accelerator Using Hyper-
dimensional Computing . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Accelerating Edit-Distance Sequence Alignment on GPU Using the
Wavefront Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 SLPal: Accelerating Long Sequence Alignment on Many-Core and
Multi-Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . 9

3 SOFTWARE REQUIREMENT ANALYSIS 11


3.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 11

v
3.2.1 python3.9: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . . . 13

4 SOFTWARE DESIGN 14
4.1 Software Development Lifecycle . . . . . . . . . . . . . . . . . . . . 14
4.2 UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Use-Case Diagram . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.3 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . 18

5 PROPOSED SYSTEM 19
5.1 Process Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Data Preprocessing: . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.2 Model building and Training: . . . . . . . . . . . . . . . . . 20
5.2.3 Hyperparameter Tuning: . . . . . . . . . . . . . . . . . . . . 21
5.2.4 Model Evolution and Comparison . . . . . . . . . . . . . . . 21
5.3 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 IMPLEMENTATION 23
6.1 Output Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Epoch Output . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.2 Accuracy-Epoch Curve . . . . . . . . . . . . . . . . . . . . . 23
6.1.3 Loss-Epoch Curve . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1.5 Classification Report . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 CONCLUSION AND FUTURE WORK 27

REFERENCES 28

vi
List of Figures

4.1 Iterative Model[14] . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


4.2 Use Case Diagram[15] . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Sequence Diagram[16] . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Activity Diagram[17] . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.1 Output after running 5 epochs for MLP-PSO . . . . . . . . . . . . . 23


6.2 Accuracy-Epoch Curve for MLP-PSO . . . . . . . . . . . . . . . . . 23
6.3 Loss-Epoch Curve for MLP-PSO . . . . . . . . . . . . . . . . . . . 24
6.4 Confusion matrix for MLP-PSO . . . . . . . . . . . . . . . . . . . . 24
6.5 Classification Report for MLP-PSO . . . . . . . . . . . . . . . . . . 25

vii
Chapter 1
INTRODUCTION

Pairwise sequence alignment is a crucial task in bioinformatics that involves com-


paring and aligning two DNA sequences to identify regions of similarity or homol-
ogy. Traditional alignment algorithms face challenges in terms of computational
efficiency and scalability, especially when dealing with large DNA sequences. To
overcome these limitations, researchers have explored the integration of Multilayer
Perceptron (MLP) and Particle Swarm Optimization (PSO) techniques to enhance
the accuracy and efficiency of pairwise sequence alignment.

Multilayer Perceptron (MLP) is a type of artificial neural network known for


its ability to learn complex patterns and make predictions. By training an MLP
on a labeled dataset of aligned sequences, it can capture patterns and relation-
ships between nucleotides, enabling it to predict alignments for new sequences.
On the other hand, Particle Swarm Optimization (PSO) is a population-based op-
timization algorithm inspired by social behavior. It iteratively adjusts alignment
parameters and scoring functions based on fitness or objective functions, leading
to improved alignment solutions.

In our study, we propose a new method that combines Multilayer Perceptron


(MLP) and Particle Swarm Optimization (PSO) for aligning DNA sequences. We
conducted experiments using standard datasets to assess the effectiveness of our
approach and compared its performance to traditional alignment algorithms. The
results showed that the MLP-PSO approach achieves comparable accuracy in se-
quence alignment while providing computational benefits. This integration has the
potential to advance pairwise sequence alignment and enhance our understanding
of the relationships and functions of DNA sequences. By overcoming the limi-
tations of traditional methods, this approach has implications for bioinformatics,
genomics, and molecular biology, enabling more efficient analysis of DNA sequences
and facilitating valuable insights into genetic information.

1
1.1 Basic Concepts
• Sequence Alignment

• Pairwise Sequence alignment

• Match,Mismatch,Gap

• Alignment Score

• Multilayer Perceptron

• Particle Swarm Optimization

1.1.1 Sequence Alignment


Alignment is the process of arranging sequences in a way that maximizes their
similarity by matching corresponding positions. It involves inserting gaps in the
sequences to account for insertions, deletions, or substitutions that may have oc-
curred during evolution.

1.1.2 Pairwise Sequence alignment


Pairwise sequence alignment in DNA involves comparing and aligning two DNA
sequences to identify regions of similarity or divergence. It is performed using
algorithms such as the Needleman-Wunsch or Smith-Waterman algorithms, which
consider nucleotide matches, mismatches, and gaps. Pairwise alignment helps
reveal evolutionary relationships, detect genetic variations, and analyze functional
elements within DNA sequences.

1.1.3 Match, Mismatch, and Gap


In sequence alignment, a match occurs when the nucleotides or amino acids at
corresponding positions in the sequences are the same. A mismatch occurs when
they are different. Gaps are introduced when a position in one sequence aligns
with a gap in the other sequence, indicating insertions or deletions.

1.1.4 Alignment Score


The alignment score represents the overall quality of an alignment and is based
on the scoring system used. Higher scores indicate better alignments, reflecting
higher similarity between the sequences.

2
1.1.5 Multilayer Perceptron (MLP):
MLP is an artificial neural network architecture commonly used in machine learn-
ing. It consists of multiple layers of interconnected nodes (neurons) that process
input data and generate output predictions. MLP is known for its ability to learn
complex patterns and relationships in data through a process called training. In
this project, MLP is utilized to capture patterns and relationships between nu-
cleotides in DNA sequences, enabling it to predict alignments for new sequences.

1.1.6 Particle Swarm Optimization (PSO)


PSO is a population-based optimization algorithm inspired by the collective be-
havior of bird flocks or fish schools. It involves a group of particles moving through
a search space, with each particle representing a potential solution. The particles
communicate with each other to update their positions based on their individual
and group experiences. PSO iteratively adjusts alignment parameters and scoring
functions based on fitness or objective functions, aiming to optimize the alignment
process. In this project, PSO is used to improve the accuracy and efficiency of
pairwise sequence alignment by optimizing the alignment parameters.

1.2 Motivation
Traditional sequence alignment algorithms over large databases cannot yield re-
sults within a reasonable time, power, and cost. Developing faster and more
efficient algorithms for sequence alignment can enable researchers to analyze large
amounts of biological data quickly and accurately.Faster and more efficient se-
quence alignment can accelerate scientific discovery and facilitate the development
of new treatments for diseases.

1.3 Problem Statement


The goal is to improve the accuracy and efficiency of sequence alignment by lever-
aging the capabilities of MLP and PSO. The challenges include training the MLP
model on a diverse dataset of DNA sequences with known alignments, optimizing
the model using PSO to find the best alignment parameters, and evaluating the
performance of the model in terms of alignment accuracy and computational ef-
ficiency. By addressing these challenges, we aim to provide a robust and efficient

3
solution for pairwise sequence alignment in DNA sequences using MLP and PSO.

1.4 Objectives
• Study on existing technologies to find out the draw backs and limitations.

• Collection of DNA Sequence alignment datasets.

• To develop a Machine learning model, that increases the accuracy in sequence


alignment.

• To train the model using a large dataset of DNA, and optimize it for en-
hancing the performance of the model.

• To evaluate the performance of the model on a test dataset, and compare it


with the existing methods.

1.5 Scope
• The scope of the project is limited to Pairwise Sequence Alignment.
.

1.6 Applications
• disease analysis: Identify genetic variations and mutations associated with
diseases by aligning DNA sequences of affected individuals with reference
sequences.

• Evolutionary studies: Study the evolutionary relationships between species


by aligning DNA sequences, identifying conserved regions, and detecting
genetic changes.

• Drug design and discovery: Accelerate drug development by aligning DNA


sequences of target proteins or disease-causing genes with known drug tar-
gets, aiding in the identification of potential drug candidates.

• Comparative genomics: Compare DNA sequences from different organisms or


species to uncover shared regions, conserved genes, and regulatory elements,
providing insights into genetic similarities and differences.

4
Chapter 2
LITERATURE REVIEW

2.1 Sequence Alignment Using Machine Learning-


Based Needleman–Wunsch Algorithm
Methodology: This paper, the authors propose a new approach that combines
the traditional Needleman-Wunsch algorithm with a neural network to improve
the accuracy of sequence alignment. They first train a neural network on a large
dataset of aligned sequences to learn patterns in the data. Then, they use the
trained network to guide the alignment process by providing a score for each
possible alignment position.
Advantages:

• The new deep learning model of this paper improves the automatic classifi-
cation performance of breast cancer.

• A new method based on CNN model for breast cancer screening and diagnosis
is proposed

Disadvantages:

• The number of clusters of the image is determined using the gray-gradient


two-dimensional histogram of the pre processed image.

2.2 Performance-Based Analogising of Needle-


man Wunsch Algorithm to Align DNA Se-
quences Using GPU and FPGA
Methodology: The paper ”Performance-Based Analogising of Needleman Wun-
sch Algorithm to Align DNA Sequences Using GPU and FPGA” proposes an
optimized implementation of the Needleman-Wunsch algorithm for aligning DNA
sequences using both GPU and FPGA architectures. The authors compare the
performance of their optimized implementation to existing CPU-based implemen-
tations, demonstrating significant speedups on both GPU and FPGA platforms.
The authors also discuss the potential for further optimization and improvement
in future work. .

5
Advantages:

• 1The proposed approach can achieve significant speedups compared to tra-


ditional implementations of sequence alignment algorithms .

• Fast Computation

Disadvantages:

• Further improvements can be performed on this work by optimizing the


networks and enhancing the performance in terms of space utilization and
execution time.

2.3 Accelerating DNA pairwise sequence align-


ment using FPGA and a customized convo-
lutional neural network
Methodology: The paper ”Accelerating DNA pairwise sequence alignment us-
ing FPGA and a customized convolutional neural network” proposes a novel ap-
proach to accelerating pairwise sequence alignment of DNA sequences using field
programmable gate arrays (FPGA) and a customized convolutional neural net-
work (CNN). The authors demonstrate that their approach can achieve significant
speedups compared to traditional implementations of sequence alignment algo-
rithms, while maintaining high accuracy. The authors also discuss the potential
for their approach to be integrated into existing bioinformatics pipelines and ap-
plications
Advantages:

• High speed: The proposed approach can achieve significant speedups com-
pared to traditional implementations of sequence alignment algorithms.

Disadvantages:

• Hardware dependency: The approach is hardware-dependent and requires


access to FPGA devices, which can limit its accessibility and portability

2.4 Local Alignment of DNA Sequence Based on


Deep Reinforcement learning
Methodology: The paper proposes a deep reinforcement learning (DRL) based
approach for local alignment of DNA sequences. Local alignment aims to identify

6
the subsequences that have the highest degree of similarity between two input se-
quences. The proposed approach employs a convolutional neural network (CNN)
to learn the sequence features and a DRL model to make optimal alignment de-
cisions. The DRL model learns to maximize the reward function by aligning the
two sequences to obtain the highest similarity score. The approach is evaluated on
standard benchmark datasets and compared with existing state-of-the-art method
Advantages:

• The proposed DRL-based approach achieves higher accuracy and faster pro-
cessing times compared to existing methods.

Disadvantages:

• The approach requires a large amount of training data to learn the optimal
alignment strategy

2.5 Simple and Efficient Pattern Matching Al-


gorithms for Biological Sequences
Methodology: The paper ”Simple and Efficient Pattern Matching Algorithms for
Biological Sequences” proposes two novel algorithms for pattern matching in bi-
ological sequences: the Boyer-Moore-Horspool (BMH) algorithm and the Reverse
Complement (RC) algorithm. The authors compare the performance of these al-
gorithms to existing algorithms such as the Basic Local Alignment Search Tool
(BLAST) and demonstrate that the BMH and RC algorithms can achieve signifi-
cant speedups while maintaining high accuracy.
Advantages:

• The paper presents two novel algorithms for pattern matching in biological
sequences that are simple and efficient.

• The proposed algorithms have low memory requirements and are well-suited
for use in resource-constrained settings

Disadvantages:

• The paper does not compare the proposed algorithms to some of the more
recent, state-of-the-art pattern matching algorithms, which could limit their
utility in cutting-edge research.

7
2.6 GenieHD: Efficient DNA Pattern Matching
Accelerator Using Hyper-dimensional Com-
puting
Methodology: GenieHD, which effectively parallelizes the DNA pattern match-
ing problem, is suggested in this study. This makes use of hyperdimensional (HD)
computing, which is inspired by the brain and imitates pattern-based calculations
in human memory. HD computing is used to convert the naturally sequential
processes involved in DNA pattern matching into highly parallelizable compute
workloads. The complete genome sequence and the target DNA pattern are first
encoded to high dimensional vectors using the suggested method. Once encoded,
a simple operation on the high-dimensional vectors can reveal whether the desired
pattern is present throughout the entire sequence. The construction of an accel-
erator architecture is also suggested in this research in order to drastically lower
the amount of memory accesses while effectively parallelizing HD-based DNA pat-
tern matching. To fulfill the needs of the target system, the architecture can be
implemented on a variety of parallel computing platforms.
Advantages:

• The experimental results show that GenieHD significantly accelerates the


pattern matching procedure, e.g., 44.4× speedup with 54.1× energy-efficiency
improvements when comparing to the existing design on the same FP.

Disadvantages:

• When we calculate the inter molecular distance between the points then it
may show different value when compared with the normal measured distance.

2.7 Accelerating Edit-Distance Sequence Align-


ment on GPU Using the Wavefront Algo-
rithm
Methodology: The paper explores the utilization of the wavefront algorithm on a
GPU (Graphics Processing Unit) to accelerate the process of sequence alignment,
specifically focusing on the edit-distance sequence alignment. The wavefront algo-
rithm is known for leveraging parallel computing capabilities to enhance the speed
of alignment algorithms. By employing this algorithm on a GPU, the authors aim
to achieve significant performance improvements.the paper provides a conceptual
overview of the technique, it highlights that implementing the wavefront algorithm

8
on a GPU requires programming expertise and familiarity with GPU computing
frameworks such as CUDA or OpenCL. Furthermore, optimizing memory access
patterns, load balancing, and algorithmic optimizations specific to the alignment
algorithm are essential for achieving efficient performance

Advantages:

• This paper includes the problems of pattern mining and its related applica-
tions.

Disadvantages:

• All the techniques related to pattern mining were not included in this paper.

2.8 SLPal: Accelerating Long Sequence Align-


ment on Many-Core and Multi-Core Archi-
tectures
Methodology: SLPal is designed to leverage the computational power of these
architectures to achieve faster and more scalable alignment. The paper presents
SLPal as an innovative approach for accelerating long sequence alignment on many-
core and multi-core architectures. By combining the seed-and-extend paradigm,
divide-and-conquer strategy, SIMD vectorization, and indexing techniques, SLPal
offers a highly efficient and scalable solution for handling the alignment of long
sequences. The work showcases the potential of SLPal in addressing the compu-
tational challenges associated with the analysis of increasingly longer sequences in
genomics and bioinformatics research.
Advantages:

• Accelerated Alignment: SLPal provides significant acceleration in long se-


quence alignment compared to traditional methods..

• Scalability: SLPal incorporates a divide-and-conquer strategy, allowing it to


scale efficiently with longer sequences.

Disadvantages:

• Hardware Dependency: SLPal is specifically designed for many-core and


multi-core architectures. While this allows for efficient parallel processing,
it also restricts its applicability to systems that possess such hardware capa-
bilities.

9
• Implementation Complexity: Implementing and optimizing SLPal for specific
hardware architectures and alignment algorithms can be complex.

10
Chapter 3
SOFTWARE REQUIREMENT
ANALYSIS

Analysis of the requirements, also known as requirements engineering, is the


method of evaluating consumer demands of a new or changed product. Analy-
sis of requirements is a team activity involving a mix of experience in engineering
hardware, software and human factors, as well as skills in communicating with
people. These characteristics, called criteria, have to be quantifiable, specific, and
detailed. Such criteria are also termed functional specifications in software en-
gineering. Analysis of specifications is an important part of project management
that requires regular contact with authorized users to establish particular function-
ality preferences, dispute resolution or uncertainty in specifications as requested
by the different users or community groups.

A specification on software requirements is a detailed overview of the intended


function and ecosystem for the under research program. The Software Require-
ments Specification thoroughly explains what the program is going to do, and
how it is supposed to function. A Software Requirement specification reduces pro-
grammers’ time and means to improve desired targets, and therefore minimizes
production costs. In a wide range of real-world scenarios, a successful Software
Requirement Specification determines how an application communicates with ma-
chine hardware, other programs and human users.

3.1 Hardware Requirements


• Ram: 8 GB

• Memory: 256GB

• Processor: Intel core i5,9th gen

3.2 Software Requirements


Operating Systems : Windows 10 or above
Python Libraries: orange,scikit-learn

11
Development Environment: Google Colab

3.2.1 python3.9:
Python is a general-purpose, versatile, and powerful programming language. It’s
a great first language because it’s concise and easy to read. Whatever you want to
do, Python can do it. From web development to machine learning to data science,
Python is the language for you.. It is simple, yet powerful. Python is easy to write,
and simple to understand. This behavior of its makes it intuitive. Situations like
getting your code from another developer that uses third-party components mean
you need very little cognitive overhead. It is also true that code is read more often
than it is written. A great choice of libraries is one of the main reasons Python
is the most popular programming language used for AI. A library is a module
or a group of modules published by different sources like PyPi which include a
pre-written piece of code that allows users to reach some functionality or perform
different actions. Python libraries provide base level items so developers don’t
have to code them from the very beginning every time. ML requires continuous
data processing, and Python’s libraries let you access, handle and transform data.
These include Python NumPy, SciPy, scikit-learn, and many more. These are good
with all intrinsic tasks of machine learning.

12
3.3 Functional Requirements
• Input DNA Sequences: The system should allow users to input DNA se-
quences for alignment, either by manual entry or by importing from external
files.

• Pairwise Sequence Alignment: The system should perform pairwise se-


quence alignment of the input DNA sequences using the Multilayer Percep-
tron (MLP) and Particle Swarm Optimization (PSO) algorithm.

• Alignment Results: The system should provide the aligned sequences as


the output, displaying the aligned regions, gaps, and mismatches.

3.4 Non-Functional Requirements


• Performance: The system should be able to align DNA sequences effi-
ciently, providing results within a reasonable time frame, even for large-scale
datasets.

• Accuracy: The alignment results should be accurate, reflecting the true


homology and evolutionary relationships between the DNA sequences.

• Reliability: The system should be reliable, producing consistent and accu-


rate alignment results across multiple runs.

• Scalability: The system should be able to handle a large number of DNA


sequences for alignment, accommodating scalability as the dataset size in-
creases.

• Usability: The system should have a user-friendly interface, making it easy


for users to input sequences, initiate alignment, and interpret the results
without requiring specialized knowledge.

13
Chapter 4
SOFTWARE DESIGN

Software design is a phase in a software development methodology that out-


comes in a brief explanation of how to best solve the problem at hand when exe-
cuted. The design will go through various iterations before finishing. The design[
14] can cover various elements of the program such as Solution Architecture, Ap-
plication Structure, Database Design, Techniques for Integration etc. The concept
feedback is the specification in research and planning. Software design refers not
only to the system in general but to any single part of the system as well. Software
design will be the coding of the program in phase

4.1 Software Development Lifecycle


1. Initially, DNA sequence data is collected for training and testing the align-
ment model.

2. After data collection, a thorough analysis is conducted to ensure the collected


data meets the required criteria.

3. Based on the analysis, a detailed plan is formulated for data preprocessing


and alignment model implementation.

4. The project design is developed, considering the architecture and key com-
ponents of the alignment model.

5. Following the design, the alignment model is implemented according to the


planned approach.

6. Testing is performed using dedicated testing data to assess the model’s per-
formance and effectiveness.

7. Model evaluation is conducted during the testing phase to measure the align-
ment model’s accuracy and reliability.

8. Upon successful testing, the model is deployed for practical usage.

9. In case additional requirements arise from users, the iterative cycle repeats,
allowing for modifications and enhancements to be made to the alignment
model.

14
Figure 4.1: Iterative Model[14]

4.2 UML Diagrams


4.2.1 Use-Case Diagram
A use case diagram is a graphical depiction of a user’s possible interactions
with a system. Figure 4.2 presents Use Case Diagram of the Proposed System.

15
Figure 4.2: Use Case Diagram[15]

16
4.2.2 Sequence Diagram
A sequence diagram or system sequence diagram shows process interactions
arranged in time sequence in the field of software engineering. Figure 4.3 presents
Sequence Diagram of the Proposed System.

Figure 4.3: Sequence Diagram[16]

17
4.2.3 Activity Diagram
Activity diagrams are graphical representations of workflows of stepwise ac-
tivities and actions with support for choice, iteration and concurrency. Figure 4.4
presents Activity Diagram of the Proposed System.

Figure 4.4: Activity Diagram[17]

18
Chapter 5
PROPOSED SYSTEM

5.1 Process Flow Diagram


In the figure 5.1 we can see the flowchart for representing the data on the map.

Figure 5.1: Flow Diagram

The process flow diagram represents the overall step by step process of the done
in the project. It involves from collection of data set to the display of end result
to the user.

5.2 Methodology
The architecture is mainly divided into 4 steps:

1. Data Preprocessing.

2. Model building and Training.

3. Hyperparameter Tuning.

4. Model Evolution and Comparison.

19
5.2.1 Data Preprocessing:
Data preprocessing is an essential step in preparing the data for analysis and
model training. In the context of Pairwise Sequence Alignment using Multilayer
Perceptron (MLP) and Particle Swarm Optimization (PSO), data preprocessing
involves the following steps:
Data Collection and Organization:
Gather a dataset of DNA sequences that need to be aligned. Ensure that the
dataset is properly organized, with each sequence represented as a separate data
point.
Data Encoding:
Convert the DNA sequences into a numerical representation that can be under-
stood by the MLP model.Common encoding techniques include one-hot encoding,
where each nucleotide is represented by a binary vector, or numerical encoding,
where each nucleotide is assigned a unique numeric value.
Data Split:
Split the dataset into training and test datasets.The training dataset is used to
train the MLP model, while the test dataset is used to evaluate its performance.
Data Normalization:
Normalize the numerical features of the dataset to a common scale.This ensures
that all features contribute equally during model training.By performing these data
preprocessing steps, the DNA sequences are transformed into a suitable format for
training the MLP model using PSO for pairwise sequence alignment.

5.2.2 Model building and Training:


Multilayer Perceptron (MLP): A Multilayer Perceptron is a type of artificial
neural network that consists of multiple layers of interconnected artificial neu-
rons. Each neuron receives inputs, applies weights to those inputs, and passes the
weighted sum through an activation function to produce an output. The MLP is
trained using a process called backpropagation, where the weights of the neurons
are adjusted iteratively to minimize the difference between the predicted output
and the actual output.
In the context of project, the MLP is utilized for pairwise sequence alignment
in DNA sequences. The input to the MLP would be the encoded numerical repre-
sentations of the DNA sequences, obtained during the data preprocessing phase.
The MLP is trained to learn the patterns and relationships in the data and predict

20
the alignment of the sequences. It consists of an input layer, one or more hidden
layers with activation functions, and an output layer.
Particle Swarm Optimization (PSO): Particle Swarm Optimization is a
metaheuristic optimization algorithm inspired by the social behavior of bird flock-
ing or fish schooling. It involves a population of particles, where each particle
represents a potential solution to the optimization problem. The particles move
through the search space, adjusting their positions and velocities based on their
own experience and the information shared with neighboring particles.
In the project, PSO is used to optimize the weights and biases of the MLP.
The particles in the PSO algorithm represent different sets of weights and biases
for the MLP. The fitness function is defined based on the performance of the MLP
in aligning the DNA sequences. The particles adjust their positions and velocities
based on their own best position and the best position found by any particle in
the swarm. This iterative process helps to search for the optimal set of weights
and biases that minimize the alignment error.

5.2.3 Hyperparameter Tuning:


Identify the hyperparameters that significantly affect the performance of the mod-
els. These may include learning rate, batch size, number of epochs, optimizer type,
regularization techniques (e.g., dropout, weight decay), and architecture-specific
parameters (e.g., depth, width, kernel size). Determine a range of values or dis-
crete options for each hyperparameter that you want to explore Here we use batch
size and no of epochs as hyperparameters
batch size=40
no of epochs=5

5.2.4 Model Evolution and Comparison


Model Evolution and Comparison for Pairwise Sequence Alignment using Mul-
tilayer Perceptron (MLP) and Particle Swarm Optimization (PSO) involves the
following steps:
Baseline Model:

• Start by building a baseline model using MLP and PSO.

• Train the baseline model on the training dataset.

• Evaluate its performance on the validation dataset using metrics specific to


sequence alignment.

Model Iteration:

21
• Make incremental changes to the baseline model to improve alignment accu-
racy.

• Explore modifications like adjusting the model architecture, hyperparame-


ters, or data preprocessing techniques.

• Train the modified models on the training dataset and evaluate their perfor-
mance on the validation dataset.

Performance Evaluation:

• Compare the performance of the modified models with the baseline model.

• Use metrics such as alignment accuracy, similarity scores, or alignment qual-


ity measures to assess the models’ performance.

• Consider the models’ ability to correctly align DNA sequences and their
overall performance.

5.3 Dataset Collection


Name of the Dataset: DNA SEQUENCE ALIGNMENT DATASETS BASED
ON NW ALGORITHM
Link:https://www.kaggle.com/datasets/amrezzeldinrashed/dna-sequence-alignmnet-
dataset
Description: The DNA sequence alignment dataset on Kaggle provides over
8 million DNA sequence reads and a reference genome sequence for researchers
in bioinformatics and genetics. By comparing the DNA sequence reads to the
reference genome, researchers can identify genetic variations, mutations, and evo-
lutionary relationships. The dataset also includes metadata such as sample ID
and sequencing platform used, which can provide additional information for data
analysis and interpretation. This dataset is a valuable resource for researchers
interested in studying DNA sequence alignment, genetic diseases, population ge-
netics, and evolutionary biology.
Source: Kaggle
Classes: 254
Instances: 65536
Attributes:16

22
Chapter 6
IMPLEMENTATION

6.1 Output Screenshots


6.1.1 Epoch Output
In the figure 6.1 we can see the output obtained after running the MLP-PSO model
for 15 epochs

Figure 6.1: Output after running 5 epochs for MLP-PSO

6.1.2 Accuracy-Epoch Curve


Figure 6.2 represents the accuracy epoch curve for Pairwise Sequence Alignmen-
twhere the no of epochs taken as x-axis and accuracy is taken as y-axis

Figure 6.2: Accuracy-Epoch Curve for MLP-PSO

23
6.1.3 Loss-Epoch Curve
Figure 6.3 represents the loss epoch curve for MLP-PSO where the no of epochs
taken as x-axis and loss is taken as y-axis

Figure 6.3: Loss-Epoch Curve for MLP-PSO

6.1.4 Confusion Matrix


Figure 6.4 represents the count obtained for patterns of width 3,4 and 5. The
above image represents the count of the pattern.

Figure 6.4: Confusion matrix for MLP-PSO

A confusion matrix is a useful tool for evaluating the performance of a model


in Pairwise Sequence Alignment using Multilayer Perceptron (MLP) and Particle
Swarm Optimization (PSO). It provides insights into how the model classifies
the DNA sequences and the number of correct and incorrect classifications. The
confusion matrix consists of four cells that represent different outcomes:
True Positive (TP):
• In the context of pairwise sequence alignment, TP refers to the number of
DNA sequences correctly classified as aligned by the model.

24
False Positive (FP):

• FP represents the number of DNA sequences incorrectly predicted as aligned


by the model when they are actually unaligned.

False Negative (FN):

• FN indicates the number of DNA sequences that are incorrectly predicted


as unaligned by the model, but they are actually aligned.

True Negative (TN):

• TN represents the number of DNA sequences correctly predicted as unaligned


by the model.

6.1.5 Classification Report


A classification report is a way to evaluate the performance of a classification
model. It provides various metrics such as precision, recall, F1-score, and support
for each class in the dataset. These metrics help assess how well the model is
performing for different classes. Figure 6.5 presents the Classification Report for
MLP-PSO

Figure 6.5: Classification Report for MLP-PSO

Precision: exactness : what percent of tuples that the classifier labeled as


positive are actually positive
P = TP/(TP + FP)
Recall: completeness: what percent of positive tuples did the classifier label as
positive
R = TP/(TP + FN)
F1 score harmonic mean of precision and recall
F1score = 2 (P R)/(P + R)
where P=Precision

25
R=Recall
TP=True Positive
FN=False Negative
FP=False Positive
Support: Support is a term used in a classification report to refer to the number of
instances in each class. It is used to identify the imbalanced classes in the dataset,
which may affect the performance of the model.

6.2 Results and Analysis


Pairwise Sequence Alignment in DNA using Multilayer Perceptron and Particle
Swarm Optimization yields important results and analysis.Firstly classification
accuracy serves as primary metrics indicating the effectiveness in calculating Align-
ment scores which this scores are used for aligning DNA sequences by using Se-
quence Alignment algorithms like Needleman-Wunch algorithm.After running the
MLP-PSO model for 15 epoches we got an accuracy of 94.6.The accuracy-epoch
curve represents change in accuracy with respect to epochs. Confusion matrix
is a useful tool for evaluating the performance of a model in Pairwise Sequence
Alignment using Multilayer Perceptron (MLP) and Particle Swarm Optimization
(PSO). It provides insights into how the model classifies the DNA sequences and
the number of correct and incorrect classifications.

26
Chapter 7
CONCLUSION AND FUTURE WORK

In conclusion, the development of a model for pairwise sequence alignment in DNA


sequences using Multilayer Perceptron (MLP) and Particle Swarm Optimization
(PSO) is a promising approach. By combining the power of deep learning with
optimization techniques, we can achieve accurate and efficient alignment results.
Through the various modules of data preparation, model design, training, PSO
integration, evaluation, and fine-tuning, we have laid out a systematic methodology
to guide the development process.
The MLP model trained on the DNA sequence pairs demonstrates the ability
to learn and predict alignment patterns accurately. The integration of PSO fur-
ther optimizes the model’s weights and biases, improving alignment accuracy and
enhancing its performance. The evaluation of the model on the separate evalua-
tion dataset provides insights into its effectiveness and aids in identifying areas for
improvement.

Integration of more advanced alignment algorithms: While MLP and PSO of-
fer effective techniques, consider exploring the integration of other alignment al-
gorithms, such as Smith-Waterman or Needleman-Wunsch. Combining multiple
algorithms can potentially improve alignment accuracy and handle complex align-
ment scenarios.
Exploration of alternative optimization algorithms: While PSO is a powerful
optimization algorithm, there are other metaheuristic algorithms available, such as
Genetic Algorithms or Ant Colony Optimization. Investigate the applicability of
these algorithms to optimize the MLP model’s weights and biases for alignment.

27
REFERENCES

[1] A. E. E. -D. Rashed, H. M. Amer, M. El-Seddek and H. E. -D. Moustafa, ”Se-


quence Alignment Using Machine Learning-Based Needleman–Wunsch Algo-
rithm,” in IEEE Access, vol. 9, pp. 109522-109535, 2021, doi: 10.1109/AC-
CESS.2021.3100408.

[2] C. Kyal, R. Kumar and A. Zamal, ”Performance-Based Analogising of Needle-


man Wunsch Algorithm to Align DNA Sequences Using GPU and FPGA,”
2020 IEEE 17th India Council International Conference (INDICON), New
Delhi, India, 2020, pp. 1-5, doi: 10.1109/INDICON49873.2020.9342078.

[3] A. E. E. -D. Rashed, H. M. Amer, M. El-Seddek and H. E. -D. Moustafa, Ac-


celerating DNA pairwise sequence alignment using FPGA and a customized
convolutional neural network”, in ScienceDirect,2021.

[4] Y. -J. Song and D. -H. Cho, ”Local Alignment of DNA Sequence Based
on Deep Reinforcement Learning,” in IEEE Open Journal of Engineering in
Medicine and Biology, vol. 2, pp. 170-178, 2021, doi: 10.1109/OJEMB.2021.3076156.

[5] P. Neamatollahi, M. Hadi and M. Naghibzadeh, ”Simple and Efficient Pat-


tern Matching Algorithms for Biological Sequences,” in IEEE Access, vol. 8,
pp. 23838-23846, 2020, doi: 10.1109/ACCESS.2020.2969038.

[6] A. Mishra, B. K. Tripathi and S. Singh Soam, ”A Genetic Algorithm based


Approach for the Optimization of Multiple Sequence Alignment,” 2020 Inter-
national Conference on Computational Performance Evaluation (ComPE),
Shillong, India, 2020, pp. 415-418, doi: 10.1109/ComPE49325.2020.9200060.

[7] Q. Aguado-Puig et al,”Accelerating Edit-Distance Sequence Alignment on


GPU Using the Wavefront Algorithm”,in IEEE Access,doi:10.1109/ACCESS.2022.3182714

[8] X. Xu et al., ”SLPal: Accelerating Long Sequence Alignment on Many-


Core and Multi-Core Architectures,” 2020 IEEE International Conference
on Bioinformatics and Biomedicine (BIBM), Seoul, Korea (South), 2020,
pp. 2242-2249, doi: 10.1109/BIBM49941.2020.9313429.

[9] https://study.com/cimages/multimages/16/iterativesdlc.png

28

You might also like