Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
Show more
Share Cite
https://doi.org/10.1016/j.compeleceng.2021.107112 ↗
Get rights and content ↗
Abstract
An optimized software and hardware digital implementation of two widely used DNA sequence
alignment algorithms based on lookup table(LUT) is illustrated in this study. These algorithms are the
best means for identifying similar regions between sequences. The proposed implementation relies on
the complete parallelization of these foundational algorithms under certain limitations to overcome
most of the problems of dynamic programming and hardware implementation. The proposed method
takes O(N/4) calculation steps, where N is the length of each sequence with a minimum value of four
(i.e., N = 4,8,12,…). A performance comparison between the state of art and our proposed algorithm is
conducted for software and hardware implementation. Combinational circuits are used for FPGA-based
hardware implementation of DNA sequence alignment algorithms. Performance and device resource
usage are evaluated for different hardware designs. A customized convolution neural network model is
used to implement global alignment and achieve 98.3% accuracy.
Graphical abstract
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 1/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Introduction
Deoxyribonucleic acid (DNA) is a complex molecule and the “hereditary material” present inside each
cell of all living beings. It contains the instructions associated with the organism's development, life,
and reproduction. These instructions direct the cells with regard to their role in our bodies. Nearly all
the cells in a human body contain similar DNA, and most of it is present in the cell nucleus. The
information in DNA is saved as a unique genetic code consisting of four chemical nucleotides (NT),
namely, adenine (A), guanine (G), cytosine (C), and thymine (T), or a four-letter set {A, C, G, T}. The
complete human DNA contains about three billion NTs. The order of these NTs determines the
biological instruction in the genome for building and maintaining an organism. This is almost similar
to that method wherein alphabets appear in particular orders to form words or sentences.
Ribonucleic acid (RNA) is “a complex compound of high molecular weight that functions in cellular
protein synthesis and replaces DNA as a carrier of genetic codes in certain viruses. It consists of four
ribose NTs or nitrogenous bases: adenine (A), guanine (G), cytosine (C), and uracil (U).” U replaces the T
present in DNA. Thus, the alphabet for RNA sequence is also a four-letter set {A, C, G, U}. and the
alphabet for protein sequences is a 20-letter set {A, C−I, K−N, P−T, V WY}.
It is feasible to determine where the mismatches and matches are among two or more DNA, RNA, or
protein sequences by aligning sequences using sequence alignment algorithms. Sequence alignment is
a broadly used process in bioinformatics for arranging two (pairwise alignment) or more (multiple
sequence alignment) biological sequences (e.g., DNA, RNA, and protein sequences) of characters to
identify regions of similarity. It seeks to identify the optimal alignment with the highest total score, i.e.,
the maximum number of base-to-base matches, without altering the order of bases in either sequence.
In addition, gap-to-gap matches are prohibited. Mismatches and gaps can be considered mutations and
indels, respectively. Thereby, differences between sequences with a similar origin can be identified.
Hence, this process is considered a foundational step for detecting the structural or functional
importance of strange sequences. This process would also aid in detecting the gene responsible for a
specified disease or disorder, or determining the gene or genes that encode for a specified protein. A
large number of DNA sequencing projects have contributed to the growth of bioinformatics and
computational biology. It has numerous significant real-world applications.
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 2/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Pairwise sequence alignment (PWSA) methods are used for aligning two sequences simultaneously to
identify regions of similarity.
Fig. 1 shows the three foundational techniques for obtaining pairwise alignments. They are the dot-
matrix technique introduced by Gibbs and McIntyre; dynamic programming (DP), which was first
developed by Charles DeLisi in the USA for protein–DNA binding and Georgii Gurskii and Alexander
Zasedatelev in the USSR; and word techniques, which are heuristic methods that cannot guarantee an
optimum alignment result. Word techniques or database search tools are well-known for their
achievement in the database search tools (FASTA), (BLAST) family, and SIM2. In large-scale database
searches or long sequences, computational efficiency is often achieved by replacing the DP algorithms
with a heuristic one that trades accuracy for a computational time, as shown in Table I. The DP
technique can be used to produce global alignments through the Needleman–Wunsch (NW) algorithm
or the Hirschberg algorithm. It can also be used to produce local alignments through the Smith–
Waterman (SW) algorithm, the Gotoh algorithm, or the Miller–Myers algorithm.
Table II notes that the NW and SW algorithms require O(MN) calculation steps and O(MN) run time.
Here, M is the length of the first sequence, and N is the length of the second sequence. These
algorithms support different scores for exact residue matches, similar residues, and gaps. A
substitution matrix such as PAM or BLOSUM can be used to weigh residue matching scores, which will
not affect the time and space complexity. The optimized methods such as Miller–Myers and Hirschberg
can optimize space complexity to O(M+N).
DP algorithms guarantee optimal alignment for a specified set of scoring functions from a
mathematical perspective. Although they do not require a gap penalty, gap penalties are essential for
their efficient operation. In addition, they become slow for multiple sequences (more than two
sequences) or very long sequences. They are cost inefficient, time consuming, and require substantial
amounts of calculation while aligning more than two sequences. In general, the outputs of sequence
alignment algorithms, which are based on DP, are classified as global or local alignments. Table III
compares SW and NW algorithms. Another important aspect here is with regard to the alignment's
array for both algorithms. The highlighted text in green is the real alignment's results that appear
when the SW algorithm is used. Unlike the SW algorithm, the NW algorithm displays the complete set
of input letters in the alignment's result or alignment array.
The length of a DNA or RNA sequence is variable. Thus, the construction of algorithms that produce an
optimal alignment and a high score between sequences consisting of the four letters A, C, G, and T (for
DNA), or A, C, G, and U (for RNA) becomes challenging. This study aims to analyze and study two
commonly used sequence alignment algorithms and effectively realize them on cost-efficient, high-
performance, and high-speed platforms. The alignment array is reshaped as a 1-D array rather than a
2-D array in this study to aid the design process.
The remainder of this study is organized as follows: The related work is covered in Section 2. The main
issues encountered in the DNA sequence and hardware implementations are explained in Section 3.
The limitations and restrictions of our proposed technique are described in Section 4. The proposed
algorithm design, statistics, and software implementations are illustrated in Section 5. The hardware
implementation of the SW algorithm and the NW algorithm using the FPGA platform is reported in
Section 6. Before the conclusions are drawn, the design and implementation of the NW algorithm are
demonstrated using a deep learning (DL) convolutional neural network (CNN) network in Section 7.
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 3/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Almost all the sections of this study describe a comparison with other studies or other tools. Future
work and potential enhancements or modifications to our optimized design to improve the prediction
results are presented in the final section.
Section snippets
Related work
In [1], Strengholt and Brobbel explained a technique to store the values of the similarity score matrix of
the SW algorithm differentially. They also described the systematic approach to design an accelerator,
which realized this technique. The realization was on an Intel FPGA platform. The author stated that
this technique could produce an overall performance of ninety-four GCU/s, which may accelerate to
5 × that of classic CPUs.
Problem definition
Biological sequence alignment algorithms are time consuming even when implemented using
accelerating hardware platforms such as CPU, GPU, or FPGA for the following reasons: (1) The number
of sequences is large, and each of their lengths can be very long. (2) Table II shows that the algorithms
used to align the sequences requires O(MN) calculation steps and consumes O(MN) time (M and N are
the lengths of the two input sequences). (3) Basic sequence alignment algorithms are internally
dependent…
Limitations
According to the DC algorithm, the alignment issue can be broken down into smaller sub problems.
Then, the smaller sub problems can be solved optimally, and their results can be used to construct the
optimum solution to the main problem. In this study, we propose using equal-length sequences (i.e.,
multiples of four N=4, 8, 12 ...) that can be applied to DNA or RNA sequences because DNA and RNA
sequences consist of four letters of the alphabet, representing four NTs despite the protein sequence …
Proposed algorithm
Fig. 3 shows that our implementation depends on the development of a truth table or an LUT of all
feasible combinations of the two DNA input sequences after converting the DNA sequence from
alphabets into binary representations. A truth table presents each feasible DNA input sequence
combination to the alignment algorithm function, with the resulting alignment or alignment array
(output) depending on the combination of DNA input sequences.
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 4/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
In previous sections, we established that our technique is faster than the state-of-the-art
implementations for long DNA sequences. Now, we demonstrate the implementation of the SW
algorithm based on Xilinx FPGA.
Fig. 5 shows the basic steps required for our implementation. The DNA input sequences are converted
from letters into a binary representation to construct a truth table of all the possibilities for hardware
implementation. This conversion will be used for local and global hardware…
Fig. 7 shows that to implement the NW algorithm, we need to first construct a truth table that contains
the two DNA input sequences (16-bit) as inputs (after converting letters into binary representation). In
addition, the output will be the alignment array for the NW algorithm after their characters are
encoded into binary representation (54-bit). Then, 54 Boolean functions are derived from the truth
table. Two proposed class reduction techniques are used. The first reduction technique reduces …
Ten traditional classifiers including MLP, support vector machine (SVM), decision tree, SGD, and
random forest are tested with four datasets as in Table XXVIII, using Python Sklearn library with default
classifiers’ hyperparameters. We use the 80/20 split for training and testing data. No reasonable
accuracy is achieved because the input features are dependent. The original dataset is the third dataset.
It has a binary input of 16 bits and 254 classes.
DL has attracted considerable interest in research centers. Compared with traditional neural network
architecture, it exhibits substantial advantages in feature extraction and model fitting. In addition, it is
highly effective at discovering increasingly abstract feature representations whose generalization
capability is strong from the raw input data. It has successfully solved certain issues that were
considered complicated to resolve by AI in the past. The use of big data for training and…
Conclusion
Most of the previous studies aimed to accelerate the alignment algorithms in different ways without
providing any effective solution for sequential process problems. Our proposed algorithms depend on
the parallelization of common alignment algorithms for DNA sequences under certain limitations to
overcome the main problems of DP and hardware implementation. It can also be applied to RNA. This
technique can be applied to any other local or global alignment method and for short as well as very…
Future work
Using different opening gap values in NW design does not substantially affect the HW performance as
well as the number of characters representing the alignment array (still 18 characters), but reviewing
and standardizing the alignment array can (i.e., use a single pattern for all full-mismatch conditions
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 5/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
['****::::****'] or a single symbol to represent mismatch condition [colons only] instead of using two
symbols [space and colon]. In addition, this standardization will reduce the number of…
Contributions
Author statement
Amr Ezz El-Din Rashed: :Conceptualization, Methodology, Software, Hardware ,Writing Original draft
preparation.
Amr Ezz El-Din Rashed, PhD. student at Electronics and Communications Engineering Department,
Faculty of Engineering, Mansoura University, Egypt. Now, he is a lecturer at Computer Engineering
Department, Faculty of Computers, and Information Technology, Taif university, KSA.The main research
points include bioinformatics, biomedical Image processing, speaker recognition, computer vision,
machine learning, deep learning applications, embedded systems including FPGA and VHDL.…
References (29)
L Ji
One-dimensional pairwise CNN for the global alignment of two DNA sequences
Neurocomputing (2015)
Yi-L Liao
Adaptively Banded Smith-Waterman Algorithm for Long Reads and Its Hardware
Accelerator
Strengholt, B; Brobbel, M. Acceleration of the Smith-Waterman algorithm for DNA sequence alignment
using an FPGA...
W.R. Pearson
Comparison of methods for searching protein sequence databases
Protein Sci (1995)
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 6/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Fa Zhang et al.
A parallel smith-waterman algorithm based on divide and conquer
P Zhang et al.
Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing
platform
M. Kim
Accelerating Next Generation Genome Reassembly in FPGAsAlignment Using Dynamic
Programming Algorithms
(2011)
M N Isa
High performance reconfigurable architectures for biological sequence alignment
(2013)
E Rucci
SWIFOLD: Smith-Waterman implementation on FPGA with OpenCL for long DNA sequences
BMC Syst Biol (2018)
D Zou et al.
Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,
GPU and FPGA
Concurr Comput: Pract Exp (2012)
Cited by (8)
Show abstract
Protein remote homology recognition using local and global structural sequence alignment
2023, Journal of Intelligent and Fuzzy Systems
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 7/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Research article
Cloud edge computing for socialization robot based on intelligent data envelopment
Computers & Electrical Engineering, Volume 92, 2021, Article 107136
Show abstract
Research article
A deep multimodal feature learning network for RGB-D salient object detection
Computers & Electrical Engineering, Volume 92, 2021, Article 107006
Show abstract
Research article
Show abstract
Research article
Research article
Show abstract
Research article
Show abstract
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 8/9
3/21/23, 11:55 AM Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network - ScienceDirect
Amr Ezz El-Din Rashed, PhD. student at Electronics and Communications Engineering Department, Faculty of
Engineering, Mansoura University, Egypt. Now, he is a lecturer at Computer Engineering Department, Faculty of
Computers, and Information Technology, Taif university, KSA.The main research points include bioinformatics,
biomedical Image processing, speaker recognition, computer vision, machine learning, deep learning applications,
embedded systems including FPGA and VHDL.
Marwa Ismael Obayya, Associate Professor at Electronics and Communications Engineering Department, Faculty of
Engineering, Mansoura University, Egypt. Now, she is a Director of Communications Engineering Program,
Electrical Engineering Department in Princess Nora Bent Abdurrahman University, Riyad, KSA. Her research area of
interest was utilized in the field of image processing, Signal Processing, Optimization, and machine learning. She
has several publications in biomedical engineering, optimization, and intelligent machine learning.
Hossam El-Din Moustafa, Associate Professor at the Department of Electronics and Communications Engineering,
the founder and executive manager of Biomedical Engineering Program (BME) at the Faculty of Engineering,
Mansoura University. The main research points include biomedical image and signal processing and deep learning
applications.
Reviews processed and recommended for publication by Guest Editor Feiran Huang.
https://www.sciencedirect.com/science/article/abs/pii/S0045790621001178 9/9