0% found this document useful (0 votes)
67 views10 pages

2016 - AAlign A SIMD Framework For Pairwise Sequence Alignment On X86-Based Multi - and Many-Core Processors

This document summarizes a research paper presented at the IEEE International Parallel and Distributed Processing Symposium in Chicago in May 2016. The paper proposes a framework called AAlign that can automatically vectorize pairwise sequence alignment algorithms across instruction set architectures. AAlign takes sequential alignment algorithms as input and generates efficient vectorized code using either a "striped-iterate" or "striped-scan" strategy. It also uses a hybrid approach that selects the best strategy at runtime based on the algorithm, configuration, and input sequences to achieve optimal performance. Evaluation shows the generated vector code achieves up to a 26-fold speedup over sequential code on Intel processors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views10 pages

2016 - AAlign A SIMD Framework For Pairwise Sequence Alignment On X86-Based Multi - and Many-Core Processors

This document summarizes a research paper presented at the IEEE International Parallel and Distributed Processing Symposium in Chicago in May 2016. The paper proposes a framework called AAlign that can automatically vectorize pairwise sequence alignment algorithms across instruction set architectures. AAlign takes sequential alignment algorithms as input and generates efficient vectorized code using either a "striped-iterate" or "striped-scan" strategy. It also uses a hybrid approach that selects the best strategy at runtime based on the algorithm, configuration, and input sequences to achieve optimal performance. Evaluation shows the generated vector code achieves up to a 26-fold speedup over sequential code on Intel processors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

AAlign: A SIMD Framework for Pairwise Sequence Alignment


on x86-based Multi- and Many-core Processors

Kaixi Hou, Hao Wang, and Wu-chun Feng


Department of Computer Science
Virginia Tech, Blacksburg, USA
{kaixihou, hwang121, wfeng}@vt.edu

Abstract—Pairwise sequence alignment algorithms, e.g., processors, it is crucial to utilize the vector processing
Smith-Waterman and Needleman-Wunsch, with adjustable gap units (VPUs), which support single-instruction, multiple-
penalty systems are widely used in bioinformatics. The strong data (SIMD) operations. However, the strong data depen-
data dependencies in these algorithms, however, prevents
compilers from effectively auto-vectorizing them. When pro- dencies among neighboring cells prevent such algorithms
grammers manually vectorize them on multi- and many-core from taking advantage of compiler auto-vectorization. Thus,
processors, two vectorizing strategies are usually considered, programmers need to explicitly vectorize their code or even
both of which initially ignore data dependencies and then resort to writing assembly code to attain better performance.
appropriately correct in a subsequent stage: (1) iterate, which The manual vectorization of such algorithms often re-
vectorizes and then compensates the scoring results with
multiple rounds of corrections and (2) scan, which vectorizes lies on two potential strategies: (1) iterate [6], [4], [5],
and then corrects the scoring results primarily via one round of which partially ignores the dependencies in one direction,
parallel scan. However, manually writing such vectorizing code vectorizes the computations, and then may compensate the
efficiently is non-trivial, even for experts, and the code may not results by using multiple rounds of corrections; or (2) scan
be portable across ISAs. In addition, even highly vectorized [7], which completely ignores the dependencies in one
and optimized codes may not achieve optimal performance
because selecting the best vectorizing strategy depends on the direction, vectorizes the computations, and re-calculates the
algorithms, configurations (gap systems), and input sequences. results with “weighted scan” operations and another round of
Therefore, we propose a framework called AAlign to au- correction. Either strategy has its own benefits depending on
tomatically vectorize pairwise sequence alignment algorithms selected algorithms (e.g., SW or NW), gap systems (linear
across ISAs. AAlign ingests a sequential code (which follows or affine), and input sequences.
our generalized paradigm for pairwise sequence alignment)
and automatically generates efficient vector code for iterate Two main challenges face programmers. First, manual
and scan. To reap the benefits of both vectorization strategies, vectorization requires substantial coding effort in order to
we propose a hybrid mechanism where AAlign automatically handle idiosyncratic vector instructions. For applications
selects the best vectorizing strategy at runtime no matter which having complex data dependencies, expert knowledge of
algorithms, configurations, and input sequences are specified. vector instruction sets and proficient skills to organize vector
On Intel Haswell and MIC, the generated codes for Smith-
Waterman and Needleman-Wunsch achieve up to a 26-fold instructions are necessary to achieve the desired functional-
speedup over their sequential counterparts. Compared to the ity. Moreover, current vector ISAs evolve quickly and some
highly optimized and multi-threaded sequence alignment tools, versions are not backwards-compatible [8]. Porting existing
e.g., SWPS3 and SWAPHI, our codes can deliver up to 2.5-fold vectorized codes to other platforms becomes a laborious and
and 1.6-fold speedups, respectively. tedious task. Second, even highly optimized vector codes
Keywords-parallelization; vectorization; SIMD; automated may not achieve optimal performance at the application
code generation; alignment; pairwise sequence search; mul- level. For pairwise sequence alignment, the combination
ticore; many-core; framework of algorithms, vectorization strategies, configurations (gap
penalty systems), and input sequences at runtime may lead
I. I NTRODUCTION to significant variability in performance. Furthermore, it in-
Pairwise sequence alignment algorithms, e.g., Smith- creases the complexity of optimizing applications on modern
Waterman (SW) [1] and Needleman-Wunsch (NW) [2], are multi- and many-core processors. Therefore, looking for a
important computing kernels in bioinformatics applications way to get around these obstacles is of great importance.
([3], [4], [5]) to quantify the similarity between pairs of In this paper, we propose a framework called AAlign
DNA, RNA, or protein sequences. This similarity is captured to automatically vectorize pairwise sequence alignment al-
by a matching score, which indicates the minimum number gorithms across ISAs. Our framework takes sequential al-
of deletion, insertion, or substitution operations with penalty gorithms, which need to follow our generalized paradigm
or award values to transform one sequence to another. To for pairwise sequence alignment, as the input and generates
boost their performance on modern multi- and many-core vectorized computing kernels as the output by using the for-
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

malized vector code constructs and linking to the platform- additional functionality, e.g., Advanced Vector Extensions
specific vector primitives. Two vectorizing strategies are (AVX), Initial Many Core Instructions (IMCI), and the
formalized as ”striped-iterate” and ”striped-scan” in our incoming AVX-512 [9]. Meanwhile, the vector width has
framework. In addition, we propose a hybrid approach that also extended from 128 bits (4 floats) to 256 bits (8 floats)
automatically switches between striped-iterate and striped- to current 512 bits (16 floats), improving the throughput of
scan vectorization based on the context at runtime, which in systems and offering potential benefits for applications. In
turn provides better performance than past approaches. this paper, we focus on AVX2 and IMCI.
The major contributions of our work are three-fold. First, AVX2 Instructions: The width of AVX2 registers is 256
we propose the AAlign framework, which can automatically bits, consisting of two 128-bit lanes. The ISA is available
generate parallel codes for pairwise sequence alignment since the Haswell architecture. AVX2 expands most vector
via a combination of algorithms, vectorizing strategies, and integer SSE and AVX instructions to 256 bits and sup-
configurations. Second, we show that the existing vectorizing ports variable-length integers. In addition, AVX2 introduces
strategies cannot always deliver optimal performance even gather, cross-lane permute and per-element shift instructions.
when the codes are highly vectorized and optimized. As IMCI Instructions: The width of IMCI registers is 512
a result, we design a hybrid approach that takes advan- bits in four 128-bit lanes. IMCI works on the Knights Corner
tage of two vectorizing strategies: iterate and scan. Third, MIC architecture. Although IMCI provides additional func-
using AAlign, we generate various parallel codes for the tionality, e.g., scatter, gather, reduce, etc., it does not support
different combinations of algorithms (SW and NW), vec- previous vector ISAs, e.g., SSE and AVX. Because IMCI
torizing strategies (striped-iterate, striped-scan, and hybrid), does not support 8-bit or 16-bit integers, we only consider
and configurations (linear and affine gap penalty systems) on 32-bit integers on IMCI in this paper.
two x86-based platforms, i.e., Advanced Vector eXtension
B. Pairwise Sequence Alignment Algorithms
(AVX2)-supported multicore and Initial Many Core Instruc-
tion (IMCI)-supported manycore. Algorithms for pairwise sequence alignment aim to quan-
We conduct a series of evaluations of the generated vector tify the best-matching score between two input sequences
codes. Compared to the optimized sequential codes on the of DNA, RNA, or proteins. Specifically, the alignment uses
Haswell CPU, our codes using striped-scan vectorization can the edit distance to describe how to transform one sequence
deliver 4.0-fold to 6.2-fold speedup, while using striped- into another by using a minimum number of predefined
iterate vectorization, our codes can provide 4.7-fold to 10.0- operations, including insertion, deletion, and substitution,
fold speedup. The vectorized codes also show performance along with an associated penalty or award. A common tech-
benefit on Intel MIC and can achieve 9.1-fold to 16.0- nique leverages dynamic programming, which uses tabular
fold speedup using striped-scan and 9.5-fold to 25.9-fold computations as shown in Fig. 1. If the input sequences
speedup using striped-iterate over the optimized sequential are query Qm with m characters and subject Sn with n
counterparts, respectively. We also compare our proposed characters, we need a m ∗ n table T , and every cell Ti,j in
hybrid approach with striped-iterate and striped-scan vec- the table stores the optimal score of matching the substring
torization and show that our hybrid approach can achieve Qi and Sj . To assist in the computation, we define three
better performance on both platforms. Next, after wrapping additional tables: Li,j , Ui,j , Di,j denoting the optimal scores
our vector codes with multi-threading, we compare our of matching with substring Qi and Sj but ending with the
hybrid approach to highly optimized sequence alignment insertion, deletion, and substitution, respectively. We can
tools, i.e., SWPS3 [4] on CPU and SWAPHI [5] on MIC. derive:
Ti,j = max(Li,j , Ui,j , Di,j ) (1)
When aligning query sequences to an entire database, our
codes achieve up to a 2.5-fold speedup over SWPS3 on CPU Fig. 1 also shows the data dependencies. Visually, Li,j , Ui,j ,
and 1.6-fold speedup over SWAPHI on MIC. Di,j rely on its left, upper, diagonal neighbors. Although the
algorithm takes O(m ∗ n) time and space complexity, we
II. BACKGROUND use a double-buffering technique, as shown by the two solid
This section provides a brief overview of (1) the vector rectangles in the figure, to address the former while lowering
ISAs of x86-based processors for both CPU and MIC and the space complexity to O(m), assuming the computation
(2) the pairwise sequence alignment algorithms. goes along the Qm .
There are two major classes of pairwise sequence align-
A. Vector ISA ment algorithms: local alignment and global alignment. For
Modern x86-based processors (e.g., CPU and MIC) are the former, the Smith-Waterman (SW) algorithm [1] can
equipped with vector processing units (VPUs). These VPUs quantify the optimal match with respect to partial (local)
can carry out a single operation over a set of data simultane- regions. For the latter, the Needleman-Wunsch (NW) al-
ously. Alongside, the vector ISA provides an abundant set of gorithm [2] can find the best-matching score over entire
instructions and continues to evolve and expand to provide sequences. Both algorithms have multiple variants by using
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

Subject Sn
i=0 Comparing the vector codes in Alg. 2 and Alg. 32 to
j=0 the sequential code in Alg. 1, we see that writing vector
Di-1,j-1Ui-1,j-1 Di,j-1 Ui,j-1 codes involves expert knowledge of the algorithms and
the platform-specific ISAs, even though the detailed low-
Query Qm

Li-1,j-1 Ti-1,j-1 Li,j-1 Ti,j-1


Diagonal Dep. Vertical Dep.
level intrinsics are hidden by our formalized codes. As a
Di-1,j Ui-1,j Di,j Ui,j
result, the first question we want to answer is whether we
can automatically vectorize these types of applications with
Li-1,j Ti-1,j Li,j Ti,j multiple combinations of parameters.
(a) double -buffering (b) Horizontal Dep.

Figure 1: Data dependencies in the alignment algorithms using dynamic programming  




   



linear or affine gap penalties. In Sec. IV, we show a general-

ized paradigm for pairwise sequence alignment algorithms. 

III. C HALLENGES

 
  
  
Alg. 1 shows the sequential code of SW with an affine  !$ %  "
#   
! "
  
gap penalty. Though writing the sequential code is relatively 
  
  ! "

simple, vectorizing such an algorithm is non-trivial due to
Figure 2: Example of comparing two vectorizing strategies under various conditions
the strong data dependencies among the neighbors shown in on MIC (the cases are from Sec. VI)
Fig. 1.
On the other hand, the differences in the two strategies
Algorithm 1: Sequential Smith-Waterman following the paradigm (Sec. IV) each have their own benefits. Fig. 2 shows our other moti-
/* GAPOPEN and GAPEXT are constants; BLOSUM62 is a vation — because the algorithms, configurations, and input
substitution matrix; ctoi is a user-defined function
to map given character to the index number in the sequences at runtime can affect performance and because no
substitution matrix */ one combination can always provide best performance, the
1 for i ←0; i < n+1; i++ do
2 T0,i = U0,i = L0,i = 0; second question is whether we can design a mechanism to
3 for j ←0; j < m+1; j++ do automatically select the favorable vectorization strategies at
4 Tj,0 = Uj,0 = Lj,0 = 0;
5 for i ←1; i < n+1; i++ do runtime.
6 for j ←1; j < m+1; j++ do
7 Li,j = max(Li−1,j + GAPEXT , Ti−1,j + GAPOPEN );
8 Ui,j = max(Ui,j−1 + GAPEXT , Ti,j−1 + GAPOPEN );
IV. G ENERALIZED PAIRWISE A LIGNMENT PARADIGM
9 Di,j = Ti−1,j−1 + BLOSUM62ctoi(Qj−1 ),ctoi(Si−1 ) ;
10 Ti,j = max(0, Li,j , Ui,j , Di,j );
Here we present our generalized paradigm for the pairwise
11 // resultant score is the maximum value in T sequence alignment algorithms with adjustable gap penal-
ties. Any sequential codes following the paradigm can be
processed by our framework to generate real vector codes.
In Sec. I, we introduced two vectoring strategies to re- ⎧
construct the data dependencies. Here we describe the major ⎪
⎪ 0

⎨max j
0l<j (Ti,l + θi,l + βi,k )
differences between the two vectorizing strategies: (1) iter- Ti,j = max  k=l+1
j (2)

⎪ max 0l<i (T l,j + θ + k=l+1 βk,j )
ate [6], which partially ignores the vertical dependencies in ⎪

l,j
Ti−1,j−1 + γi,j
Fig. 1, and processes the vertical cells simultaneously along
the column1 and (2) scan [7], which completely ignores the In the paradigm in Eq.(2), the T is the working-set table and
vertical dependencies at the beginning. The vertical cells Ti,j stores the suboptimal score. 0 is optional and used only

can be processed in a SIMD way, giving us the preliminary in local alignment. θi,l (θl,j ) is the gap penalty of initiating

results. After that, a parallel max-scan operation is conducted a gap at the position l of Qm (Sn ). βi,k (βk,j ) is the gap
on the preliminary results, and the scan results are applied penalty of continuing a gap at the position k of Qm (Sn ).
to correct the results in another round of computation. γi,j is the substitution score of matching base j of Qm and
The fundamental difference in these two strategies is in base i of Sn . In bioinformatics, the substitution scores are
the correction. While iterate may not need any correction usually from the scoring matrix, such as BLOSUM62. Both
 
or will finish the correction with one or several steps of θi,l (θl,j ) and βi,k (βk,j ) can be configured to be either
re-computations, scan will always take two rounds of re- constants or variables. By using the dynamic programming,
computations, i.e., the scan on all vertical cells and then a one can use three assistant symbols, i.e., Ui,j , Li,j , Di,j , to
round of much lighter correction. represent the influence from Ti,j ’s upper, left, and diagonal
1 This round of computations only ensures a portion of the results as 2 Although we use our formalized codes as the examples, the hand-written
correct, leading to potentially multiple rounds of corrections. vector codes presented in previous research, e.g., [6], are similar to ours.
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

Original: i a b c d e A B C D E
neighbors. Therefore, the paradigm is equivalent to Eq.(3-6).
Striped: i a A b B c C d D e E

 v1 v2 v3 v4 v5
⎧ Ui,j−1 + βi,j Figure 4: The original and SIMD-friendly striped layouts

⎪ 0 Ui,j = max (4)

⎨ T + θi,j−1 + βi,j
Ui,j  i,j−1
= max 
Ti,j (3) Li−1,j + βi,j

⎪Li,j Li,j = max (5)

⎩ 
Ti−1,j + θi−1,j 
+ βi,j
Di,j
Di,j = Ti−1,j−1 + γi,j (6)
data from the same column in Fig. 1 to the buffer, AAlign
transforms the data layout to the striped format, which is
Now, we can fit the real algorithms into the paradigm. SIMD-friendly because the data dependency among adjacent
Smith-Waterman: Because it is a local alignment algo- elements are eliminated. Fig. 4 shows the data layouts
rithm, we need to keep 0 as the initial. If we simply use the before and after the striped transformation. In the original
 
linear gap penalty, the θi,l (θl,j ) is set to 0 and βi,k (βk,j ) is buffer, we have 20 elements from the same column of the
the gap penalty value. If we use affine gap penalty, the θi,l tabular; and each element depends on its preceding neighbor
 
(θl,j ) is the gap open penalty value and βi,k (βk,j ) is the gap (the vertical direction in Fig. 1). If we load the elements
extension penalty value. If these parameters are variables, directly into five vectors, the data dependencies will hinder
other gap penalty systems can be used. Needleman-Wunsch: efficient vector operations. By rearranging the buffer into
Because it is a global alignment algorithm, we don’t need the the striped format, dependent elements are distributed to
0. The configuration of other parameters is similar with the different vectors, making the interaction happening among
SW. Actually, ln. 7 to ln. 10 in Alg. 1 follow the paradigm vectors rather than within vectors.
with necessary initialization codes in ln. 1 to ln. 4.
V. AA LIGN F RAMEWORK
Algorithm 2: Vector code constructs for striped-iterate
The AAlign framework adopts the “striped-iterate” and /* m is the aligned length of Q, n is the length of S,
“striped-scan” as the basic vectorization strategies. We make k is number of vectors in Q, equal to m/veclen . If
the linear gap penalty system is taken, the AAlign
a few modifications to the original methods derived from [6] will ignore the asterisked statements */
and [7] to fit our framework. Fig. 3 illustrates the overview 1 vec vTdia , vTleft , vTup , vT;
2 vec vTmax = broadcast(INT MIN);
of AAlign. The framework can accept any kind of sequential 3 vec vGapTleft = broadcast(GAP LEFT);
codes following our generalized paradigm in Sec. IV. After 4 vec vGapTup = broadcast(GAP UP);
5 *vec vL, vU;
analyzing the Abstract Syntax Tree (AST) of the sequential 6 *vec vGapL = broadcast(GAP LEFT EXT);
code, AAlign can obtain the required information, such as 7 *vec vGapU = broadcast(GAP UP EXT);
8 *vec vZero = broadcast(0);
the type of the given alignment algorithm and the selected 9 for i ←0; i < n; i++ do
gap penalty system. Then, AAlign will input the information 10 vTdia = rshif t x f ill(arrT1 + (k − 1) ∗ veclen , 1, INIT T);
11 vTup = set vector(m, INIT T, GAP UP);
to the “vec code constructs” which are formalized according 12 vTup = add vector(vTup , vGapTup );
to the aforementioned vectorizing strategies. Finally, the 13 *vU = set vector(m, INIT U, GAP UP EXT);
14 *vU = add vector(vU, vGapU);
framework can generate real codes by using proper vector 15 *vU = max vector(vU, vTup );
modules. These modules include primitive vector operations 16 for j ←0; j < k; j++ do
17 vTdia = add array(prof + ctoi(Si ) ∗ m + j ∗ veclen , vTdia );
whose implementation is ISA-specific. 18 vTleft = add array(arrT1 + j ∗ veclen , vGapTleft );
19 *vL = add array(arrL + j ∗ veclen , vGapL);
20 *vL = max vector(vL, vTleft );
Clang mod 21 *store vector(arrL + j ∗ veclen , vL);
vec code
framework 22 vT = max vector(vTdia , MAX OPRD);
constructs mod
seq code ISA-specific vec code 23 store vector(arrT2 + j ∗ veclen , vT);
AST
modules 24 vTmax = max vector(vTmax , vT);
(seq code) AST (vec code constructs) 25 vTdia = load vector(arrT1 + j ∗ veclen );
Traverse Build vec code 26 vTup = vT;
27 vTup = add array(vTup + vGapTup );
Identify Use hybrid method 28 *vU = add vector(vU + vGapU);
29 *vU = max vector(vTup , vU);
Figure 3: High-level overview of the structure of AAlign framework 30 REC UP = rshif t x f ill(REC UP, 1, REC FILL);
31 int j = 0;
32 vT = load vector(arrT2 + j ∗ veclen );
33 while inf luence test(REC UP, REC CRT) do
A. Vector Code Constructs 34 vT = max vector(vT, REC UP);
35 store vector(arrT2 + j ∗ veclen , vT);
In this section, we will first describe the SIMD-friendly 36 vTmax = max vector(vTmax , vT);
data layout used in AAlign. Based on it, we will present 37 REC UP = add vector(REC UP, REC UP GAP);
38 if ++j >= k then
two vector code constructs containing the vector modules 39 REC UP = rshif t x f ill(REC UP, 1, REC FILL);
(Sec. V-C) and the configurable parameters (Sec. V-D). 40 j=0;
41 vT = load vector(arrT2 + j ∗ veclen );
Striped layout: AAlign always conducts the tabular com- 42 swap(arrT1 , arrT2 );
putation along the query sequence Qm . After loading the
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

Striped-iterate: This vectorizing strategy is based on [6]. completely ignore the data dependencies within the buffer
The modified vector code constructs are shown in Alg. 2. (along the Q) to do the tentative computation (ln. 9 to ln. 17).
We use two m-element buffers arrT1 and arrT2 to store the Unlike the striped-iterate, we conduct “weighted” scan over
best-matching scores. Additionally, a m-element buffer arrL the tentative results arrT2 and store the scan results to arrscan
stores the scores denoting best-matching with ending gap in (ln. 18). Finally, we use the values in arrscan to correct the
Q. The scores denoting best-matching with ending gap in results (ln. 19 to ln. 24). After that, we continue to process
S are stored in the vector register Tup or vU if affine gap the next character in S (ln. 7).
penalty system is taken. In this strategy, we first partially
ignore the data dependencies within the buffer (along the B. Hybrid Method
Q) and use the predefined vectors (ln. 11 and ln. 13) to set As we discussed in Sec. III, no one combination of
lower bounds. In the predefined vectors (Tup or vU), only the algorithms (SW or NW), vectoring strategies (iterate
first elements come from the real initialization expressions or scan), gap penalty systems (linear or affine) can al-
(IN IT T and IN IT U ), while other elements are derived ways provide optimal performance for different pairs of
from them and corresponding gap penalties (GAP U P and input sequences. Before we provide a better solution, we
GAP U P EXT ). As a result, the first round of preliminary investigate the reason under what circumstances a specific
computations (ln. 16 to ln. 29) only ensures the first elements combination can win. We test various query sequences,
in each vector are correct (a-e cells in Fig. 4). whose lengths range from 100 to 36k characters. We fix the
We need to correct the results if the updated predefined algorithm to SW and the gap penalty system to the affine
vectors affect the results (ln. 33). The re-computations of gap, and change the vectoring strategies. We find that the
correction (ln. 34 to ln. 41) will take at most veclen -1 times striped-scan strategy performs better when the number of
to ensure all the other elements in the vectors are correct. re-computations in striped-iterate is around 1.5 times more
After that, we continue the for loop (ln. 9) to process the on MIC, and 2.5 times on Haswell (For other combinations
next character in S, which corresponds to another column of algorithms and gap systems, the ratios are similar due to
in Fig. 1. the similar computational pattern and workloads). Generally,
if the best-matching score before the re-computations is
Algorithm 3: Vector code constructs for striped-scan high, meaning that the two input sequences may be close to
// m is the aligned length of Q, n is the length of S, each other, the striped-iterate has to carefully and iteratively
k is number of vectors in Q, equal to m/veclen . If check each position with more re-computation steps in order
the linear gap penalty system is taken, the AAlign
will ignore the asterisked statements to eliminate the false negative; while in striped-scan, no
1 vec vTdia , vTleft , vTup , vT; matter what the matching scores are, the fix number of
2 vec vTmax = broadcast(INT MIN);
3 vec vGapTleft = broadcast(GAP LEFT); re-computations are needed. Paradoxically, we cannot rely
4 *vec vL; on this observation to determine which strategy should be
5 *vec vGapL = broadcast(GAP LEFT EXT);
6 *vec vZero = broadcast(0); taken, because unless we finish the alignment algorithm and
7 for i ←0; i < n; i++ do get the real matching scores, we don’t know how similar
8 vTdia = rshif t x f ill(arrT1 + (k − 1) ∗ veclen , 1, INIT T);
9 for j ←0; j < k; j++ do or dissimilar in the input pair of sequences, or even in a
10 vTdia = add array(prof + ctoi(Si ) ∗ m + j ∗ veclen , vTdia ); specific rang of pairs.
11 vTleft = add array(arrT1 + j ∗ veclen , vGapTleft );
12 *vL = add array(arrL + j ∗ veclen , vGapL); In the paper, we propose an input-agnostic hybrid method
13 *vL = max vector(vL, vTleft ); that can automatically select the efficient vectorizing strategy
14 *store vector(arrL + j ∗ veclen , vL);
15 vT = max vector(vTdia , MAX OPRD); at the runtime. Our hybrid method starts from the striped-
16 store vector(arrT2 + j ∗ veclen , vT); iterate strategy, in which we maintain a counter to record
17 vTdia = load vector(arrT1 + j ∗ veclen );
18 wgt max scan(arrT2 , arrScan , m, INIT T, GAP UP EXT, GAP UP); the number of re-computations. When the counter exceeds
the configured threshold, the method will switch to the
19 for j ←0; j < k; j++ do
20 vTup = load vector(arrScan + j ∗ veclen ); striped-scan. For example, based on the experiments for the
21 vT = load vector(arrT2 + j ∗ veclen ); combination of SW with the affine gap presented in the
22 vT = max vector(vT, vTup );
23 vTmax = max vector(vTmax , vT); previous paragraph, we set the threshold to be 2 for MIC and
24 store vector(arrT2 + j ∗ veclen , vT); 3 for Haswell CPU. However, switching back from striped-
25 swap(arrT1 , arrT2 );
scan to striped-iterate is nontrivial, because we don’t know
the amount of re-computations for striped-iterate when the
Striped-scan: The scan strategy in AAlign is based on algorithm is working in the striped-scan mode. Alternatively,
the GPU method [7]. We modify it by using the striped we design a solution to “probe” the re-computation overhead
format on x86-based platforms, shown in Alg. 3. Similar at a configurable interval stride. That way, after processing
with the striped-iterate, we define three m-element buffers stride characters in the subject sequence using the striped-
arrT1 , arrT2 , and arrL . In addition, an extra buffer arrscan scan, we tentatively switch back to the striped-iterate and
is used to store the scan results. In this strategy, we first rely on the counter to determine the next switch. Once
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

the counter is above the threshold, we switch back again are wrapper functions of the directly-mapped ISA intrinsics.
to the striped-scan for another round of processing stride As a contrast, the second group of modules carry out an
characters. Otherwise, our method will stay in the striped- application-specific operations, customized to our formal-
iterate mode and continue checking the counter. ized vector code constructs.
Fig. 5 shows an example of the hybrid method. If we only Table I: The vector modules in AAlign
rely on the striped-iterate method, the re-computations in the
middle part of the subject sequence will kill the performance Module Name Description
Basic Vector Operation API
due to the overhead of re-computations. In contrast, if we load vector(void *ad); Load/store a vector from/to the memory address
only use the striped-scan, the benefits of the head and tail store vector(void *ad, vec v); ad, which can be char*, short*, or int*
(the same below)
parts in the striped-iterate will be wasted. Our hybrid method add vector(vec va, vec v); Add a vector of va or from the memory address
uses the counter to find the amount of re-computations is add array(void *ad, vec v); ad by vector v,
max vector(vec v1, ...); Take any count of input vectors, and return
above the threshold around processing the 800-th character, the vector with largest integers in each aligned
and thus switch to striped-scan method. Then, it will probe position
App-specific Vector Operation API
the counter periodically by going back to the iterate method set vector(int m, int i, int g); Init a new vector, in which i is the default Ti,j
until the counter drops below the threshold or the end of the or Fi,j value when j=0, g is their correspond-
ing gap βi,j or θi,j
sequence S is achieved. rshift x fill(vec v, int n, ...); Right shift the vector of v or loaded from ad by
rshift x fill(void *ad, int n, n of positions and fill the gaps with specified
  
...); values
   influence test(vec va, vec vb); Check if vector va can affect the values in vb
   wgt max scan(void *in, void “weighted” max-scan over the values in in of

 



!

# *out, int m, int i, int g, int G); the striped format, store the results to out. i is

the default Ti,j value when j=0, g, G are the
 corresponding βi,j , θi,j

!

#

set vector: is to set the lower-bound vector in the striped-
 !"# !"#
iterate strategy. Fig. 6 shows that AAlign will set the first

     value of the lower-bound vector to be the initial value i

       Then, the lower-bound values of the rest are set to be
Figure 5: The mechanism of the hybrid method i + l ∗ k ∗ g, where l is the element’s index, k is the
total number of vectors, and g is the associate gap penalty.
One may wonder why the hybrid mechanism starts from The implementation of the module is to use the proper
the striped-iterate, conservatively switches from striped- _mm256/512_set instrinsics.
iterate to striped-scan only when the counter exceeds the
set_vector(20,i,g) e E
threshold, and aggressively switches back by using the rshift_x_fill (v5,1,x)
1st round of computation can only
ensure the 1st column is correct

lower- v0 x e E
proactive probe. The reason is related with the characteristics bound i
influence_test (
of sequence search: although the sequence alignment is v1 a A v1 a A v0-vGapߚ, v1-vGapߠ );
designed to find similar sequences of databases for the input v2 b B v2 b B
2nd round of re-
query, it cannot identify too many similar sequences because v3 c C v3 c C computation might be
avoided depending on the
v4 d D v4 d D
statistically most of the sequences of databases are dissimilar influence_test result
v5 e E v5 e E
with a specific input. Even if a sequence is determined
similar to the input, their exactly match regions are few. Figure 6: Vector modules used in the striped-iterate
Considering the much faster convergence speed of striped-
iterate for dissimilar pairs, we prefer it, and conservatively rshift x fill: is to right shift the vector elements with
switch to striped-scan only when we find current aligned the value x filled. AAlign uses this module to adjust the
regions are highly matched. data dependencies between vectors. As shown in Fig. 6, the
1st round of computation can ensure the values in the first
C. Vector Modules column (a-e cells) are correct, since they are calculated based
We have already seen the usage of the vector modules on the real initial value i. Therefore, the test of the need
in Alg. 2 and Alg. 3. These vector modules are designed to for correction is required. Before that, we observe that in
express the required primitive vector operations in our vector the 2nd round, the current “true” value e would affect A
code constructs and hide the ISA-specific vector instruction according to the original layout in Fig. 4. As a result, we
details. Therefore, when the platform changes, AAlign only shift the vector v5 to right by 1 position and fill the gap
needs to re-link the vector code to the proper set of vector using a small enough number x to make sure there is no
modules. Tab. I defines the vector primitive modules. The influence caused by it.
first group of modules are designed to conduct basic vector The implementation is essentially a combination of data-
operations over given arrays or vectors. Specifically, they reordering operations. However, the selection of instructions
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

arrT
is quite different because of different ISAs and desired data Original: i a b c d e f g h Striped: a c e g b d f h

types. Fig. 7 shows how to achieve the same functionality wgt_max_scan (arrT,arrScan,i,ߚ,ߠ)
with different intrinsics. Because the shortest integer data v1 1. Inter-vector weighted scan u1
a c e g a+ߠ+ߚ c+ߠ+ߚ e+ߠ+ߚ g+ߠ+ߚ
type supported by IMCI is 32-bit, we only show IMCI with
v2 b d f h u1=v1+vGapߠ+vGapߚ u2 a+ߠ+2ߚ c+ߠ+2ߚ e+ߠ+2ߚ g+ߠ+2ߚ
32-bit int, which uses a combination of the cross 128-bit u2=max(v2+vGapߠ+vGapߚ , u1+vGapߚ ) b+ߠ+ߚ d+ߠ+ߚ f+ߠ+ߚ h+ߠ+ߚ

lane permutevar and swizzle intrinsics. As a contrast, we i+ߠ+ߚ

directly insert the value x after the permutevar completes i+ߠ+3ߚ


i+ߠ+5ߚ i+ߠ+7ߚ 3. Inter-vector
i+ߠ+3ߚ
i+ߠ+5ߚ i+ߠ+7ߚ
a+ߠ+4ߚ a+ߠ+6ߚ weighted a+ߠ+4ߚ a+ߠ+6ߚ
r1 i+ߠ+ߚ a+ߠ+2ߚ
on AVX2 with 32-bit int. If we work on the 16-bit values, b+ߠ+ߚ
… …
f+ߠ+ߚ
broadcast s i+ߠ+ߚ a+ߠ+2ߚ
b+ߠ+ߚ
… …
d+ߠ+ߚ d+ߠ+ߚ f+ߠ+ߚ
there is no equivalent permutevar intrinsics so that we i+ߠ+4ߚ i+ߠ+8ߚ r1=s +2ߚ +2ߚ +2ߚ
i+ߠ+6ߚ
use shufflehi/hi, permute8x32 and blend intrinsics for this r2
i+ߠ+2ߚ
a+ߠ+ߚ
a+ߠ+3ߚ
b+ߠ+2ߚ
a+ߠ+5ߚ

a+ߠ+7ߚ

r2=u1+(s+vGapߚ )
2. Intra-vector weighted scan
functionality, followed by the insert. c+ߠ+ߚ e+ߠ+ߚ g+ߠ+ߚ s1=i+ߠ+ߚ s2=max(u21,s1+2ߚ)
s3=max(u22,s2+2ߚ) s4=max(u23,s3+2ߚ)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 rshift_x_fill (IMCI 32-bit int) arrScan i+ߠ+7ߚ i+ߠ+8ߚ


i+ߠ+4ߚ i+ߠ+5ߚ i+ߠ+6ߚ
i+ߠ+3ߚ
i+ߠ+2ߚ a+ߠ+3ߚ a+ߠ+4ߚ a+ߠ+5ߚ a+ߠ+6ߚ a+ߠ+7ߚ
16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 __m512_permutevar_epi32 Weighted scan result i+ߠ+ߚ a+ߠ+2ߚ
… …
a+ߠ+ߚ b+ߠ+2ߚ … …
(in original order) b+ߠ+ߚ
c+ߠ+ߚ d+ߠ+ߚ e+ߠ+ߚ f+ߠ+ߚ g+ߠ+ߚ
x x x x x x x x x x x x x x x x __m512_set1_epi32

x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 __m512_mask_swizzle_epi32 Figure 8: Orchestration mechanism in the wgt max scan (Maximum operations are
applied on each cell)
1 2 3 4 5 6 7 8 rshift_x_fill (AVX2 32-bit int)
8 1 2 3 4 5 6 7 __m256_permutevar_epi32

x 1 2 3 4 5 6 7 __m256_insert_epi32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 rshift_x_fill (AVX2 16-bit int)


D. Code Translation
4 1 2 3 8 5 6 7 12 9 10 11 16 13 14 15 __m256_shufflehi/lo_epi16 The AAlign framework takes the sequential codes fol-
16 13 14 15 4 1 2 3 8 5 6 7 12 9 10 11 __m256_permutevar8x32_epi16
lowing our generalized paradigm as the input. After the
16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 __m256_blend_epi16
analysis of the codes, the framework will decide how to
x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 __m256_insert_epi16
modify the vector code constructs. We make use of Clang
Figure 7: Example of chosen ISA intrinsics for rshift x fill (only blend operations are driver [10] to create the Abstract Syntax Tree (AST) for
shown with arrows)
both the sequential codes and vector code constructs, shown
in Fig. 3. To traverse the trees, we build our Matcher and
influence test checks if an extra re-computation of correc- Visitor classes in Clang’s AST Consumer class. Once the
tion is necessary in the striped-iterate method. Specifically, information from the AST nodes of interest is identified and
the module is a vector comparison. The comparison results retrieved, we use our Rewriter class to modify the AST tree
containing 1s mean the 1st operand will affect the 2nd one. of the vector code constructs with the information and its
In IMCI, the results are stored in a 16-bit mask and then we derivative results. Note, present framework only supports
simply check if this value is larger than 0 or not. However, in the constant gap penalties (e.g., βi,k , θi,l ). We will leave
AVX2, the “mask” is stored in a 256-bit vector, and there is it to future work to support variable penalties used in, for
no single instruction to peek how many set bits inside. Our example, the dynamic time warping (DTW) algorithm.
solution is to split the vector to two 128-bit SSE vectors and Tab. II shows the configurable expressions in Alg. 2 and
use the intrinsic _mm_test_all_zeros to detect if there Alg. 3. The information can be retrieved from the sequential
are set bits. codes in four groups: 1. Identify which type of the pairwise
wgt max scan implements the “weighted” scan along the alignment algorithm is used: local or global. This can be
buffer holding the tentative results (denoted as T̃i,j and done by checking if there is a constant 0 set to T or not. 2.
stored in arrT1 from ln. 18 of Alg. 3). Mathematically, Identify what kind of gap penalty system is used. We can
we perform the calculation of max0l<j (T̃i,l + θi,l +
j check if θ is set to 0 or not (row 1-4 in Tab.II). 3. Learn how
k=l+1 βi,k ). For simplicity, let’s suppose θi,l , βi,k are two to initialize the boundary values (row 5,6). 4. Derive other
constants θ and β and only use 8 characters as the example information of how to organize the vectors (row 7-11). After
for the striped sequence shown in Fig. 4. In Fig. 8, we use the vector code constructs have been rewritten, we use the
three steps to achieve the wgt max scan. First, we conduct hybrid method to generate our pairwise sequence alignment
a preliminary round of inter-vector weighted scan on v1 and kernels.
v2 with initial weight θ + β and extensive weight β. The
results will be stored in the intermediate vectors u1 and E. Multi-threaded Version
u2 . Second, an intra-vector and exclusive weighted scan is The AAlign framework can also utilize the thread-level
performed on vector u2 with the weight of k ∗ β, where k parallelism of the multi- and many-cores to align a given
is the total number of vectors. The results are stored in s. query with all subject sequences in a database. We first
Third, the last round of inter-vector and exclusive weighted assign the generated kernel to each thread, and a thread
broadcast is performed on s, u1 and u2 with the weight of will get a subject sequence from the database to conduct
β. The final scan results are stored in arrT1 . the alignment until all subject sequences are aligned. After
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

                               
               
   

   

   

 


 




   

   

   

   
                 
           

(a) SW (CPU) (b) NW (CPU) (c) SW (MIC) (d) NW (MIC)


Figure 9: AAlign codes vs. Baseline sequential codes. The baselines are different and they are optimized to follow the similar logic with the corresponding AAlign codes.

Table II: Configurable expressions in vector code contructs


AAlign-generated codes with the optimized sequential
Expression Description & Format Example* codes; (2) compare our proposed hybrid method with the
GAP LEFT Gap penalty from the left T cell (i.e. GAPOPEN (ln.7) iterate and scan methods, respectively, and (3) compare
θ  +β  ); constants or variables
GAP UP Gap penalty from the upper T cell (i.e. GAPOPEN (ln.8)
multi-threaded versions of AAlign-generated codes with
θ+β); constants or variables existing state-of-the-art tools.
GAP LEFT EXT Gap penalty from the left L cell (i.e. β  ); GAPEXT (ln.7)
constants or variables A. Speedups from Our Framework
GAP UP EXT Gap penalty from the upper U cell (i.e. β); GAPEXT (ln.8)
constants or variables We first compare the AAlign-generated codes (32-bit
INIT T Upper boundary value of T cell; func(i) 0 (ln.2)
INIT U Upper boundary value of U cell; func(i) 0 (ln.2) int) with the sequential codes (32-bit int) to evaluate the
MAX OPRD Operands required by the max operation; vU, vL, vZero vectorization efficiency. The subject sequence is a Q282. The
vec variables
REC FILL Value to fill the right shifted gap; constant GAPOPEN (ln.8) sequential codes are following the same logic of the vector
REC UP Operand for checking the re-computation; vU codes. We also add “#pragma vector always” in the inner-
vec variable
REC UP GAP Gap operand for REC UP; vec variable vGapU loop of the codes. The speedups, shown in Fig. 9, are the
REC CRIT Criterion for checking re-computation; vec vGapTup -vGapU performance benefits brought by the AAlign using striped-
variable
*: The examples are fetched or derived from Alg. 1
iterate and striped-scan respectively. By using the striped-
scan, the SW and NW can achieve an average of 4.8 and
13.6-fold speedups over the sequential codes on CPU and
MIC respectively. In contrast, the speedups of the striped-
we sort the database by the subject sequence length, this
iterate SW and NW vary in a wider range of 4.7 to 10-fold on
dynamic binding mechanism is extremely efficient because
CPU and 9.5 to 25.9-fold on MIC. The superlinear speedups
of the load balance among threads. For the implementation,
of the striped-iterate are mainly because the striped-iterate
we don’t need to create the profile array of substitution
avoids a considerable amount of computation along the Q
matrix for the query every time (prof in ln. 17 of Alg. 2 or
if the inf luence test fails.
ln. 10 of Alg. 3). Therefore, the only change of the kernel
We can see that the performance variance of the striped-
is to extract the part of building profile array and perform it
scan is smaller than the striped-iterate. For example, though
once before launching multiple threads.
the SW approximates the NW in terms of computational
VI. E VALUATION workloads, the performance of the striped-iterate SW-affine
(Fig. 9c) and NW-affine (Fig. 9d) changes a lot, while
In the section, we evaluate the AAlign-generated pairwise the striped-scan keeps relatively consistent. Actually, the
sequence alignment codes on Haswell CPU and Knights performance difference of the two methods depends on
Corner MIC. For Haswell, we use 2 sockets of E5-2680 v3, the processed numerical values which are affected by the
which totally contain 24 cores running on 2.5 GHz with 128 algorithms, gap systems, and input sequences.
GB DDR3 memory. Each core has 32 KB L1, 256 KB L2,
and shares 30 MB L3 cache. For MIC, we use the Intel Xeon B. Performance for Pairwise Alignment
Phi 5110P coprocessor in the native mode. The coprocessor In the preceding section, we observe that the algorithm
consists of 60 cores running on 1.05 GHz with 8 GB GDDR5 and gap penalty system will affect the choice of the better
memory, and each core includes 32 KB L1 and 512 KB L2 vectorizing strategy. This section changes the input se-
cache. We use icpc in Intel compiler 15.3 with -O3 option quences. We first borrow the concepts of query coverage
to compile the codes. To specialize the desired vector ISA, (QC) and max identity (MI) [12] from the bioinformatics
we also include -xCORE-AVX2 for CPU and -mmic for MIC. community to describe the similarity of the input sequences.
All the sequences are from NCBI-protein database [11]. The QC means the percent of query sequence Q overlapping
number of characters is integrated into the query name. the subject S, while the MI is the percentage of the simi-
Our objectives include the following: (1) compare larity between Q and S over the length of the overlapped
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

   
   
 !!
    
 !!

 
    

 !!
     

 !!

  
   



 

   



 
  !    
 "#
 !!
   !   
 "#
 !!


  



  

 
 
 
   





















  
   


(a) SW w/ linear gap (CPU) (b) SW w/ affine gap (CPU) (c) NW w/ linear gap (CPU) (d) NW w/ affine gap (CPU)
   
           
 !!


             

 !!

  
   

   

   



 
  !    !"     !   
 "#
 !!

   
 

  

 
   
  

  
   













      


(e) SW w/ linear gap (MIC) (f) SW w/ affine gap (MIC) (g) NW w/ linear gap (MIC) (h) NW w/ affine gap (MIC)
Figure 10: AAlign codes using striped-iterate, striped-scan, and hybrid method. The x-axis represents the similarity of the two sequences using the format of QC MI in which
the query coverage (QC) and max identity (MI) metrics are in three levels: high (>70%), medium (70%-30%), and low(<30%)

area. Additionally, we define three ranges of hi (>70%), In the corner cases, the hybrid method approximates to the
md (30%-70%), and lo (<30%). That way, we have nine better solution instead of the worse one.
combinations of QC MI to represent the similarity and
C. Performance for Multi-threaded Codes
dissimilarity of two input sequences. For example, lo hi
means only a small portion of two sequences overlaps each In the section, we compare AAlign’s multi-threaded SW
other, but the overlapped areas are highly similar. In the with affine gap penalty system with the tools of SWPS3
experiment, we use Q2000 against the “nr” database using and SWAPHI. The database is the “swiss-prot” containing
NCBI-BLAST [12] and pick out nine typical subjects for more than 570k sequences [13]. SWPS3 [4] uses a modified
the aforementioned criteria. version of the striped-iterate method working on CPUs.
The buffers of the table T are of char and short data
Fig. 10 shows the performance of AAlign using different types. SWAPHI [5] supports both inter-sequence and intra-
vectorizing strategies, including striped-iterate, striped-scan, sequence vectorization in the multi-threaded on MIC. In the
and hybrid, on CPU and MIC. For the alignment algorithms experiment, we only focus on their intra-sequence method of
with linear gap penalty, the striped-iterate method always int data type. Correspondingly, we use our generated kernel
outperforms the striped-scan, because the effects of the zero of short and int data type on CPU and MIC respectively.
θ cause the number of re-computations falling into a very
small number. The results also show that with the linear  


gap penalty, our hybrid method will fall back to the striped-    

iterate and has very similar performance with it. For the 




 
algorithms with affine gap penalty, the striped-scan is better 
 
than the striped-iterate when two sequences have high or 

medium scores of QC and MI, meaning that the input 
 
        
sequences are very similar. For example, for the sequences  
labeled as hi hi, hi md, md hi, md md, in Fig. 10b, 10d, (a) vs. SWPS3 on CPU (b) vs. SWAPHI on MIC
10f, and 10h, striped-scan is the better solution, thanks to Figure 11: AAlign Smith-Waterman w/ affine gap vs. existing highly-optimized tools
its fixed rounds of re-computation. In the cases of the NW in billion cell updates per second (GCUPS)
with the affine gap, the striped-scan can deliver up to 3.5
fold speedup on MIC and up to 1.9 fold speedup on CPU Fig. 11 presents the results of AAlign SW algorithms
over the striped-iterate. For other inputs (dissimilar input comparing with the two highly-optimized tools. On the CPU,
sequences), the striped-iterate is better. Because the hybrid the generated AAlign codes can outperform the SWPS3 for
method can automatically switch to the better solution, in up to 2.5 times, especially for the short query sequences.
most test cases, the hybrid method has better performance However, in Fig. 11a, for the long sequences Q4000, SWPS3
than either of the striped-iterate and striped-scan method. is better. This mainly because rather than working entirely
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, May 2016.

on the short data type (16 bits), SWPS3 also uses the char- that the vector codes can deliver considerable performance
type (8 bits) buffers. Only when the overflow occurs, the gains over the sequential counterparts by utilizing the data-
tool will switch to the short. This is especially beneficial for level parallelism and decreasing the amount of computation.
long query sequences by lowering the cache pressure. For We also demonstrate that our hybrid method is able to
the MIC, we can outperform the SWAPHI on an average automatically switch to the better vectorization strategy at
of 1.6 times, thanks to our hybrid method and the efficient runtime. Finally, compared to the existing highly-optimized
vector modules. multi-threaded tools, the multi-threaded AAlign codes can
also achieve competitive performance.
VII. R ELATED W ORK
ACKNOWLEDGEMENT
To fully utilize the computing power of modern accelera-
tors, it is crucial to utilize the SIMD units within. However, This research was supported in part by NSF-BIGDATA
the low programmability are still obstacles facing non-expert program via IIS-1247693.
programmers. Though some applications can naturally enjoy R EFERENCES
the benefits brought by the compiler auto-vectorization tech-
[1] T. F. Smith and M. S. Waterman, “Identification of Common Molecular
niques [14], there are still many applications not belonging Subsequences,” Journal of molecular biology, 1981.
to this category. As a result, programmers have to smartly [2] S. B. Needleman and C. D. Wunsch, “A General Method Applicable to the
Search for Similarities in the Amino Acid Sequence of Two Proteins,” Journal
design and hand-code the SIMD codes. [15] propose a of molecular biology, 1970.
fast SIMD sorting algorithm using CPU vector instrinsics. [3] P. Rice, I. Longden, A. Bleasby et al., “Emboss: the european molecular biology
open software suite.”
The work of [5], [6], [7] operate on Smith-Waterman by
[4] A. Szalkowski, C. Ledergerber, P. Krhenbhl, and C. Dessimoz, “SWPS3 fast
manually writing compiler instrinsics and GPU kernel codes. multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and 86/SSE2,”
Heinecke et al. [16] optimize the Linpack Benchmark by BMC Res Notes, 2008.
[5] Y. Liu and B. Schmidt, “SWAPHI: Smith-waterman protein database search
using assembly codes on MIC. Unfortunately, explicitly on Xeon Phi coprocessors,” in the Int’l Conf. on Application-specific Systems,
writing vector codes is still not productive and portable. Architectures and Processors (ASAP), 2014.
[6] M. Farrar, “Striped Smith-Waterman Speeds Database Searches Six Times over
Some compiler-based solutions are proposed to ease the other SIMD Implementations,” Bioinformatics, 2007.
situation. Polyhdral compiler [17] uses a set of loop transfor- [7] A. Khajeh-Saeed, S. Poole, and J. B. Perot, “Acceleration of the Smith-
Waterman Algorithm using Single and Multiple Graphics Processors,” Journal
mation, optimization and vectorization to generate efficient of Computational Physics, 2010.
codes. ISPC [18] provides SIMD-friendly data structures [8] R. Rahman, Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for
and function APIs. However, these solutions still require the Application Developers, 1st ed. Apress, 2013.
[9] Intel. Intel Architecture Instruction Set Extensions Programming Reference.
expert knowledge of vectorization and applications. Document ID: 319433-023.
Other research works focus on the specialized vectoriza- [10] Clang: a C Language Family Frontend for LLVM. http://clang.llvm.org/.
[11] NCBI-protein. http://blast.ncbi.nlm.nih.gov/protein,.
tion patterns and code generation. Ren et al. [19] present a
[12] BLAST. http://blast.ncbi.nlm.nih.gov/Blast.cgi.
set of novel code transformations to facilitate vectorization [13] UniProt: Universal Protein Resource. http://www.uniprot.org/.
of recursive programs. PeerWave [20] explores the wave- [14] K. Hou, H. Wang, and W. chun Feng, “Delivering Parallel Programmability to
the Masses via the Intel MIC Ecosystem: A Case Study,” in The 43rd IEEE
front parallelism on GPUs including intra-tile parallelism Int’l Conf. on Parallel Processing Workshops (ICCPW), 2014.
on SIMD units. Ren et al. [21] propose a code generation [15] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen,
A. Baransi, S. Kumar, and P. Dubey, “Efficient Implementation of Sorting on
and optimization engine targeting at using SIMD resources Multi-core SIMD CPU Architecture,” Proc. of the VLDB Endowment (PVLDB),
for the irregular data-traversal applications. ASPaS [22] are 2008.
designed to generate optimized and efficient vector codes [16] A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov,
G. Henry, A. G. Shet, G. Chrysos, and P. Dubey, “Design and Implementation of
for the sorts. Compared to the existing work, the distinctive the Linpack Benchmark for Single and Multi-node Systems based on Intel Xeon
aspects of our work are to automatically generate the vector Phi Coprocessor,” in the IEEE Int’l Symp. on Parallel & Distributed Processing
(IPDPS), 2013.
codes based on different vectorizing strategies. Our solution [17] M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan,
is able to switch among these strategies no mater the selected “When Polyhedral Transformations Meet SIMD Code Generation,” in Proceed-
ings of the 34th ACM SIGPLAN Conference on Programming Language Design
algorithms, configurations, and inputs in the runtime. In and Implementation (PLDI), 2013.
addition, our codes are portable among different x86-based [18] M. Pharr and W. Mark, “ispc: A SPMD compiler for high-performance CPU
programming,” in Innovative Parallel Computing (InPar), 2012.
systems. [19] B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni, “Efficient Exe-
cution of Recursive Programs on Commodity Vector Hardware,” in Proc. of the
VIII. C ONCLUSION ACM SIGPLAN Conf. on Programming Language Design and Implementation
(PLDI), 2015.
The AAlign framework can generate the vector codes [20] M. E. Belviranli, P. Deng, L. N. Bhuyan, R. Gupta, and Q. Zhu, “PeerWave:
based on “striped-iterate” and “striped-scan”. Moreover, we Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization,” in
Proc. of the ACM on Int’l Conf. on Supercomputing (ICS), 2015.
design an input-agnostic hybrid method, which can take [21] B. Ren, T. Mytkowicz, and G. Agrawal, “A Portable Optimization Engine
advantage of both the vectorization strategies. The generated for Accelerating Irregular Data-Traversal Applications on SIMD Architectures,”
ACM Trans. Archit. Code Optim. (TACO), 2014.
codes will be linked to a set of platform-specific vector mod- [22] K. Hou, H. Wang, and W.-c. Feng, “ASPaS: A Framework for Automatic
ules. To do this, the AAlign only needs the input sequential SIMDization of Parallel Sorting on x86-based Many-core Processors,” in Proc.
of the ACM Int’l Conf. on Supercomputing (ICS), 2015.
codes following our generalized paradigm. The results show

You might also like