0% found this document useful (0 votes)
108 views12 pages

03 Myers Bit Vector

This document discusses bit-parallel algorithms for approximate string matching. It begins by summarizing three sources on the topic and then describes the classical dynamic programming algorithm, which computes an edit distance matrix in quadratic time. It also describes an optimization using bit vectors proposed by Myers that represents the differences between matrix entries rather than their absolute values, allowing computation in linear time and space.

Uploaded by

Dethleff90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views12 pages

03 Myers Bit Vector

This document discusses bit-parallel algorithms for approximate string matching. It begins by summarizing three sources on the topic and then describes the classical dynamic programming algorithm, which computes an edit distance matrix in quadratic time. It also describes an optimization using bit vectors proposed by Myers that represents the differences between matrix entries rather than their absolute values, allowing computation in linear time and space.

Uploaded by

Dethleff90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

3 Bit-parallel string matching

This exposition has been developed by C. Gröpl, G. Klau, D. Weese, and K. Reinert. It is based on the following
sources, which are all recommended reading:

1. G. Myers (1999) A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming,
Journal of the ACM, 46(3): 495-415

2. Heikki Hyyrö (2001) Explaining and Extending the Bit-parallel Approximate String Matching Algorithm of
Myers, Technical report, University of Tampere, Finland.
3. Heikki Hyyrö (2003) A bit-vector algorithm for computing Levenshtein and Damerau edit distances, Nordic
Journal of Computing, 10: 29-39

3.1 Bit vector based approximate string matching


In the following we will focus on pairwise string alignments minimizing the edit distance. The algorithms
covered are:

1. The classical dynamic programming algorithm, discovered and rediscovered many times since the 1960’s.
It computes a DP table indexed by the positions of the text and the pattern.
2. This can be improved using an idea due to Ukkonen, if we are interested in hits of a pattern in a text with
bounded edit distance (which is usually the case).

3. The latter can be further speeded up using bit parallelism by an algorithm due to Myers. The key idea is
to represent the differences between the entries in the DP matrix instead of their absolute values.

3.2 The classical algorithm


We want to find all occurences of a query P = p1 p2 . . . pm that occur with k ≥ 0 differences (substitutions and
indels) in a text T = t1 t2 . . . tn .
The classic approach computes in time O(mn) a (m + 1) × (n + 1) dynamic programming matrix C[0..m, 0..n]
using the recurrence
 C[i − 1, j − 1] + δi j 
 

 
C[i, j] = min  C[i − 1, j] + 1  ,
 

C[i, j − 1] + 1 

 

where 
0,
 if pi = t j ,
δi j = 

1,
 otherwise.
Each location j in the last (m-th) row with C[m, j] ≤ k is a solution to our query.
The matrix C is initialized:

• at the upper boundary by C[0, j] = 0, since an occurrence can start anywhere in the text, and
• at the left boundary by C[i, 0] = i, according to the edit distance.

Example. We want to find all occurences with less than 2 differences of the query annual in the text
annealing. The DP matrix looks as follows:
Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053001

A N N E A L I N G
0 0 0 0 0 0 0 0 0 0
A 1 0 1 1 1 0 1 1 1 1
N 2 1 0 1 2 1 1 2 2 2
N 3 2 1 0 1 2 2 2 2 2
U 4 3 2 1 1 2 3 3 3 3
A 5 4 3 2 2 1 2 3 4 4
L 6 5 4 3 3 2 1 2 3 4

3.3 Computing the score in linear space


A basic observation is that the score computation can be done in O(m) space because computing a column C j
only requires knowledge of the previous column C j−1 . Hence we can scan the text from left to right updating
our column vector and reporting a match every time C[m, j] ≤ k.
sequence x sequence x
  
 

  
 

initial case initial case


  
 

  
 

  
 

  
 

  
 

  
 

  
 
sequence y

sequence y
  
 

  
 

  
 

  

  

  

  

  
 

  
 

recursive case recursive case


  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

  
 

 
       

 
   

in memory active in memory active


 
       

 
   

sequence x sequence x
      
     

      
     

initial case initial case


      
     

      
     

       
     

       
     

       
     

       
     

     
   
sequence y

sequence y

     
   

     
   

     
   

     
 

     
 

     
 

     
 




  




  




  




  

  
 

  
 

recursive case recursive case


  
 

  
 

     
   

     
   

     
   

     
   

     
   

     
   

     
   

     
   

     
   

     
   

     
   

     
   

   

   

in memory active in memory active


   

   

sequence x sequence x
   
   

   
   

initial case initial case


   
   

   
   

   
   

   
   

   
   

   
   

      
     
sequence y

sequence y

      
     

      
     

      
     

       
     

       
     

       
     

       
     

     
   

     
   

     
   

     
   

     
 

     
 

     
 

     
 




  




  

recursive case recursive case





  




  

  
 

  
 

  
 

  
 

     
   

     
   

     
   

     
   

   

   

in memory active in memory active


   

   
3002Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:05

sequence x
 
 

 
 

initial case
 
 

 
 

 

 

 

 


sequence y

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

recursive case
 
 

 
 

 
 

 
 

 
 

 
 

    
   

    
   

    
   

    
   

  
 

  
 

  
 

  
 

 

 

in memory active
 

 

The approximate matches of the pattern can be output on-the-fly when the final value of the column vector
is (re-)calculated.
The algorithm still requires in O(mn) time to run, but it uses only O(m) memory.
Note that in order to obtain the actual alignment (not just its score), we need to trace back by some other
means from the point where the match was found, as the preceeding columns of the DP matrix are no longer
available.

3.4 Ukkonen’s algorithm


Ukkonen studied the properties of the dynamic programming matrix and came up with a simple twist to the
classical algorithm that retains all of its flexibility while reducing the running time to O(kn) (as opposed to
O(mn)) on average.
The idea is that since the pattern will normally not match the text, the entries of each column read from
top to bottom will quickly reach k + 1. Let us call an entry of the DP matrix active if its value is at most k.
Ukkonens algorithm maintains an index lact pointing to the last active cell and updates it accordingly. Due to
the properties of lact it can avoid working on subsequent cells.
In the exercises you will prove that the value of lact can decrease in one iteration by more than one, but it
can never increase by more than one.

3.5 Pseudocode of Ukkonen’s algorithm


(1) // Preprocessing
(2) for i ∈ 0 . . . m do Ci = i; od
(3) lact = k + 1;
(4) // Searching
(5) for pos ∈ 1 . . . n do
(6) Cp = 0; Cn = 0; pos − 1 pos
(7) for i ∈ 1 . . . lact do i−1 Cp [was: Cn]
(8) if pi = tpos
(9) then Cn = Cp; i Ci Cn
(10) else
(11) if Cp < Cn then Cn = Cp; fi
(12) if Ci < Cn then Cn = Ci ; fi
(13) Cn++;
(14) fi
(15) Cp = Ci ; Ci = Cn;
(16) od
(17) // Updating lact
(18) while Clact > k do lact−−; od
(19) if lact = m then report occurrence
(20) else lact++;
(21) fi
(22) od
Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053003

3.6 Running time of Ukkonen’s algorithm


Ukkonen’s algorithm behaves like a standard dynamic programming algorithm, except that it maintains and
uses in the main loop the variable lact, the last active cell.
The value of lact can decrease in one iteration by more than one, but it can never increase more than one
(why?). Thus the total time over the run of the algorithm spent for updating lact is O(n). In other words, lact is
maintained in amortized constant time per column.
One can show that on average the value of lact is bounded by O(k). Thus Ukkonen’s modification of the
classical DP algorithm has an average running time of O(kn).

3.7 Encoding and parallelizing the DP matrix


Next we will look at Myers’ bit-parallel algorithm which – in combination with Ukkonens trick – yields a
remarkably fast algorithm, which was used e. g. for the overlap computations performed at Celera are the
starting point for genome assembly.
For simplicity, we assume that m is smaller than w the length of a machine word.
The main idea of following the bit-vector algorithm is to parallelize the dynamic programming matrix.
We will compute the column as a whole in a series of bit-level operations. In order to do so, we need to

1. encode the dynamic programming matrix using bit vectors, and


2. resolve the dependencies (especially within the columns).

3.8 Encoding the DP matrix


The binary encoding is done by considering the differences between consecutive rows and columns instead of
their absolute values. We introduce the following nomenclature for these differences (“deltas”):

horizontal adjacency property ∆hi, j = Ci, j − Ci, j−1 ∈ {−1, 0, +1}


vertical adjacency property ∆vi, j = Ci, j − Ci−1, j ∈ {−1, 0, +1}
diagonal property ∆di, j = Ci, j − Ci−1, j−1 ∈ {0, +1}

Exercise. Prove that these deltas are indeed within the claimed ranges.
The delta vectors are encoded as bit-vectors by the following boolean variables:

• VPij ≡ (∆vi, j = +1), the vertical positive delta vector


• VNij ≡ (∆vi, j = −1), the vertical negative delta vector
• HPij ≡ (∆hi, j = +1), the horizontal positive delta vector
• HNij ≡ (∆hi, j = −1), the horizontal negative delta vector

• D0ij ≡ (∆di, j = 0), the diagonal zero delta vector

The deltas and bits are defined such that

∆vi, j = VPi, j − VNi, j


∆hi, j = HPi, j − HNi, j
∆di, j = 1 − D0i, j .
Pi
It is also clear that these values “encode” the entire DP matrix C[0..m, 0..n] by C(i, j) = r=1 ∆vr, j . Below is our
example matrix with ∆vi, j values.
3004Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:05

∆vi, j : A N N E A L I N G
0 0 0 0 0 0 0 0 0 0
A 1 0 1 1 1 0 1 1 1 1
N 1 1 -1 0 1 1 0 1 1 1
N 1 1 1 -1 -1 1 1 0 0 0
U 1 1 1 1 0 0 1 1 1 1
A 1 1 1 1 1 -1 -1 0 1 1
L 1 1 1 1 1 1 -1 -1 -1 0

We denote by score j the edit distance of a pattern occurrence ending at text position j. The key ideas of
Myers’ algorithm are as follows:

1. Instead of computing C we compute the ∆ values, which in turn are represented as bit-vectors.
2. We compute the matrix column by column as in Ukkonen’s version of the DP algorithm.

3. We maintain the value score j using the fact that score0 = m and score j = score j−1 + ∆hm, j .

In the following slides we will make some observations about the dependencies of the bit-vectors.

3.9 Observations
Lets have a look at the ∆s. The first observation is that

HNi, j ⇔ VPi,j−1 AND D0i,j .

Proof. If HNi,j then ∆hi, j = −1 by definition, and hence ∆vi, j−1 = 1 and ∆di, j = 0. This holds true, because
otherwise the range of possible values ({−1, 0, +1}) would be violated.

(i−1,j−1) (i−1,j)

x+1 x
(i,j−1) (i,j)
By symmetry we have:
VNi, j ⇔ HPi−1, j AND D0i,j .

The next observation is that

HPi, j ⇔ VNi, j−1 OR NOT (VPi, j−1 OR D0i, j ) .

Proof. If HPi, j then VPi, j−1 cannot hold without violating the ranges. Hence ∆vi, j−1 is −1 or 0. In the first case
VNi, j−1 is true, whereas in the second case we have neither VPi, j−1 nor D0i, j
Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053005

(i−1,j−1) (i−1,j)

x or x+1

x x+1
(i,j−1) (i,j)
Again by symmetry we have
VPi, j ⇔ HNi−1, j OR NOT (HPi−1,j OR D0i, j ) .

Finally, D0i, j = 1 iff C[i, j] and C[i − 1, j − 1] have the same value. This can be true for three possible reasons,
which correspond to the three cases of the DP recurrence:

1. pi = t j , that is the query at position i matches the text at position j.


2. C[i, j] is obtained by propagating a lower value from the left, that is C[i, j] = 1 + C[i, j − 1]. Then we have
VNi, j−1 .
3. C[i, j] is obtained by propagating a lower value from above, that is C[i, j] = 1 + C[i − 1, j]. Then we have
HNi, j−1 .

So writing this together yields:


D0i, j ⇔ (pi = t j ) OR VNi, j−1 OR HNi−1, j .

Taking everything together we have the following equivalences:


HNi, j ⇔ VPi, j−1 AND D0i, j (3.1)
VNi, j ⇔ HPi−1, j AND D0i, j (3.2)
HPi, j ⇔ VNi, j−1 OR NOT (VPi,j−1 OR D0i, j ) (3.3)
VPi, j ⇔ HNi−1, j OR NOT (HPi−1, j OR D0i, j ) (3.4)
D0i, j ⇔ (pi = t j ) OR VNi, j−1 OR HNi−1, j (3.5)
Can these be used to update the five bit-vectors as the algorithms searches through the text?

3.10 Resolving circular dependencies


In principle the overall strategy is clear: We traverse the text from left to right and keep track of five bit-vectors
VN j , VP j , HN j , HP j , D0 j , each containing the bits for 1 ≤ i ≤ m. For moderately sized patterns (m ≤ w) these
fit into a machine word. The reduction to linear space works similar as for C.
It is clear that we initalize VP0 = 1m , VN0 = 0m , so perhaps we can compute the other bit-vectors using the
above observations.
There remains a problem to be solved: D0i, j depends on HNi−1, j which in turn depends on D0i−1, j – a value
we have not computed yet. But there is a solution.
Let us have a closer look at D0i, j and expand it.
D0i, j = (pi = t j ) OR VNi,j−1 OR HNi−1, j
HNi−1, j = VPi−1, j−1 AND D0i−1, j
⇒ D0i, j = (pi = t j ) OR VNi, j−1 OR (VPi−1, j−1 AND D0i−1,j )
3006Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:05

The formula
D0i, j = (pi = t j ) OR VNi, j−1 OR (VPi−1, j−1 AND D0i−1, j )
is of the form
D0i = Xi OR (Yi−1 AND D0i−1 ) ,
where
Xi := (t j = pi ) OR VNi
and Yi := VPi .
Here we omitted the index j from notation for clarity. We solve for D0i . Unrolling now the first few terms we
get:
D01 = X1 ,
D02 = X2 OR (X1 AND Y1 ) ,
D03 = X3 OR (X2 AND Y2 ) OR (X1 AND Y1 AND Y2 ) .
In general we have:
D0i = OR ir=1 (Xr AND Yr AND Yr+1 AND . . . AND Yi−1 ) .

To put this in words, let s < i be such that Ys . . . Yi−1 = 1 and Ys−1 = 0 . Then D0i = 1 if Xr = 1 for some
s ≤ r ≤ i. That is, the first consecutive block of 1s in Y that is right of i or i itself must be covered by a 1 in X.
Here is an example:
Y = 00011111000011
X = 00001010000101
D0 = 00111110000111
A position i in D0 is set to 1 if one position to the right in Y is a 1 and in the consecutive runs of 1s from there
on or at position i is also an Xr set to 1.

3.11 Computing D0
Now we solve the problem to compute D0 from X and Y using bit-vector and arithmetic operations.
The first step is to compute the & of X and Y and add Y. The result will be that every Xr = 1 that is aligned
to a Yr = 1 will be propagated one position to the left of a block of 1s if we then add Y to the result.
Again our example:
Y = 00011111000011
X = 00001010000101
X & Y = 00001010000001
(X & Y) + Y = 00101001000100
···
D0 = 00111110000111
Note the two 1s that got propagated to the end of the runs of 1s in Y.
However, note also the one 1 that is not in the solution but was introduced by adding a 1 in Y to a 0 in
(X & Y).
As a remedy we XOR the term with Y so only the bits that changed during the propagation stay turned
on. Again our example:
Y = 00011111000011
X = 00001010000101
X & Y = 00001010000001
(X & Y) + Y = 00101001000100
((X & Y) + Y) ∧ Y = 00110110000111
···
D0 = 00111110000111
Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053007

Now we are almost done. The only thing left to fix is the observation that there may be several Xr bits under
the same block of Y. Of those all but the first remain unchanged and hence will not be marked by the XOR.
To fix this and to account for the case X1 = 1 we OR the final result with X. Again our example:
Y = 00011111000011
X = 00001010000101
X & Y = 00001010000001
(X & Y) + Y = 00101001000100
((X & Y) + Y) ∧ Y = 00110110000111
(((X & Y) + Y) ∧ Y)) | X = 00111110000111
= D0 = 00111110000111
Now we can substitute X back by (pi = t j ) OR VN and Y by VP.

3.12 Preprocessing the alphabet


The only thing left now in the computation of D0 is the expression pi = t j .
Of course we cannot check for all i whether pi is equal t j for all i. This would take time O(m) and defeat
our goal. However, we can preprocess the query pattern and the alphabet Σ. We use | Σ | many bit-vectors
B[α] | ∀α ∈ Σ with the property that B[α]i = 1 if pi = α. These vectors can easily be precomputed in time
O(| Σ | m).
Now we have everything together.

3.13 Myers’ bit-vector algorithm

(1) // Preprocessing
(2) for c ∈ Σ do B[c] = 0m od
(3) for j ∈ 1 . . . m do B[p j ] = B[p j ] | 0m− j 10 j−1 od
(4) VP = 1m ; VN = 0m ;
(5) score = m;

(1) // Searching
(2) for pos ∈ 1 . . . n do
(3) X = B[tpos ] | VN;
(4) D0 = ((VP + (X & VP)) ∧ VP) | X;
(5) HN = VP & D0;
(6) HP = VN | ∼ (VP | D0);
(7) X = HP  1;
(8) VN = X & D0;
(9) VP = (HN  1) | ∼ (X | D0);
(10) // Scoring and output
(11) if HP & 10m−1 , 0m
(12) then score += 1;
(13) else if HN & 10m−1 , 0m
(14) then score −= 1;
(15) fi
(16) fi
(17) if score ≤ k report occurrence at pos fi;
(18) od

Note that this algorithm easily computes the edit distance if we add in line 7 the following code fragment:
| 0m−1 1. This because in this case there is a horizontal increment in the 0-th row. In the case of reporting all
occurrences each column starts with 0.
3008Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:05

3.14 The example


If we try to find annual in annealing we first run the preprocessing resulting in:

a 0 1 0 0 0 1
l 1 0 0 0 0 0
n 0 0 0 1 1 0
u 0 0 1 0 0 0
* 0 0 0 0 0 0

in addition

VN = 000000
VP = 111111
score 6

We read the first letter of the text, an a.

Reading a 010001
D0 = 111111
HN = 111111
HP = 000000
VN = 000000
VP = 111110
score 5

We read the second letter of the text, an n.

Reading n 000110
D0 = 111110
HN = 111110
HP = 000001
VN = 000010
VP = 111101
score 4

We read the third letter of the text, an n.

Reading n 000110
D0 = 111110
HN = 111100
HP = 000010
VN = 000100
VP = 111001
score 3

We read the fourth letter of the text, an e.


Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053009

Reading e 000000
D0 = 000100
HN = 000000
HP = 000110
VN = 000100
VP = 110001
score 3

We read the fifth letter of the text an a.

Reading a 010001
D0 = 110111
HN = 110011
HP = 001100
VN = 010000
VP = 100110
score 2

and so on..... In the exercises you will extend the algorithm to handle queries larger than w.

3.15 Banded Myers’ bit vector algorithm


• Often only a small band of the DP matrix needs to be computed, e. g.:
– Global alignment with up to k errors (left)
s
0 k t
0 initial case

recursive case

t p

– Verification of a potential match returned by a filter (right)

Myers’ bit-vector algorithm is efficient if the number of DP rows is less or equal to the machine word
length (typically 64). For larger matrices, the bitwise operations must be emulated using multiple words at the
expense of running time.
Banded Myers’ bit vector algorithm:

• Adaption of Myers’ algorithm to calculate a banded alignment (Hyyrö 2003)


• Very efficient alternative if the band width is less than the machine word length

Algorithm outline:

• Calculate VP and VN for column j from column j − 1


3010Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:05

A Bit-Vector
• Right-shift Algorithm
VP and VN (see b) for Computing Levenshtein and Damerau Edit Distances
• Use either D0 or HP, HN to track score (dark cells in c)

j -1 j j -1 j j -1 j

a) b) c)

Figure 5: a) Horizontal tiling (left) and diagonal tiling (right). b) The figure shows
how3.16
the diagonal
Preprocessingstep aligns the (j − 1)th column vector one step above the jth
and Searching
column vector. c) The figure depicts in gray the region of diagonals, which are filled
• Given
according text t of length nrule.
toa Ukkonen’s and a pattern p of length
The cells m
on the lower boundary are in darker tone.
• We consider a band of the DP matrix:
– consisting of w consecutive diagonals
have a value V Nj [lv ] = 1. This is impossible, because it can happen only if D0j has
– where the leftmost diagonal is the main diagonal shifted by c to left
an ”extra” set bit at position lv + 1 and HPj [lv ] = 1, and these two conditions cannot
simultaneously 0 be true.
In addition to the obvious way of first computing V Pj and V Nj in normal fashion
c
and then shifting them up (to the right) when processing the (j + 1)th column, we
propose also a secondband option.
width w
It can be seen that essentially the same shifting effect
can be achieved already when the vectors V Pj and V Nj are computed by making the
following changes on the last two lines of the algorithms in Figures 3 and 4:
-The diagonal zero delta vector D0j is shifted one step to the right on the second
m
last line. 0 m−c n
-The left shifts of the horizontal delta vectors are removed.
-The OR-operation
Now, we don’t encode of the V
wholePj withDP column 1 is but
removed.
the intersection of a column and the band (plus one
diagonal left of the band). Thus we store only w vertical deltas in VP and VN. The lowest bits encode the
Thisdifferences
secondbetween alternative uses less bit operations, but the choice between the two may
the rightmost and the left adjacent diagonal.
dependThe onpattern
other bit practical
mask computationissues. remains Fortheexample
same. if several bit vectors have to be used
in encoding D0j , the column-wise top-to-bottom order may make it more difficult to
shift D0j up than shifting both V Pj and V Nj down.
We(1) also modify the way some cell values are explicitly maintained. We choose
// Preprocessing
(2) for i ∈ Σthe
to calculate do B[i] = 0m od along the lower boundary of the filled area of the dynamic
values
(3) for j ∈ 1 . . . m do B[p j ] = B[p j ] | 0
m− j
10 j−1 ; od
programming w matrix
(4) VP = 1 ; VN = 0 ;
w (Figure 5c). For two diagonally consecutive cells D[i − 1, j − 1]

and D[i, j] =along


(5) score c; the diagonal part of the boundary this means setting D[i, j] =
D[i − 1, j − 1] if D0j [lv ] = 1, and D[i, j] = D[i − 1, j − 1] + 1 otherwise. The horizontal
part of Instead
the boundary is handled
of shifting HP/HN to the left,in wesimilar
shift D0 tofashion as in the
the right. Pattern original
bit masks algorithm
must be of Myers:
shifted accordingly.
For horizontally consecutive cells D[i, j − 1] and D[i, j] along the horizontal part of
the boundary we set D[i, j] = D[i, j − 1] + 1 if HPj [lv ] = 1, D[i, j] = D[i, j − 1] − 1
if HNj [lv ] = 1, and D[i, j] = D[i, j − 1] otherwise. Here we assume that the vector
length lv is appropriately decremented as the diagonally shifted vectors would start
to protrude below the lower boundary.
Another necessary modification is in the way the pattern match vector P Mj is
Bit vector algorithms for approximate string matching, by C. Gröpl, G. Klau, K. Reinert, October 28, 2013, 14:053011

// Original // Banded
for pos ∈ 1 . . . n do for pos ∈ 1 . . . n do
// Use shifted pattern mask
B = (B[tpos ]0w  (pos + c)) & 1w ;
X = B[tpos ] | VN; X = B | VN;
D0 = ((VP + (X & VP)) ∧ VP) | X; D0 = ((VP + (X & VP)) ∧ VP) | X;
HN = VP & D0; HN = VP & D0;
HP = VN | ∼ (VP | D0); HP = VN | ∼ (VP | D0);
X = HP  1; X = D0  1;
VN = X & D0; VN = X & HP;
VP = (HN  1) | ∼ (X | D0); VP = HN | ∼ (X | HP);
// Scoring and output // Scoring and output
... ...
od od

 beginning with score = c.


The score value is tracked along the left diagonal and along the last row,
Bit w in D0 is used to track the diagonal differences and bit (w − 1) − pos − (m − c + 1) in HP/HN is used to
track the horizontal differences.

(1) // Scoring and output


(2) if pos ≤ m − c
(3) then   
(4) score += 1 − D0  (w − 1) & 1 ;
(5) else  
(6) s = (w − 2) − pos − (m − c + 1) ;
(7) score += (HP  s) & 1;
(8) score −= (HN  s) & 1;
(9) fi
(10) if pos ≥ m − c ∧ score ≤ k report occurrence at pos fi;

3.17 Edit distance


In the beginning bit vectors partially encode cells outside the DP matrix. Depending on the search mode
(approximate search or edit distance computation) we initialize VP/VN according to the scheme below and
zero the pattern masks.

Approximate search Edit distance computation

-5 -5 -5 -5 -5 -5 0 1 2 3 4 5
-4 -4 -4 -4 -4 -4 0 1 2 3 4 5
-3 -3 -3 -3 -3 -3 0 1 2 3 4 5
-2 -2 -2 -2 -2 -2 0 1 2 3 4 5
-1 -1 -1 -1 -1 -1 0 1 2 3 4 5
0 0 0 0 0 0 0 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
6 6

VP = 1w , VN = 0w VP = 1c+1 0w−c−1 , VN = 0w

You might also like