ReSMA Accelerating Approximate String Matching Using
ReSMA Accelerating Approximate String Matching Using
991
Word line
V1DRV Key/Mask
aa b
b c
c aa bb cc F ilt- M F ilt- C V e r if y - M V e r if y - C
j j==00 1 0 0 G(1,1) G(1,2) G(1,3) DRV
BL BNL
V2DRV
8 0
i i==00
Word line
00 11 22 3
3 00 11 22 33 Top Electrode
...
B(0,0) B(0,n)
O p e r a tio n s R a tio (% )
... G(2,3) T
...
6 0
Metal Oxide Vi DRV R Word line ... A
cc 11 11 1 0 c 11 11 22 22
DRV
G
Bit line
Bit-not line
Bottom Electrode
Bit line
G(i,j) T
4 0 I1 I2 Ij B(m,0) B(m,n)
992
E F CAM MAT G ReCAM Array 1 Procedure FilterQ1(Q1, 𝑉𝑥 )
Register
Filters Group C On-chip Key/Mask
B Transfer 2 𝑉𝑥 .VoS ← 0, i = 0 ;
ReCAM ReCAM
Controller-F
PCIe
...
Data Transfer Controller
sALU
... ... while not the last 𝑞-gram of Q1 do
Filter Filter B B B
3
...
D T
... ...
Control
Signals R B B B A
4 𝑉𝑥 ’s Key register ← Q1[i], j = 0 ;
...
ReCAM ReCAM
Off-chip SSD
V G
Filter Filter MAT
B B ... B
5 while not the last 𝑞-gram of 𝑉𝑥 do
On-chip
Transfer 6 For rows: activate DRVs ;
From Filters Group
PEs Group D H
XB
I
XB J ReRAM Crossbar 7 For columns: activate the Mask register of 𝑉𝑥 ’s j-th
REG-P
...
ReRAM ReRAM
PCIe
𝑞-gram ;
... ... ...
PE PE
Controller-P
ADC
...
CMP D
ReRAM ReRAM AMT 8 XBs share 1 ADC
R 8 VoS of tagged rows plus 1 ;
V
...
PE PE XB XB 9 invalidate tagged rows ;
SE
ADC S/H MIN 10 j=j+1;
A ReSMA Architecture
11 i=i+1;
Figure 3: ReSMA memory architecture 12 remove all marks (line 9) ;
...
... ... ... ... number of the common 𝑞-grams, named vector of score (VoS). The
1 0 1 0 1
bm V1.Str_m VoS V2.Sm S Vn.Sm S 0 1 0 1 0
columns marked in yellow are left as a Buffer to store the TAG
(a) (c) (d) information. We store strings with the same length in the same
vector to avoid memory waste. When processing a dataset with
Figure 4: (a) The data mapping of the ReCAM filter, (b) An
long sequences, we divide each sequence into several short strings,
example comparison matrix, (c) The rotated comparison ma-
which can be stored in different rows. By processing all short strings
trix, and (d) The comparison matrix in the crossbar
in parallel, we can get the results between long sequences quickly.
in both filtering and verification motivates us to design a novel PIM- The filtering phase works as follows. With the dataset stored
featured ASM solution to accelerate both filtering and verification in the array and the query string stored in the Key register, the
algorithms without any off-chip data transfer overhead. system will compute |𝑄𝑠 ∩ 𝑄𝑡 | between the query string and all
strings in the dataset. The current ReCAM array can not support the
3 RESMA intersection operation between two multi-sets. Hence, we divide 𝑄𝑠
into two sub-sets, and the first sub-set 𝑄1 stores non-repetitive q-
3.1 Overview
grams. The second sub-set 𝑄2 is a two-tuple set ⟨𝑄2.𝑠𝑡𝑟, 𝑄2.𝑐𝑜𝑢𝑛𝑡⟩,
As shown in Fig. 3 ( A ), ReSMA is comprised of several Tiles, and storing 𝑞-grams and their repetition times. Taking multi-set {a, a,
each Tile contains a filters group (FG) ( C ) to process the 𝑞-gram a, b, b, c, d} as an example, its 𝑄1 is {c, d} and its 𝑄2 is {⟨𝑎, 3⟩,
filtering and ED processing elements (PEs) group (PG) ( D ) to com- ⟨𝑏, 2⟩}. We design two algorithms to separately compute |𝑄1 ∩ 𝑄𝑡 |
pute the ED. The Data Transfer Controller (DTC) ( B ) will load the and |𝑄2 ∩ 𝑄𝑡 |. By adding up |𝑄1 ∩ 𝑄𝑡 | and |𝑄2 ∩ 𝑄𝑡 |, we can get
datasets from the off-chip Solid State Disk (SSD) to ReCAM filters |𝑄𝑠 ∩ 𝑄𝑡 | easily. Finally, the system will send the corresponding
via Peripheral Component Interconnect express (PCIe) bus. Assume a score threshold 𝜏 (which is computed using the ED threshold 𝜃
query string arrives ReSMA, which can use the FG to select candi- according to Equation (2)) to their Key register to compare 𝜏 with
date strings without off-chip data access. These candidate strings the VoS to finish the 𝑞-gram filtering.
will be transported to the PG via DTC (on-chip bandwidth) to com- |Q1 ∩ Q𝑡 | Computation. Fig. 5 shows the algorithm to count the
pute the ED using high parallel MVM operation and then compare common 𝑞-grams between 𝑄1 and vector_x (𝑉𝑥 ). In line 2, the VoS of
with 𝜃 . Finally, ReSMA reads out these ed(𝑠, 𝑡) ≤ 𝜃 strings as the 𝑉𝑥 is initialized to ‘0’. The CTRL-F sends the first 𝑞-gram in 𝑄1 to the
query results and store them back to the off-chip SSD via DTC. Key register of 𝑉𝑥 (line 4). In line 6, the system will activate the DRVs.
The Mask register will activate the corresponding bit line (BL) and
3.2 Filtering Phase bit-not line (BNL) to perform a vector-scalar comparison between
As Fig.3 ( E ) shows, a ReCAM filter (RF) contains a register (REG), a the Key register and the first 𝑞-gram of all strings in 𝑉𝑥 (line 7). If
simple arithmetic and logical unit (sALU), a MAT, and a Controller- one row is matched during the comparison, the matched row has a
F (CTRL-F). REG is used to store the query strings and the ED common 𝑞-gram with the query string, and this row’s VoS should
threshold 𝜃 . The sALU will compute the 𝑞-gram sets of the query be added by one (line 8). Then, the system will invalidate these
strings. As shown in ( F ), the MAT contains lots of ReCAM arrays, matched rows for the next comparisons (line 9). After comparing
which store and filter the string datasets. The CTRL-F is responsible the first 𝑞-gram, the system will iteratively compare the remaining
for generating the control signals for all components in one filter. 𝑞-grams between 𝑄1 and 𝑉𝑥 (lines 3 and 5).
( G ) further gives a sketch of the ReCAM array used in our filters. |Q2 ∩ Q𝑡 | Computation. Fig. 6 introduces the algorithm of
Assume the 8-bits is used for storing the ASCII value of each counting the common 𝑞-grams between 𝑄2 and all strings in 𝑉𝑥 .
character. As Fig. 4(a) shows, we divide ReCAM arrays into column- Initially, the Buffer is set to ‘0’ (line 4). The CTRL-F then sends the
vectors from Vector_1 to Vector_n. Each row of the array stores first element of 𝑄2 to the Key register of 𝑉𝑥 (line 5). Further, the
one element of each vector, and each element has two parts. The system activates all DRVs, and the Mask register will activate the
993
1 Procedure FilterQ2(Q2, 𝑉𝑥 ) Table 1: Datasets Settings
2 i=0; Dateset # of strings Max. Length Avg. Length q
3 while not the last 𝑞-gram of Q2 do Author 29.48 M 53 24.3 2
4 B (Buffer) ← 0, j = 0 ; Title 42.37 M 129 47.9 2
Actor 17.16 M 77 21.2 2
5 Send Q2.str[i] and Q2.count[i] to 𝑉𝑥 ’s Key register ; DNA 35.87 M 100 100 5
6 while not the last 𝑞-gram of 𝑉𝑥 do Dictionary 0.15 M 30 8.77 2
7 For rows: activate DRVs ; Tweets 61.8 M 137 67.65 3
8 For columns: activate the Mask register of 𝑉𝑥 ’s j-th
𝑞-gram ; compute the ED. Every eight XBs share one ADC for area saving. If
9 Buffer of tagged rows plus 1 ; necessary, several neighbor arrays can work as a holistic crossbar
10 For rows: activate DRVs ; array to process long strings (note these neighbor arrays do not
11 For columns: activate Mask register of the Buffer ; need communication). Each crossbar contains drivers (DRVs), MIN
12 Add tagged rows’ Buffer to their VoS ; comparator, and sample and hold (S/H) unit (Fig. 3 ( J )). The CTRL
13 invalidate tagged rows ; is used to control all components in one PE.
14 j=j+1; Fig. 4(b) shows a comparison matrix of "ab" and "cd". The pro-
cessing of this CM is anti-diagonal, but the MVM operation is row-
15 i=i+1;
and column-parallel. To compute the ED using the ReRAM array,
16 Remove all marks (line 13) ;
we propose a new data mapping strategy as Fig. 4(d) shows. First,
17 Add all Buffers to their VoS ;
we rotate the CM from Fig. 4(b) to Fig. 4(c). Then, we map the
Figure 6: Parallel computation algorithm of |𝑄2 ∩ 𝑄𝑡 | 3×3 CM in Fig. 4(c) to a 5×5 ReRAM crossbar as the red frame
rectangle in Fig. 4(d) shows. Our mapping strategy can transform
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the anti-diagonal parallelism to row- and column-parallelism. The
0 1 0 1 0 0 1 0 1 0 0 1 0 1 0
2 0 1 0 2 2 0 1 0 2 2 0 1 0 2
last two rows of the ReRAM array will be set to repeating ‘01’ and
0 1 0 1 0 0 1 0 1 0 0 2 0 2 0 ‘10’ sequences to perform the ‘plus 1’ operation in Equation (1).
0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 The key idea of our verification algorithm is to perform parallel
1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 processing for cells in the same layer and sequential processing
0 1 0 1 0 0 1 0 1 0 0 1 0 1 0
for different layers from top to bottom. Fig. 7 shows an example
- 2 1 2 - 3 2 2 2 3 - 3 2 3 -
to introduce the details of the edit distance computation between
min min min min min min min min min "ab" and "cd". In Fig. 7(a), the system will compute the red cell’s
1 Write Back 2 2 Write Back 2 Write Back
(a) (b) (c)
ED first. The rows selected and the columns selected signals are
shown in brown triangles. The MVM operation will be performed
Figure 7: (a) The ReRAM array state when the first layer
on these selected cells (marked with blue-dotted rectangle). The S/H
is processed, (b) Computations for the second layer, and (c)
unit will hold ‘2’, ‘1’, and ‘2’ as the results of the MVM operation.
Computations for the last layer
The system will choose the minimum one (here is ‘1’) among the
corresponding BL and BNL to compare the first 𝑞-gram between
three values, which will be written back to the red cell. In Fig. 7(b),
𝑄2 and 𝑉𝑥 (lines 7 and 8). Afterward, the Buffer of these tagged
the system will compute the ED of the second layer, which is also
rows is added by one to record a common 𝑞-gram (line 9). Since
marked in red. The rows and columns selected signals are shown
all 𝑞-grams in 𝑄2 occur more than once, arrays shall apply rows
as brown triangles, too. The results of the MVM operation will flow
and columns signals again to compare the information in the Buffer
to the S/H unit, and the system will choose the minimum numbers
and 𝑄2.𝑐𝑜𝑢𝑛𝑡 [𝑖] (lines 10 and 11). Note that these rows matched
among these values according to Equation (1). Hence, ‘2’ and ‘2’
in lines 10 and 11 can not have more common 𝑞-grams. Therefore,
will be the ED of this layer and further written back to two red cells.
the system will add the tagged rows’ Buffer to their VoS to update
Fig. 7(c) shows the last layer’s computation. The S/H unit holds ‘3’,
the common 𝑞-grams in line 12. The whole procedure is iterative
‘2’, and ‘3’, the minimum among which is ‘2’. ‘2’ will be the ED of
(lines 3 and 6). The system will remove all marks after processing
the last layer, which is also the final ED between "ab" and "cd".
one 𝑞-gram, which is needed in the next iteration (line 16). Finally,
all VoS of these un-tagged rows are updated (line 17).
4 EXPERIMENTAL EVALUATION
3.3 Verification Phase 4.1 Experimental Setup
As shown in Fig. 3 ( H ), one ReRAM PE (RP) contains a register ReSMA Configurations. We employ a cycle-accurate simulator
(REG-P), a comparator (CMP), an address mapping table (AMT), written in Python for modeling the filtering phase. While the verifi-
a ReRAM string element (SE), and a Controller-P (CTRL-P). The cation phase, we obtain the time and energy consumption by using
REG-P is used to buffer the candidate strings received from the NVSim [17]. We use the 1000GB/s On-Chip Interconnect (OCI) for
filtering phase via DTC (on-chip bandwidth). The CMP is used to the on-chip transfer bandwidth. ReSMA’s energy consumption is
calculate the comparison matrix (CM) (according to the calculation obtained with 7pJ per bit as in [18].
of 𝑀 [𝑖, 𝑗] in Section 2.1). After one CM is generated, we soon map The ReSMA configurations are summarized in Table 2. We config-
this comparison matrix to the ReRAM array and use the AMT to ure ReSMA with 32 Tiles, and each Tile includes a Filters Group and
record the addresses of mapped cells. The SE ( I ) contains lots of a PEs Group. Each ReCAM filter consists 477 512×512 crossbars con-
ReRAM crossbars (XB) and analog-to-digital converters (ADCs) to figured as ReCAM arrays while a ReRAM PE has 32 2-bits 256×256
994
Table 2: ReSMA Configurations 1.0E+05 CPU GPU FPGA ASIC PIM ReSMA
995
1.0E+06
CPU GPU FPGA ASIC PIM ReSMA
5 CONCLUSION
Res. time (ms)
1.0E+05
1.0E+04 This paper introduces a novel PIM-featured ASM accelerator, namely
1.0E+03 ReSMA, based on ReCAM- and ReRAM-arrays. Following the filter-
1.0E+02
1.0E+01 and-verify paradigm, we analyze the in-situ ASM challenges when
1.0E+00
20% 40% 60% 80% DNA adopting the ReCAM array to perform 𝑞-gram filtering. We, there-
Figure 10: Response time with different DNA dataset sizes fore, design ReCAM-friendly filters and corresponding algorithms
to solve these challenges. We also present a new data mapping
CPU GPU FPGA ASIC PIM ReSMA
1.0E+07 strategy and a new edit distance computation algorithm, enabling
Res. time (ms)
996