0% found this document useful (0 votes)
21 views

ReSMA Accelerating Approximate String Matching Using

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

ReSMA Accelerating Approximate String Matching Using

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ReSMA: Accelerating Approximate String Matching Using

ReRAM-based Content Addressable Memory


Huize Li, Hai Jin, Long Zheng, Yu Huang, Xiaofei Liao, Zhuohui Duan, Dan Chen, Chuangyi Gui
National Engineering Research Center for Big Data Technology and System/ Services Computing Technology and
System Lab/ Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
{huizeli,hjin,longzh,yuh,xfliao,zhduan,cdhust,chygui}@hust.edu.cn

ABSTRACT (ED) computations that construct massive dynamic programming


Approximate string matching (ASM) functions as the basic opera- matrices, increasing the off-chip transmission overhead [8].
tion kernel for a large number of string processing applications. Recent work shows the promise of PIM-based architectures to
Existing Von-Neumann-based ASM accelerators suffer from huge relieve the memory bandwidth pressure in ASM. AlignS presents
intermediate data with the ever-increasing string data, leading to hardware-friendly alignment algorithms based on SOT-MRAM to
massive off-chip data transmissions. This paper presents a novel execute DNA short read alignment [9]. BioSEAL is architected as an
ASM processing-in-memory (PIM) accelerator, namely ReSMA, based energy-efficient ReCAM-based accelerator for biological sequence
on ReCAM- and ReRAM-arrays to eliminate the off-chip data trans- alignment [11]. RADAR shows efficient hardware methods to accel-
missions in ASM. We develop a novel ReCAM-friendly filter-and- erate Basic Local Alignment Search Tool (BLAST) using 3D ReCAM
filtering algorithm to process the 𝑞-grams filtering in ReCAM mem- arrays [10]. However, these DNA sequence alignment accelerators
ory. We also design a new data mapping strategy and a new ver- are deficient in processing non-DNA sequences for two reasons.
ification algorithm, which enables computing the edit distances First, DNA sequence alignment usually uses a seed-and-extend
totally in ReRAM crossbars for energy saving. Experimental results method such as BLAST. In contrast, non-DNA sequence ASM uses
show that ReSMA outperforms the CPU-, GPU-, FPGA-, ASIC-, and a filter-and-verify approach, such as 𝑞-gram filtering and edit dis-
PIM-based solutions by 268.7×, 38.6×, 20.9×, 707.8×, and 14.7× in tance computing. Therefore, existing DNA sequence alignment
terms of performance, and 153.8×, 42.2×, 31.6×, 18.3×, and 5.3× in accelerators can hardly extend to non-DNA ASM efficiently since
terms of energy-saving, respectively. different algorithms involve different design principles in a specific
accelerator. Second, existing PIM-based ASM accelerators choose
to accelerate filtering [9, 10] or verification [11] only, involving
1 INTRODUCTION substantial off-chip data transfers by leaving another component
Various websites and mobile devices generate a considerable amount to be processed by a processor. We find that the off-chip memory
of data, which follows the "Moore’s Law" in data traffic. The data access causes the main latency in ASM. Therefore, using PIM-based
explosion causes a large volume of string data needs to be processed. methods to accelerate filtering and verification simultaneously can
Approximate string matching (ASM) works as the primitive opera- eliminate the off-chip data transfers and promote ASM.
tion in many string processing applications, e.g., data cleaning [1], Emerging Resistive Random Access Memory (ReRAM)-based con-
information retrieval [2], and biological sequence analysis [3]. Mak- tent addressable memory (ReCAM) stores data as the memory while
ing the ASM performant is of great importance. the in-situ processing data as the processor [12]. The core opera-
There emerged lots of ASM hardware accelerators based on tion of the filtering is the sub-strings comparisons, which can be
Application-Specific Integrated Circuit (ASIC) [4], Graphics Process- processed in parallel by the ReCAM array. The verification phase
ing Units (GPU) [5], and Field-Programmable Gate Array (FPGA) [6]. involves the parallel addition operations, which can be transformed
These efforts generate lots of intermediate data and introduce mas- to the matrix-vector multiplication (MVM) for efficient process-
sive off-chip data transmissions. Consider the commonly-used filter- ing using ReRAM crossbars [13]. Therefore, in this paper, we are
and-verify ASM algorithm. The filtering phase generates lots of motivated to present a novel ReCAM-based PIM-featured ASM
sub-strings that need to be transported from the memory to the Accelerator, namely ReSMA, to accelerate both filtering and verifi-
processor [7]. The verification phase involves lots of edit distance cation of ASM algorithms. However, achieving this idea still faces
many challenges.
This work is supported by the NSFC (No. 61832006, 62072195, and 61825202) and First, since the ReCAM array supports only the comparison be-
Huawei Technologies Co., Ltd. The correspondence of this paper should be addressed
to Long Zheng. tween two single sets, existing filtering techniques specialized for
the traditional architectures focus on the intersection operations
Permission to make digital or hard copies of all or part of this work for personal or
between multiple sets, inapplicable for ReCAM arrays. Second,
classroom use is granted without fee provided that copies are not made or distributed the computation fashion of the existing verification algorithm is
for profit or commercial advantage and that copies bear this notice and the full citation anti-diagonal parallel addition operation, which is also inapplica-
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, ble for the row- or column-parallel MVM operation of ReRAM
to post on servers or to redistribute to lists, requires prior specific permission and/or a crossbars. To cope with the above challenges, ReSMA features two
fee. Request permissions from [email protected]. technical innovations. We design a dedicated filtering algorithm
DAC ’22, July 10–14, 2022, San Francisco, CA, USA
© 2022 Association for Computing Machinery. transforming the intersection operation between multi-sets to a
ACM ISBN 978-1-4503-9142-9/22/07. . . $15.00
https://doi.org/10.1145/3489517.3530559

991
Word line
V1DRV Key/Mask
aa b
b c
c aa bb cc F ilt- M F ilt- C V e r if y - M V e r if y - C
j j==00 1 0 0 G(1,1) G(1,2) G(1,3) DRV
BL BNL
V2DRV
8 0
i i==00

Word line
00 11 22 3
3 00 11 22 33 Top Electrode

...
B(0,0) B(0,n)

O p e r a tio n s R a tio (% )
... G(2,3) T

...
6 0
Metal Oxide Vi DRV R Word line ... A
cc 11 11 1 0 c 11 11 22 22
DRV
G

Bit line

Bit-not line
Bottom Electrode

Bit line
G(i,j) T
4 0 I1 I2 Ij B(m,0) B(m,n)

aa 22 00 1 1 a 22 11 22 33 S&H S&H S&H

2 0 (a) (b) (c)


bb 33 11 0 1 b 33 22 11 22
0
A u th o r A c to r T itle D N A T w e e ts
Figure 2: (a) The ReRAM cell structure, (b) The 1D1R ReRAM
(a) (b) (c) crossbar, and (c) The 2T2R ReCAM array
Figure 1: (a) The CM 𝑀 [𝑖, 𝑗] between "abc" and "cab", (b) The
be less than 𝑞 at both ends of the string). |𝑠 | indicates the length of
EM 𝑀 ′ [𝑖, 𝑗] between "abc" and "cab", and (c) Response time
the string s. For example, the 3-gram set of the string "algorithm" is
breakdown of a CPU-based ASM accelerator (Filt-M, Filt-C,
{a, al, alg, lgo, gor, ori, rit, ith, thm, hm, m}. 𝑄𝑠 is a multi-set, where
Verify-M, and Verify-C refer to Filtering phase’s memory ac-
the same 𝑞-gram at different positions are included.
cess time, CPU execution time, verification phase’s memory
Giving two strings s and t, and the ED threshold 𝜃 , the 𝑞-gram
access time, and CPU execution time, respectively)
filtering [14] is shown in Equation (2).
two-step intersection operation between single-sets. We also de- |𝑄𝑠 ∩ 𝑄𝑡 | ≥ 𝒎𝒂𝒙{|𝑠 |, |𝑡 |} + 𝑞 − 1 − 𝑞𝜃 (2)
velop a ReRAM-based verification algorithm that exploits MVM where the result of 𝑄𝑠 ∩ 𝑄𝑡 is a multi-set since both 𝑄𝑠 and 𝑄𝑡 are
operations in a row- and column-parallel fashion. multi-sets. |𝑄 | is the number of elements in 𝑄.
This paper makes the following contributions:
• We present a novel PIM-featured ASM accelerator based on 2.2 ReCAM Basics
ReCAM arrays and ReRAM crossbars with a high memory
ReRAM is widely used for its low memory access latency, high
density and few off-chip data transmissions.
density, and non-volatile features [13]. One ReRAM cell contains a
• We design novel ReCAM-friendly filters and filtering algo-
metal oxide layer sandwiched between a top electrode and a bottom
rithms to accelerate ASM filtering effectively. We also de-
electrode as Fig. 2(a) shows. ReRAM cells are often organized as one
velop a new ReRAM-based verification architecture to per-
diode and one ReRAM cell (1D1R) crossbar layout to process matrix-
form the edit distance efficiently.
vector multiplication (MVM) operations efficiently [13], denoted as
• We compare ReSMA against the state-of-the-art CPU-, GPU-, Í𝑁
𝐼 𝑗 = 𝑖=0 𝑉𝑖 × 𝐺 (𝑖,𝑗) , as Fig. 2(b) shows.
FPGA-, ASIC-, and PIM-based solutions, yielding significant
ReCAM is another computation mode using ReRAM cells for in-
improvements in terms of performance and energy-saving.
situ comparison purposes. Fig. 2(c) shows a typical representation
of ReCAM array using a two transistors and two memristors (2T2R)
2 BACKGROUND AND MOTIVATION
ReCAM bit-cell [12]. A 2T2R ReCAM bit-cell contains a couple of
2.1 Approximate String Matching ReRAM cells to represent one ReCAM bit. We use the information
The ED between two strings is the minimum number of substitution, (0,1) and (1,0) of two ReRAM cells to represent logic ‘1’ and ‘0’ of
insertion, and deletion operations that are required to transform one ReCAM bit, respectively. A ReCAM array contains a Key/Mask
one string to another string. Given a set of strings D, a query string s, register, a ReCAM cells array, drivers (DRV), and tag registers (TAG).
and a threshold 𝜃 of ED. The ED-based ASM will find all strings t in ReCAM arrays can perform vector-scalar comparisons in paral-
D such that ed(𝑠, 𝑡) ≤ 𝜃 , where ed(𝑠, 𝑡) represents the ED between lel [12]. With all word-lines set to high voltage, the TAG will latch
s and t. a ‘1’ signal if one row matches with the Key register and vice versa.

 𝑀 [𝑖 − 1, 𝑗] + 1


𝑀 ′ [𝑖, 𝑗] = 𝒎𝒊𝒏 𝑀 ′ [𝑖, 𝑗 − 1] + 1 2.3 Motivation

(1)
 𝑀 ′ [𝑖 − 1, 𝑗 − 1] + 𝑀 [𝑖, 𝑗]

We break down the response time of a CPU-based ASM accelera-

Equation (1) presents the algorithm for using the comparison tor [8] for five real-world datasets, as shown in Fig. 1(c). For all five
matrix (CM) 𝑀 [𝑖, 𝑗] to compute an ED matrix (EM) 𝑀 ′ [𝑖, 𝑗] between datasets, the memory access of the filtering phase takes an average
s and t. Fig. 1(a) shows the CM between "abc" and "cab". For initial, of 71.7% of response time, while the CPU execution only takes 28.3%
𝑀 [𝑖, 0] (𝑀 ′ [𝑖, 0]) = 𝑖 (blue-dotted rectangle) and 𝑀 [0, 𝑗] (𝑀 ′ [0, 𝑗]) of response time. The reasons for this result are as follows. Besides
= 𝑗 (red-dotted rectangle). For 𝑖 > 0 and 𝑗 > 0, 𝑀 [𝑖, 𝑗] will be 1 (0) if the original string storage, the 𝑞-gram based filtering adopted in the
s[𝑖−1] ≠ t[𝑗 −1] (otherwise). We mark those cells in the anti-diagonal CPU-based architecture needs extra 𝑞× memory space to store the
with the green-dotted lines because they can be processed in an strings’ 𝑞-gram sets and causes a memory explosion problem [7].
anti-diagonal parallelism. Fig. 1(b) further shows the EM between This further causes enormous 𝑞-grams that need to be transported
"abc" and "cab" (ed(𝑠, 𝑡) = 𝑀 ′ [3, 3] in this case). However, the time from memory to CPU. Also, the memory access of the verification
complexity of the ED computation is O (𝑁 2 ), which is costly and phase takes an average of 63.9% of five datasets, while the CPU
particularly true for processing large datasets. Prior researches execution only takes 36.1% of response time. That is because its
perform the filtering phase to reduce the ED computation in the verification phase will construct millions of intermediate ED matri-
verification phase significantly. We focus on the 𝑞-gram filtering [8, ces, transported from the memory to the processor, to compute the
14] for its high efficiency and powerful filtering capability. ED between candidate strings and the query string. In this context,
We use 𝑄𝑠 to present the 𝑞-gram set of s, which contains all of its many off-chip data transmissions happen, which greatly increases
(|𝑠 | + 𝑞 − 1) sub-strings with length 𝑞 (The sub-strings length can the memory access latency. The massive off-chip data transmission

992
E F CAM MAT G ReCAM Array 1 Procedure FilterQ1(Q1, 𝑉𝑥 )
Register
Filters Group C On-chip Key/Mask
B Transfer 2 𝑉𝑥 .VoS ← 0, i = 0 ;
ReCAM ReCAM

Controller-F
PCIe
...
Data Transfer Controller

sALU
... ... while not the last 𝑞-gram of Q1 do
Filter Filter B B B
3

...
D T

... ...
Control
Signals R B B B A
4 𝑉𝑥 ’s Key register ← Q1[i], j = 0 ;

...
ReCAM ReCAM
Off-chip SSD

V G
Filter Filter MAT
B B ... B
5 while not the last 𝑞-gram of 𝑉𝑥 do
On-chip
Transfer 6 For rows: activate DRVs ;
From Filters Group

PEs Group D H
XB
I
XB J ReRAM Crossbar 7 For columns: activate the Mask register of 𝑉𝑥 ’s j-th
REG-P

...
ReRAM ReRAM
PCIe
𝑞-gram ;
... ... ...
PE PE

Controller-P
ADC
...

CMP D
ReRAM ReRAM AMT 8 XBs share 1 ADC
R 8 VoS of tagged rows plus 1 ;
V

...
PE PE XB XB 9 invalidate tagged rows ;
SE
ADC S/H MIN 10 j=j+1;
A ReSMA Architecture
11 i=i+1;
Figure 3: ReSMA memory architecture 12 remove all marks (line 9) ;
...

Buffer Vector_1 Vec_2 Vec_n 0 1 2 0 0 0 0 0


Figure 5: Parallel computation algorithm of |𝑄1 ∩ 𝑄𝑡 |
1 1 1 0 1 0 1 0
b1 V1.Str_1 VoS V2.S1 S Vn.S1 S
2 1 1 2 0 1 0 2
b2 V1.Str_2 VoS V2.S2 S Vn.S2 S
first part (marked in red) stores the digital value of one string
(b) 0 1 0 1 0
denoted as V𝑛 .Str𝑚 . The second part (marked in blue) records the
...

b3 V1.Str_3 VoS V2.S3 S Vn.S3 S 0 0 1 0 0

... ... ... ... number of the common 𝑞-grams, named vector of score (VoS). The
1 0 1 0 1
bm V1.Str_m VoS V2.Sm S Vn.Sm S 0 1 0 1 0
columns marked in yellow are left as a Buffer to store the TAG
(a) (c) (d) information. We store strings with the same length in the same
vector to avoid memory waste. When processing a dataset with
Figure 4: (a) The data mapping of the ReCAM filter, (b) An
long sequences, we divide each sequence into several short strings,
example comparison matrix, (c) The rotated comparison ma-
which can be stored in different rows. By processing all short strings
trix, and (d) The comparison matrix in the crossbar
in parallel, we can get the results between long sequences quickly.
in both filtering and verification motivates us to design a novel PIM- The filtering phase works as follows. With the dataset stored
featured ASM solution to accelerate both filtering and verification in the array and the query string stored in the Key register, the
algorithms without any off-chip data transfer overhead. system will compute |𝑄𝑠 ∩ 𝑄𝑡 | between the query string and all
strings in the dataset. The current ReCAM array can not support the
3 RESMA intersection operation between two multi-sets. Hence, we divide 𝑄𝑠
into two sub-sets, and the first sub-set 𝑄1 stores non-repetitive q-
3.1 Overview
grams. The second sub-set 𝑄2 is a two-tuple set ⟨𝑄2.𝑠𝑡𝑟, 𝑄2.𝑐𝑜𝑢𝑛𝑡⟩,
As shown in Fig. 3 ( A ), ReSMA is comprised of several Tiles, and storing 𝑞-grams and their repetition times. Taking multi-set {a, a,
each Tile contains a filters group (FG) ( C ) to process the 𝑞-gram a, b, b, c, d} as an example, its 𝑄1 is {c, d} and its 𝑄2 is {⟨𝑎, 3⟩,
filtering and ED processing elements (PEs) group (PG) ( D ) to com- ⟨𝑏, 2⟩}. We design two algorithms to separately compute |𝑄1 ∩ 𝑄𝑡 |
pute the ED. The Data Transfer Controller (DTC) ( B ) will load the and |𝑄2 ∩ 𝑄𝑡 |. By adding up |𝑄1 ∩ 𝑄𝑡 | and |𝑄2 ∩ 𝑄𝑡 |, we can get
datasets from the off-chip Solid State Disk (SSD) to ReCAM filters |𝑄𝑠 ∩ 𝑄𝑡 | easily. Finally, the system will send the corresponding
via Peripheral Component Interconnect express (PCIe) bus. Assume a score threshold 𝜏 (which is computed using the ED threshold 𝜃
query string arrives ReSMA, which can use the FG to select candi- according to Equation (2)) to their Key register to compare 𝜏 with
date strings without off-chip data access. These candidate strings the VoS to finish the 𝑞-gram filtering.
will be transported to the PG via DTC (on-chip bandwidth) to com- |Q1 ∩ Q𝑡 | Computation. Fig. 5 shows the algorithm to count the
pute the ED using high parallel MVM operation and then compare common 𝑞-grams between 𝑄1 and vector_x (𝑉𝑥 ). In line 2, the VoS of
with 𝜃 . Finally, ReSMA reads out these ed(𝑠, 𝑡) ≤ 𝜃 strings as the 𝑉𝑥 is initialized to ‘0’. The CTRL-F sends the first 𝑞-gram in 𝑄1 to the
query results and store them back to the off-chip SSD via DTC. Key register of 𝑉𝑥 (line 4). In line 6, the system will activate the DRVs.
The Mask register will activate the corresponding bit line (BL) and
3.2 Filtering Phase bit-not line (BNL) to perform a vector-scalar comparison between
As Fig.3 ( E ) shows, a ReCAM filter (RF) contains a register (REG), a the Key register and the first 𝑞-gram of all strings in 𝑉𝑥 (line 7). If
simple arithmetic and logical unit (sALU), a MAT, and a Controller- one row is matched during the comparison, the matched row has a
F (CTRL-F). REG is used to store the query strings and the ED common 𝑞-gram with the query string, and this row’s VoS should
threshold 𝜃 . The sALU will compute the 𝑞-gram sets of the query be added by one (line 8). Then, the system will invalidate these
strings. As shown in ( F ), the MAT contains lots of ReCAM arrays, matched rows for the next comparisons (line 9). After comparing
which store and filter the string datasets. The CTRL-F is responsible the first 𝑞-gram, the system will iteratively compare the remaining
for generating the control signals for all components in one filter. 𝑞-grams between 𝑄1 and 𝑉𝑥 (lines 3 and 5).
( G ) further gives a sketch of the ReCAM array used in our filters. |Q2 ∩ Q𝑡 | Computation. Fig. 6 introduces the algorithm of
Assume the 8-bits is used for storing the ASCII value of each counting the common 𝑞-grams between 𝑄2 and all strings in 𝑉𝑥 .
character. As Fig. 4(a) shows, we divide ReCAM arrays into column- Initially, the Buffer is set to ‘0’ (line 4). The CTRL-F then sends the
vectors from Vector_1 to Vector_n. Each row of the array stores first element of 𝑄2 to the Key register of 𝑉𝑥 (line 5). Further, the
one element of each vector, and each element has two parts. The system activates all DRVs, and the Mask register will activate the

993
1 Procedure FilterQ2(Q2, 𝑉𝑥 ) Table 1: Datasets Settings
2 i=0; Dateset # of strings Max. Length Avg. Length q
3 while not the last 𝑞-gram of Q2 do Author 29.48 M 53 24.3 2
4 B (Buffer) ← 0, j = 0 ; Title 42.37 M 129 47.9 2
Actor 17.16 M 77 21.2 2
5 Send Q2.str[i] and Q2.count[i] to 𝑉𝑥 ’s Key register ; DNA 35.87 M 100 100 5
6 while not the last 𝑞-gram of 𝑉𝑥 do Dictionary 0.15 M 30 8.77 2
7 For rows: activate DRVs ; Tweets 61.8 M 137 67.65 3
8 For columns: activate the Mask register of 𝑉𝑥 ’s j-th
𝑞-gram ; compute the ED. Every eight XBs share one ADC for area saving. If
9 Buffer of tagged rows plus 1 ; necessary, several neighbor arrays can work as a holistic crossbar
10 For rows: activate DRVs ; array to process long strings (note these neighbor arrays do not
11 For columns: activate Mask register of the Buffer ; need communication). Each crossbar contains drivers (DRVs), MIN
12 Add tagged rows’ Buffer to their VoS ; comparator, and sample and hold (S/H) unit (Fig. 3 ( J )). The CTRL
13 invalidate tagged rows ; is used to control all components in one PE.
14 j=j+1; Fig. 4(b) shows a comparison matrix of "ab" and "cd". The pro-
cessing of this CM is anti-diagonal, but the MVM operation is row-
15 i=i+1;
and column-parallel. To compute the ED using the ReRAM array,
16 Remove all marks (line 13) ;
we propose a new data mapping strategy as Fig. 4(d) shows. First,
17 Add all Buffers to their VoS ;
we rotate the CM from Fig. 4(b) to Fig. 4(c). Then, we map the
Figure 6: Parallel computation algorithm of |𝑄2 ∩ 𝑄𝑡 | 3×3 CM in Fig. 4(c) to a 5×5 ReRAM crossbar as the red frame
rectangle in Fig. 4(d) shows. Our mapping strategy can transform
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the anti-diagonal parallelism to row- and column-parallelism. The
0 1 0 1 0 0 1 0 1 0 0 1 0 1 0

2 0 1 0 2 2 0 1 0 2 2 0 1 0 2
last two rows of the ReRAM array will be set to repeating ‘01’ and
0 1 0 1 0 0 1 0 1 0 0 2 0 2 0 ‘10’ sequences to perform the ‘plus 1’ operation in Equation (1).
0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 The key idea of our verification algorithm is to perform parallel
1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 processing for cells in the same layer and sequential processing
0 1 0 1 0 0 1 0 1 0 0 1 0 1 0
for different layers from top to bottom. Fig. 7 shows an example
- 2 1 2 - 3 2 2 2 3 - 3 2 3 -
to introduce the details of the edit distance computation between
min min min min min min min min min "ab" and "cd". In Fig. 7(a), the system will compute the red cell’s
1 Write Back 2 2 Write Back 2 Write Back
(a) (b) (c)
ED first. The rows selected and the columns selected signals are
shown in brown triangles. The MVM operation will be performed
Figure 7: (a) The ReRAM array state when the first layer
on these selected cells (marked with blue-dotted rectangle). The S/H
is processed, (b) Computations for the second layer, and (c)
unit will hold ‘2’, ‘1’, and ‘2’ as the results of the MVM operation.
Computations for the last layer
The system will choose the minimum one (here is ‘1’) among the
corresponding BL and BNL to compare the first 𝑞-gram between
three values, which will be written back to the red cell. In Fig. 7(b),
𝑄2 and 𝑉𝑥 (lines 7 and 8). Afterward, the Buffer of these tagged
the system will compute the ED of the second layer, which is also
rows is added by one to record a common 𝑞-gram (line 9). Since
marked in red. The rows and columns selected signals are shown
all 𝑞-grams in 𝑄2 occur more than once, arrays shall apply rows
as brown triangles, too. The results of the MVM operation will flow
and columns signals again to compare the information in the Buffer
to the S/H unit, and the system will choose the minimum numbers
and 𝑄2.𝑐𝑜𝑢𝑛𝑡 [𝑖] (lines 10 and 11). Note that these rows matched
among these values according to Equation (1). Hence, ‘2’ and ‘2’
in lines 10 and 11 can not have more common 𝑞-grams. Therefore,
will be the ED of this layer and further written back to two red cells.
the system will add the tagged rows’ Buffer to their VoS to update
Fig. 7(c) shows the last layer’s computation. The S/H unit holds ‘3’,
the common 𝑞-grams in line 12. The whole procedure is iterative
‘2’, and ‘3’, the minimum among which is ‘2’. ‘2’ will be the ED of
(lines 3 and 6). The system will remove all marks after processing
the last layer, which is also the final ED between "ab" and "cd".
one 𝑞-gram, which is needed in the next iteration (line 16). Finally,
all VoS of these un-tagged rows are updated (line 17).
4 EXPERIMENTAL EVALUATION
3.3 Verification Phase 4.1 Experimental Setup
As shown in Fig. 3 ( H ), one ReRAM PE (RP) contains a register ReSMA Configurations. We employ a cycle-accurate simulator
(REG-P), a comparator (CMP), an address mapping table (AMT), written in Python for modeling the filtering phase. While the verifi-
a ReRAM string element (SE), and a Controller-P (CTRL-P). The cation phase, we obtain the time and energy consumption by using
REG-P is used to buffer the candidate strings received from the NVSim [17]. We use the 1000GB/s On-Chip Interconnect (OCI) for
filtering phase via DTC (on-chip bandwidth). The CMP is used to the on-chip transfer bandwidth. ReSMA’s energy consumption is
calculate the comparison matrix (CM) (according to the calculation obtained with 7pJ per bit as in [18].
of 𝑀 [𝑖, 𝑗] in Section 2.1). After one CM is generated, we soon map The ReSMA configurations are summarized in Table 2. We config-
this comparison matrix to the ReRAM array and use the AMT to ure ReSMA with 32 Tiles, and each Tile includes a Filters Group and
record the addresses of mapped cells. The SE ( I ) contains lots of a PEs Group. Each ReCAM filter consists 477 512×512 crossbars con-
ReRAM crossbars (XB) and analog-to-digital converters (ADCs) to figured as ReCAM arrays while a ReRAM PE has 32 2-bits 256×256

994
Table 2: ReSMA Configurations 1.0E+05 CPU GPU FPGA ASIC PIM ReSMA

Energy Cost (J)


1.0E+04
Component Area (mm2 )Power (mW) Params. Spec. 1.0E+03
RF properties (16 RFs per FG) 1.0E+02
1.0E+01
Bits per Cell 1 1.0E+00
XB Array 0.262 273.47 Size 512 × 512 DNA Author Actor Dict. Title Tweets
Total 477 Figure 8: Energy consumption of ReSMA against other plat-
Key/Mask 0.0109 0.608 Total 477
DRV 0.0132 54.85 Total 477 × 512 forms
CTRL-F 0.0014 0.391 Total 1 CPU GPU FPGA ASIC PIM ReSMA
1.0E+06
TAG 0.018 4.38 Total 477 × 512

Res. time (ms)


1.0E+05
REG 0.0007 0.221 Size 256B 1.0E+04
sALU 0.0054 0.814 Total 1 1.0E+03
1.0E+02
RF Total 0.312 334.74 Size 7.81MB 1.0E+01
1.0E+00
RP properties (16 RPs per PG) DNA Author Actor Dict. Title Tweets
Resolution 8 Bits Figure 9: Response time of ReSMA against other platforms
ADC 0.016 11.84 Total 4
Bits per Cell 2 We use the elapsed time as the performance metric and energy
XB Array 0.0096 22.04 Size 256 × 256
Total 4×8 consumption as the energy metric. All results are total execution
S/H 0.00032 0.074 Total 32 × 256 times of running 10,000 query strings, including filtering and veri-
Inputs 3 fication times. We assume in-memory dataset that all strings are
MIN 0.00051 2.19 Total 32 × 128
DRV 0.00088 6.36 Total 32 × 512 pre-loaded to the memory. So the off-chip SSD data pre-load time
CTRL-P 0.0012 0.357 Total 1 is not included in the elapsed time.
REG-P 0.0075 4.47 Size 8KB Platforms. We choose the new hash-based algorithm proposed
AMT 0.0046 2.03 Size 4KB in [8] as the CPU baseline, which is conducted in a machine run-
Inputs 2
CMP 0.000031 0.0019 Total 3 ning Linux with Intel Xeon E5-2640v4 [email protected], 15MB L3
RP Total 0.041 49.36 Size 269KB Cache, 128GB memory, and 95Watt TDP. For GPU platform, we
Tile properties (32 Tiles for ReSMA) choose the new generic inverted index framework GENIE [5], which
RFs 16 is performed in NVIDIA RTX 3060@1780MHz, 3584 Stream Pro-
FG 4.99 5,356.21 Size 124.97MB cessors, 12 GB Graphic Memory, and 170Watt TDP. Cinti et al.
RPs 16
PG 0.65 789.81 Size 4.31MB propose a novel FPGA-based algorithm for Online Approximate
ReSMA properties String Matching (OASM) [6]. We choose their HW-OASM imple-
Tiles Total 180.64 196.67K Total 32 mentation as the FPGA platform. Tandon et al. propose a custom
DTC 2.17 447.32 Total 1 hardware accelerator for similarity measure [4], aiming to save the
ReSMA 182.88 197.15K Size 4.16GB energy consumption of the ASM acceleration, which is our ASIC
crossbars configured as ReRAM arrays. ReSMA crossbars is de- comparison platform. BioSEAL is a new PIM-based accelerator for
signed under the 32nm process with 533MHz clock frequency [13]. large-scale genomic data [11]. We extend BioSEAL (with architec-
Based on TaO𝑥 ReRAM cells from [13], we conduct SPICE simu- ture and algorithm unchanged) to support non-DNA sequences
lation for the crossbar configuration (1T1R for ReCAM and 1D1R since the authors evaluate DNA datasets only.
for ReRAM). We use CACTI 6.5 in 32nm technology to evaluate
power and area of all registers (REG, REG-P, and AMT). To obtain 4.2 Results and Analysis
characters of components, we use Cadence-simulator [20] for DRV, Energy Efficiency. We have shown the energy consumption re-
S/H, Key/Mask in a crossbar and adopt from [16] for an 8-bits reso- sults in Fig. 8. ReSMA can process all datasets with less energy
lution and 750MS/s sampling rate ADC. The crossbars are read and consumption against all other platforms. For all six workloads,
written in a row parallel manner [19]. The area and energy of sALU, ReSMA has an average of 153.8×, 42.2×, 31.6×, 18.3×, and 5.3×
MIN, TAG, CMP, and CTRLs are established by SPICE circuits. energy saving against CPU-, GPU-, FPGA-, ASIC-, and PIM-based
Datasets. We use six real string datasets from different domains. platforms. The energy-saving cause against these non-PIM plat-
Author and Title are extracted from DBLP, containing the names of forms is due to the significant data movement reduction. Compared
authors and publications. Actor contains the name of actors from with PIM-based BioSEAL, we design a new ReCAM filtering archi-
IMDB. DNA is extracted from Sequence Read Archive [15]. Tweets is tecture, which can avoid computing lots of the ED of strings by
a dataset of removing special characters and figures of tweets from performing a more energy-saving filtering phase.
the twitter. Dictionary is all the words from an English dictionary. Performance. Fig. 9 shows the elapsed time results. ReSMA
We generate 10,000 query strings through random sampling for each has significant performance improvement against state-of-the-art
dataset above. The detailed statistics of six datasets are summarized platforms. For all six workloads, ReSMA has an average of 268.7×,
in Table 1. We set 𝑞 = (log 10 + log |𝑄 |)/log |Γ| as proposed in [8], 38.6×, 20.9×, 707.8×, and 14.7× speedup against CPU-, GPU-, FPGA-,
where |𝑄 | is the average 𝑞-gram set size and |Γ| is the alphabet size ASIC-, and PIM-based platforms. The ASIC platform performs even
of characters. We flexibly distribute the following typical 𝜃 with 2, worse than the CPU baseline because the specialized architecture
4, 6, 8, 10, 12 according to the length of query strings, representing adopted in [4] is early developed, with the hardware capability
the query string length is <26, 26-50, 51-75, 76-100, 101-125, >125, being much weaker than that of the state-of-the-art CPU baseline.
respectively. ReSMA performs better than CPU, GPU, and FPGA platforms since

995
1.0E+06
CPU GPU FPGA ASIC PIM ReSMA
5 CONCLUSION
Res. time (ms)

1.0E+05
1.0E+04 This paper introduces a novel PIM-featured ASM accelerator, namely
1.0E+03 ReSMA, based on ReCAM- and ReRAM-arrays. Following the filter-
1.0E+02
1.0E+01 and-verify paradigm, we analyze the in-situ ASM challenges when
1.0E+00
20% 40% 60% 80% DNA adopting the ReCAM array to perform 𝑞-gram filtering. We, there-
Figure 10: Response time with different DNA dataset sizes fore, design ReCAM-friendly filters and corresponding algorithms
to solve these challenges. We also present a new data mapping
CPU GPU FPGA ASIC PIM ReSMA
1.0E+07 strategy and a new edit distance computation algorithm, enabling
Res. time (ms)

1.0E+06 ReRAM crossbars effectively and efficiently. Our experimental re-


1.0E+05
1.0E+04 sults show that ReSMA outperforms the contemporary CPU-, GPU-,
1.0E+03
1.0E+02 FPGA-, ASCI-, and PIM-based platforms significantly in terms of
1.0E+01
DNA 2× 4× 6× 8× performance and energy-saving.
Figure 11: Respone time with various DNA dataset lengths
REFERENCES
[1] S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins
we avoid generating intermediate data and the frequent off-chip in Data Cleaning," In Proceedings of ICDE’06, pp. 5–5, 2006.
memory accesses in the filtering phase. Another reason is that we [2] M. Krallinger, O. Rabal, A. Lourenco, J. Oyarzabal, A. Valencia, "Information
design ReRAM PE for the verification phase, which can compute Retrieval and Text Mining Technologies for Chemistry," Chemical Reviews, vol.
117, no. 12, pp. 7673–7761, 2017.
the ED in a linear time while avoiding lots of off-chip data transfers. [3] D. S. Cali, G. S. Kalsi, Z. Bingol, C. Firtina, L. Subramanian, J. S. Kim, R.
ReSMA also performs better than the PIM-based platform BioSEAL. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand, A. Nori, A. Scibisz, S.
Subramoney, C. Alkan, S. Ghose, and O. Mutlu, "GenASM: A High-Performance,
That is because BioSEAL can not support in-memory filtering, and Low-Power Approximate String Matching Acceleration Framework for Genome
therefore it relies on off-chip filters and introduces massive off-chip Sequence Analysis," In Proceedings of MICRO’20, pp. 951–966, 2020.
data access overhead. Furthermore, we use ReCAM’s vector-scalar [4] P. Tandon, V. Qazvinian, J. Chang, P. Ranganathan, R. G. Dreslinski, and T. F.
Wenisch, "Hardware acceleration for similarity measurement in natural language
comparison to design a filtering algorithm and minimize memory processing," In Proceedings of ISLPED’13, pp. 409–414, 2013.
use for reading and writing operations. In contrast, BioSEAL has [5] J. Zhou, Q. Guo, H. Jagadish, L. Krcal, S. Liu, W. Luan, A. K. Tung, Y. Yang, and Y.
frequent data read and write operations, which are bandwidth- Zheng, "A generic inverted index framework for similarity search on the GPU,"
In Proceedings of ICDE’18, pp. 893–904, 2018
bounded for ReCAM arrays. Note that, when processing the 𝐷𝑁 𝐴 [6] A. Cinti, F. M. Bianchi, A. Martino, and A. Rizzi, "A novel algorithm for online
dataset, ReSMA has 10.9× speedup than BioSEAL. But when pro- inexact string matching and its FPGA implementation," Cognitive Computation,
vol. 12, no. 2, pp. 369–387, 2020.
cessing other non-DNA datasets, ReSMA has an average 15.5× [7] W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi, "Efficiently Supporting Edit
speedup than BioSEAL. These results reveal that a straightforward Distance Based String Similarity Search Using B + -Trees," IEEE TKDE, vol. 26, no.
extension of a DNA-specific accelerator to support non-DNA se- 12, pp. 2983–2996, 2014.
[8] H. Wei, J. X. Yu, and C. Lu, "String similarity search: A hash-based approach,"
quences will suffer great performance loss. IEEE TKDE, vol. 30, no. 1, pp. 170–184, 2017.
[9] S. Angizi, J. Sun, W. Zhang, and D. Fan, "AlignS: A Processing-In-Memory Accel-
erator for DNA Short Read Alignment Leveraging SOT-MRAM," In Proceedings
4.3 Scalability Study of DAC’19, pp. 1–6, 2019.
We first study the scalability with respect to the number of strings [10] W. Huangfu, S. Li, X. Hu, and Y. Xie, "RADAR: A 3D-ReRAM based DNA Align-
ment Accelerator Architecture," In Proceedings of DAC’18, pp. 1–6, 2018.
as Fig. 10 shows. We set 𝜃 = 10 in this test and randomly sample [11] R. Kaplan, L. Yavits, and R. Ginosasr, "BioSEAL: In-Memory Biological Sequence
20%, 40%, 60%, 80%, and 100% strings from 𝐷𝑁 𝐴 dataset. In the Alignment Accelerator for Large-Scale Genomic Data," In Proceedings of SYS-
20% situation, ReSMA has only 181.7×, 25.9×, 11.6×, 530.2×, and TOR’20, pp. 36–48, 2020.
[12] L. Yavits, S. Kvatinsky, A. Morad, and R. Ginosar, "Resistive Associative Processor,"
8.7× speedup against the CPU-, GPU-, FPGA-, ASIC-, and PIM- IEEE Computer Architecture Letters, vol. 14, no. 2, pp. 148–151, 2015.
based platforms. When the setting grows to 100%, ReSMA offers [13] D. Niu, C. Xu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, "Design of Cross-
point Metal-oxide ReRAM Emphasizing Reliability and Cost," In Proceedings of
significant speedups by 545.5×, 77.2×, 34.9×, 1590.9×, and 14.7× ICCAD’13, pp. 17–23, 2013.
against these platforms. As the number of strings grows, ReSMA [14] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Sri-
performs better than all other platforms. That is because ReSMA is vastava, "Approximate string joins in a database (almost) for free,” In Proceedings
of VLDB’01, pp. 491–500, 2001.
a PIM-based architecture. More memory indicates more Tiles for [15] X. Yang, Y. Wang, B. Wang, and W. Wang, "Local filtering: Improving the per-
in-situ processing in parallel. formance of approximate queries on string collections," In Proceedings of SIG-
We also study the scalability concerning the string length with 𝜃 MOD’15, pp. 377–392, 2015.
[16] Y. C. Lien, "A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC
= 20. To generate a long-length string, we randomly select 𝑘 (e.g., 2, in 28-nm CMOS technology," In Proceedings of VLSIC’12, pp. 88–89, 2012.
4, 6, 8) strings from 𝐷𝑁 𝐴 and merge the 𝑘 strings to be a new long [17] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "Nvsim: A circuit-level performance,
energy, and area model for emerging nonvolatile memory," IEEE TCAD, vol. 31,
string. As shown in Fig. 11, ReSMA performs better than all other no. 7, pp. 994–1007, 2012.
platforms as the string length grows. In the 𝐷𝑁 𝐴 dataset, ReSMA [18] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie,
offers 545.5×, 77.2×, 34.9×, 1590.9×, and 14.7× speedups against "Hygcn: A GCN accelerator with hybrid architecture," In Proceedings of HPCA’20,
pp. 15–29, 2020.
CPU-, GPU-, FPGA-, ASIC-, and PIM-based platforms. In particular, [19] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y.
the case of 𝑘 = 8 offers even 2066.8×, 280.6×, 129.2×, 5516.4×, and Xie, "Overcoming the challenges of crossbar resistive memory architectures," In
32.4× speedups against these platforms. All other platforms have Proceedings of HPCA’15, pp. 476–488, 2015.
[20] O. Krestinskaya, I. Fedorova, and A. P. James, "Memristor load current mirror
well-optimized linear scalability to the string length. In ReSMA, a circuit," In Proceedings of ICACCI’15, pp. 538–542, 2015.
long string is divided into several short strings, and we can process
short strings in parallel to make full use of ReCAM’s parallelism.

996

You might also like