0% found this document useful (0 votes)
23 views

ReSQM Accelerating Database Operations Using ReRAM-Based Content Addressable Memory-2

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

ReSQM Accelerating Database Operations Using ReRAM-Based Content Addressable Memory-2

Uploaded by

莊昆霖
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

4030 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO.

11, NOVEMBER 2020

ReSQM: Accelerating Database Operations Using


ReRAM-Based Content Addressable Memory
Huize Li, Graduate Student Member, IEEE, Hai Jin , Fellow, IEEE,
Long Zheng , Member, IEEE, and Xiaofei Liao , Member, IEEE

Abstract—The huge amount of data enforces great pressure database systems in a wide variety of data-intensive appli-
on the processing efficiency of database systems. By leveraging cations (such as biodiversity research [2]) for real-time data
the in-situ computing ability of emerging nonvolatile memory, analytics demand, such that the response time of database
processing-in-memory (PIM) technology shows great potential in
accelerating database operations against traditional architectures operations must be much faster than ever before.
without data movement overheads. In this article, we introduce A wealth of the existing database systems are built upon
ReSQM, a novel ReCAM-based accelerator, which can dramat- CPU [3], [27], [32], [33], which is, however, difficult to
ically reduce the response time of database systems. The key satisfy the low-latency requirement due to its limited com-
novelty of ReSQM is that some commonly used database queries putational parallelism [8]. Alternatively, some efforts have
that would be otherwise processed inefficiently in previous stud-
ies can be in-situ accomplished with massively high parallelism been made in accelerating database operations with dedi-
by exploiting the PIM-enabled ReCAM array. ReSQM supports cated hardware. For instance, traditional CMOS-based content
some typical database queries (such as SELECTION, SORT, addressable memory (CAM) is developed as a coprocessor for
and JOIN) effectively based on the limited computational mode CPU to achieve data-parallel computing for multiple database
of the ReCAM array. ReSQM is also equipped with a series operations. However, it still relies on CPU to manage data
of hardware-algorithm co-designs to maximize efficiency. We
present a new data mapping mechanism that allows enjoying。 in- transfer between CAM and main memory. In addition, due
to the well-known scalability problem of the CMOS transis-

situ in-memory computations for SELECTION operating upon
intermediate results. We also develop a count-based ReCAM- tors, the computing ability of the CMOS-based CAM often
specific algorithm to enable the in-memory sorting without any suffers greatly in practice [9], [10]. Many studies lever-
row swapping. The relational comparisons are integrated for age the massive parallelism of GPU [11], [12], [14], [15]

accelerating inequality join by making a few modifications to
the ReCAM cells with negligible hardware overhead. The exper- (or FPGA [16], [18]) for (energy) efficiency improvement.
imental results show that ReSQM can improve the (energy) Nevertheless, because of the separate computation-storage
efficiency by 611× (193×), 19× (17×), 59× (43×), and 307× hierarchy by following the von Neumann architecture,
(181×) in comparison to a 10-core Intel Xeon E5-2630v4 pro- these earlier studies suffer from the “memory wall”
cessor for SELECTION, SORT, equi-join, and inequality join, problem.
respectively. In contrast to state-of-the-art CMOS-based CAM,
GPU, FPGA, NDP, and PIM solutions, ReSQM can also offer To address the above problem, near-data processing (NDP)
2.2× 39× speedups. integrates processing units into the memory or storage.
Although significant data movement can be reduced for an
Index Terms—Content addressable memory (CAM), database
query, nonvolatile memory, processing-in-memory (PIM). NDP accelerator, they still suffer from challenges with the
computing-ability-limited logic units in memory with consid-
erable integration cost [19]–[21], [31]. Processing-in-memory
I. I NTRODUCTION (PIM) technology provides a promising way with the in-situ

I N THE big data era, modern enterprise data and Internet


traffic have been exploding exponentially with a per-year
growth amount that exceeds the total amount of data in the
computing ability and massive parallelism. Sun et al. [22]
presented a first PIM-enabled design to accelerate SQL
query operations based on resistive random access memory
past years [1]. That exerts tremendous pressure on the existing (ReRAM). They exploit the bipolar structure characteristic of
ReRAM crossbar and present a hybrid of columnwise and
Manuscript received April 16, 2020; revised June 12, 2020; accepted
July 6, 2020. Date of publication October 2, 2020; date of current ver- row-wise dot-product computations. Since the SELECTION
sion October 27, 2020. This work was supported by the National Natural operation contains some inherent comparison semantics that
Science Foundation of China under Grant 61832006, Grant 61702201, Grant ReRAM does not support, they attach a simple peripheral
61825202, and Grant 61929103. This article was presented in the International
Conference on Hardware/Software Codesign and System Synthesis 2020 and scalar comparison unit to each row of ReRAM crossbar. This
appears as part of the ESWEEK-TCAD special issue. (Corresponding author: PIM-featured approach can offer the orders of energy effi-
Long Zheng.) ciency over the traditional architecture, but its practicability
The authors are with the National Engineering Research Center for Big
Data Technology and System, Services Computing Technology and System still suffers. It is extremely difficult, if not impossible, for their
Lab, Cluster and Grid Computing Lab, School of Computer Science and approach to area-efficiently support complex but the important
Technology, Huazhong University of Science and Technology, Wuhan 1037, database operations, such as SORT and JOIN, which can be
China (e-mail: [email protected]; [email protected]; [email protected];
[email protected]). several million times comparisons than SELECTION in quan-
Digital Object Identifier 10.1109/TCAD.2020.3012860 tity for even a moderately sized database [4]. Yet, different
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4031

database operations involve different peripheral circuit layouts, to support the relational comparison with negligible
making their design extraordinarily complex. hardware overhead.
Recently, there emerges ReRAM-based content address- 3) We conduct a comprehensive evaluation. We compare
able memory (ReCAM), takes the best of both worlds of ReSQM with not only the traditional CPU-based, GPU-
nonvolatile ReRAM [34], [35] and specialized CAM hard- based, FPGA-based, and CMOS-based efforts but also
ware with large capacity and PIM feature [24]. In addition to the emerging NDP-enabled and PIM-enabled accelera-
scalar comparison, ReCAM is also naturally capable of mak- tors. Results show that ReSQM outperforms state of the
ing the comparisons in a vector granularity, also known as art significantly.
vector–scalar comparison, at a time with higher parallelism. The remainder of this article is organized as follows.
ReCAM is promising to enable in-situ in-memory computing Section II describes the background and motivation. Section III
to handle the database table for a wide variety of database presents the architectural designs of ReSQM. Section IV
operations efficiently. More importantly, the array structure of shows the experimental results. Section V concludes the work.
ReCAM can be intuitively regarded as a database table layout,
making easy access to data and a fast mapping on ReCAM
crossbars. II. BACKGROUND AND M OTIVATION
Nevertheless, exploiting ReCAM for accelerating database A. Database Operations
queries remains tremendously challenging. First, to support In this article, we mainly focus on the relational database
processing a database query, it is challenging to store and since it is widespread in the current mainstream market. In a
handle a lot of intermediate results. NVQuery [29] presents relational database, those records with the same attributes are
the first ReCAM-based accelerator for accelerating database called tuples. In general, each tuple is distributed row by row to
operations. However, in order to obtain the final results of a form a table. Each column of the table indicates an attribute
query, NVQuery often relies on the main processor to process of the table. In this article, we focus on three fundamental
the intermediate results. Therefore, substantial data movements kernels of database queries as follows.
can be transferred between the processor and the ReCAM SELECTION: The selection query aims to choose tuples by
array, limiting the overall efficiency. Second, since ReCAM querying a table via a restricted statement, which usually con-
functions as both storage and processing units, the raw data tains several arithmetic expressions connected with each other
of the database table in ReCAM must be consistent with- using various logical operators, such as AND, OR, NAND, and
out data pollution for subsequent operations. This requirement NXOR. The arithmetic operators used in the arithmetic expres-
may potentially suppress the efficiency of many database oper- sion may also get involved, e.g., “+,” “−,” “×,” “=,” “#=,” ≤,
ations, such as the SORT that often involves (substantial) ≥, “>,” “<.”
data reordering (if not carefully designed). Besides, vector– SORT: The sort query aims to reorder the tuples in an
scalar comparison in ReCAM can compute only the equality expected (e.g., ascending or descending) order according to
between a given number and every element in a vector, affect- some attributes.
ing the applicability to handle some database operations, such JOIN: The join query aims to generate a new table using
as inequality join that needs to know the relativity [7]. the Cartesian product of two relational attributes. In this arti-
In this article, we make the following contributions. cle, we consider two typical join operations: 1) equi-join and
1) We identify that the existing PIM-based database- 2) inequality join. The former indicates a join operation with
oriented accelerators can support a subset of database the condition containing an equality operator of =. The latter
operations. Neither can support SELECTION, SORT, represents a join condition with the inequality operators, e.g.,
and JOIN queries simultaneously. Yet, these exist- > and <.
ing studies also typically rely on the main processor
that assists a PIM architecture in handling a lot of
intermediate results, which can become a bottleneck B. ReCAM Basics
limiting the overall efficiency. To the best of our knowl- Fig. 1 illustrates the basics of ReCAM, which consists of a
edge, ReSQM is the first ReCAM-based architecture that MASK register, a KEY register, an array of ReCAM bit-cells
can process various database queries in memory effec- organized in a crossbar architecture, and TAG registers. The
tively and efficiently without the assistance of a CPU MASK register decides which columns will be selected to do
processor. read, write, and match operations. The KEY register stores a
2) We develop a series of hardware-algorithm co-designs data word that will be used for a write or match operation. As
to improve the efficiency of performance accelera- shown in Fig. 1(a), a ReCAM bit-cell is organized with two
tion on different database operations. For SELECTION, transistors and two memristors (2T2R) elements with one bit
we present a new data mapping mechanism that line and one bit-not line. The match/word line of the ReCAM
allows enjoying in-situ in-memory computations of array is attached to a TAG register [Fig. 1(b)] in which each
the SELECTION query operating upon intermediate ReRAM array row is connected to a signal amplifier (SA) and
results for performance acceleration. For SORT, we a TAG latch. The TAG registers mark those matched rows that
develop a count-based ReCAM-specific algorithm to satisfy the condition of comparison. Unlike the row-oriented
enable the in-memory sorting. For inequality-join, we or column-oriented storage in a traditional memory [11], [28],
make a slight modification to the basic ReCAM cell the ReCAM crossbar is a natural fit to store the database table
4032 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020

happens, leakage current will flow through that cell, and the
voltage of the word line will drop off. Note that all per-row
vector elements of selected columns against the scalar data
can be compared in parallel and finished at one cycle.

C. Related Work
(b) GPU and FPGA Acceleration: A lot of efforts have been
put into speeding up database operations based on the tradi-
tional architectures, such as GPUs and FPGAs [11], [12], [14],
[15], [18]. For instance, Schaa and Kaeli [12] pointed out
that the Peripheral component interconnect express bus will
also become a bottleneck on multiple GPUs unless the com-
plete dataset can be placed in the memory of GPU. StoreGPU
proposes to accelerate several hashing-based primitives for
a distributed storage system [17]. By initializing the input
data in the pinned host memory, StoreGPU protects the GPU
driver from an extra memory copy with reduced data trans-
(a) fers. Asymmetric distributed shared memory [13] is proposed
to maintain a shared logical memory space for reducing the
Fig. 1. Basics of the ReCAM array. (a) Sketch of ReCAM bitcell. (b) TAG
register organization. amount of data movement between the host and the acceler-
ator. An in-memory FPGA-based architecture is developed to
accelerate table joins [16]. Compared with CPUs, these studies
can provide superior results. Also, both GPU and FPGA accel-
eration of SQL operations can be designed with the flexibility
that can deal with a larger set of SQL operations, types, and
column/row sizes. However, currently, GPU and FPGA still
suffer from the limited memory size such that they have to
read/write through the host-system from/to SSD/HDD storage
(a) (b) (c) with I/O bottlenecks.
NDP and PIM Accelerators: Near-data computing integrates
Fig. 2. Common computational patterns for (a) SELECTION, (b) SORT, the processing units into storage or memory to reduce data
and (c) JOIN. access overhead [19]–[21], [31]. Although near-data com-
puting can improve computing efficiency by reducing data
movement, it still faces several challenges. Their process-
with bit lines representing attributes and each match line show- ing ability of computing logic integrated into the storage and
ing a tuple. By using ReCAM, we can perform vector–scalar memory is quite limited, and also computational parallelism
comparisons with massive parallelism. suffers. Integrating logic units into the stacking memory dies
As applied in [25] and [30], we follow to use the high- may also lead to a potentially high cost.
resistance state (HRS) to represents logic “1” (i.e., the switch- Sun et al. [22] presented the first PIM-enabled design
off state), while the low-resistance state (LRS) represents logic based on ReRAM to accelerate SQL query operations.
“0” (i.e., the switch-on state). Since a ReCAM cell often uses Due to the limited computational paradigm of the ReRAM
two memristive cells to represent one logic bit. We use the array, this work can support only some operations of a
“10” of two memristive cells to represent the logic 1, and SELECTION query. ReCAM has been widely used in many
vice versa. fields. Yavits et al. [23] replaced the last level cache with
Vector–Scalar Comparison: Initially, a given scalar data that ReCAM as an associative processor. Kaplan et al. [25] lever-
need to be compared is stored in the KEY register. All match aged ReCAM to accelerate the Smith–Waterman algorithm
lines are precharged with high voltage, while the KEY register for DNA sequence alignment. To the best of our knowl-
was set on bit and bit-not lines. Note that the precharged signal edge, NVQuery [29] is the most related ReCAM-based work
and the signals operating upon the bit line and bit-not line of specialized for accelerating database applications.
the KEY register are activated at the same time. The bit and NVQuery presents a heterogeneous solution. It enables sup-
bit-not lines of those columns that do not need to be compared porting some basic database operations based on ReCAM,
are set to the low voltage by the MASK register. For each row such as nearest distance search, equi-join, and some bitwise
(i.e., a vector element), if all selected bits match the given data, operations. To obtain the final results of a query, NVQuery
the corresponding precharged word line will keep high voltage relies on the main processor to process the intermediate results.
that can be captured by the corresponding SA and also held Therefore, an amount of data movement can be transferred
in the TAG latch. Otherwise, if the mismatch of any one bit between the processor and the ReCAM array, limiting the
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4033

Fig. 3. Overview of ReSQM. (a) Layout of ReSQM chip. (b) Architecture of database structure query unit. (c) Truth table for supporting arithmetic and
logical operators. (d) Interconnect of DSQ mat. (e) Table layout for the ReCAM array. The pink arrow shows the workflow for the SELECTION query. The
blue arrow indicates the workflow of the SORT and JOIN queries.

overall efficiency. That is, particularly true and serious for han- III. R E SQM
dling a large database table. ReSQM differs from NVQuery Fig. 3(a) shows the overview of the ReSQM chip, which
in two ways: 1) we use an in-memory reserved region to consists of multiple DSQ units connected through a bus that
buffer the intermediate results so as to perform operations is used for receiving (sending) the (results of) queries from
between the intermediate results and the original data in (to) users. Initially, we partition a database table into multiple
memory and 2) each database structured query (DSQ) unit slices such that each piece can fit into the DSQ unit. For han-
self-contains the arithmetic and logic units (ALUs) and a stack dling an even larger table that cannot fit into the ReSQM’s
register, which can function to parse a restricted expression memory entirely, ReSQM can also work effectively by putting
in SELECTION queries (without the assistance of the main the large table in the solid-state-disk (SSD) storage, and send
processor), avoiding the substantial data transfer overheads. it to ReSQM in batches for processing.
We particularly note that the way of performing an addition In this section, we first elaborate on the architectural
operation in [29] is based on breaking it down into a serial details of ReSQM, and then show the fundamental designs
of NOR operations, which is entirely different from ours that of accelerating different database queries.
applies a lightweight and straightforward truth table (inspired
from [24] and [25]). Although both are based on ReCAM, A. Architecture
the architectures are different, and all algorithms that drive
Fig. 3(b) shows the architecture of a DSQ unit that is the
database operations are also different.
core of performing the execution for every received query.
Note that ReSQM currently processes all queries serially.
Supporting concurrent query execution can be considered an
D. Motivation interesting future work. A DSQ unit contains some necessary
Fig. 2 shows the computational patterns of SELECTION, components to support an effective query execution. Next, we
SORT, and JOIN. We observe that database operations often discuss them as follows.
involve many different practical demands that may be beyond 1) Structured Query (SQ) Buffer: It is mainly used to store
the vector–scalar comparison pattern of the ReCAM array. the queries that will be processed by the DSQ unit. Yet,
For example, in addition to comparison operators, the it can also be functioned to identify the type of a query,
restricted expressions in SELECTION often involve many i.e., JOIN, SORT, or SELECTION.
noncomparison operators operating upon the bitwise vector– 2) ALUs and Stack Register: ReSQM includes some sim-
vector computing, where only elements in the same row of ple ALUs to convert a restricted expression used in
two vectors are needed to take the computation [Fig. 2(a)]. SELECTION queries into a suffix one that can show the
Although SORT has the vector–scalar meta-operations [as correct execution priority of the operators. Stack register
shown in Fig. 2(b)], in the presence of the existing sort- stores the operands and operators during the expres-
ing algorithms (such as radix sort and merge-sort) they also sion parsing. ALUs and the stack register enable that
involve the row swappings that ReCAM cannot support effec- ReSQM can work independently from CPU to accelerate
tively. More importantly, the matching principle of ReCAM database queries.
based on the leakage current mechanism can check only 3) Look-Up Table (LUT): It is introduced to enable ReCAM
whether two elements are equal or not (i.e., equal compari- to support basic arithmetic or logic operations based on
son). However, most comparisons amongst database operations the comparison paradigm of ReCAM. LUT stores the
(such as inequality join) need to know more concerning which precalculated truth tables of basic instructions, as shown
element is greater or smaller (i.e., the relational comparison). in Fig. 3(c).
4034 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020

TABLE I
C OMMON ROW-W ISE V ECTOR –V ECTOR I NSTRUCTIONS U SED IN results, which can be further stored in the R region. In this
DATABASE O PERATIONS way, ReSQM can get the final results of SELECTION queries
by rational use of the R region. Second, we architect some
ALUs and a stack register in each DSQ unit, as shown in
Fig. 3(b). The ALUs will parse the restricted expression of
a SELECTION query and obtain the correct execution of the
restricted expression that will be stored in the stack register
in the form of operands and operators. The processing of the
restricted expressions is as follows.
In the beginning, two operands and one operator will
4) Address Information: This is an address register that be popped from the stack register and sent to the address
records and specifies which columns a particular information register and LUT, respectively. According to the
attribute is stored in DSQ mat. truth table of this operator in LUT, the controller will generate
5) Ctrl: This is a local microcontroller that manages the suitable signals to the DSQ mat. With control signals applied
components in the DSQ unit to perform the correspond- to the memory address of these two operands recorded by the
ing database operations and sends some control signals address information register, DSQ mat can calculate the results
to the DSQ mat. of these two vectors, and the results will be stored in the R
6) DSQ Mat: It is the main storage and computing com- region. After that, the stack register will pop an operator and
ponent in the DSQ unit. It contains many processing an operand out to control the corresponding vector calculating
elements that are connected through H-Tree, as shown with the intermediate results. When all operands have been
in Fig. 3(d). processed, we can get the final result of the restricted expres-
7) Processing Element (PE): Similar to prior work [24], sion in the R region. A logic 1 stored in the R region means
we configure each PE with a ReRAM array size of 512 that the value of the restricted expression is true on this tuple.
rows and 512 columns. Fig. 3(e) composes a sketch of Finally, ReSQM can get the results of this SELECTION query
data organization. We reserve the first 64 b (marked with through a memory read request according to the value in the
“R”) as a buffer to store the intermediate results (when R region.
necessary). The rest of the columns in PE are used to Example: Suppose two 32-b numbers “A” and “B” needs
store the table. an addition. Next, we introduce how this simple operation
8) SSD: An off-chip SSD is optionally used to store a is performed on ReSQM. This is a typical multibit addition
large number of the results of a JOIN query when the case [23]–[25], which are often processed by transforming it
ReCAM’s on-chip memory is not sufficient. into multiple single-bit additions. Then, we can use a truth
In the ReSQM designs, we hold the argument that ReCAM table for the single-bit addition to process the operation. The
arrays should function as both storage and computing units to procedure works as follows. First, the lowest bit of A and B
eliminate the data movement between the processor and the will do a single-bit addition. The carry-out and result will be
memory. Based on this design philosophy, we next present how stored in the R region. Afterward, the carry-out will work as
these key components are designed to accelerate SELECTION, the carry-in and do an addition with the second-lowest bit of A
SORT, and JOIN operations effectively and efficiently by and B to generate the next carry-out and result. This process is
exploiting ReCAM. repeated until the highest bit of A and B is processed. Finally,
the R region will hold the final result of “A + B”. Note that the
B. Accelerating SELECTION Queries addition of the two elements in different rows is computed in
ReCAM can perform a variety of bitwise operations based parallel.
on its vector–scalar comparison paradigm [24], [25]. To sup- The multiplication can be considered multistep addi-
port SELECTION, NVQuery [29] proposes to transfer the tions [23]. Row-wise max instruction is used to find a
operations of a SELECTION query into a series of bitwise maximum number among the corresponding elements of two
operations, which can generate a large number of intermediate vectors in the same row.
results. To obtain the final results of the query, NVQuery In ReSQM, we perform the addition on the two 32-b vectors
relies on the main processor to process these intermediate that operate on an 8-row truth table row by row. This thus takes
results. Consider the restricted expression in a SELECTION 32 × 8 × 2 = 512 cycles for a vector–vector addition. Other
query contains a variety of bitwise operations, NVQuery often instructions are similar. Table I lists the instructions supported
needs to transfer lots of intermediate results to the processor, in ReSQM.
degrading the overall efficiency.
To avoid the off-chip data transfers, we make the two core
designs to perform a SELECTION query in memory. First, C. Accelerating SORT Queries
we reserve some memory spaces as R regions, as shown in The sorting for some attribute columns based on the
Fig. 3(e), which will be used to store the intermediate results ReCAM array often needs first to perform the interrow com-
of SELECTION queries. Since the R region also has a com- parisons and then reorder the attribute in different rows.
puting ability, the intermediate results can also be computed However, the ReCAM array supports intercolumn comparison
with the original data for generating the next intermediate only. Also, the substantial amount of row swapping would
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4035

incur significant overheads in efficiency and energy consump-


tion. Therefore, traditional sorting algorithms are often difficult
to be applied for ReCAM. Imani et al. [29] used the difference
between discharging currents to perform the nearest distance
search. It has the key idea that for a given number, the closer
the number on the match line is to the given number, the faster
the current on the match line leaks. They use this way to find
MIN and MAX results, which are a subset of SORT queries.
However, as the number of records increases, this method will
become challenging to differentiate data depending upon the
discharging currents. Therefore, the method in [29] is hardly
used for SORT queries, which are often operated upon millions
of records in a table.
We present a count-based algorithm to support SORT Fig. 4. Finding a minimum digit and its count.
queries on ReCAM effectively and efficiently. It can complete
the ranking using the vector–scalar comparisons of ReCAM
without any row swapping. The main idea is to construct a list every bit of the digitmin . All the rows matched can be consid-
of binary groups &digit, cnt', where digit is an element from ered as digitmin , and their number indicates the cnt of digitmin .
an attribute column that needs to be sorted and cnt represents Through Sizeof(Vector) (e.g., 32 in this article) cycles,
the repetition times of digit. These binary groups are generated we can find a minimum digit and its corresponding count. All
serially according to the size of digit from the smallest to the the rows matched to this digitmin will not be precharged so as
largest. They will be stored in the R region [Fig. 3(e)] from to find the next digitmin . Note that the ith minimum digit can
the first line to the last line after the generation. Therefore, we be easily found by disabling matching the (i − 1)th minimum
can get a well-sorted attribute quickly by: 1) merely reading digit.
these binary groups in ascending order of rows and 2) then For every sorting queries, the same operations are performed
replicate the digit element cnt times. That is, the data in the upon the KEY and MASK registers of all processing elements
R region can be treated as a well-sorted attribute that can be (PEs). Each PE can execute FindMinimumDigit in parallel
visible to users as the final result of the SORT query. Note that under the control of the KEY and the MASK registers. Since
the number of binary groups might be large. Storing them in each PE returns a cnti of the same digit. By simply adding all
the R region can save not only many spaces but also avoid cnti in the ALU, the cnt of digitmin can be obtained then.
extra overheads of creating other data structures. Technically, Computational Complexity: The computational complexity
applying the data structure of binary groups in ReSQM can of our sorting procedure is O(NM), where N is the number
also reduce a large number of writes on the ReCAM cells for of elements and M is the number of elements with duplicates.
boosting energy efficiency significantly. Suppose all the elements are unique, the worst complexity will
The question is then how to generate a binary group accord- be O(N 2 ), which can be finished in N cycles under ReSQM.
ing to the size of digit. Let us take the ascending order as an
example. Fig. 4 shows the procedure of finding digitmin and
its corresponding cnt on attribute columns from the highest bit D. Accelerating JOIN Queries
to lowest bit via FindMinimumDigit. Compared with SELECTION and SORT operations, the
Initially, we clear all bits of the KEY register by 0. results of a JOIN query can be too large and might exceed
FindMinimumDigit seems like a filter algorithm, in which the memory size of ReSQM. In this case, ReSQM enables
we step by step determine every bit of digitmin and get rid of to optionally store the massive results of a JOIN query into
those elements that are definitely not digitmin from the highest an off-chip SSD instead of the R region. Once a join-induced
bit to the lowest bit. First, the MASK register will activate matching (a part of final results) is found, it can be (optionally)
the highest bit to do a match operation between the highest transferred to the SSD (if necessary). Note that these off-chip
bit of the KEY register and all elements of the attribute to be data transfers can be conducted in an overlapping fashion with
sorted. If some rows are tagged by the TAG register, it means the normal executions. Hence, their impact of off-chip data
the highest bit of digitmin must be 0, and these unmatched movements can be mitigated as well.
rows will not be precharged any more because their highest Equi-Join: Imani et al. [29] used a so-called “exact search”
bit is 1 and they can never be the smallest digit. If no row is mode to support equi-joins based on the LUTs. For Equi-
tagged, this indicates that the highest bit of the digitmin must Join, the number of table lookups can often be thousands of
be 1, and the highest bit of the KEY register will be set to 1. times that of SELECTION. Once the LUT is occupied by
Thus, the precharge information will stay unchanged. some operations, they often require considerable overheads
Afterward, DSQ Mat [as described in Fig. 3(b)] will acti- to finish the operations, since the frequent switching of con-
vate the second highest bit of the MASK register to determine trol signals has occurred. The LUT becomes a bottleneck for
the second highest bit of the digitmin to be 0 or 1. DSQ Mat equi-join. ReSQM copes with this issue by performing a data
will repeat the same procedure until the lowest bit is matched. reading in advance to use the vector–scalar comparison ability
After this phase, the number in the KEY register has stored of ReCAM, without the assistant of LUTs for equi-join.
4036 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020

The equi-join implementation of ReSQM can be as sim-


ple as performing some per-tuple scalar-vector comparisons.
Suppose we hold two attribute columns A and B from two
tables. All records of A will be read out in turn and sent
to the KEY register to compare with B simultaneously. All
matched rows tagged by the TAG register can be part of the
results of the equi-join of A and B. Note that the intermediate
results of the equi-join between B and each element of A can
be optionally written into the SSD. Note that although our
method introduces extra read operations, ReSQM can still pre-
serve the efficiency for the following reason. The reading of
A can run while the processing of B in an overlapping manner
since A and B are stored in different tables.
Inequality Join: Unlike SELECTION, SORT, and equi-join Fig. 5. Modified ReCAM bitcell organization. Each TAG-G is shared.
that performs equal comparisons, which have a good fit for
TABLE II
the ReCAM computational paradigm, inequality join involves W ORKLOAD C HARACTERISTICS W HERE m = 1, 4, 8, 16
the relational comparison. To enable relational comparison, we
make a slight modification to the ReCAM bit-cell organization
with respect to the current leakage mechanism.
Fig. 5 shows the modified structure of ReCAM bit-cell that
we add a TAG-G register to each row of ReCAM. The main
idea is to architect an extra TAG to capture the direction of
leakage current when dismatched. We architect a TAG-G regis-
ter between all memristors of the bit line and the ground wire
to detect the potential leakage current. When ReSQM per- relationship of the size. If their highest bits are equal, we can
forms SELECTION, SORT, and equi-join queries, the switch iteratively compare their subsequent bits from the high to the
controller (SC) in TAG-G will be in the switch-on state to low bit until a nonequal relation is found. Otherwise, the two
make the SA and the latch invalid. The SC will be in the data are essentially equal. Similar to equi-join, inequality join
switch-off state if inequality join query is under processing. also performs the meta-operation of scalar-vector comparison.
Suppose the KEY register stores 1 and ReCAM bit-cell stores Zhao et al. [30] presented a relational comparator for the
0, leakage current will flow through the switch-on memristor random forest. They divide the original bit lines and bit-not
on the bit line to the TAG-G since the memristor on the bit- lines into the two separate ReRAM arrays. Through precharg-
not line is in the switch-off state in this case. Then, both of ing the two arrays individually, the relativity of two data can
TAG-M and TAG-G will store a logic 1. The modified bit-cell be obtained by computing the voltage difference between the
may have three cases: 1) no leakage current occurs. Only the two arrays. Unfortunately, this approach suffers in accelerat-
TAG-M will capture the voltage, indicating the equality; 2) the ing database operations, which involve not only the relational
TAG-G and the TAG-M both capture a leakage current, indi- comparisons but also a large number of equal comparisons (as
cating a greater data in the KEY register; and 3) both TAG-M in SELECTION and SORT operations) or even noncomparison
and TAG-G do not capture a signal, meaning a smaller-than operations. Their separate architecture, in many cases, might
relation. double the overheads of those database operations since at
Note that we architect TAG-G based on the off-the-shelf least double rows need to be precharged. In contrast, ReSQM
TAG architecture. The timing correctness of its SA and latch adds only a neat and cheap TAG register attached to each row
controller has been demonstrated in previous studies [23]–[25]. with nearly negligible modification to the ReCAM architecture
The working mechanism of SA and latch is easy to understand. without sacrificing any potential parallelism of the ReCAM
Both SA and latch have an internal resistance. Therefore, the array.
current often prefers to be leaked out of the memristor pref- Computational Complexity: The JOIN query often needs
erentially. If and only if the current cannot flow through the O(NM) matchings, where N and M represent the lengths of
memristor, the match line will then hold a high voltage. In two attribute columns A and B, which can be accelerated in
this case, the SA and latch will start working. The activated N cycles under ReSQM if A is read to match B.
timing of the precharge signal and the signals operating upon
the KEY register is the key for the correctness of the TAG cir-
IV. E XPERIMENTAL E VALUATION
cuit. We also note that these signals are activated at the same
time, and thus, the correctness can be ensured. A. Experimental Setup
The basic principle of performing the relational compari- Workloads: Table II lists the workloads used for ReSQM.
son between two data based on the modified bit-cells works We used the GNU library [11] to create two tables M and N.
as follows. We can perform a bitwise comparison from the Both have nine attributes, among which the key has 4 B, and
highest bit to the lowest bit of two data. The relational com- two attributes are the 2-B integers used for the 16-b multipli-
parison between their highest bits can directly indicate their cation while the rest is the 4-B integers. The key attribute was
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4037

TABLE III
ATTRIBUTE D ISTRIBUTION OF THE TABLE M
AT D IFFERENT S IZES , AND THE TABLE N

(a) (b)

marked as attr0 , while others are marked from attr1 to attr8


in turn. As shown in Table III, these attributes are generated (c) (d)
based on the Bernoulli, uniform, and Gaussian distributions.
In particular, we generate table M at four different scales for Fig. 6. Throughput (tuples processed per second) of ReSQM when hanlding
sensitivity study. all the four database operations on the table M at different sizes. (a) SE.
(b) SO. (c) EJ. (d) IJ.
Measurement: We evaluate ReSQM by two metrics:
1) response time and 2) energy consumption. All results for full capacity. The area of each DSQ mat is 103.7 mm2 and
ReSQM and the baseline are obtained by an average of run- each ReCAM array takes 0.0034 mm2 . The read and write
ning ten different queries for each query type. SELECTION operations are atomic, and their latency is 8.31 and 17.42 ns,
and SORT are performed at the table M with all the four respectively. The matching operation is also atomic during the
scales. Equi-join and inequality join are performed on table computation. Its latency is 1 ns, an inverse of the frequency. In
N and four scales of table M. We list each typical query as general, the read and write operations are more expensive than
follows. the match operation, and hence we implement our algorithms
1) SELECTION (SE): Select attr0 , attr2 , attr5 , and attr8 by using read and write operations at a minimum level. Taking
from the table M where 2 × (attr5 +attr6 − attr7 ) > 3000 SORT as an example, we do not use any reads and use writes
AND (attr2 − attr4 − attr1 ) < 500 OR 4 × (attr8 + attr5 ) only when the binary groups need to be stored.
+ 5 × (attr7 − attr6 ) > 1000. In this article, we conservatively set the size of the KEY
2) SORT (SO): Select attr3 from the table M order by attr3 . register to 32 b for careful consideration of supporting the bit-
3) Equi-Join (EJ): Select M.attr2 , M.attr4 , N.attr0 , and wise operations that are frequently used in SELECTION. As
N.attr3 from the tables M and N where M.attr2 = discussed in Table I, performing bitwise operations is often
N.attr3 . sensitive to the bit number. As the size of the KEY regis-
4) Inequality Join (IJ): Select M.attr5 , M.attr3 , N.attr3 , ter increases, the number of cycles required can significantly
and N.attr1 from the tables M and N where M.attr3 increase. In this case, the KEY register is scanned and pro-
> N.attr1 . cessed one bit at a time in a bit-serial manner [23]–[25]. In
Cycle-Accurate Simulation: We use a cycle-accurate simu- actuality, the size of the KEY register can be still set to 256-b
lator in which the underlying mathematical model constraints if only SORT and/or JOIN queries are considered.
have been proved to ensure the correctness and accuracy for We evaluate ReSQM against a baseline with a 10-core Intel
program executions [23]. ReSQM applies this with a three- Xeon E5-2630v4 [email protected] GHz, 25-MB Cache, 68.3 GB/s,
step simulation [23] for the ReCAM hardware. The first step and 85-W TDP. We run SQLite selection, radix sort [14], sort-
is data mapping. ReSQM can work as a memory. The two merge join [27], and inequality join on PostgreSQLv9.4 [5] for
tables M and N will be written into the DSQ Mat of all the our baseline comparison. We use RAPL [26] to measure the
DSQ units. Their attribute locations are also recorded in the energy of the CPU. ReSQM is a PIM accelerator and all tables
address information register. The second step is to decom- and the results of SELECTION and SORT are stored in the
pose a database query into a serial of arithmetics and data memory. The baseline also stored all tables and the results of
communication operations. This step is managed by the con- SELECTION and SORT in the memory. The JOIN results of
troller in the DSQ unit. Consider the SELECTION query as an ReSQM and baseline are both stored in the off-chip SSD. The
example. The original query statement will be parsed as arith- SSD card connected to ReSQM is the same as that to CPU
metic expressions shown in Table I. Finally, these arithmetic with a size of 480 GB. The SSD interface is based on SATA3.
operations will be converted into a serial of ReCAM atomic Its read and write speeds are 562 and 420 MB/s, respectively.
operations, such as read, write, and comparison. Through look- To make apple-to-apple comparisons, we benchmark ReSQM
ing up Table I, we can hence obtain the running time of each against the CPU baselines using the same workloads and the
query accurately. The simulation for SORT and JOIN needs same benchmarks. For preserving fairness, the loading time of
only the first and the last steps since we perform their algorith- the original data from the disk to the memory is not counted.
mic operations straightforward based on the ReCAM atomic
logic rather than arithmetic operations.
ReSQM Configurations: ReSQM runs at 1 GHz with 12 B. Overall Efficiency
DSQ units. We use the SPICE simulator to obtain the energy We first evaluate the speedup and energy consumption of
consumption, area parameters, and performance of ReSQM. ReSQM against the CPU baseline for SELECTION (SE),
Each DSQ unit has 437.5-MB memory and 200-W power at SORT (SO), Equi-Join (EJ), and Inequality Join (IJ).
4038 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020

(a) (b)

Fig. 7. Response time of ReSQM against CPU for all the four database
operations on the table M at different sizes.

(c) (d)

Fig. 9. Response time of ReSQM against CPU with varying query result
sizes. All results are obtained on M@16M. (a) SE. (b) SO. (c) EJ. (d) IJ.

operations with far fewer energy consumption against the


baseline due to the substantial reduction of data movement.
For the table@4M, ReSQM can reduce energy consumption
by 164×, 12×, 21×, and 114× for SE, SO, EJ, and IJ, respec-
Fig. 8. Energy consumption of ReSQM against CPU for all the four database
operations on the table M at different sizes.
tively. A better scalability of ReSQM further reduces them into
239×, 23×, 55×, and 237× for the table M@64M.

Throughput: Fig. 6 shows the throughput between ReSQM


and CPU-based platform. For the SELECTION query, we C. Systematic Impact of Query Result Size
see that ReSQM can achieve 162G tuples per second, while We further show the systematic impact of ReSQM when the
the CPU can only process 359M tuples per second for the result size of query increases for SE, SO, EJ, and IJ.
table M@4M. In particular, when the table is scaled to 64M, SE: Fig. 9(a) characterizes the performance of ReSQM for
ReSQM can continue improving the throughput to up to SE against CPU. We can see that the CPU seems to be insen-
1083G tuples per second while the throughput of CPU stays sitive to the query result size, while ReSQM is sensitive. The
immutable. For the SORT query, ReSQM has an average reason is below. In a CPU architecture, no matter how many
throughput of 1096M tuples per second while CPU has an tuples are matched with the restricted expression, it always
average throughput of 81M tuples per second only for the needs to load all tuples from memory to cache for a global
four scales of the table M. For the equi-join query, ReSQM analysis, thereby yielding a relatively stable performance. On
has an average throughput of 513M tuples per second while the contrary, ReSQM performs the in-situ computation of
CPU has an average throughput of 13M tuples per second restricted expression with only those columns that need com-
only. For the inequality join query, ReSQM has an average parisons computed. Despite the rising tendency in response
throughput of 3.5M tuples per second. The CPU has a typ- time, ReSQM still has a faster response speed than baseline
ically small throughput of 15.2K tuples per second for table due to fewer data movements.
M@4M. When the table size is at 64M, the throughput of SO: Fig. 9(b) characterizes the performance of ReSQM
CPU becomes worse by 7.6K tuples per second. against radix sort on CPU with different repetition times of
Speedup: Fig. 7 shows the speedup results. Overall, ReSQM every unique element on an attribute that needs to be sorted.
significantly outperforms CPU for all database operations. For As the repetition times increase, results show that the response
example, for the table M@4M, ReSQM can accelerate SE, SO, time of CPU maintains a stable level while that of ReSQM is
EJ, and IJ in 0.025 ms, 3.65 ms, 7.8 ms, 1.15 s, while CPU reduced significantly. Radix sort is easy to understand since
complete them by 11.13 ms, 49.37 ms, 319.5 ms, 263.6 s, it needs to compare every element in each round. In ReSQM,
yielding the speedups of 445×, 13×, 41×, and 229×, respec- the same elements are reduced into one unique binary tuple
tively. More importantly, ReSQM shows better scalability than such that we can access only once to this unique binary tuple
CPU as the data size increases. Taking IJ as an example, its to remove the redundant accesses.
response time can be increased from 17.9 to 8 469 s when EJ: Fig. 9(c) shows the response time of EJ, for which CPU
the table M varies from 4M to 64M, while ReSQM keeps has an increasing overhead while ReSQM’s is degraded. The
the response time between 1.15 and 17.9 s with significant reason is as follows. CPU-based EJ includes a sort-merge two-
improvement by the two to three orders of magnitude. That step approach. As discussed above, the performance of sorting
yields the speedups of 721×, 25×, 95×, and 471× for the keeps stable. For the merging operation, a large size of results
table M@64M. often implies more than row-wise comparisons, thereby lead-
Energy Efficiency: Fig. 8 shows the energy results fur- ing to longer response time. The case is completely different
ther. We can see that ReSQM can complete all the database in ReSQM. It has no sort step. Also, the EJ is a natural fit
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4039

TABLE IV TABLE VI
C OMPARISONS B ETWEEN THE O RIGINAL R E CAM A RRAY AND O UR E NERGY B REAKDOWN
M ODIFIED A RRAY

TABLE VII
A REA B REAKDOWN

TABLE V
OVERHEAD B REAKDOWN

TABLE VIII
P ERFORMANCE OF R E SQM AGAINST GPU, FPGA, NDP, AND PIM
P LATFORMS (N ORMALIZED TO CPU P LATFORM )

for vector–scalar comparisons. More comparisons involved in


a large query result size can be well parallelized by exploiting
the massive parallelism of the ReCAM array.
IJ: Fig. 9(d) shows the response time of IJ. By rearchitecting
the ReCAM bit-cell, ReSQM also exposes the massive paral-
lelism of ReCAM to handle the relational vector–scalar com- is initially set to 0. Therefore, with just one signal from the
parisons. Thus, the performance of ReSQM against CPU for controller, the MASK register can work 32 times and find the
processing IJ shows a similar variation trend as processing EJ. digitmin and its count. However, for the SELECTION query,
the situation is different. The execution of each arithmetic
D. Overheads and Breakdown operation needs control signals from the LUT. Therefore, one
signal from the controller can manage the write of only one
TAG-G Overheads: We evaluate the overheads of the TAG- row to the KEY register.
G register. Table IV shows the latency and energy consumption Energy and Area Breakdown: We also investigate the energy
of the match operations between the original ReCAM array consumption and area of each component in ReSQM. In
and our modified array. We see that only the match latency of Table VI, we can see that the ReCAM array consumes most
the inequality join is 0.1 ns longer than the original array. The (97.93%) of energy, among which dynamic computations
latency for other database operations is the same as the origi- (replying on the precharging, KEY register, MASK register,
nal array. The reason is apparent that TAG-G is not used when etc.) occupy 89.79% of energy consumption while leakage cur-
performing SELECTION, SORT, and equi-join, and their per- rent takes 8.14% energy. The other components beyond the
formances are not affected by this modification. The energy DSQ mat cost only 2.07% energy. Table VII further shows
consumption for all database operations is not influenced either the area breakdown of ReSQM. The ReCAM array occupies
by this modification because the TAG-G makes full use of the 88.39% of the total area, with the ratio of H-tree is 0.04%. All
existing leakage current mechanism, rather than architects the buffers, ALUs, and microcontrollers have only 7.57%, 1.59%,
new hardware components. Finally, we also see that the area and 2.41% area, respectively. By adding small add-on periph-
of DSQ Mat with TAG-G is 103.7 mm2 , introducing only an eral circuits, ReSQM functions well as a promising-in-memory
extra 2.3 mm2 against the original array without TAG-G. device to accelerate database operations.
Controller Overheads: We further investigate the controller
overheads when handling SELECTION, SORT, and JOIN
queries, respectively. Table V depicts the results. “CAM” indi- E. Compared With Other Platforms
cates the overhead from the DSQ Mat. We can see that the We finally evaluate ReSQM against some state-of-the-art
overhead of the controller takes 16.23% of the total overheads GPU, FPGA, NDP, PIM, and CMOS-CAM-based efforts. Note
for the SELECTION query. The controller can be as small as that some of these studies may support a part of four database
1.72% and 2.44% in the SORT and JOIN queries, respectively. operations involved in this work.
The reasons for the low overheads of the controllers are For GPU, we use an NVIDIA GTX1080@1733 MHz,
simple. The execution logic of the DSQ Mat is driven by 2560 Cuda Cores, 2-MB shared L2 Cache, 8-GB Graphic
the KEY and MASK registers, which are dependent on the Memory, and 180-W TDP. The SELECTION algorithm is
data transferred from the DSQ unit, further dependent on introduced from [11], the SORT algorithm from [14], Equi-
the controllers. For the SORT and JOIN queries, the MASK join from [15], and inequality join from [6]. For FPGA, we use
and KEY registers work in a regular way. For example, in the architecture and algorithms from [16] for SELECTION,
FindMinimumDigit, the MASK register is activated bit by SORT, and equi-join queries. The NDP baseline is from [31]
bit from the highest bit to the lowest bit, and the KEY register for SELECTION only. As for the PIM baseline, we select
4040 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020

NVQuery [29] for SELECTION and equi-join queries. To This work is just small-step research of using ReCAM to
ensure fairness, we evaluate ReSQM against these baselines accelerate some database queries. Although supporting strings
by running the same benchmarks on the same workloads. remains an open question, we believe that ReSQM still has
Performance Comparisons and Analysis: Table VIII shows addressed several critical challenges in this timely topic and
the performance results for GPU, FPGA, NDP, and PIM plat- would facilitate the subsequent research of handling strings
forms. We can see that ReSQM shows the best performance effectively and efficiently.
by the speedups of 15×, 2.2×, 6.8×, and 39× over the better
performer among GPU, FPGA, NDP, and PIM platforms for V. C ONCLUSION
SE, SO, EJ, and IJ, respectively. For SELECTION, the NDP
This article identified a spectrum of comparison semantics
accelerator offers the worst acceleration effect compared with
in the relational database operations. We introduce ReSQM,
other platforms. This is because, for a large table, [31] relies
a novel ReCAM-based accelerator, which can boost the
on a CPU to process lots of operators and intermediate results.
performance for many typical database operations by flexi-
Thus, the data transfer bottleneck limits the overall efficiency.
bly exploiting the inherent parallelism of the ReCAM array.
Compared with NVQuery [29], ReSQM offers more than 30×
Results showed ReSQM significantly outperform existing
speedup, due to the reduced number of intermediate result
CPU, CMOS-based CAM, GPU, FPGA, NDP, and PIM solu-
transfers. Since SELECTION is as simple as being with good
tions by the orders of magnitude improvement in terms of
data parallelism, GPU and FPGA platforms show the superior
the speedups of (2.2× ∼ 39×), and ReSQM also achieved
results over NVQuery for all database tables.
17× ∼ 193× energy saving compared with the CPU baseline.
Note that ReSQM on SO shows a relatively less speedup
than those on SE, EJ, and IJ due to the underutilization of
ReCAM bit-cells. Actually, only 5% of bit-cells are used for ACKNOWLEDGMENT
SO in ReSQM. The rest (unrelated to a sorting attribute) The authors would like to thank the anonymous reviewers
is aggressively disabled for correctness. On accelerating SO for their insightful comments and valuable feedback.
faster by fully utilizing ReCAM resources better, we leave it
as future work. For equi-join, which represents higher com- R EFERENCES
plexities than SELECTION, we see that NVQuery becomes [1] K. G. Coffman and A. M. Odlyzko, “Internet growth: Is there a “Moore’s
superior against GPU and FPGA. Without the lookup over- law” for data traffic?” in Handbook of Massive Data Sets. Boston, MA,
heads of LUT, ReSQM offers more than 6× speedup over USA: Springer, 2001, pp. 47–93.
[2] S. Kelling et al., “Data-intensive science: A new paradigm for biodiver-
NVQuery. For inequality join, only GPU and ReSQM can sity studies,” BioScience, vol. 59, no. 7, pp. 613–620, 2009.
support it currently. However, we still find that ReSQM out- [3] M. Korkmaz, M. Karsten, K. Salem, and S. Salihoglu, “Workload-aware
performs GPU by 39×, due to the in-situ computing ability CPU performance scaling for transactional database systems,” in Proc.
and massive parallelism of the ReCAM array. ACM SIGMOD Int. Conf. Manag. Data, 2018, pp. 291–306.
[4] K. Ono and G. M. Lohman, “Measuring the complexity of join
For the CMOS-based CAM, it often suffers from the severe enumeration in query optimization,” in Proc. VLDB, 1990, pp. 314–325.
scalability issue with the limited dataset supported. To facil- [5] G. Smith, PostgreSQL 9.0 High Performance. Birmingham, U.K.: Packt
itate comparison with the existing work, we use similar Publ., 2010.
[6] D. R. Augustyn and L. Warchal, “GPU-accelerated method of query
workloads to [9] by performing SO on a 40 000-tuple table, selectivity estimation for non equi-join conditions based on discrete
and running EJ and IJ on two tables with 20 000 tuples Fourier transform,” in New Trends in Database and Information
and 40 000 tuples, respectively. For SO, EJ, and IJ, CMOS- Systems II. Cham, Switzerland: Springer, 2015, pp. 215–227.
[7] P. Mishra and M. H. Eich, “Join processing in relational databases,”
based CAM can offer the speedups of 1.59×, 7.3×, and ACM Comput. Surveys, vol. 24, no. 1, pp. 63–113, 1992.
11.2× against CPU, while our accelerator offers 7.7×, 21×, [8] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications
and 136×. of the obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1,
pp. 20–24, 1995.
[9] N. Bandi, D. Agrawal, and A. E. Abbadi, “Fast computation of database
operations using content-addressable memories,” in Proc. 17th Int. Conf.
F. Discussion Database Expert. Syst. Appl. (DEXA), 2006, pp. 389–398.
[10] D. Agrawal and A. E. Abbadi, “Hardware acceleration for database
So far, using ReCAM to handle string types has some dif- systems using content-addressable memories,” in Proc. Int. Workshop
ficulties with many challenges faced, particularly lack of an Data Manag. New Hardw. (DaMoN), 2005, pp. 1–7.
effective data mapping: 1) using a fixed size to represent a [11] P. Bakkum and K. Skadron, “Accelerating SQL database operations on
a GPU with CUDA,” in Proc. 3rd Workshop Gen. Purpose Comput.
character is often difficult, if not impossible, to support an Graph. Process. Units (GPGPU), 2010, pp. 94–103.
arbitrary-length string. Supposing we use 26 English letters [12] D. Schaa and D. Kaeli, “Exploring the multiple-GPU design space,”
as a collection to generate strings, so each character is repre- in Proc. IEEE Int. Symp. Parallel Distrib. Process. (IPDPS), 2009,
pp. 1–12.
sented by 5 B, one row of the ReCAM array can often support [13] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu,
a maximum of 50-character string only; 2) using multilevel “An asymmetric distributed shared memory model for heterogeneous
cells (MLCs) can mitigate the above issue to support a rela- parallel systems,” in Proc. Int. Conf. Archit. Support Program. Lang.
Oper. Syst. (ASPLOS), 2010, pp. 347–358.
tively long string. However, this needs a strict MLC production [14] N. Satish et al., “Fast sort on CPUs and GPUs: A case for bandwidth
process and also introduces a precision problem; and 3) using oblivious SIMD sort,” in Proc. ACM SIGMOD Int. Conf. Manag. Data,
a fixed size of the ReCAM array size to support irregular 2010, pp. 351–362.
[15] T. Kaldewey, G. Lohman, R. Mueller, and P. Volk, “GPU join processing
strings is also difficult, which needs a valid tradeoff between revisited,” in Proc. Int. Workshop Data Manag. New Hardw. (DaMoN),
computational parallelism and storage efficiency. 2012, pp. 55–62.
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4041

[16] J. Casper and K. Olukotun, “Hardware acceleration of database Huize Li (Graduate Student Member, IEEE) is cur-
operations,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate rently pursuing the Ph.D. degree with the School
Arrays (FPGA), 2014, pp. 151–160. of Computer Science and Technology, Huazhong
[17] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, University of Science and Technology, Wuhan,
“StoreGPU: Exploiting graphics processing units to accelerate dis- China.
tributed storage systems,” in Proc. Int. Symp. High Perform. Distrib. His current research interests include computer
Comput. (HPDC), 2008, pp. 165–174. architecture and emerging nonvolatile memory.
[18] B. Sukhwani et al., “Database analytics acceleration using FPGAs,” in
Proc. Int. Conf. Parallel Archit. Comp. Tech. (PACT), 2012, pp. 411–420.
[19] J. Do, Y. Kee, J. M. Patel, C. Park, K. Park, and D. J. Dewitt, “Query
processing on smart SSDs: Opportunities and challenges,” in Proc. ACM
SIGMOD Int. Conf. Manag. Data, 2013, pp. 1221–1230.
[20] Y. Kang, Y. Kee, E. L. Miller, and C. Park, “Enabling cost-effective data Hai Jin (Fellow, IEEE) received the Ph.D. degree
processing with smart SSD,” in Proc. IEEE Symp. Mass Stor. Syst. Tech. in computer engineering from Huazhong University
(MSST), 2013, pp. 1–12. of Science and Technology (HUST), Wuhan, China,
[21] R. Balasubramonian et al., “Near-data processing: Insights from a micro- in 1994.
46 workshop,” IEEE Micro, vol. 34, no. 4, pp. 36–42, Jul./Aug. 2014. He is a Cheung Kung Scholars Chair Professor of
[22] Y. Sun, Y. Wang, and H. Yang, “Bidirectional database storage and computer science and engineering with the HUST.
SQL query exploiting RRAM-based process-in-memory structure,” ACM He worked with the University of Hong Kong,
Trans. Stor., vol. 14, no. 1, p. 8, 2018. Hong Kong, from 1998 to 2000, and as a Visiting
[23] L. Yavits, A. Morad, and R. Ginosar, “Computer architecture with asso- Scholar with the University of Southern California,
ciative processor replacing last-level cache and SIMD accelerator,” IEEE Los Angeles, CA, USA, from 1999 to 2000. He
Trans. Comput., vol. 64, no. 2, pp. 368–381, Feb. 2015. has coauthored 15 books and published over 600
[24] L. Yavits, S. Kvatinsky, A. Morad, and R. Ginosar, “Resistive associa- research papers. His research interests include computer architecture, virtu-
tive processor,” IEEE Comput. Archit. Lett., vol. 14, no. 2, pp. 148–151, alization technology, cluster computing and cloud computing, peer-to-peer
Jul.–Dec. 2015. computing, network storage, and network security.
[25] R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive CAM Dr. Jin was awarded Excellent Youth Award from the National Science
processing-in-storage architecture for DNA sequence alignment,” IEEE Foundation of China in 2001. In 1996, he was awarded a German Academic
Micro, vol. 37, no. 4, pp. 20–28, Aug. 2017. Exchange Service Fellow-Ship to visit the Technical University of Chemnitz
[26] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “RAPL: in Germany. He is a fellow of CCF and a member of ACM.
Memory power estimation and capping,” in Proc. ACM/IEEE Int. Symp.
Low Power Elect. Design (ISLPED), 2010, pp. 189–194.
[27] S. Blanas and J. M. Patel, “Memory footprint matters: Efficient equi-
join algorithms for main memory data processing,” in Proc. Annu. Symp. Long Zheng (Member, IEEE) received the
Cloud Comput. (SOCC), 2013, pp. 1–16. Ph.D. degree in computer engineering, Huazhong
[28] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu, “Multi-core, main- University of Science and Technology (HUST),
memory joins: Sort vs. hash revisited,” in Proc. VLDB Endow., vol. 7, Wuhan, China, in 2016.
no. 1, 2013, pp. 85–96. He is currently an Associate Professor with
[29] M. Imani, S. Gupta, S. Sharma, and T. S. Rosing, “NVQuery: Efficient the School of Computer Science and Technology,
query processing in nonvolatile memory,” IEEE Trans. Comput.-Aided HUST. His current research interests include pro-
Design Integr. Circuits Syst., vol. 38, no. 4, pp. 628–639, Apr. 2019. gram analysis, runtime systems, and configurable
[30] L. Zhao, Q. Deng, Y. Zhang, and J. Yang, “RFAcc: A 3D ReRAM computer architecture with a particular focus on
associative array based random forest accelerator,” in Proc. ACM Int. graph processing.
Conf. Supercomput. (ICS), 2019, pp. 473–483.
[31] S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreos, “Beyond the
wall: Near-data processing for databases,” in Proc. Int. Workshop Data
Manag. New Hardw. (DaMoN), 2015, pp. 1–10. Xiaofei Liao (Member, IEEE) received the Ph.D.
[32] L. Li, H. Wang, J. Li, and H. Gao, “A survey of uncertain data degree in computer science and engineering from
management,” Front. Comput. Sci., vol. 14, no. 1, pp. 162–190, 2020. the Huazhong University of Science and Technology
[33] M. Zhang, H. Wang, J. Li, and H. Gao, “Diversification on big data in (HUST), Wuhan, China, in 2005.
query processing,” Front. Comput. Sci., vol. 14, no. 4, 2020, Art. no. He is currently the Vice Dean with the School of
144607. Computer Science and Technology, HUST. He has
[34] J. Cao and R. Li, “Fixed-time synchronization of delayed memristor- served as a Reviewer for many conferences and jour-
based recurrent neural networks,” Sci. China Inf. Sci., vol. 60, no. 3, nal papers. His research interests are in the areas of
2017, Art. no. 032201. system software, P2P system, cluster computing, and
[35] D. Wang, W. Zhao, W. Chen, H. Xie, and W. Yin, “Fully coupled streaming services.
electrothermal simulation of resistive random access memory (RRAM) Dr. Liao is a Member of the IEEE Computer
array,” Sci. China Inf. Sci., vol. 63, no. 8, 2020, Art. no. 189401. Society.

You might also like