ReSQM Accelerating Database Operations Using ReRAM-Based Content Addressable Memory-2
ReSQM Accelerating Database Operations Using ReRAM-Based Content Addressable Memory-2
Abstract—The huge amount of data enforces great pressure database systems in a wide variety of data-intensive appli-
on the processing efficiency of database systems. By leveraging cations (such as biodiversity research [2]) for real-time data
the in-situ computing ability of emerging nonvolatile memory, analytics demand, such that the response time of database
processing-in-memory (PIM) technology shows great potential in
accelerating database operations against traditional architectures operations must be much faster than ever before.
without data movement overheads. In this article, we introduce A wealth of the existing database systems are built upon
ReSQM, a novel ReCAM-based accelerator, which can dramat- CPU [3], [27], [32], [33], which is, however, difficult to
ically reduce the response time of database systems. The key satisfy the low-latency requirement due to its limited com-
novelty of ReSQM is that some commonly used database queries putational parallelism [8]. Alternatively, some efforts have
that would be otherwise processed inefficiently in previous stud-
ies can be in-situ accomplished with massively high parallelism been made in accelerating database operations with dedi-
by exploiting the PIM-enabled ReCAM array. ReSQM supports cated hardware. For instance, traditional CMOS-based content
some typical database queries (such as SELECTION, SORT, addressable memory (CAM) is developed as a coprocessor for
and JOIN) effectively based on the limited computational mode CPU to achieve data-parallel computing for multiple database
of the ReCAM array. ReSQM is also equipped with a series operations. However, it still relies on CPU to manage data
of hardware-algorithm co-designs to maximize efficiency. We
present a new data mapping mechanism that allows enjoying。 in- transfer between CAM and main memory. In addition, due
to the well-known scalability problem of the CMOS transis-
。
situ in-memory computations for SELECTION operating upon
intermediate results. We also develop a count-based ReCAM- tors, the computing ability of the CMOS-based CAM often
specific algorithm to enable the in-memory sorting without any suffers greatly in practice [9], [10]. Many studies lever-
row swapping. The relational comparisons are integrated for age the massive parallelism of GPU [11], [12], [14], [15]
。
accelerating inequality join by making a few modifications to
the ReCAM cells with negligible hardware overhead. The exper- (or FPGA [16], [18]) for (energy) efficiency improvement.
imental results show that ReSQM can improve the (energy) Nevertheless, because of the separate computation-storage
efficiency by 611× (193×), 19× (17×), 59× (43×), and 307× hierarchy by following the von Neumann architecture,
(181×) in comparison to a 10-core Intel Xeon E5-2630v4 pro- these earlier studies suffer from the “memory wall”
cessor for SELECTION, SORT, equi-join, and inequality join, problem.
respectively. In contrast to state-of-the-art CMOS-based CAM,
GPU, FPGA, NDP, and PIM solutions, ReSQM can also offer To address the above problem, near-data processing (NDP)
2.2× 39× speedups. integrates processing units into the memory or storage.
Although significant data movement can be reduced for an
Index Terms—Content addressable memory (CAM), database
query, nonvolatile memory, processing-in-memory (PIM). NDP accelerator, they still suffer from challenges with the
computing-ability-limited logic units in memory with consid-
erable integration cost [19]–[21], [31]. Processing-in-memory
I. I NTRODUCTION (PIM) technology provides a promising way with the in-situ
database operations involve different peripheral circuit layouts, to support the relational comparison with negligible
making their design extraordinarily complex. hardware overhead.
Recently, there emerges ReRAM-based content address- 3) We conduct a comprehensive evaluation. We compare
able memory (ReCAM), takes the best of both worlds of ReSQM with not only the traditional CPU-based, GPU-
nonvolatile ReRAM [34], [35] and specialized CAM hard- based, FPGA-based, and CMOS-based efforts but also
ware with large capacity and PIM feature [24]. In addition to the emerging NDP-enabled and PIM-enabled accelera-
scalar comparison, ReCAM is also naturally capable of mak- tors. Results show that ReSQM outperforms state of the
ing the comparisons in a vector granularity, also known as art significantly.
vector–scalar comparison, at a time with higher parallelism. The remainder of this article is organized as follows.
ReCAM is promising to enable in-situ in-memory computing Section II describes the background and motivation. Section III
to handle the database table for a wide variety of database presents the architectural designs of ReSQM. Section IV
operations efficiently. More importantly, the array structure of shows the experimental results. Section V concludes the work.
ReCAM can be intuitively regarded as a database table layout,
making easy access to data and a fast mapping on ReCAM
crossbars. II. BACKGROUND AND M OTIVATION
Nevertheless, exploiting ReCAM for accelerating database A. Database Operations
queries remains tremendously challenging. First, to support In this article, we mainly focus on the relational database
processing a database query, it is challenging to store and since it is widespread in the current mainstream market. In a
handle a lot of intermediate results. NVQuery [29] presents relational database, those records with the same attributes are
the first ReCAM-based accelerator for accelerating database called tuples. In general, each tuple is distributed row by row to
operations. However, in order to obtain the final results of a form a table. Each column of the table indicates an attribute
query, NVQuery often relies on the main processor to process of the table. In this article, we focus on three fundamental
the intermediate results. Therefore, substantial data movements kernels of database queries as follows.
can be transferred between the processor and the ReCAM SELECTION: The selection query aims to choose tuples by
array, limiting the overall efficiency. Second, since ReCAM querying a table via a restricted statement, which usually con-
functions as both storage and processing units, the raw data tains several arithmetic expressions connected with each other
of the database table in ReCAM must be consistent with- using various logical operators, such as AND, OR, NAND, and
out data pollution for subsequent operations. This requirement NXOR. The arithmetic operators used in the arithmetic expres-
may potentially suppress the efficiency of many database oper- sion may also get involved, e.g., “+,” “−,” “×,” “=,” “#=,” ≤,
ations, such as the SORT that often involves (substantial) ≥, “>,” “<.”
data reordering (if not carefully designed). Besides, vector– SORT: The sort query aims to reorder the tuples in an
scalar comparison in ReCAM can compute only the equality expected (e.g., ascending or descending) order according to
between a given number and every element in a vector, affect- some attributes.
ing the applicability to handle some database operations, such JOIN: The join query aims to generate a new table using
as inequality join that needs to know the relativity [7]. the Cartesian product of two relational attributes. In this arti-
In this article, we make the following contributions. cle, we consider two typical join operations: 1) equi-join and
1) We identify that the existing PIM-based database- 2) inequality join. The former indicates a join operation with
oriented accelerators can support a subset of database the condition containing an equality operator of =. The latter
operations. Neither can support SELECTION, SORT, represents a join condition with the inequality operators, e.g.,
and JOIN queries simultaneously. Yet, these exist- > and <.
ing studies also typically rely on the main processor
that assists a PIM architecture in handling a lot of
intermediate results, which can become a bottleneck B. ReCAM Basics
limiting the overall efficiency. To the best of our knowl- Fig. 1 illustrates the basics of ReCAM, which consists of a
edge, ReSQM is the first ReCAM-based architecture that MASK register, a KEY register, an array of ReCAM bit-cells
can process various database queries in memory effec- organized in a crossbar architecture, and TAG registers. The
tively and efficiently without the assistance of a CPU MASK register decides which columns will be selected to do
processor. read, write, and match operations. The KEY register stores a
2) We develop a series of hardware-algorithm co-designs data word that will be used for a write or match operation. As
to improve the efficiency of performance accelera- shown in Fig. 1(a), a ReCAM bit-cell is organized with two
tion on different database operations. For SELECTION, transistors and two memristors (2T2R) elements with one bit
we present a new data mapping mechanism that line and one bit-not line. The match/word line of the ReCAM
allows enjoying in-situ in-memory computations of array is attached to a TAG register [Fig. 1(b)] in which each
the SELECTION query operating upon intermediate ReRAM array row is connected to a signal amplifier (SA) and
results for performance acceleration. For SORT, we a TAG latch. The TAG registers mark those matched rows that
develop a count-based ReCAM-specific algorithm to satisfy the condition of comparison. Unlike the row-oriented
enable the in-memory sorting. For inequality-join, we or column-oriented storage in a traditional memory [11], [28],
make a slight modification to the basic ReCAM cell the ReCAM crossbar is a natural fit to store the database table
4032 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020
happens, leakage current will flow through that cell, and the
voltage of the word line will drop off. Note that all per-row
vector elements of selected columns against the scalar data
can be compared in parallel and finished at one cycle.
C. Related Work
(b) GPU and FPGA Acceleration: A lot of efforts have been
put into speeding up database operations based on the tradi-
tional architectures, such as GPUs and FPGAs [11], [12], [14],
[15], [18]. For instance, Schaa and Kaeli [12] pointed out
that the Peripheral component interconnect express bus will
also become a bottleneck on multiple GPUs unless the com-
plete dataset can be placed in the memory of GPU. StoreGPU
proposes to accelerate several hashing-based primitives for
a distributed storage system [17]. By initializing the input
data in the pinned host memory, StoreGPU protects the GPU
driver from an extra memory copy with reduced data trans-
(a) fers. Asymmetric distributed shared memory [13] is proposed
to maintain a shared logical memory space for reducing the
Fig. 1. Basics of the ReCAM array. (a) Sketch of ReCAM bitcell. (b) TAG
register organization. amount of data movement between the host and the acceler-
ator. An in-memory FPGA-based architecture is developed to
accelerate table joins [16]. Compared with CPUs, these studies
can provide superior results. Also, both GPU and FPGA accel-
eration of SQL operations can be designed with the flexibility
that can deal with a larger set of SQL operations, types, and
column/row sizes. However, currently, GPU and FPGA still
suffer from the limited memory size such that they have to
read/write through the host-system from/to SSD/HDD storage
(a) (b) (c) with I/O bottlenecks.
NDP and PIM Accelerators: Near-data computing integrates
Fig. 2. Common computational patterns for (a) SELECTION, (b) SORT, the processing units into storage or memory to reduce data
and (c) JOIN. access overhead [19]–[21], [31]. Although near-data com-
puting can improve computing efficiency by reducing data
movement, it still faces several challenges. Their process-
with bit lines representing attributes and each match line show- ing ability of computing logic integrated into the storage and
ing a tuple. By using ReCAM, we can perform vector–scalar memory is quite limited, and also computational parallelism
comparisons with massive parallelism. suffers. Integrating logic units into the stacking memory dies
As applied in [25] and [30], we follow to use the high- may also lead to a potentially high cost.
resistance state (HRS) to represents logic “1” (i.e., the switch- Sun et al. [22] presented the first PIM-enabled design
off state), while the low-resistance state (LRS) represents logic based on ReRAM to accelerate SQL query operations.
“0” (i.e., the switch-on state). Since a ReCAM cell often uses Due to the limited computational paradigm of the ReRAM
two memristive cells to represent one logic bit. We use the array, this work can support only some operations of a
“10” of two memristive cells to represent the logic 1, and SELECTION query. ReCAM has been widely used in many
vice versa. fields. Yavits et al. [23] replaced the last level cache with
Vector–Scalar Comparison: Initially, a given scalar data that ReCAM as an associative processor. Kaplan et al. [25] lever-
need to be compared is stored in the KEY register. All match aged ReCAM to accelerate the Smith–Waterman algorithm
lines are precharged with high voltage, while the KEY register for DNA sequence alignment. To the best of our knowl-
was set on bit and bit-not lines. Note that the precharged signal edge, NVQuery [29] is the most related ReCAM-based work
and the signals operating upon the bit line and bit-not line of specialized for accelerating database applications.
the KEY register are activated at the same time. The bit and NVQuery presents a heterogeneous solution. It enables sup-
bit-not lines of those columns that do not need to be compared porting some basic database operations based on ReCAM,
are set to the low voltage by the MASK register. For each row such as nearest distance search, equi-join, and some bitwise
(i.e., a vector element), if all selected bits match the given data, operations. To obtain the final results of a query, NVQuery
the corresponding precharged word line will keep high voltage relies on the main processor to process the intermediate results.
that can be captured by the corresponding SA and also held Therefore, an amount of data movement can be transferred
in the TAG latch. Otherwise, if the mismatch of any one bit between the processor and the ReCAM array, limiting the
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4033
Fig. 3. Overview of ReSQM. (a) Layout of ReSQM chip. (b) Architecture of database structure query unit. (c) Truth table for supporting arithmetic and
logical operators. (d) Interconnect of DSQ mat. (e) Table layout for the ReCAM array. The pink arrow shows the workflow for the SELECTION query. The
blue arrow indicates the workflow of the SORT and JOIN queries.
overall efficiency. That is, particularly true and serious for han- III. R E SQM
dling a large database table. ReSQM differs from NVQuery Fig. 3(a) shows the overview of the ReSQM chip, which
in two ways: 1) we use an in-memory reserved region to consists of multiple DSQ units connected through a bus that
buffer the intermediate results so as to perform operations is used for receiving (sending) the (results of) queries from
between the intermediate results and the original data in (to) users. Initially, we partition a database table into multiple
memory and 2) each database structured query (DSQ) unit slices such that each piece can fit into the DSQ unit. For han-
self-contains the arithmetic and logic units (ALUs) and a stack dling an even larger table that cannot fit into the ReSQM’s
register, which can function to parse a restricted expression memory entirely, ReSQM can also work effectively by putting
in SELECTION queries (without the assistance of the main the large table in the solid-state-disk (SSD) storage, and send
processor), avoiding the substantial data transfer overheads. it to ReSQM in batches for processing.
We particularly note that the way of performing an addition In this section, we first elaborate on the architectural
operation in [29] is based on breaking it down into a serial details of ReSQM, and then show the fundamental designs
of NOR operations, which is entirely different from ours that of accelerating different database queries.
applies a lightweight and straightforward truth table (inspired
from [24] and [25]). Although both are based on ReCAM, A. Architecture
the architectures are different, and all algorithms that drive
Fig. 3(b) shows the architecture of a DSQ unit that is the
database operations are also different.
core of performing the execution for every received query.
Note that ReSQM currently processes all queries serially.
Supporting concurrent query execution can be considered an
D. Motivation interesting future work. A DSQ unit contains some necessary
Fig. 2 shows the computational patterns of SELECTION, components to support an effective query execution. Next, we
SORT, and JOIN. We observe that database operations often discuss them as follows.
involve many different practical demands that may be beyond 1) Structured Query (SQ) Buffer: It is mainly used to store
the vector–scalar comparison pattern of the ReCAM array. the queries that will be processed by the DSQ unit. Yet,
For example, in addition to comparison operators, the it can also be functioned to identify the type of a query,
restricted expressions in SELECTION often involve many i.e., JOIN, SORT, or SELECTION.
noncomparison operators operating upon the bitwise vector– 2) ALUs and Stack Register: ReSQM includes some sim-
vector computing, where only elements in the same row of ple ALUs to convert a restricted expression used in
two vectors are needed to take the computation [Fig. 2(a)]. SELECTION queries into a suffix one that can show the
Although SORT has the vector–scalar meta-operations [as correct execution priority of the operators. Stack register
shown in Fig. 2(b)], in the presence of the existing sort- stores the operands and operators during the expres-
ing algorithms (such as radix sort and merge-sort) they also sion parsing. ALUs and the stack register enable that
involve the row swappings that ReCAM cannot support effec- ReSQM can work independently from CPU to accelerate
tively. More importantly, the matching principle of ReCAM database queries.
based on the leakage current mechanism can check only 3) Look-Up Table (LUT): It is introduced to enable ReCAM
whether two elements are equal or not (i.e., equal compari- to support basic arithmetic or logic operations based on
son). However, most comparisons amongst database operations the comparison paradigm of ReCAM. LUT stores the
(such as inequality join) need to know more concerning which precalculated truth tables of basic instructions, as shown
element is greater or smaller (i.e., the relational comparison). in Fig. 3(c).
4034 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 39, NO. 11, NOVEMBER 2020
TABLE I
C OMMON ROW-W ISE V ECTOR –V ECTOR I NSTRUCTIONS U SED IN results, which can be further stored in the R region. In this
DATABASE O PERATIONS way, ReSQM can get the final results of SELECTION queries
by rational use of the R region. Second, we architect some
ALUs and a stack register in each DSQ unit, as shown in
Fig. 3(b). The ALUs will parse the restricted expression of
a SELECTION query and obtain the correct execution of the
restricted expression that will be stored in the stack register
in the form of operands and operators. The processing of the
restricted expressions is as follows.
In the beginning, two operands and one operator will
4) Address Information: This is an address register that be popped from the stack register and sent to the address
records and specifies which columns a particular information register and LUT, respectively. According to the
attribute is stored in DSQ mat. truth table of this operator in LUT, the controller will generate
5) Ctrl: This is a local microcontroller that manages the suitable signals to the DSQ mat. With control signals applied
components in the DSQ unit to perform the correspond- to the memory address of these two operands recorded by the
ing database operations and sends some control signals address information register, DSQ mat can calculate the results
to the DSQ mat. of these two vectors, and the results will be stored in the R
6) DSQ Mat: It is the main storage and computing com- region. After that, the stack register will pop an operator and
ponent in the DSQ unit. It contains many processing an operand out to control the corresponding vector calculating
elements that are connected through H-Tree, as shown with the intermediate results. When all operands have been
in Fig. 3(d). processed, we can get the final result of the restricted expres-
7) Processing Element (PE): Similar to prior work [24], sion in the R region. A logic 1 stored in the R region means
we configure each PE with a ReRAM array size of 512 that the value of the restricted expression is true on this tuple.
rows and 512 columns. Fig. 3(e) composes a sketch of Finally, ReSQM can get the results of this SELECTION query
data organization. We reserve the first 64 b (marked with through a memory read request according to the value in the
“R”) as a buffer to store the intermediate results (when R region.
necessary). The rest of the columns in PE are used to Example: Suppose two 32-b numbers “A” and “B” needs
store the table. an addition. Next, we introduce how this simple operation
8) SSD: An off-chip SSD is optionally used to store a is performed on ReSQM. This is a typical multibit addition
large number of the results of a JOIN query when the case [23]–[25], which are often processed by transforming it
ReCAM’s on-chip memory is not sufficient. into multiple single-bit additions. Then, we can use a truth
In the ReSQM designs, we hold the argument that ReCAM table for the single-bit addition to process the operation. The
arrays should function as both storage and computing units to procedure works as follows. First, the lowest bit of A and B
eliminate the data movement between the processor and the will do a single-bit addition. The carry-out and result will be
memory. Based on this design philosophy, we next present how stored in the R region. Afterward, the carry-out will work as
these key components are designed to accelerate SELECTION, the carry-in and do an addition with the second-lowest bit of A
SORT, and JOIN operations effectively and efficiently by and B to generate the next carry-out and result. This process is
exploiting ReCAM. repeated until the highest bit of A and B is processed. Finally,
the R region will hold the final result of “A + B”. Note that the
B. Accelerating SELECTION Queries addition of the two elements in different rows is computed in
ReCAM can perform a variety of bitwise operations based parallel.
on its vector–scalar comparison paradigm [24], [25]. To sup- The multiplication can be considered multistep addi-
port SELECTION, NVQuery [29] proposes to transfer the tions [23]. Row-wise max instruction is used to find a
operations of a SELECTION query into a series of bitwise maximum number among the corresponding elements of two
operations, which can generate a large number of intermediate vectors in the same row.
results. To obtain the final results of the query, NVQuery In ReSQM, we perform the addition on the two 32-b vectors
relies on the main processor to process these intermediate that operate on an 8-row truth table row by row. This thus takes
results. Consider the restricted expression in a SELECTION 32 × 8 × 2 = 512 cycles for a vector–vector addition. Other
query contains a variety of bitwise operations, NVQuery often instructions are similar. Table I lists the instructions supported
needs to transfer lots of intermediate results to the processor, in ReSQM.
degrading the overall efficiency.
To avoid the off-chip data transfers, we make the two core
designs to perform a SELECTION query in memory. First, C. Accelerating SORT Queries
we reserve some memory spaces as R regions, as shown in The sorting for some attribute columns based on the
Fig. 3(e), which will be used to store the intermediate results ReCAM array often needs first to perform the interrow com-
of SELECTION queries. Since the R region also has a com- parisons and then reorder the attribute in different rows.
puting ability, the intermediate results can also be computed However, the ReCAM array supports intercolumn comparison
with the original data for generating the next intermediate only. Also, the substantial amount of row swapping would
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4035
TABLE III
ATTRIBUTE D ISTRIBUTION OF THE TABLE M
AT D IFFERENT S IZES , AND THE TABLE N
(a) (b)
(a) (b)
Fig. 7. Response time of ReSQM against CPU for all the four database
operations on the table M at different sizes.
(c) (d)
Fig. 9. Response time of ReSQM against CPU with varying query result
sizes. All results are obtained on M@16M. (a) SE. (b) SO. (c) EJ. (d) IJ.
TABLE IV TABLE VI
C OMPARISONS B ETWEEN THE O RIGINAL R E CAM A RRAY AND O UR E NERGY B REAKDOWN
M ODIFIED A RRAY
TABLE VII
A REA B REAKDOWN
TABLE V
OVERHEAD B REAKDOWN
TABLE VIII
P ERFORMANCE OF R E SQM AGAINST GPU, FPGA, NDP, AND PIM
P LATFORMS (N ORMALIZED TO CPU P LATFORM )
NVQuery [29] for SELECTION and equi-join queries. To This work is just small-step research of using ReCAM to
ensure fairness, we evaluate ReSQM against these baselines accelerate some database queries. Although supporting strings
by running the same benchmarks on the same workloads. remains an open question, we believe that ReSQM still has
Performance Comparisons and Analysis: Table VIII shows addressed several critical challenges in this timely topic and
the performance results for GPU, FPGA, NDP, and PIM plat- would facilitate the subsequent research of handling strings
forms. We can see that ReSQM shows the best performance effectively and efficiently.
by the speedups of 15×, 2.2×, 6.8×, and 39× over the better
performer among GPU, FPGA, NDP, and PIM platforms for V. C ONCLUSION
SE, SO, EJ, and IJ, respectively. For SELECTION, the NDP
This article identified a spectrum of comparison semantics
accelerator offers the worst acceleration effect compared with
in the relational database operations. We introduce ReSQM,
other platforms. This is because, for a large table, [31] relies
a novel ReCAM-based accelerator, which can boost the
on a CPU to process lots of operators and intermediate results.
performance for many typical database operations by flexi-
Thus, the data transfer bottleneck limits the overall efficiency.
bly exploiting the inherent parallelism of the ReCAM array.
Compared with NVQuery [29], ReSQM offers more than 30×
Results showed ReSQM significantly outperform existing
speedup, due to the reduced number of intermediate result
CPU, CMOS-based CAM, GPU, FPGA, NDP, and PIM solu-
transfers. Since SELECTION is as simple as being with good
tions by the orders of magnitude improvement in terms of
data parallelism, GPU and FPGA platforms show the superior
the speedups of (2.2× ∼ 39×), and ReSQM also achieved
results over NVQuery for all database tables.
17× ∼ 193× energy saving compared with the CPU baseline.
Note that ReSQM on SO shows a relatively less speedup
than those on SE, EJ, and IJ due to the underutilization of
ReCAM bit-cells. Actually, only 5% of bit-cells are used for ACKNOWLEDGMENT
SO in ReSQM. The rest (unrelated to a sorting attribute) The authors would like to thank the anonymous reviewers
is aggressively disabled for correctness. On accelerating SO for their insightful comments and valuable feedback.
faster by fully utilizing ReCAM resources better, we leave it
as future work. For equi-join, which represents higher com- R EFERENCES
plexities than SELECTION, we see that NVQuery becomes [1] K. G. Coffman and A. M. Odlyzko, “Internet growth: Is there a “Moore’s
superior against GPU and FPGA. Without the lookup over- law” for data traffic?” in Handbook of Massive Data Sets. Boston, MA,
heads of LUT, ReSQM offers more than 6× speedup over USA: Springer, 2001, pp. 47–93.
[2] S. Kelling et al., “Data-intensive science: A new paradigm for biodiver-
NVQuery. For inequality join, only GPU and ReSQM can sity studies,” BioScience, vol. 59, no. 7, pp. 613–620, 2009.
support it currently. However, we still find that ReSQM out- [3] M. Korkmaz, M. Karsten, K. Salem, and S. Salihoglu, “Workload-aware
performs GPU by 39×, due to the in-situ computing ability CPU performance scaling for transactional database systems,” in Proc.
and massive parallelism of the ReCAM array. ACM SIGMOD Int. Conf. Manag. Data, 2018, pp. 291–306.
[4] K. Ono and G. M. Lohman, “Measuring the complexity of join
For the CMOS-based CAM, it often suffers from the severe enumeration in query optimization,” in Proc. VLDB, 1990, pp. 314–325.
scalability issue with the limited dataset supported. To facil- [5] G. Smith, PostgreSQL 9.0 High Performance. Birmingham, U.K.: Packt
itate comparison with the existing work, we use similar Publ., 2010.
[6] D. R. Augustyn and L. Warchal, “GPU-accelerated method of query
workloads to [9] by performing SO on a 40 000-tuple table, selectivity estimation for non equi-join conditions based on discrete
and running EJ and IJ on two tables with 20 000 tuples Fourier transform,” in New Trends in Database and Information
and 40 000 tuples, respectively. For SO, EJ, and IJ, CMOS- Systems II. Cham, Switzerland: Springer, 2015, pp. 215–227.
[7] P. Mishra and M. H. Eich, “Join processing in relational databases,”
based CAM can offer the speedups of 1.59×, 7.3×, and ACM Comput. Surveys, vol. 24, no. 1, pp. 63–113, 1992.
11.2× against CPU, while our accelerator offers 7.7×, 21×, [8] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications
and 136×. of the obvious,” SIGARCH Comput. Archit. News, vol. 23, no. 1,
pp. 20–24, 1995.
[9] N. Bandi, D. Agrawal, and A. E. Abbadi, “Fast computation of database
operations using content-addressable memories,” in Proc. 17th Int. Conf.
F. Discussion Database Expert. Syst. Appl. (DEXA), 2006, pp. 389–398.
[10] D. Agrawal and A. E. Abbadi, “Hardware acceleration for database
So far, using ReCAM to handle string types has some dif- systems using content-addressable memories,” in Proc. Int. Workshop
ficulties with many challenges faced, particularly lack of an Data Manag. New Hardw. (DaMoN), 2005, pp. 1–7.
effective data mapping: 1) using a fixed size to represent a [11] P. Bakkum and K. Skadron, “Accelerating SQL database operations on
a GPU with CUDA,” in Proc. 3rd Workshop Gen. Purpose Comput.
character is often difficult, if not impossible, to support an Graph. Process. Units (GPGPU), 2010, pp. 94–103.
arbitrary-length string. Supposing we use 26 English letters [12] D. Schaa and D. Kaeli, “Exploring the multiple-GPU design space,”
as a collection to generate strings, so each character is repre- in Proc. IEEE Int. Symp. Parallel Distrib. Process. (IPDPS), 2009,
pp. 1–12.
sented by 5 B, one row of the ReCAM array can often support [13] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu,
a maximum of 50-character string only; 2) using multilevel “An asymmetric distributed shared memory model for heterogeneous
cells (MLCs) can mitigate the above issue to support a rela- parallel systems,” in Proc. Int. Conf. Archit. Support Program. Lang.
Oper. Syst. (ASPLOS), 2010, pp. 347–358.
tively long string. However, this needs a strict MLC production [14] N. Satish et al., “Fast sort on CPUs and GPUs: A case for bandwidth
process and also introduces a precision problem; and 3) using oblivious SIMD sort,” in Proc. ACM SIGMOD Int. Conf. Manag. Data,
a fixed size of the ReCAM array size to support irregular 2010, pp. 351–362.
[15] T. Kaldewey, G. Lohman, R. Mueller, and P. Volk, “GPU join processing
strings is also difficult, which needs a valid tradeoff between revisited,” in Proc. Int. Workshop Data Manag. New Hardw. (DaMoN),
computational parallelism and storage efficiency. 2012, pp. 55–62.
LI et al.: ReSQM: ACCELERATING DATABASE OPERATIONS USING ReRAM-BASED CAM 4041
[16] J. Casper and K. Olukotun, “Hardware acceleration of database Huize Li (Graduate Student Member, IEEE) is cur-
operations,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate rently pursuing the Ph.D. degree with the School
Arrays (FPGA), 2014, pp. 151–160. of Computer Science and Technology, Huazhong
[17] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, University of Science and Technology, Wuhan,
“StoreGPU: Exploiting graphics processing units to accelerate dis- China.
tributed storage systems,” in Proc. Int. Symp. High Perform. Distrib. His current research interests include computer
Comput. (HPDC), 2008, pp. 165–174. architecture and emerging nonvolatile memory.
[18] B. Sukhwani et al., “Database analytics acceleration using FPGAs,” in
Proc. Int. Conf. Parallel Archit. Comp. Tech. (PACT), 2012, pp. 411–420.
[19] J. Do, Y. Kee, J. M. Patel, C. Park, K. Park, and D. J. Dewitt, “Query
processing on smart SSDs: Opportunities and challenges,” in Proc. ACM
SIGMOD Int. Conf. Manag. Data, 2013, pp. 1221–1230.
[20] Y. Kang, Y. Kee, E. L. Miller, and C. Park, “Enabling cost-effective data Hai Jin (Fellow, IEEE) received the Ph.D. degree
processing with smart SSD,” in Proc. IEEE Symp. Mass Stor. Syst. Tech. in computer engineering from Huazhong University
(MSST), 2013, pp. 1–12. of Science and Technology (HUST), Wuhan, China,
[21] R. Balasubramonian et al., “Near-data processing: Insights from a micro- in 1994.
46 workshop,” IEEE Micro, vol. 34, no. 4, pp. 36–42, Jul./Aug. 2014. He is a Cheung Kung Scholars Chair Professor of
[22] Y. Sun, Y. Wang, and H. Yang, “Bidirectional database storage and computer science and engineering with the HUST.
SQL query exploiting RRAM-based process-in-memory structure,” ACM He worked with the University of Hong Kong,
Trans. Stor., vol. 14, no. 1, p. 8, 2018. Hong Kong, from 1998 to 2000, and as a Visiting
[23] L. Yavits, A. Morad, and R. Ginosar, “Computer architecture with asso- Scholar with the University of Southern California,
ciative processor replacing last-level cache and SIMD accelerator,” IEEE Los Angeles, CA, USA, from 1999 to 2000. He
Trans. Comput., vol. 64, no. 2, pp. 368–381, Feb. 2015. has coauthored 15 books and published over 600
[24] L. Yavits, S. Kvatinsky, A. Morad, and R. Ginosar, “Resistive associa- research papers. His research interests include computer architecture, virtu-
tive processor,” IEEE Comput. Archit. Lett., vol. 14, no. 2, pp. 148–151, alization technology, cluster computing and cloud computing, peer-to-peer
Jul.–Dec. 2015. computing, network storage, and network security.
[25] R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive CAM Dr. Jin was awarded Excellent Youth Award from the National Science
processing-in-storage architecture for DNA sequence alignment,” IEEE Foundation of China in 2001. In 1996, he was awarded a German Academic
Micro, vol. 37, no. 4, pp. 20–28, Aug. 2017. Exchange Service Fellow-Ship to visit the Technical University of Chemnitz
[26] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “RAPL: in Germany. He is a fellow of CCF and a member of ACM.
Memory power estimation and capping,” in Proc. ACM/IEEE Int. Symp.
Low Power Elect. Design (ISLPED), 2010, pp. 189–194.
[27] S. Blanas and J. M. Patel, “Memory footprint matters: Efficient equi-
join algorithms for main memory data processing,” in Proc. Annu. Symp. Long Zheng (Member, IEEE) received the
Cloud Comput. (SOCC), 2013, pp. 1–16. Ph.D. degree in computer engineering, Huazhong
[28] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu, “Multi-core, main- University of Science and Technology (HUST),
memory joins: Sort vs. hash revisited,” in Proc. VLDB Endow., vol. 7, Wuhan, China, in 2016.
no. 1, 2013, pp. 85–96. He is currently an Associate Professor with
[29] M. Imani, S. Gupta, S. Sharma, and T. S. Rosing, “NVQuery: Efficient the School of Computer Science and Technology,
query processing in nonvolatile memory,” IEEE Trans. Comput.-Aided HUST. His current research interests include pro-
Design Integr. Circuits Syst., vol. 38, no. 4, pp. 628–639, Apr. 2019. gram analysis, runtime systems, and configurable
[30] L. Zhao, Q. Deng, Y. Zhang, and J. Yang, “RFAcc: A 3D ReRAM computer architecture with a particular focus on
associative array based random forest accelerator,” in Proc. ACM Int. graph processing.
Conf. Supercomput. (ICS), 2019, pp. 473–483.
[31] S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreos, “Beyond the
wall: Near-data processing for databases,” in Proc. Int. Workshop Data
Manag. New Hardw. (DaMoN), 2015, pp. 1–10. Xiaofei Liao (Member, IEEE) received the Ph.D.
[32] L. Li, H. Wang, J. Li, and H. Gao, “A survey of uncertain data degree in computer science and engineering from
management,” Front. Comput. Sci., vol. 14, no. 1, pp. 162–190, 2020. the Huazhong University of Science and Technology
[33] M. Zhang, H. Wang, J. Li, and H. Gao, “Diversification on big data in (HUST), Wuhan, China, in 2005.
query processing,” Front. Comput. Sci., vol. 14, no. 4, 2020, Art. no. He is currently the Vice Dean with the School of
144607. Computer Science and Technology, HUST. He has
[34] J. Cao and R. Li, “Fixed-time synchronization of delayed memristor- served as a Reviewer for many conferences and jour-
based recurrent neural networks,” Sci. China Inf. Sci., vol. 60, no. 3, nal papers. His research interests are in the areas of
2017, Art. no. 032201. system software, P2P system, cluster computing, and
[35] D. Wang, W. Zhao, W. Chen, H. Xie, and W. Yin, “Fully coupled streaming services.
electrothermal simulation of resistive random access memory (RRAM) Dr. Liao is a Member of the IEEE Computer
array,” Sci. China Inf. Sci., vol. 63, no. 8, 2020, Art. no. 189401. Society.