0% found this document useful (0 votes)
29 views12 pages

2020 - ICDE2020 NUMA Main Memory Query

Uploaded by

胡仲义
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

2020 - ICDE2020 NUMA Main Memory Query

Uploaded by

胡仲义
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication, but has not been fully edited.

Content may change prior to final publication.

The Art of Efficient In-memory Query Processing


on NUMA Systems: a Systematic Approach
Puya Memarzia 1 , Suprio Ray 2 , Virendra C. Bhavsar 3
Faculty of Computer Science, University of New Brunswick, Canada. Email: {1 pmemarzi, 2 sray, 3 bhavsar}@unb.ca

Abstract—Data analytics systems commonly utilize in-memory Furthermore, achieving optimal performance on different sys-
query processing techniques to achieve better throughput and tem configurations can be costly and time-consuming. As a
lower latency. Modern computers increasingly rely on Non- result, we were motivated to pursue strategies that can improve
Uniform Memory Access (NUMA) architectures to achieve scal-
ability. A key drawback of NUMA architectures is that many performance across-the-board without code tuning.
existing software solutions are not aware of the underlying In an effort to provide a general solution that speeds up
NUMA topology and thus do not take full advantage of the applications on NUMA systems, some researchers have pro-
hardware. Modern operating systems are designed to provide posed using NUMA schedulers that co-exist with the operating
basic support for NUMA systems. However, default system system (OS). These schedulers monitor running applications
configurations are typically sub-optimal for large data analytics
applications. Additionally, rewriting the application from the in real-time and attempt to improve performance by migrat-
ground up is not always feasible. ing threads and memory pages to address load balancing
In this work, we evaluate a variety of strategies that aim to issues [2]–[4]. However, some of these approaches are not
accelerate memory-intensive data analytics workloads on NUMA architecture or OS independent. For instance, Carrefour [5]
systems. Our findings indicate that the operating system default requires an AMD CPU that is based on the K10 architecture,
configurations can be detrimental to query performance. We
analyze the impact of different memory allocators, memory
in addition to a modified OS kernel. Moreover, researchers
placement strategies, thread placement, and kernel-level load have argued that these schedulers may not be beneficial for
balancing and memory management mechanisms. With extensive multi-threaded in-memory query processing [6].
experimental evaluation, we demonstrate that the methodical Lately, researchers have started to pay attention to the
application of these techniques can be used to obtain significant issues affecting query performance on NUMA systems. These
speedups in four commonplace in-memory query processing
tasks, on three different hardware architectures. Furthermore,
researchers have favored a more application-oriented approach
we show that these strategies can improve the performance of five that involves algorithmic tweaks to the application’s source
popular database systems running a TPC-H workload. Lastly, we code, particularly in the context of query processing engines.
summarize our findings in a decision flowchart for practitioners. Among these works, some are static solutions that attempted
to make query operators NUMA-aware [7], [8]. Others are
I. I NTRODUCTION dynamic solutions that focused on work allocation to threads
using work-stealing [9], data placement [10], [11], and task
The digital world is producing large volumes of data at scheduling with adaptive data repartitioning [12]. These ap-
increasingly higher rates. The breadth of applications that proaches can be costly and time-consuming to implement,
depend on fast and efficient data processing has grown dra- and incorporating these solutions in commercial database
matically. Main memory query processing systems have been engines will take time. Regardless, our work is orthogonal to
increasingly utilized to satisfy the growing demands of the these efforts, as we explore application-agnostic approaches to
data analytics industry [1]. As hardware moves toward greater improve query performance.
parallelism and scalability, taking advantage of the hardware’s Main memory query processing systems leverage data par-
full potential remains a key challenge for these systems. allelism on large sets of memory-resident data, thus dimin-
NUMA architectures are pervasive in multi-socket and in- ishing the influence of disk I/O. However, applications that
memory rack-scale systems, as well as a growing range are not NUMA-aware do not fully utilize the hardware’s
of CPUs with on-chip NUMA. It is clear that NUMA is potential [10]. Furthermore, rewriting the application is not
ubiquitous and is here to stay, and that software needs to evolve always an option. Solving this problem without extensively
and keep pace with these changes. Although these advances modifying the code requires tools and tuning strategies that
have opened a path toward greater performance, the burden of are application-agnostic. In this work, we evaluate the viability
efficiently leveraging the hardware mostly falls on developers. and impact of several key parameters (shown in Table IV)
NUMA systems include a wide range of CPU architectures, that aim to achieve this. We demonstrate that significant
topologies, and interconnect technologies. As such, there is performance gains can be achieved by managing dynamic
no standard for what a NUMA system’s topology looks like. memory allocators, thread placement and scheduling, memory
Due to the variety of NUMA topologies and applications, fine- placement policies, indexing, and the OS configuration. In
tuning an algorithm to a single machine configuration will this context, the impact and role of memory allocators have
not necessarily deliver better performance for other machines. been under-appreciated and overlooked by researchers. We

2375-026X (c) 2020 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
TABLE I: Experiment Workloads
center our investigation around five different memory-intensive Workload SQL Equivalent
query workloads (shown in Table I) that prominently feature
W1) Holistic Aggregation SELECT groupkey, MEDIAN(val)
joins and aggregations, arguably two of the most popular and (Hashtable-based) [14] FROM records
computationally expensive workloads used in data analytics. GROUP BY groupkey;
We selected the open-source MonetDB, PostgreSQL, MySQL, W2) Distributive Aggregation SELECT groupkey, COUNT(val)
and Quickstep database systems, as well as a commercial (Hashtable-based) [14] FROM records
database system DBMSx for evaluation. These systems were GROUP BY groupkey;
selected due to their significantly divergent architectures as W3) Hash Join [15] SELECT *
well as their popularity. W4) Index Nested Loop Join FROM table1
(ART [16], Masstree [17], INNER JOIN table2
An important finding from our research is that the default B+tree [18], Skip List [19]) ON table1.pk = table2.fk;
(out-of-the-box) OS environment can be surprisingly sub- W5) TPC-H [20] 22 analytical queries (Q1 , Q2 , ... , Q22 )
optimal for high-performance query processing. For instance,
the default Linux memory allocator ptmalloc can significantly of our machines is shown in Figure 1. A local memory
lag behind other alternatives. Furthermore, with extensive access involves data that resides on the same node, whereas
experimental evaluation, we demonstrate that it is possible accessing data on any other node is considered a remote
to systematically utilize application-agnostic (or black-box) access. Remote data travels over the interconnect, and may
approaches to obtain speedups on a variety of data analytics need to hop through one or more nodes to reach its destination.
workloads. We show that a hash join workload can achieve a Consequently, remote memory access is slower.
3× speedup on Machine C (see machine topologies in Figure 1 In addition to remote memory access, contention is another
and specifications in Table II), by replacing the memory possible cause of sub-optimal performance on NUMA sys-
allocator. This speedup can be further improved to 20× by tems. Due to the memory wall [13], modern CPUs are capable
optimizing the memory placement policy and modifying the of generating memory requests at a very high rate, which
OS configuration. We also show that our findings apply to can easily saturate the interconnect or memory controller
other hardware configurations, by evaluating the experiments bandwidth [3]. Lastly, the abundance of hardware threads in
on three machines with different hardware architectures and NUMA systems presents a challenge in terms of scalability,
NUMA topologies. Lastly, we show how database system particularly in scenarios with many concurrent memory allo-
performance can be improved by systematically modifying the cation requests. In Section III, we explore strategies which can
default OS configuration and overriding the memory allocator. be used to mitigate these issues.
For example, we demonstrate that MonetDB’s query latency
in the TPC-H workload can be reduced by up to 43%. A. Experiment Workloads
The main contributions of this paper are as follows: Our goal is to analyze the effects of NUMA on query
• Categorization and analysis of strategies to improve ap- processing workloads, and show effective strategies to gain
plication performance on NUMA systems speedups in these workloads. We have selected five workloads,
• The first study on NUMA systems (to our knowledge) shown in Table I, to represent a variety of data operations that
that explores the combined impact of different memory are common in data analytics and decision support systems.
allocators, thread and memory placement policies, and The implementation of these workloads is described in more
OS-level configurations, on different analytics workloads detail in Section IV-B. We now provide some background on
• Extensive experimental evaluation, involving different the experiment workloads.
workloads, indexes and database systems on different Joins and aggregations are ubiquitous, essential data pro-
machine architectures and topologies, with profiling and cessing primitives used in many different applications. When
performance counters, and microbenchmarks used for in-memory query processing, they are notably de-
• A decision flowchart (Figure 10) to help practitioners manding on cache and memory. Joins and aggregations are
speed up query processing on NUMA systems with integral components in analytical queries and are frequently
minimal code modifications used in popular database benchmarks, such as TPC-H [20].
The paper is organized as follows: we provide some back- Although we do not evaluate transactional workloads such as
ground on the problem and the workloads in Section II. In TPC-C, we note that processing many concurrent transactions
Section III we discuss the strategies for improving query in-memory is also taxing on the cache and memory.
performance on NUMA systems. We present our setup and A typical aggregation workload involves grouping tuples by
experimental results in Section IV. Finally, we discuss related a designated grouping column and then applying an aggregate
work in Section V and conclude the paper in Section VI. function to each group. Aggregate functions are divided into
three categories: distributive, algebraic, and holistic. Distribu-
II. BACKGROUND tive functions, such as the Count function used in W2 (see
A NUMA system is divided into several NUMA nodes. Table I), can be decomposed and processed in a distributed
Each node consists of one or more processors and their local manner. This means that the input can be split up, processed,
memory resources. Multiple NUMA nodes are linked together and recombined to produce the final result. Algebraic functions
using an interconnect to form a NUMA topology. The topology combine two or more distributive functions. For instance,

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
16GB 16GB 16GB 16GB I/O 16GB 768GB 768GB
16GB 16GB
CPU CPU CPU CPU CPU I/O CPU CPU
2 4 6 I/O
0 1 Hub 0 1
CPU CPU
0 7
CPU CPU CPU CPU CPU CPU CPU
I/O I/O
1 3 5 2 3 2 3

I/O I/O
16GB 16GB 16GB 16GB I/O 16GB 768GB 768GB

(a) Machine A (b) Machine B (c) Machine C


Fig. 1: Machine NUMA Topologies (machine specifications in Table II)
Average can be broken down into two distributive functions: exploring ways to tweak the application’s algorithms. It can be
Count and Sum. Holistic aggregate functions, such as the argued that this makes them one of the most under-appreciated
Median function used in W1, are computed by analyzing the system components. Both UMA and NUMA systems can ben-
entire input at once. Although approximation can be used to efit from faster or more efficient memory allocators. However,
accelerate Holistic aggregation, accurate results require the the potential is greater on NUMA systems, as the performance
processing of all input tuples for each group. As a result, these penalties caused by inefficient memory or cache behavior
aggregate functions are typically more expensive in terms of can be significantly higher. Key allocator attributes include
computing resources. W3 represents a hash join query. As allocation speed, fragmentation, and concurrency. Most devel-
described in [15], the query joins two tables with a size ratio opers use the default memory allocation functions to allocate
of 1:16, which is designed to mimic common decision support or deallocate memory (malloc/new and free/delete) and trust
systems. The join is performed by building a hash table on the that their library will perform these operations efficiently. In
smaller table and probing the larger table for matching keys. recent years, with the growing popularity of multi-threaded
W4 is an index nested loop join using the same dataset as applications, there has been a renewed interest in memory al-
W3. The main difference between W3 and W4 is that W3 locators, and several alternative allocators have been proposed.
builds an ad hoc hash table to perform the join, whereas W4 Earlier iterations of malloc used a single lock resulting in
uses a pre-built in-memory index that accelerates lookups to serialized access to the global memory pool. Although recent
one of the join relations. W5 is a database system workload, malloc implementations provide support for multi-threaded
using the well-known queries and datasets from the TPC-H scalability, there are now several competing memory allocators
benchmark [20]. We evaluate W5 on five database systems: that aim for faster performance and reduced contention and
MonetDB [21], PostgreSQL [22], MySQL [23], DBMSx, memory consumption overhead. We evaluate the following
and Quickstep [24]. In order to analyze query performance allocators: ptmalloc, jemalloc, tcmalloc, Hoard, tbbmalloc,
under memory-bound (rather than I/O-bound) situations, we mcmalloc, and supermalloc.
configure the databases to use large buffer caches where 1) ptmalloc (pthreads malloc): The standard memory al-
applicable. Furthermore, we measure multiple warm runs for locator that ships with most Linux distributions. ptmalloc
each query. aims to attain a balance between speed, portability, and
space-efficiency. It supports multi-threaded applications by
III. I MPROVING Q UERY P ERFORMANCE ON NUMA employing multiple mutexes to synchronize and protect access
S YSTEMS to its data structures. The downside of this approach is the
Achieving good performance on NUMA systems involves possibility of lock contention on the mutexes. In order to
careful consideration of thread placement, memory manage- mitigate this issue, ptmalloc creates additional regions of
ment, and load balancing. We explore application-agnostic memory (arenas) whenever contention is detected. Allocated
strategies that can be applied to the data analytics application memory can never move between arenas. ptmalloc employs a
in either a black box manner, or with minimal tweaks to per-thread cache for small allocations. This helps to further
the code. Some strategies are exclusive to NUMA systems, reduce lock contention by skipping access to the memory
whereas others may also yield benefits on uniform memory arenas when possible.
access (UMA) systems. These strategies consist of: overriding 2) jemalloc (Jason Evans malloc) [25]: First appearing as
the memory allocator, defining a thread placement and affinity an SMP-aware memory allocator for the FreeBSD operating
scheme, using a memory placement policy, and changing the system, jemalloc was later expanded and adapted for use as a
operating system configuration. In this section, we describe general purpose memory allocator. When a thread requests
these strategies and outline the options used for each one. memory from jemalloc for the first time, it is assigned
a memory allocation arena. Arena assignments for multi-
A. Dynamic Memory Allocators threaded applications follow a round-robin order. In order to
Dynamic memory allocators track and manage dynamic further improve performance, this allocator also uses thread-
memory during the lifetime of an application. The performance specific caches, which allows some allocation operations to
impact of memory allocators is often overlooked in favor of completely avoid arena synchronization. Lock-free radix trees

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
TABLE II: Machine Specifications evaluate Hoard version 3.13 in our experiments.
System Machine A Machine B Machine C 5) tbbmalloc: The tbbmalloc allocator is included as part
CPUs/ 8×Opteron 4×Xeon 4×Xeon of the Intel Thread Building Blocks (TBB) library [28]. This
Model 8220 E7520 E7-4850 v4 allocator pursues better performance and scalability for multi-
CPU Frequency 2.8GHz 2.1GHz 2.1GHz threaded applications, and generally considers increased mem-
Cores/Threads 16/16 16/32 32/64 ory consumption as an acceptable tradeoff. Allocations in
Last Level Cache 2MB 18MB 40MB
tbbmalloc are supported by per-thread memory pools. If the
allocating thread is the owner of the target memory pool, no
4KB TLB L1:32×4KB L1:64×4KB L1:64×4KB
Capacity L2:512×4KB L2:512×4KB L2:1536×4KB locking is required. If the target pool belongs to a different
thread then the request is placed in a synchronized linked list,
2MB TLB L1:8×2MB L1:32×2MB L1:32×2MB
Capacity - - L2:1536×2MB and the owner of the pool will allocate the object. We used
version 2019 Update 4 of the TBB library for our experiments.
NUMA Nodes 8 4 4
6) supermalloc [29]: This malloc replacement synchro-
NUMA Twisted Fully Fully
Topology Ladder Connected Connected
nizes concurrent memory allocation requests using hardware
transactional memory (HTM) if available, and falls back
Relative Local: 1.0 Local: 1.0 Local: 1.0
NUMA Node 1 hop: 1.2 1 hop: 1.1 1 hop: 2.1 to pthread mutexes if HTM is not available. supermalloc
Memory 2 hop: 1.4 prefetches all necessary data while waiting to acquire a lock
Latency 3 hop: 1.6 in order to minimize the amount of time spent in the critical
Interconnect
2GT/s 4.8GT/s 8GT/s
section. It uses homogeneous chunks of objects for allocations
Bandwidth smaller than 1MB, and supports larger objects using operating
Memory 16GB/node 16GB/node 768GB/node system primitives. Given a pointer to an object, its corre-
Capacity 128GB Total 64GB Total 3TB Total sponding chunk is tracked using a lookup table. This lookup
Memory Clock 800MHz 1600MHz 2400MHz table is implemented as a large 512MB array, which takes
Operating System Ubuntu 16.04 Ubuntu 18.04 CentOS 7.5 advantage of the fact that most of its virtual memory will
Linux Kernel 4.4 4.15 3.10 not be committed to physical memory by the OS. For our
experiments, we use the latest publicly released source code,
track allocations across all arenas. jemalloc attempts to reduce which was last updated in October 2017.
memory fragmentation by packing allocations into contiguous 7) mcmalloc (many-core malloc) [30]: This allocator fo-
blocks of memory and re-using the first available low address. cuses on mitigating multi-threaded lock contention by reduc-
This allocator maintains allocation arenas on a per-CPU basis ing calls to kernel space, dynamically adjusting the memory
and associates threads with their parent CPU’s arena. We use pool structures, and using fine-grained locking. Similar to
jemalloc version 5.1.0 for our experiments. other allocators, mcmalloc uses a global and local (per-thread)
3) tcmalloc (thread-caching malloc) [26]: Developed by memory pool layout. It monitors allocation requests, and
Google and included as part of the gperftools library, its goal dynamically splits its global memory pool into two categories:
is to provide faster memory allocations in memory-intensive frequently used memory chunk sizes, and infrequently used
multi-threaded applications. tcmalloc handles small allocations memory chunk sizes. Dedicated homogeneous memory pools
using thread-private caches that do not require locking. Large are created to support frequently used chunk sizes. Infrequent
allocations use a central heap that is organized into contiguous memory chunk sizes are handled using size-segregated mem-
groups of pages called “spans”. Each span stores multiple ory pools. mcmalloc reduces system calls by batching multiple
allocations of a particular size class. However, applications chunk allocations together. We use the latest mcmalloc source
that use many different size classes may waste memory due code, which was updated in March 2018.
to under-utilization of the memory spans. The central heap 8) Memory Allocator Microbenchmark: We now describe a
uses fine-grained locking on a per-span basis. As a result, two multi-threaded microbenchmark that we use to gain insight on
threads requesting memory from the central heap can do so the relative performance of these memory allocators. Our goal
concurrently, as long as their requests fall in different class is to answer the question: how well do these allocators scale
categories. We use tcmalloc from gperftools release 2.7. up on a NUMA machine? The microbenchmark simulates
4) Hoard [27]: A standalone cross-platform allocator re- a memory-intensive workload with multiple threads utilizing
placement designed specifically for multi-threaded applica- the allocator at the same time. Each thread completes 100
tions, Hoard’s main design goals are to provide memory million memory operations, consisting of allocating memory
efficiency, reduce allocation contention, and prevent false and writing to it, or reading an existing item and then
sharing. At its core, Hoard consists of a global heap (the deallocating it. The distribution of allocation sizes is inversely
“hoard”) that is protected by a lock and accessible by all proportional to the size class (smaller allocations are more
threads, as well as per-thread heaps that are mapped to each frequent). We use two metrics to compare the allocators:
thread using a hash function. Hoard uses heuristics to detect execution time and memory allocation overhead. The exe-
temporal locality and fill cache lines with objects that were cution time gives an idea of how fast an allocator is, as
allocated by the same thread, thus reducing false sharing. We well as its efficiency when being used in a NUMA system

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
ptmalloc jemalloc tcmalloc Hoard ptmalloc jemalloc tcmalloc Hoard tbbmalloc mcmalloc supermalloc
tbbmalloc mcmalloc supermalloc

6.577
100
7

4.975
Memory Overhread
6

(used/requested)
5

3.465
4
Time (s)

10

2.131

1.825
1.741
1.578
3

1.131

1.025
1.010

1.010

1.008
1.010
1.003
1.005
1.006
1.006
1.006

1.005
1.006
1.007
1.006
1.007

1.005
1.006
1.006
1.007
1.007

1.006
1.007

1.007

1.007

1.007
1.011

1.011
2
1
1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 4 8 16
Number of threads Number of Threads

(a) Multi-threaded Scalability (b) Memory Consumption Overhead


Fig. 2: Memory Allocator Microbenchmark - Machine A
No Affinity Affinitized (Sparse) Dense Affinity Sparse Affinity
by concurrent threads. In Figure 2a, we vary the number of 10-9 160

Runtime (Billion CPU Cycles)


10-8
threads in order to see how each allocator behaves under 10-7 120

Relative Runtime
contention. The results show that tcmalloc provides the fastest 10-6
single-threaded performance, but immediately falls behind the 80
10-5
competition once the number of threads is increased. Hoard 10-4
40
10-3
and tbbmalloc show good scalability and outperform the other
102 - 0
allocators by a considerable margin. In Figure 2b, we show 10-1 2 4 8 16 2 4 8 16 2 4 8 16
each allocator’s overhead. This is calculated by measuring 1
- Moving Sequential Zipf
1 2 3 4 5 6 7 8 9 10 Cluster
the amount of memory allocated by the OS (as maximum Run Dataset and Number of Threads
resident set size), and dividing it by the amount of memory
Fig. 3: OS thread scheduler Fig. 4: Comparison of
that was requested by the microbenchmark. This experiment
behavior vs thread affinity Sparse and Dense thread
shows considerably higher memory overhead for mcmalloc as
strategy - Consecutive runs affinitization strategies - W1
the number of threads increases. Hoard and tbbmalloc are
of W1 - Machine A - Machine A
slightly more memory-hungry than the other allocators, which
highlights jemalloc as a low memory overhead alternative with processors. There are a variety of ways to implement and
decent performance. We omit supermalloc and mcmalloc from manage thread placement, depending on the level of access to
subsequent experiments due to their poor performance in terms the source code and the library used to provide multithreading.
of scalability and memory overhead respectively. Applications built on OpenMP can use the OMP PROC BIND
and OMP PLACES environment variables in order to set a
B. Thread Placement and Scheduling thread placement strategy.
Defining an efficient thread placement strategy is a well- To demonstrate the impact of affinitization, we evaluate
known and essential step toward obtaining better performance workload W1 from Table I. The workload involves building a
on NUMA systems. By default, the kernel thread scheduler hash table with key-value pairs taken from a moving cluster
is free to migrate threads created by the program between distribution. Figure 3 depicts 10 consecutive runs of this
all available processors. The reasons for doing so include workload on Machine A. The runtime number of the default
power efficiency and balancing the heat output of different configuration (no affinity) is expressed in relation to the affini-
processors. This behavior is not ideal for large data analytics tized configuration. The results highlight the inconsistency
applications and may result in significantly reduced query of the default OS behavior. In the best case, the affinitized
throughput. The thread migrations slow down the program due configuration is several orders of magnitude faster, and the
to cache invalidation, as well as a likelihood of moving threads worst case runtime is still around 27% faster. In order to gain
away from their data. The combination of cache invalidation, a better understanding of how each configuration affects the
loss of locality, and non-deterministic behavior of the OS workload, we use the perf tool to measure several key metrics.
scheduler, can result in fluctuating runtimes (as shown in The results, depicted in Table III, show that the operating
Figure 3 with 16 threads). Binding threads to processor cores system migrates the worker threads many times during the
can solve this issue by preventing the OS from migrating course of the workload. The Sparse affinity configuration
threads. However, deciding how to place the threads requires prevents migration-induced cache invalidation, which in turn
careful consideration of the topology and workload. reduces cache misses. Furthermore, the affinitized configura-
A thread placement strategy details the manner in which tion increases the ratio of local memory accesses.
threads are assigned to processors. We explore two strategies In Figure 4 we evaluate the Sparse and Dense thread affinity
for assigning thread affinity: Dense and Sparse. A Dense strategies on workload W1, and vary the number of threads.
thread placement involves packing threads in as few processors We also vary the dataset (see Section IV-B) in order to ensure
as possible. The idea behind this approach is to minimize that the distribution of the data records is not the defining
remote access distance and maximize resource sharing. In factor. The goal of this experiment is to determine if threads
contrast, the Sparse strategy attempts to maximize memory benefit from being on the same NUMA node against utilizing
bandwidth utilization by spreading the threads out among the a greater number of the system’s memory controllers. The

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
TABLE III: Profiling thread placement - W1 on Machine A - 1) Virtual Memory Page Management: OS memory man-
Default (managed by OS) vs Modified (Sparse policy) agement works at the virtual page level. Pages represent
Performance Metric Default Modified Percent Change chunks of memory, and their size determines the granularity of
Thread Migrations 33196 16 −99.95% which memory is tracked and managed. Most Linux systems
Cache Misses 1450M 972M −32.95% use a default memory page size of 4KB in order to minimize
Local Memory Accesses 367M 374M +2.06%
wasted space. The CPU’s TLB caches can only hold a limited
number of page entries. When the page size is larger, each
Remote Memory Accesses 159M 108M −31.95%
TLB entry spans a greater memory area. Although the TLB
Local Access Ratio 0.70 0.78 +10.77% capacity is even smaller for large entries, the total volume of
cached memory space is increased. As a result, larger page
Sparse policy achieves better performance when the workload sizes may reduce the occurrence of TLB misses. Transparent
is not using all available hardware threads. This is due to the Hugepages (THP) is an abstraction layer that automates the
threads having access to additional memory bandwidth, which process of creating large memory pages from smaller pages.
plays a major role in memory-intensive workloads. When THP is not to be confused with Hugepages, which depends
all hardware threads are occupied, the two policies perform on the application explicitly interfacing with it and is usually
almost identically. Henceforth, we use the Sparse configuration disabled by default. We use the global THP toggles on our
(when applicable) for all our experiments. Linux machines to configure its behavior.
2) Automatic NUMA Load Balancing: There have been
C. Memory Placement Policies several projects to develop NUMA-aware schedulers that
facilitate automatic load balancing. Among these projects,
Memory pages are not always accessed from the same
Dino [2] and AsymSched [4] do not provide any source code,
threads that allocated them. Memory placement policies are
and Numad [33] is designed for multi-process load balancing.
used to control the location of memory pages in relation
Carrefour [3] provides public source code, but requires an
to the NUMA topology. As a general rule of thumb, data
AMD CPU based on the K10 architecture (with instruction-
should be on the same node as the thread that processes
based sampling), as well as a modified operating system
it and sharing should be kept to a minimum. However, too
kernel. Consequently, we opted to evaluate the AutoNUMA
much consolidation can lead to congestion of the interconnects
scheduler, which is open-source and supports all hardware
and contention on the memory controllers. The numactl tool
architectures. AutoNUMA was initially developed by Red Hat
applies a memory placement policy to a process, which is
and later on merged with the Linux kernel. It attempts to
then inherited by all its children (threads). We evaluate the
maximize data and thread co-location by migrating memory
following policies: First Touch, Interleave, Localalloc, and
pages and threads. AutoNUMA has two key limitations: 1)
Preferred. We also use hardware counters to measure the ratio
workloads that utilize data sharing can be mishandled due to
of local to total (local + remote) memory accesses.
the unnecessary migration of memory pages between nodes,
Modern Linux systems employ a memory placement policy
2) it does not factor in the cost of migration or contention,
called First Touch. In First Touch, each memory page is
and thus aims to improve locality at any cost. AutoNUMA has
allocated to the first node that performs a read or write
received continuous updates, and is considered to be one of the
operation on it. If the selected node does not have sufficient
most well-rounded kernel-based NUMA schedulers. We use
free memory, an adjacent node is used. This is the most
the numa balancing kernel parameter to toggle the scheduler.
popular memory placement policy and represents the default
configuration for most Linux distributions. Interleave places IV. E VALUATION
memory pages on all NUMA nodes in a round-robin fashion.
In this section we describe our setup and evaluate the effec-
In some prior works, memory interleaving was used to spread
tiveness of our strategies. In Section IV-A we outline the hard-
a shared hash table across all available NUMA nodes [9],
ware/software specifications of our machines. Section IV-B
[31], [32]. In Localalloc, the memory pages are placed on
describes the datasets, implementations, and systems used. We
the same NUMA node as the thread performing the allocation.
analyze the impact of the OS configuration in Section IV-C. In
The Preferred policy places all newly allocated memory pages
Section IV-E we evaluate these techniques on database engines
on a single node that is selected by the user. This policy will
running TPC-H queries. We explore the effects of overriding
fall back to using other nodes for allocation when the selected
the default system memory allocator in Section IV-D. Finally,
node has run out of free space and cannot fulfill the allocation.
we summarize our findings in Section IV-F.
D. Operating System Configuration A. Experimental Setup
In this section, we outline two key operating system mech- We run our experiments on three different machines based
anisms that affect NUMA applications: Virtual Memory Page on different architectures. This is done to ensure that the
Management (Transparent Hugepages), and Load Balancing applicability of our findings is not biased to a particular
Schedulers (AutoNUMA). These mechanisms are enabled out- system’s characteristics. The NUMA topologies of these ma-
of-the-box on most Linux distributions. chines are depicted in Figure 1 and their specifications are

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
TABLE IV: Experiment Parameters (bolded = system defaults) The aggregation workloads (W1 and W2) evaluate a typi-
Parameter Values cal hash-based aggregation query, based on a state-of-the-art
Experiment W1) Holistic Aggregation [14] concurrent hash table [35], which is implemented as a shared
Workload W2) Distributive Aggregation[14] global hash table [14]. The datasets used for the aggregation
W3) Hash Join [15]
W4) Index Nested Loop Join using: 1)ART [16], workloads are based on three different data distributions:
2)Masstree [17], 3)B+tree [18], 4)Skip List [19] Moving Cluster (default), Sequential, and Zipfian. Each dataset
W5) TPC-H Queries (Q1 to Q22 ) [20] consists of 100 million records with a group-by cardinality of
Thread Placement None (OS scheduler is free to migrate threads), one million. In the Moving Cluster dataset, the keys are chosen
Strategy Sparse, Dense from a window that gradually slides. The Moving Cluster
Memory First Touch, Interleave, Localalloc, Preferred dataset provides a gradual shift in data locality that is similar
Placement Policy
to workloads encountered in streaming or spatial applications.
Memory Allocator ptmalloc, jemalloc, tcmalloc, Hoard, tbbmalloc In the Sequential dataset, we generate a series of segments that
Dataset Moving Cluster (default for W1), Sequential contain multiple number sequences. The number of segments
Distribution (default for W3 and W4), Zipfian (default for is equal to the group-by cardinality, and the number of records
W2), TPC-H (W5)
in each segment is equal to the dataset size divided by
Database System MonetDB [21], PostgreSQL [22], MySQL [23], the cardinality. This dataset mimics transactional data where
(W5) DBMSx, Quickstep [24]
the key incrementally increases. In the Zipfian dataset, the
OS Configuration AutoNUMA on/off, Transparent Hugepages
(THP) on/off
distribution of the keys approximates Zipf’s law. We first
generate a Zipfian sequence with the desired cardinality c
Hardware System Machine A, Machine B, Machine C
and Zipf exponent e = 0.5. Then we take n random samples
outlined in Table II. We used LIKWID [34] to measure each from this sequence to build n records. The Zipfian distribution
system’s relative memory access latencies, and the remainder is used to model many big data phenomena, such as word
of the specifications were obtained using product pages, spec frequency, website traffic, and city population.
sheets, and Linux system queries. Now we outline some of The join workloads (W3 and W4) evaluate a typical join
the key hardware specifications for each machine. Machine A query involving two tables. W3 is a non-partitioning hash join
is an eight socket AMD-based server with a total of 128GB based on the code and dataset from [15]. The dataset contains
of memory. As the only machine with eight NUMA nodes, two tables sized at 16M and 256M tuples, and is designed to
machine A provides us with an opportunity to study NUMA simulate a decision support system. W4 is an index nested loop
effects on a larger scale. The twisted ladder topology shown join that uses the same dataset as W3. We evaluated several in-
in Figure 1a is designed to minimize inter-node latency with memory indexes for this workload: ART [16], Masstree [17],
three HyperTransport interconnect links per node. As a result, B+tree [18], and Skip List [19]. ART [16] is based on the
Machine A has three remote memory access latencies, depend- concept of a Radix tree. Masstree [17] is a hybrid index
ing on number of hops between the source and the destination. that uses a trie of B+trees to store keys. B+tree [18] is a
Each node contains an AMD Opteron 8220 CPU running at cache-optimized in-memory B + tree. Skip List is a canonical
2.8GHz and 16GB of memory. Machine B is a quad-socket implementation of a Skip List [19].
Intel server with four NUMA nodes and a total memory We use the TPC-H workload (W5) to investigate how our
capacity of 64GB. The NUMA nodes are fully connected, and strategies can benefit database systems. This entails some
each node consists of an Intel Xeon E7520 CPU running at limitations, as databases are complex systems with less flex-
1.87GHz with 16GB of memory. Lastly, Machine C contains ibility compared to microbenchmarks and codelets. Although
four sockets populated with Intel Xeon E7-4850 v4 processors. there are many available database systems that are TPC-H
Each processor constitutes a NUMA node with 768MB of compliant, we note that comparing an extensive variety of
memory, providing a total system memory capacity of 3TB. systems is beyond the scope of this paper. We evaluate W5 on
The NUMA nodes of this machine are fully connected. the MonetDB [21] (version 11.33.3), PostgreSQL [22] (version
Our experiments are coded in C++ and compiled using 11.4), MySQL [23] (version 8.0.17), DBMSx, Quickstep [24]
GCC 7.3.0 with the -O3 and -march=native flags. Likewise, (latest Github version as of October 2019) database systems.
all dynamic memory allocators and database systems are MonetDB is an open-source columnar store that uses memory
compiled from source. Machines B and C are owned and mapped files with demand paging and multiple worker threads
maintained by external parties and are based on different Linux for its query processing. PostgreSQL is a widely-used open-
distributions. The experiments are configured to utilize all source row store that supports intra-query parallelism using
available hardware threads on each machine. multiple worker processes and a shared memory pool for com-
munication. We configured PostgreSQL with a 42GB buffer
B. Datasets and Implementation Details pool. MySQL is an open-source row store that remains highly
In this section, we outline the datasets and code used for popular. DBMSx is a commercial hybrid row/column-store
the experiments. Unless otherwise noted, all workloads operate with a parallel in-memory query execution engine. Quickstep
on datasets that are stored in memory resident data structures, is an open-source hybrid store with a focus on in-memory
hence avoiding any I/O bottlenecks. analytical workloads. W5 uses version 2.18 of the TPC-

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
AutoNUMA On AutoNUMA Off AutoNUMA On AutoNUMA Off THP Off THP On Machine A Machine B Machine C

Runtime CPU Cycles (Billions)


9 90% 8

8.3
Runtime CPU Cycles (Billions)
9
Runtime CPU Cycles (Billions)

8.0
79 79 78 76
8 80% 75 75 7 8

Local Access Ratio


7 70% 7
6

5.4
5.4

5.3
6

5.1

5.1

5.1
5.0
6 60%

4.8

4.7
4.4
5 5
5 50%
4 4
4 40% 3

1.6
1.6
3 3

1.4
30%

1.3

1.3

1.3
2
2 20% 17 17 2 1
1 10% 0
1

First Touch

First Touch
Localalloc

Localalloc
Interleave

Interleave
0
0

AutoNUMA and THP AutoNUMA and THP


Memory Placement Policy Memory Placement Policy Dynamic Memory Allocator enabled disabled

(a) AutoNUMA effect on ex- (b) Profiling effect of AutoN- (c) Impact of THP on mem- (d) Combined effect of AutoNUMA
ecution time - Machine A UMA on Local Access Ratio ory allocators - Machine A and THP on different memory place-
- Machine A ment policies - variable machine
Fig. 5: Effect of AutoNUMA load balancing and THP page merging on memory placement policies and allocators - W1
H dataset specifications. We evaluate the impact of the OS this, the First Touch policy with AutoNUMA enabled (system
configuration on each database system, using all 22 queries default) is 86% slower than Interleave without AutoNUMA,
and a dataset scale factor of 20. Additionally, we use Queries despite a higher LAR measurement. In summary, we obtain
5 and 18 to show the impact of utilizing different memory significant speedups using a modified OS configuration, and
allocators, as both queries involve a combination of joins and note that LAR is not necessarily an accurate predictor of
aggregations. performance on NUMA systems.
The experimental parameters are shown in Table IV. We 2) Transparent Hugepages Experiments: Next we evaluate
use the maximum number of hardware threads supported by the effect of the Transparent Hugepages (THP) configuration,
each machine. In W1-W4, we measure the average workload which automatically merges groups of 4KB memory pages
execution time using the timer from [15]. In W5, we use the into 2MB memory pages. As shown in Figure 5c, THP’s im-
built-in query timing features of each database system. pact on the workload execution time ranges from detrimental
in most cases to negligible in other cases. As THP alters the
C. Operating System Configuration Experiments composition of the operating system’s memory pages, support
In this section, we evaluate three key OS mechanisms for THP within the memory allocators is the defining factor on
that affect NUMA behavior: NUMA Load Balancing (Au- whether it is detrimental to performance. tcmalloc, jemalloc,
toNUMA), Transparent Hugepages (THP), and the system’s and tbbmalloc are currently not handling THP well. We hope
memory placement policy. The experiments demonstrate each that future versions of these memory allocators will rectify
parameter’s affect on query performance. We also examine this issue out-of-the-box. Although most Linux distributions
how these variables are affected by other experiment parame- enable THP by default, our results indicate that it is better to
ters, such as hardware architecture, and the interaction between disable THP for high performance data analytics.
THP and memory allocators. 3) Hardware Architecture Experiments: Here we show how
1) AutoNUMA Load Balancing Experiments: In Figures 5a the performance of data analytics applications running on
and 5b, we evaluate W1 and toggle the state of AutoNUMA different machines with different hardware architectures is
Load Balancing between On (the system default) and Off. affected by the memory placement strategies. For all ma-
The results in Figure 5a show that AutoNUMA slows down chines, the default configuration uses the First Touch mem-
the runtime for the First Touch, Interleave, and Localalloc ory placement, and both AutoNUMA and THP are enabled.
memory placement policies. In most cases, AutoNUMA’s over- The results depicted in Figure 5d show that Machine A is
head dominates any performance gained by migrating threads slower than Machine B when both machines are using the
and memory pages. The best runtime is obtained by applying default configuration. However, using the Interleave memory
the Interleave policy and disabling AutoNUMA. If AutoNUMA placement policy and disabling the operating system switches
is enabled, the best approach is to apply the Interleave allows Machine A to outperform Machine B by up to 15%.
policy, which may be useful for scenarios where superuser Machine A shows the most significant improvement from
access is unavailable. We observed similar behavior for the operating system and memory placement policy changes, and
other workloads and machines. AutoNUMA had a significantly the workload runtime is reduced by up to 46%. The runtime
detrimental effect on runtime. The best overall approach is to for Machine C is reduced by up to 21%. The performance
use memory interleaving and disable AutoNUMA. The Local improvement on Machine B is around 7%, which is fairly
Access Ratio (LAR) shown in Figure 5b specifies the ratio modest compared to the other machines. Although Machines
of memory accesses that were satisfied with local memory B and C have a similar inter-socket topology, the relative
[3] compared to all memory accesses. AutoNUMA attempts to local and remote memory access latencies are much closer
increase LAR without considering other costs, such as moving in Machine B (see Table II). Henceforth, we keep AutoNUMA
threads and memory, or memory controller contention. Due to and THP disabled for our experiments, unless otherwise noted.

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc first touch interleave localalloc
6 6 3 14 3.5
Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)


Runtime CPU Cycles (Billions)
5 5 2.5 12 3
10 2.5
4 4 2
8 2
3 3 1.5
6 1.5
2 2 1 4 1
1 1 0.5 2 0.5
0 0 0 0 0

Memory Allocator Memory Allocator Memory Allocator Memory Allocator Memory Allocator

(a) W1 - Machine A (b) W1 - Machine B (c) W1 - Machine C (d) W2 - Machine A (e) W2 - Machine B
First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc Moving Cluster Sequential Zipf
1.6 16 30 7 25
Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)


Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)


14 6
25 20
1.2 12 5
10 20
4 15
0.8 8 15
3 10
6
10
0.4 4 2
5 5
2 1
0 0 0 0 0

Memory Allocator Memory Allocator Memory Allocator Memory Allocator Memory Allocator

(f) W2 - Machine C (g) W3 - Machine A (h) W3 - Machine B (i) W3 - Machine C (j) W1 - Machine A - Ef-
fect of dataset distribution
Fig. 6: Comparison of memory allocators - variable workload, memory placement policy, and machine
D. Memory Allocator Experiments 2) Impact of Dataset Distribution: The performance of
query workloads and memory allocators can be sensitive to the
In Section III-A8, we used a memory allocator microbench- access patterns induced by the dataset distribution. The three
mark to show that there are significant differences in both tested datasets have the same number of records, but differ
multi-threaded scalability and memory consumption overhead. in the way the record keys are distributed (see Section IV-B
In this section, we explore the performance impact of overrid- for more information). In our previous experiments, we used
ing the system default memory allocator on four in-memory the Heavy Hitter dataset as the default dataset for W1. In
data analytics workloads. Figure 6j, we vary the dataset distribution to investigate its
1) Hashtable-based Experimental Workloads: In Figure impact on different memory allocators. The results show that
6, we show our results for the holistic aggregation (W1), tbbmalloc continues to produce the largest speedups on both
distributive aggregation (W2), and hash join (W3) workloads the Zipf and Sequential datasets. We also observe this trend
running on each of the machines. In addition to the memory on Machines B and C, but omit them due to space constraints.
allocators, we vary the memory placement policies for each
workload. The results show significant runtime reductions 3) Effect on In-memory Indexing: In W4, we investigate
on all three machines, particularly when using tbbmalloc in index nested loop join query processing with different in-
conjunction with the Interleave memory placement policy. The memory indexes. The type of index used to accelerate the
holistic aggregation workload (W1) shown in Figure 6a to 6c nested loop join workload (W4) plays a key role in determin-
extensively uses memory allocation during its runtime to store ing its speed. We evaluate four in-memory indexes: ART [16],
the tuples for each group and calculate their aggregate value. Masstree [17], B+tree [18], and Skip List [19]. As the index is
Utilizing tbbmalloc reduced the runtime of W1 by up to 62% pre-built, the workload is relatively light in terms of number
on Machine A, 83% on Machine B, and 72% on Machine C, of memory allocations during the join phase, hence factors
compared to the default allocator (ptmalloc). The results for such as scan/lookup times, materialization, and locality play
the join query (W3) depicted in Figures 6g to 6i also show a greater role. For each index, we vary the memory allocator
significant improvements, with tbbmalloc reducing workload and memory placement policy and measure the join time. The
execution time by 70% on Machine A, 94% on Machine B, results, depicted in Figures 7a to 7c, show that runtime can
and 92% on Machine C. The distributive aggregation query be significantly improved for most of the tested indexes. In
(W2) shown in Figure 6d to 6f speeds up by 44%, 27%, and Figure 7a, we show that ART’s join time can be substantially
28% on Machines A, B, and C respectively. This speedup improved using the jemalloc or tbbmalloc allocators. A key
is almost entirely due to the Interleave memory placement characteristic of ART is that it uses variable node sizes
policy. Although W2 is not allocation-heavy and does not gain and a variety of compression techniques for its trie, thus
much benefit from a faster memory allocator, it can still be requesting a greater variety of size classes from the memory
accelerated using a more efficient memory placement policy. allocator, compared to the other allocators. In Figures 7b and

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
First Touch Interleave Localalloc First Touch Interleave Local Alloc First Touch Interleave Local Alloc First Touch Interleave Local Alloc Build Time Join Time
70 80 25 90 70
Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)


Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)

Runtime CPU Cycles (Billions)


60 70 80
20 60
60 70
50 50
50 60
40 15
50 40
40
30 10 40 30
30 30
20 20 20
5 20
10 10 10 10
0 0 0 0 0

Memory Allocator Memory Allocator Memory Allocator Memory Allocator Index Data Structure

(a) ART Index - Join (b) Masstree Index - Join (c) B+tree index - Join (d) Skip List Index - Join (e) Index build and join
Times Times Times Times times (best configuration)
Fig. 7: Index nested loop join workload (W4) - variable memory allocators and memory placements - Machine A
7c Masstree and B+tree show a notable improvement with the hardware. MySQL’s query latency is reduced by up to 49%
Hoard allocator. Both indexes rely on grouping many keys per with an average reduction of 12%. Lastly, we observe that
node, which is favorable for Hoard’s straightforward global DBMSx query latency improved by up to 43% with an average
heap approach. Skip List breaks the trend as the only index that of 21%. Lastly, Quickstep query latency speeds up by up to
runs fastest with ptmalloc. Finally, we summarize the results 40% and an average of 7%. All five database systems obtained
in Figure 7e, which depicts each index’s build and join times speedups from modifying the default OS configuration.
using their fastest configuration. The results show that we were Next we investigate the effect of memory allocator overrid-
able to speed up the two fastest indexes (ART and B+tree) ing on MonetDB. To do so, we select queries 5 and 18 due to
despite their inherent lack of NUMA-awareness. their usage of both joins and aggregation. The results, shown
in Figure 9a, indicate that tbbmalloc can provide an average
E. Database Engine Experiments query latency reduction of up to 11% for Query 5, and 20%
In this section, we evaluate the TPC-H workload (W5) for Query 18, compared to ptmalloc. As with other memory
on five database systems: MonetDB, PostgreSQL, MySQL, allocator experiments, we measure the average of five runs.
DBMSx, and Quickstep. Investigating NUMA strategies on
database systems is more challenging compared to stan- F. Summary
dalone in-memory microbenchmark workloads, as there is The strategies explored in this paper, when carefully ap-
considerably more complexity involved in storing and loading plied, can significantly speed up query processing workloads
the data and great care must be taken to ensure that disk without the need for source code modification. The effec-
I/O and caching do not skew the results. To ensure fair tiveness and applicability of these strategies to a workload
and consistent results, we clear the page cache using the depend on several factors. Figure 10 shows a strategic plan
proc/sys/vm/drop caches command before running each query, for practitioners. The flowchart outlines a systematic guide to
disregard the first (cold) run, and measure the mean runtime improving performance on NUMA systems, along with some
for five additional runs. In a similar vein to our previous general recommendations. We base these recommendations on
experiments, we evaluate the impact of the OS configuration, our extensive experimental evaluation using multiple machine
memory placement policies, and memory allocators. Due to an architectures and workloads. Starting with thread management,
issue with PostgreSQL producing severely sub-optimal plans we showed that thread affinitization can be critical for NUMA
for queries 17 and 22, we evaluate modified versions of these systems, but more importantly how a Sparse placement
two queries which use joins instead of nested queries. All other approach can maximize performance in situations that are
database systems run the original versions of queries 17 and memory-bandwidth-bound. We then showed that the default
22. We used the following parameters to speed up W5: First OS configuration can have a significant detrimental effect on
Touch memory placement, AutoNUMA disabled, THP disabled query performance. The overhead of AutoNUMA and THP was
(for all except DBMSx), and the tbbmalloc memory allocator. demonstrated to be too costly for high performance data an-
In Figure 8, we present the speedups obtained across all 22 alytics workloads. Although superuser privileges are required
TPC-H queries for each of the database systems. The results to modify AutoNUMA and THP, we observed that optimizing
show that MonetDB’s query latencies improved by up to the memory placement policy (such as using Interleave) can
43%, with an average improvement of 14.5%. In comparison, mostly mitigate their negative impact. We also investigated
the gains for PostgreSQL are less consistent. Query latency dynamic memory allocators using a microbenchmark. The
improved by up to 27.6%, but the average improvement is microbenchmark results showed that there are considerable
3% and seven queries take slightly longer to complete. We differences between the allocators, both in terms of scalability
believe these variances are due to PostgreSQL’s rigid multi- and efficiency. In our evaluation, we demonstrated that these
process query processing approach, which sometimes opts to differences translate into real gains in analytical query process-
use only one worker process and thus fails to fully utilize the ing workloads, although the performance gains depend on the

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
MonetDB PostgreSQL MySQL DBMSx Quickstep

49.2
50%

43.4

43.1
42.9

42.6
41.7
40.6

37.2
40%

35.7
34.7
Query Latency Reduction

33.3
33.0

31.1
30.1
27.6
26.9
30%

25.1
25.0

24.8

24.0
23.5

20.5

19.9
19.4

19.4

19.3

18.7

18.6
18.4

17.5

17.2
20%

15.5
14.3

14.1
14.0

13.8

12.1
11.8
10.5

10.5

10.3

10.0

9.8
9.2
9.0

8.8

8.6

8.1
8.1
7.7
7.6
10%

7.2
7.0
6.5

6.1
5.8

5.6

5.6
5.2

5.1
5.0

4.9

4.6

4.5
4.5
4.3

4.3

4.3

4.2
4.0

3.8
3.6

3.4

3.2
3.1

3.1
3.0
2.8

2.7

2.6
2.6

2.4
2.3

2.3
2.3

2.2
2.2

1.9
1.8

1.8
1.5

1.1
1.0

0.9
0.7

0.6
0.3

0.2
0.1

-0.1
-0.2
-0.7

-1.0
-1.1

-1.0
-1.7
-1.7

-2.3

-3.4
-4.5
-10.%
1 Q1 2Q2 3Q3 4Q4 5Q5 6Q6 7Q7 8Q8 9
Q9 10
Q10 11
Q11 12
Q12 13
Q13 14
Q14 1515
Q 16
Q16 17
Q17 1818
Q 1919
Q 20
Q 20 Q21
21 Q22
22
Query Number
Fig. 8: 22 TPC-H queries (W5) scale factor 20 - Query latency reduction - Variable database systems - Machine A

2.5 12 Start
10
2 Is Memory
Query Latency (s)
Query Latency (s)

thread placement Placement


8 managed? Yes Defined? Yes
1.5
6 No No
1 Affinitize thread Optimize the Memory
4 placement Placement (Interleave)

0.5 2
Bound by memory Allocation-heavy
0 0 No bandwidth?
No workload?
Yes
Yes
Adopt Dense Adopt Sparse Evaluate and Override
Memory Allocator Memory Allocator
Strategy Strategy Memory Allocator
(a) TPC-H Q5 (b) TPC-H Q18
Fig. 9: Effect of memory allocator on TPC-H query latency - Superuser Free memory is
access? No constrained? No
MonetDB - Machine A
Yes Yes
Preload jemalloc
way the workloads allocate memory. For example, allocation- Configure OS to disable
AutoNUMA and THP Preload tbbmalloc
heavy workloads, such as the hash join (W3) benefited the
End
most, whereas the index nested loop join (W4) exhibited
smaller gains due to the prebuilt index. Although we have Fig. 10: Application-agnostic Decision Flowchart
shown that tbbmalloc frequently outperformed its competi- focus on improving performance by altering the process and/or
tors on different machines and workloads, we recommend memory placement. Some examples include Dino [2], Car-
experimentation with new/updated memory allocators before refour [3], AsymSched [4], Numad [33], and AutoNUMA [33].
selecting a solution. These schedulers have been shown to improve performance
V. R ELATED W ORK in some cases, particularly on systems running multiple inde-
The rising demand for high performance parallel computing pendent processes. However, some researchers have claimed
has motivated many works on leveraging NUMA architectures. that these schedulers do not provide much benefit for multi-
We now outline existing research that is relevant to our work. threaded query processing applications [6], [7].
In [36], Kiefer et al. evaluated the performance impact A different approach involves either extensively modifying
of NUMA effects on multiple independent instances of the or completely replacing the OS. This is done with the goal of
MySQL database system. Popov et al. [37] explored the providing a custom tailored environment for the application.
combined effect of thread and page placement using super- Some researchers have pursued this direction with the goal of
computing benchmarks running on NUMA systems. They ob- providing an OS that is more suitable for large database appli-
served that co-optimizing thread and memory page placement cations [39]–[41]. Custom operating systems aim to reduce the
can provide significant speedups. Durner et al. [38] explored burden on developers, but their adoption has been limited. In
the performance impact of dynamic memory allocators on a the past, researchers in the systems community proposed a few
database system running TPC-DS. The authors obtained signif- new OSes for multicore architectures, including Corey [42],
icant speedups utilizing jemalloc and tbbmalloc, which agrees Barrelfish [43] and fos [44]. Although none were widely
with our findings. In this paper, we evaluate a broader and adopted by the industry, we believe these efforts underscore
newer range of allocators, and additional NUMA parameters, the need to investigate the impact of system and architectural
indexes, datasets, databases, and workloads. aspects on query performance.
Some prior work has pursued automatic load balancing Some researchers have favored an application-oriented ap-
approaches that can improve NUMA system performance in proach that fine-tunes query processing algorithms to the
an application-agnostic manner. These approaches generally hardware. Wang et al. [8] proposed an aggregation algorithm

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
for NUMA systems, based on radix partitioning. The authors [10] T. Kissinger et al., “ERIS: A NUMA-aware in-memory storage engine
also proposed a load balancing algorithm that focuses on inter- for analytical workloads,” VLDB Endow., vol. 7, no. 14, pp. 1–12, 2014.
[11] D. Porobic, E. Liarou, P. Tozun, and A. Ailamaki, “Atrapos: Adaptive
socket task stealing, and prohibits task stealing until a socket’s transaction processing on hardware islands,” in ICDE, 2014, pp. 688–
local tasks have been completed. Leis et al. [9] presented a 699.
NUMA-aware parallel scheduling algorithm for hash joins, [12] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki, “Adap-
tive NUMA-aware Data Placement and Task Scheduling for Analytical
which uses dynamic task stealing in order to deal with dataset Workloads in Main-memory Column-stores,” VLDB, pp. 37–48, 2016.
skew. Schuh et al. [7] conducted an in-depth comparison of [13] K. Asanovic et al., “The landscape of parallel computing research: A
thirteen main memory join algorithms on a NUMA system. view from berkeley,” Technical Report UCB/EECS-2006-183, University
of California, Berkeley, Tech. Rep., 2006.
Our work is orthogonal to these approaches and they can [14] P. Memarzia, S. Ray, and V. C. Bhavsar, “A Six-dimensional Analysis
benefit from applying the application-agnostic strategies that of In-memory Aggregation,” in EDBT, 2019, pp. 289–300.
we have suggested. [15] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main
memory hash join algorithms for multi-core CPUs,” in SIGMOD, 2011.
VI. C ONCLUSION [16] V. Leis, F. Scheibner, A. Kemper, and T. Neumann, “The ART of
practical synchronization,” in DaMoN. ACM, 2016, pp. 1–8.
In this work, we have outlined and investigated several [17] Y. Mao, E. Kohler, and R. T. Morris, “Cache craftiness for fast multicore
application-agnostic strategies to speedup query processing on key-value storage,” in EuroSys. ACM, 2012, pp. 183–196.
[18] T. Bingmann, “STX B+ Tree,” panthema.net/2007/stx-btree, 2019.
NUMA machines. Our experiments on five analytics work- [19] S. Vokes, “skiplist,” github.com/silentbicycle/skiplist, 2016.
loads have shown that it is possible to obtain significant [20] “TPC-H benchmark specification 2.18.0 rc2,” 2019.
speedups by utilizing these strategies. We also demonstrated [21] MonetDB B.V., “MonetDB,” monetdb.org, 2018.
[22] “PostgreSQL,” postgresql.org, 2019.
that current operating system default configurations are gen- [23] Oracle Corporation, “MySQL,” mysql.com, 2019.
erally sub-optimal for in-memory data analytics. Our results, [24] J. M. Patel et al., “Quickstep: A data platform based on the scaling-up
surprisingly, indicate that many elements of the default OS approach,” VLDBJ, vol. 11, no. 6, pp. 663–676, 2018.
[25] J. Evans, “A scalable concurrent malloc (3) implementation for
environment, such as AutoNUMA, Transparent Hugepages, FreeBSD,” in BSDCan, 2006.
default memory allocator (eg. ptmalloc), and the OS thread [26] S. Ghemawat and P. Menage, “Tcmalloc: Thread-caching malloc,”
scheduler, should be disabled or customized for high perfor- github.com/gperftools/, 2015.
[27] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson, “Hoard:
mance analytical query processing, regardless of the hardware A scalable memory allocator for multithreaded applications,” SIGARCH,
generation. We have also demonstrated that memory allocator vol. 28, no. 5, pp. 117–128, 2000.
performance on NUMA systems can be a major bottleneck and [28] W. Kim and M. Voss, “Multicore desktop programming with intel
threading building blocks,” IEEE software, vol. 28, no. 1, pp. 23–31,
that this under-appreciated topic is ripe for investigation. We Jan 2011.
obtained large speedups for our query processing workloads [29] B. C. Kuszmaul, “SuperMalloc: a super fast multithreaded malloc for
by overriding the default dynamic memory allocator with 64-bit machines,” in SIGPLAN Notices, vol. 50. ACM, 2015, pp. 41–55.
[30] A. Umayabara and H. Yamana, “MCMalloc: A scalable memory al-
alternatives such as tbbmalloc. locator for multithreaded applications on a many-core shared-memory
As our approach does not target a specific NUMA topology, machine,” in IEEE Big Data, 2017, pp. 4846–4848.
we have shown that our findings can be applied to systems with [31] C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu, “Main-memory
hash joins on multi-core CPUs: Tuning to the underlying hardware,” in
different architectures. As hardware architectures continue to ICDE, 2013, pp. 362–373.
advance towards greater parallelism and greater levels of [32] H. Lang, V. Leis, M.-C. Albutiu, T. Neumann, and A. Kemper, “Mas-
memory access partitioning, we hope our results and decision sively Parallel NUMA-Aware Hash Joins,” in IMDM, 2013.
[33] Red Hat Inc, “Red Hat Enterprise Linux Product Documentation,” 2018.
flowchart can help practitioners to accelerate data analytics. [34] G. Hager, G. Wellein, and J. Treibig, “LIKWID: A Lightweight
Performance-Oriented Tool Suite for x86 Multicore Environments,” in
R EFERENCES ICPP. IEEE, 2010, pp. 207–216.
[35] X. Li, D. G. Andersen, M. Kaminsky, and M. J. Freedman, “Algorithmic
[1] A. Kemper and T. Neumann, “HyPer: A hybrid OLTP&OLAP main
improvements for fast concurrent cuckoo hashing,” in EuroSys, 2014, pp.
memory database system based on virtual memory snapshots,” in ICDE,
1–14.
2011, pp. 195–206.
[36] T. Kiefer, B. Schlegel, and W. Lehner, “Experimental evaluation of numa
[2] S. Blagodurov, A. Fedorova, S. Zhuravlev, and A. Kamali, “A case for
effects on database management systems,” BTW, 2013.
NUMA-aware contention management on multicore systems,” in PACT,
[37] M. Popov, A. Jimborean, and D. Black-Schaffer, “Efficient thread/page/-
2010, pp. 557–558.
parallelism autotuning for numa systems,” in ICS. ACM, 2019, pp.
[3] M. Dashti et al., “Traffic management: a holistic approach to memory
342–353.
placement on NUMA systems,” SIGPLAN Notices, vol. 48, no. 4, 2013.
[38] D. Durner, V. Leis, and T. Neumann, “On the impact of memory
[4] B. Lepers, V. Quema, and A. Fedorova, “Thread and Memory Placement
allocation on high-performance query processing,” in DaMoN, ser.
on NUMA Systems: Asymmetry Matters,” in USENIX ATC, 2015.
DaMoN’19. ACM, 2019, pp. 21:1–21:3.
[5] J. Corbet, “AutoNUMA: the other approach to NUMA scheduling,”
[39] J. Giceva, “Operating Systems Support for Data Management on Modern
LWN. net, 2012.
Hardware,” sites.computer.org/debull/A19mar/p36.pdf, 2019.
[6] I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki, “Scal-
[40] J. Giceva, A. Schüpbach, G. Alonso, and T. Roscoe, “Towards database/-
ing up concurrent main-memory column-store scans: towards adaptive
operating system co-design,” in SFMA, vol. 12, 2012.
NUMA-aware data and task placement,” VLDBJ, vol. 8, no. 12, 2015.
[41] J. Giceva, G. Zellweger, G. Alonso, and T. Rosco, “Customized OS
[7] S. Schuh, X. Chen, and J. Dittrich, “An experimental comparison of
support for data-processing,” in DaMoN, 2016, pp. 1–6.
thirteen relational equi-joins in main memory,” in SIGMOD, 2016.
[42] S. Boyd-Wickizer et al., “Corey: An operating system for many cores,”
[8] L. Wang, M. Zhou, Z. Zhang, M.-C. Shan, and A. Zhou, “NUMA-aware
in USENIX OSDI, ser. OSDI’08, 2008, pp. 43–57.
scalable and efficient in-memory aggregation on large domains,” TKDE,
[43] A. Baumann et al., “The multikernel: A new os architecture for scalable
vol. 27, no. 4, 2015.
multicore systems,” in SIGOPS SOSP, 2009, pp. 29–44.
[9] V. Leis, P. Boncz, A. Kemper, and T. Neumann, “Morsel-driven paral-
[44] D. Wentzlaff and A. Agarwal, “Factored operating systems (fos): The
lelism: a NUMA-aware query evaluation framework for the many-core
case for a scalable operating system for multicores,” SIGOPS Oper. Syst.
age,” in SIGMOD, 2014, pp. 743–754.
Rev., vol. 43, no. 2, pp. 76–85, 2009.

2375-026X (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication or
redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

You might also like