MONTRES-NVM: An External Sorting Algorithm For Hybrid Memory
MONTRES-NVM: An External Sorting Algorithm For Hybrid Memory
Abstract—DRAM technology is approaching its scaling limit In the literature, many algorithms have been proposed to
and the use of emerging NVM is seen as one possible solution sort data in the main memory (DRAM) such as quick sort,
to such an issue. As NVM technologies are not mature enough radix sort or merge sort. When dealing with large data volumes
and does not outperform DRAMs, several studies expect the
use of hybrid main memories containing both DRAM and PCM (i.e which size is larger than the allocated main memory
NVM. Redesigning applications for such systems is mandatory size), external sorting algorithms have to be used. They are
as PCM does not have the same performance model as DRAM. composed of two phases: a run generation phase and a
In this context, we designed a hybrid memory-aware sorting run merge phase [8]. The run generation phase splits data
algorithm called MONTRES-NVM. Since an NVM-based hybrid into chunks that fit into the main-memory, sorts them and
memory presents a performance gap between DRAM and PCM,
we believe that the sorting algorithm falls in the external sorting writes the sorted chunk into intermediate files (called runs).
category. As a matter of fact, we extended our previously Then, runs are merged and written into the final sorted file.
designed flash-based external sorting algorithm MONTRES for The performance of these algorithms highly depends on the
a hybrid memory by taking profit of byte addressability, and way they manage I/O operations. External sorting algorithms
performance asymmetry between reads and writes. MONTRES- were designed to optimize I/O requests on traditional magnetic
NVM enhances the performance of the merge sort algorithm on
PCM by more than 60%, the merge sort on DRAM by 3-40% drives [8]. They were then optimized to take benefit from flash
and MONTRES (on a hybrid memory) by 3-33% according to memory performance (SSD) [10], [7], [6].
the proportion of already sorted data in the dataset. In our work, we considered a hybrid main memory with a
Index Terms—Sorting algorithm, Hybrid memory, Non Volatile large proportion of PCM as compared to DRAM, as in several
Memory, Phase Change Memory. state-of-the-art work [2]. Since PCM has higher latencies than
DRAM and asymmetric read/write operations performance
I. I NTRODUCTION [4][2], we believe that sorting in a PCM/DRAM main memory
Nowadays, the scaling of DRAM memory is approaching have some similarity with external sorting algorithms.
its limit [3] and increasing its density imposes an exponen- In this paper, we present a new hybrid memory-aware
tial cost penalty. Emerging memory technologies, such as sorting algorithm named MONTRES-NVM. This sorting algo-
Phase Change Memory (PCM), may be part of the solution rithm is based on a previously developed external sorting flash
thanks to the high density they can provide [2]. PCM is memory-based algorithm named MONTRES (Merge ON-The-
a byte-addressable memory. It has small-sized cells and a Run External Sorting) [10]. The main idea of MONTRES-
good endurance compared to NAND flash memory. PCM NVM is to take profit of the small size DRAM to accelerate
may change our view on the memory hierarchy. It can be the sorting process while minimizing the number of write op-
integrated either horizontally where it is considered as an erations performed on the PCM. To do so, MONTRES-NVM
extension of an existing memory level, or vertically where it is is composed of three main phases: (1) a first read operation is
interleaved between two existing memory levels [2]. However, performed on the data to detect already sorted sub-sequences.
as compared to DRAM, PCM has a higher access latency, These sub-sequences are then indexed ; (2) unsorted sub-
especially for write operations, thus a higher energy cost [4]. sequences are divided in blocks that can fit into DRAM
The volume of data is growing exponentially and it is workspace and sorted in DRAM, we used MONTRES’merge-
supposed to attain 185 zettabytes in 2025 [5]. To take profit of on-the-fly mechanism to store parts of the sorted data in
this huge amount of data, for instance, for real-time analytics PCM, (3) finally, all sorted data sub-parts are merged using
applications, the need for fast processing becomes a necessity. dichotomy technique and a heap data structure.
Sorting data is one of the most important computational MONTRES-NVM has been compared with both in-memory
problem for which algorithms have been developed [9]. For and external sorting algorithms. On random data, when com-
instance, the CPU spends 60% of its time sorting data [11] and pared with the merge sort (in-memory) algorithm, MONTRES-
most operations in a Data Base Management Systems (DBMS) NVM decreases the sorting time on PCM by more than 60%,
use these algorithms. and on DRAM by about 14%. It outperformed MONTRES
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.
by about 6%. With partially sorted data, MONTRES-NVM $
External sorting algorithms are tailored for data volumes
that are larger than the main memory. Data are initially stored
Fig. 2: Illustration of the run merge phase [10].
in a storage device and the sorting algorithm uses an allocated
amount of DRAM to work. As the performance gap between
DRAM and storage devices is significant, external sorting A. Motivation: sorting in PCM vs sorting in DRAM
algorithms try to minimize the number of I/O operations
performed. External sorting algorithms were first designed The objective of this section is to show the impact of
for Hard Disk Drives (HDD). In order to reduce the cost the PCM performance properties on the execution of some
of I/O operations, they massively relied on sequential I/O traditional sorting algorithms compared to their execution on
operations. Then, they were optimized for Solid State Drives DRAM. We will highlight the need for revising state-of-the-art
(SSD), mainly by relying on random reads to minimize the sorting algorithms for PCM-based memories.
number of I/O operations and at the same time by reducing We have evaluated the execution of the following popular in-
the amount of write operations to preserve SSD lifetime. memory sorting algorithms: merge sort, quick sort, heap sort,
An external sorting algorithm is composed of two phases: counting sort and radix sort. For each algorithm, we measured
(1) a run generation phase and (2) a run merge phase. In the the execution time for two memory configurations: the first
first phase, a chunk of data is loaded from the storage device one is a full DRAM main memory, and the second one is a
into the DRAM. Then it is sorted with an in-memory sorting full PCM main memory, emulated using a PCMSim [1].
algorithm. The sorted chunk is written back to the storage Fig. 3 presents the execution times of in-memory sorting
device in an intermediate file called a run, see Fig. 1. In the algorithms on the two memory configurations. Many obser-
second phase, the sorting algorithm merges the runs by loading vations can be drawn from this figure: (1) using PCM slows
sub-parts into the DRAM iteratively until the runs are entirely down the execution time of the sorting algorithms, which is
merged, see Fig. 2. quite intuitive as the PCM presents higher access latencies
than DRAM. (2) The performance degradation due to PCM
III. MONTRES-NVM D ESIGN is highly variable from one algorithm to another. Indeed, the
In this section, we first discuss the motivation behind this performance degradation for counting sort, radix sort, merge
work, then we describe MONTRES-NVM, the hybrid memory sort, quick sort and heap sort is 45%, 194%, 27%, 39% and
aware sorting algorithm we have designed. 61% respectively. This is mainly due to the proportion of write
50
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.
operations performed by these algorithms. (3) Finally, for a The more the input data contains already sorted values, the
given data set to sort, the ranking of the sorting algorithms is longer the sorted sequences are, and the lower is the amount of
changed from one memory to the other. For instance, the radix data to sort in the next phases. The remaining unsorted data are
sort is better than the merge sort on DRAM but on PCM the processed during the run generation phase. This first phase
radix sort is more than two times slower. did not exist in MONTRES.
Those results motivated us to revise sorting algorithms for
Example 1. In Fig. 4, we give an example for the sorted data
NVM, especially hybrid memories that consist of PCM and
detection phase. In this example, the input data contain 16
DRAM. In such a configuration, we assume a large PCM
values stored in PCM and the number of sequences already
volume (as PCM presents a higher density) and a small
sorted to extract is set to L = 2. So, MONTRES-NVM finds
DRAM. Since the performance gap between PCM and DRAM
L = 2 longest already sorted sequences containing 4 sorted
is high for the write operations, we believe that external
values. The first already sorted sequence is located between
sorting algorithms are more suited for such hybrid memory
positions 4 and 7 in the input data. The second one is located
configuration. In the case of external sorting, data to sort are
between positions 10 and 13. Finally, these sequences are
stored in PCM and chunks are successively brought to the
inserted into the primary-index.
DRAM for sorting process. The final sorted data are written
back to PCM.
3&0
3ULPDU\LQGH[ DOUHDG\VRUWHGGDWDLQGH[
'5$0
B. The design of MONTRES-NVM
51
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.
data into chunks of m values, m being number of values able algorithm retrieves from the obtained run all minimum values
to fit in the available DRAM space dedicated for the sorting lower than the next minimum value in the min-heap. These
process. These chunks are indexed using a secondary index. minimum values are written directly into the final sorted data.
Each entry of the index includes the minimum value of the Therefore, this mechanism allows MONTRES-NVM to write
chunk and positions of the remaining data belonging to one several values to the output array in every iteration, whereas
chunk (starting and ending position). MONTRES writes only one.
The run generation phase starts by retrieving from the
Example 3. Fig 8 illustrates the merge of the two generated
secondary index the chunk containing the lowest value. This
runs with L = 2 already sorted data sequences. The merge
chunk is loaded into the DRAM space and sorted using the
mechanism builds a min-heap with the minimum values from
merge sort in-memory algorithm. Merge sort was used since
the generated runs and the already sorted data. Once the min-
it gives a good trade-off between performance and memory
heap is created, MONTRES-NVM starts the merge process by
footprint. Once the chunk is sorted in DRAM, sorted values
retrieving the first minimum value from the min-heap (min =14
greater than the next minimum value in the secondary index
located in intermediate run 1). Then, all the minimum values
are written into an intermediate run in the PCM memory.
in the intermediate run 1 lower than the next minimum value in
The lower values are merged on-the-fly with previously sorted
the min-heap (17), that are 14 and 15 in this case, are retrieved
data (written into previously generated run if any) and already
and written into final sorted data structure. Then, MONTRES-
sorted data (see [10] for more details). This phase generates
n NVM updates the heap by inserting the new minimum value
at most m runs stored in PCM.
in the intermediate run 1, 40 in this case.
While MONTRES loads data block by block during this
phase, MONTRES-NVM uses the byte addressability property
to load chunks of data of different size according to the ,QSXWGDWD $OUHDG\VRUWHGGDWD $OUHDG\VRUWHGGDWD
secondary index.
Example 2. In Fig.6, MONTRES-NVM gathers unsorted data
to create chunks containing m = 4 values. These chunks are
,QWHUPHGLDWHUXQ
then inserted into the secondary-index and sorted according
to their minimum value. Chunks are processed successively, 3&0
starting from the one having the lowest value. In Fig. 6, SULPDU\LQGH[ VHFRQGDU\LQGH[
the first chunk (chunk 0) containing the lowest value in the &KXQN
secondary-index is loaded from PCM memory into DRAM, &KXQN 1H[WPLQLPXP
then sorted. Values in the sorted chunk, greater than the next &KXQN ,QPHPRU\ 6RUWHGFKXQN
into the intermediate run in the PCM memory. The remaining '5$0
values are merged on-the-fly with already sorted data (see Fig
7). The merge on-the-fly, presented in Fig 7, considers three Fig. 6: Sorting chunks in the DRAM
inputs: the remaining values of the sorted chunk in DRAM (6
and 9) and two already sorted data belonging to the input
data already sorted and stored in PCM. In this case, only &KXQN 1H[WPLQ
one intermediate run has been created, and its values are all 0LQYDOXHVLQFKXQN
greater than the next min value 14, that is why intermediate
run does not take part in the merge on-the-fly process. Values
'5$0
merged on-the-fly are written directly into the final sorted data
space in PCM.
,QSXWGDWD
,QSX $OUHDG\VRUWHGGDWD $OUHDG\VRUWHGGDWD
3) Run merge phase: The run merge phase of MONTRES-
52
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.
B. Results and discussion
0LQKHDS
'5$0
,QWHUPHGLDWHUXQ ,QWHUPHGLDWHUXQ
3&0
53
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.
(a) Partially sorted data 20% (b) Partially sorted data 40% (c) Partially sorted data 60%
Fig. 10: Execution time speedup of MONTRES-NVM on partially sorted data as compared to merge sort on DRAM, on PCM
and MONTRES
R EFERENCES
too low to justify the use of more DRAM. In effect, the added
CPU overhead is not compensated by the memory read/write [1] PCMSim, https://github.com/huwan/pcmsim
[2] J. Boukhobza, S. Rubini, R. Chen, and Z. Shao, “Emerging NVM.,” In:
savings induced by the use of MONTRES-NVM. ACM Transactions on Design Automation of Electronic Systems 23.2
In case of 60% partially sorted data, MONTRES-NVM (Nov. 2017), pp. 132.
improves the merge sort on PCM by up to 80%, the merge sort [3] O. Mutlu, “Main Memory Scaling: Challenges and Solution Directions.,”
In: More than Moore Technologies for Next Generation Computer
on DRAM by up to 40% and MONTRES by up to 33%. In Design. New York, NY: Springer New York, 2015, pp. 127153.
fact, when there are more partially sorted data, the primary [4] Benjamin C Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin
index makes a good job in avoiding several block sorting Ipek, Onur Mutlu, and Doug Burger, “Phase-change technology and the
future of main memory.,” In: IEEE micro 30.1 (2010).
operations (thus avoiding several read and write operations [5] J. Boukhobza, and P. Olivier, “Flash Memory Integration: Performance
on PCM). In addition, the merge phase is also accelerated. and Energy Issues.,” 1st. UK: ISTE Press - Elsevier, 2017.
[6] H. Park, and K. Shim, “FAST: Flash-aware external sorting for mobile
database systems.,” In: Journal of Systems and Software 82.8 (2009),
V. C ONCLUSION pp. 12981312. 2017.
[7] J. Lee, H. Roh, and S. Park, “External Mergesort for Flash-Based Solid
State Drives.,” In: IEEE Transactions on Computers 65.5 (May 2016),
This paper presents an external sorting algorithm named pp. 15181527.
MONTRES-NVM for a hybrid main memory. This algorithm [8] G. Graefe, “Implementing Sorting in Database Systems.,” In: ACM
is an adaptation of MONTRES, a flash-based external sorting Comput. Surv. 38.3 (Sept. 2006).
[9] T. Cormen H., C. Leiserson E., Ronald L. Rivest, and C. Stein,
algorithm. We believe that in a hybrid memory, traditional “Introduction to Algorithms, Third Edition.,” 3rd. The MIT Press, 2009.
in-memory sorting algorithms are not well suited as the [10] A. Laga, J. Boukhobza, F. Singhoff, and M. Koskas, “MONTRES :
performance behavior of DRAM and PCM are different. Merge ON-the-Run External Sorting Algorithm for Large Data Volumes
on SSD Based Storage Systems.,” In: IEEE Transactions on Computers
MONTRES-NVM uses a small part of DRAM to sort a data 66.10 (Oct. 2017), pp. 16891702.
set on PCM. MONTRES-NVM tries to reduce the number of [11] D.E. Knuth, “The art of computer programming: sorting and searching.,”
write operations performed on the PCM while maintaining a Vol. 3. Pearson Education, 1998.
set of structures in DRAM to accelerate the sorting process.
Less efforts have been made in state-of-the-art work to
optimize the CPU overhead of external sorting algorithms as
compared to in-memory algorithms. Traditionally, as the I/O
operations are very time consuming, CPU overhead is hidden.
When performing external sorting on hybrid memory, one
should pay a particular attention to the CPU overhead. We will
investigate different ways to reduce the CPU overhead to better
take profit of the DRAM space during the sorting process
in MONTRES-NVM. We will also work toward reducing the
energy consumption overhead of sorting algorithms on hybrid
memories for embedded systems.
54
Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:36:32 UTC from IEEE Xplore. Restrictions apply.