0% found this document useful (0 votes)
39 views12 pages

A Performance Study of Big Data On Small

Uploaded by

Eder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views12 pages

A Performance Study of Big Data On Small

Uploaded by

Eder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A Performance Study of Big Data on Small Nodes

Dumitrel Loghin, Bogdan Marius Tudor, Hao Zhang, Beng Chin Ooi, Yong Meng Teo
Department of Computer Science
National University of Singapore
{dumitrel,bogdan,zhangh,ooibc,teoym}@comp.nus.edu.sg

ABSTRACT to an energy utilization problem, and mounting operational


The continuous increase in volume, variety and velocity of overheads related to the datacenter costs such as hosting
Big Data exposes datacenter resource scaling to an energy space, cooling and manpower costs.
utilization problem. Traditionally, datacenters employ x86- Concomitantly with the explosion of Big Data, the past
64 (big) server nodes with power usage of tens to hundreds few years have seen a spectacular evolution of the processing
of Watts. But lately, low-power (small) systems originally speed of ARM-based mobile devices such as smartphones
developed for mobile devices have seen significant improve- and tablets. Most high-end mobile devices routinely have
ments in performance. These improvements could lead to processors with four or eight cores and clock frequencies
the adoption of such small systems in servers, as announced exceeding 2 GHz, memory sizes of up to 4 GB, and fast
by major industry players. In this context, we systemati- flash-based storage reaching up to 128 GB. Moreover, the
cally conduct a performance study of Big Data execution on latest generations of mobile hardware can run full-fledged
small nodes in comparison with traditional big nodes, and operating systems such as Linux, and the entire stack of
present insights that would be useful for future development. user-space applications that is available under Linux. Due
We run Hadoop MapReduce, MySQL and in-memory Shark to their smaller size, smaller power requirements, and lower
workloads on clusters of ARM big.LITTLE boards and In- performance, these systems are often called small nodes or
tel Xeon server systems. We evaluate execution time, en- wimpy nodes [14].
ergy usage and total cost of running the workloads on self- As a result of the fast-evolving landscape of mobile hard-
hosted ARM and Xeon nodes. Our study shows that there ware, and in a bid to reduce the energy-related costs, many
is no one size fits all rule for judging the efficiency of ex- companies and research projects are increasingly looking at
ecuting Big Data workloads on small and big nodes. But using non-traditional hardware as server platforms [27, 31].
small memory size, low memory and I/O bandwidths, and For example, Barcelona Supercomputing Center is looking
software immaturity concur in canceling the lower-power at using ARM-based systems as the basis for their exas-
advantage of ARM servers. We show that I/O-intensive cale platform [23]. Key hardware vendors such as Dell, HP
MapReduce workloads are more energy-efficient to run on and AppliedMicro have launched server prototypes based on
Xeon nodes. In contrast, database query processing is al- ARM processors [28], and a plethora of startups are looking
ways more energy-efficient on ARM servers, at the cost of into adopting ARM solutions in the enterprise computing
slightly lower throughput. With minor software modifica- landscape. Even AMD, which historically has only shipped
tions, CPU-intensive MapReduce workloads are almost four server processors based on x86/x64 architecture, targets to
times cheaper to execute on ARM servers. launch ARM-based servers [2].
Scaling Big Data performance requires multiple server
nodes with good CPU and I/O resources. Intuitively, high-
1. INTRODUCTION end ARM-based servers could fit this bill well, as they have
The explosion of Big Data analytics is a major driver for a relatively good balance of these two resources. Further-
datacenter computing. As the volume, variety and veloc- more, their low energy consumption, low price, and small
ity of the data routinely collected by commercial, scientific physical size make them attractive for cluster deployments.
and governmental users far exceeds the capacity of a single This naturally raises the research question of the feasibil-
server, scaling performance in the Big Data era is primarily ity of low-power ARM servers as contenders for traditional
done via increasing the number of servers. But this approach Intel/AMD x64 servers for Big Data processing. If an ARM-
of scaling performance leaves Big Data computing exposed based cluster can match the performance of a traditional In-
tel/AMD cluster with lower energy or cost, this could usher
This work is licensed under the Creative Commons Attribution-
in a new era of green computing that can help Big Data an-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- alytics reach new levels of performance and cost-efficiency.
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- Energy-efficient data processing has long been a common
mission prior to any use beyond those covered by the license. Contact interest across the entire landscape of systems research. The
copyright holder by emailing [email protected]. Articles from this volume bulk of related work can be broadly classified in two main
were invited to present their results at the 41st International Conference on categories: energy-proportionality studies, which aims to im-
Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast,
Hawaii.
prove the correlation between server power consumption and
Proceedings of the VLDB Endowment, Vol. 8, No. 7 the server utilization [19, 17, 10, 26], and building blocks for
Copyright 2015 VLDB Endowment 2150-8097/15/03.

762
energy-efficient servers, where various software and hard- 2.1 Setup
ware techniques are used to optimize bottleneck resources
in servers [8, 26, 31, 24, 27]. As the dominating architec- 2.1.1 ARM Server Nodes
ture in datacenters has traditionally been Intel/AMD x64, The ARM server node analyzed throughout this paper is
most of the studies are geared toward these types of servers. the Odroid XU development board with Samsung Exynos
In the database community, studies on Big Data workloads 5410 System on a Chip (SoC). This board is representative
almost always use Intel/AMD servers, with a few works us- for high-end mobile phones. For example, Samsung Exynos
ing Intel Atom systems as low-power servers [8, 1, 31, 24]. 5410 is used in the international version of the Samsung
The shift to ARM-based servers is a game-changer, as these Galaxy S4 phones. Other high end contemporary mobile
systems have different CPU architectures, embedded-class devices employ SoCs with very similar performance charac-
storage systems, and a typically lower power consumption teristics, such as Qualcomm Snapdragon 80x and NVIDIA
than even the most energy-efficient Intel systems [25]. As the Tegra 4.
basic building blocks of servers are changing, this warrants Specific to the Exynos 5410 SoC is that the CPU has
a new look at the energy-efficiency of Big Data workloads. two types of cores: ARM Cortex-A7 little cores, which con-
In this paper, we conduct a measurement-driven analy- sume a small amount of power and offer slow in-order execu-
sis of the feasibility of executing data analytics workloads tion, and ARM Cortex-A15 big cores which support faster
on ARM-based small server nodes, and make the following out-of-order execution, but with a higher power consump-
contributions: tion. This heterogeneous CPU architecture is termed ARM
big.LITTLE. The CPU has a total of eight cores, split in two
• We present the first comparative study of Big Data
groups1 of cores: one group of four ARM Cortex-A7 little
performance on ARM big.LITTLE small nodes versus
cores, and one group of four ARM Cortex-A15 big cores.
traditional Intel Xeon nodes. This analysis covers the
Each core has a pair of dedicated L1 data and instruction
performance, energy-efficiency and total cost of owner-
caches, and each group of cores has an L2 unified cache.
ship (TCO) of data analytics. Our analysis shows that
Although the CPU has eight cores, Exynos 5410 allows
there is no one size fits all rule regarding the efficiency
either the four big cores, or the four little cores to be active
of the two server systems: sometimes ARM systems
at one moment. To save energy, when one group is active,
are better, other times Intel systems are better. Sur-
the other one is powered down. Thus, a program cannot
prisingly, deciding which workloads are more efficient
execute on both the big and the little cores at the same
on small nodes versus traditional systems is not pri-
time. Instead, the operating system (OS) can alternate the
marily affected by the bottleneck hardware resource,
execution between them. Switching between the two groups
as observed in high-performance computing and tradi-
incurs a small performance price, as the L2 and L1 caches
tional server workloads [27]. Instead, software imma-
of the newly activated group must warm up.
turity and the challenges imposed by limited RAM size
The core clock frequency of the little cores ranges from 250
and bandwidth are the main culprits that compromise
to 600 MHz, and that of big cores ranges from 600 MHz to
the performance on current generation ARM devices.
1.60 GHz. Dynamic voltage and frequency scaling (DVFS)
Due to these limitations, Big Data workloads on ARM
is employed to increase the core frequency in response to
servers often cannot make full use of the native CPU or
the increase in CPU utilization. On this ARM big.LITTLE
I/O performance. During this analysis, we identify a
architecture, the OS can be instructed to operate the cores
series of features that can improve the performance of
in three configurations:
ARM devices by a factor of five, with minor software
modifications. 1. little: only use the ARM Cortex-A7 little cores, and
their frequency is allowed to range from 250 to 600
• Using Google’s TCO model [15], our analysis shows MHz.
that ARM servers could potentially lead to four times
cheaper CPU-intensive data analytics. However, I/O- 2. big: only use the ARM Cortex-A15 big cores, and their
intensive jobs may incur higher cost on these small frequency is allowed to range from 600 to 1600 MHz.
server nodes. 3. big.LITTLE : when the OS is allowed to switch be-
tween the two types of cluster. The switching fre-
The rest of the paper is organized as follows. In Sec- quency is 600 MHz.
tion 2 we provide a detailed performance comparison be-
tween single-node Xeon and ARM systems. In Section 3 Each Odroid XU node has 2 GB of low-power DDR3 mem-
we present measurement results of running MapReduce and ory, a 64 GB eMMC flash-storage and a 100 Mbit Ethernet
query processing on clusters of Xeon and ARM nodes. We card. However, to improve the network performance, we
discuss our TCO analysis in Section 4 and present the re- connect a Gbit Ethernet adapter on the USB 3.0 interface.
lated work in Section 5. Finally, we conclude in Section 6.
2.1.2 Intel Server Nodes
We compare the Odroid XU nodes with server-class Intel
2. SYSTEMS CHARACTERIZATION x86-64 nodes. We use Supermicro 813M 1U server system
based on two Intel Xeon E5-2603 CPUs with four cores each.
This section provides a detailed performance characteri-
This system has 8 GB DDR3 memory, 1 TB hard disk and
zation of an ARM big.LITTLE system by comparison with a
1
server-class Intel Xeon system. We employ a series of widely In the computer architecture literature, this group of cores
used micro-benchmarks to assess the static performance of is termed cluster of cores. However, due to potential confu-
the CPU, memory bandwidth, network and storage. sion with cluster of nodes encountered in distributed com-
puting, we shall use the term group of cores.

763
Table 1: Systems characterization
ARM (Odroid XU)
Xeon
Cortex-A7 (LITTLE) Cortex-A15 (big)
ISA x86-64 ARMv7l ARMv7l
Cores 4 4 4
Frequency 1.20 - 1.80 GHz 250 - 600 MHz 0.60 - 1.60 GHz
Specs
L1 Data Cache 128 KB 32 KB 32 KB
L2 Cache 1 MB 2 MB 2 MB
L3 Cache 10 MB N/A N/A
Dhrystone [MIPS/MHz] 5.8 3.7 3.1
CPU power [W] 15.0 0.5 3.4
System power [W] 50.0 4.4 7.3
CPU CoreMark [iterations/MHz] 5.3 5.0 3.52
(one core, CPU power [W] 15.6 0.3 2.5
max frequency) System power [W] 50.6 4.2 6.4
Java [MIPS/MHz] 0.36 0.40 0.38
CPU power [W] 16.5 0.3 3.4
System power [W] 51.5 3.0 6.1
Write throughput [MB/s] 165.0 32.6 39.2
Read throughput [MB/s] 173.0 118.0 121.0
Storage Buffered read throughput [GB/s] 4.6 0.8 1.2
Write latency [ms] 9.8 14.2 14.6
Read latency [ms] 2.7 0.9 0.8
TCP bandwidth [Mbits/s] 942 199 308
Network UDP bandwidth [Mbits/s] 811 295 420
Ping latency [ms] 0.2 0.7 0.7

1 Gbit Ethernet network card. For a fair comparison with tion, -O3, and tuning the code for the target processor (e.g.
the 4-core ARM nodes, we remove one of the Xeon CPUs -mcpu=cortex-a7 -mtune=cortex-a7 for little cores). In
from each node. The idle power of the 4-core Xeon node is terms of Dhrystone MIPS per MHz, we obtain a surpris-
35 W, and its peak power is around 55 W. Hence, this node ing result: little cores perform 21% better than big cores,
has a lower power profile compared to traditional nodes. as per MHz. This is unexpected because ARM reports that
Cortex-A7 has lower Dhrystone MIPS per MHz than Cortex-
2.1.3 Software Setup A15, but they use internal armcc compiler [5]. We conclude
The ARM-based Odroid XU board runs Ubuntu 13.04 that it is the gcc way of generating machine code that leads
operating system with Linux kernel 3.4.67, which is the lat- to these results. To check our results, we run newer Core-
est kernel version working on this platform. For compil- Mark CPU benchmark which is being increasingly used by
ing native C/C++ programs, we use gcc 4.7.3 arm-linux- embedded market players, including ARM [4]. We use com-
gnueabihf. The Xeon server runs Ubuntu 13.04 with Linux piler optimization flags to match those employed in the re-
kernel 3.8.0 for x64 architecture. The C/C++ compiler ported performance results for an ARM Cortex-A15. More
available on this system is gcc 4.7.3. We install on both sys- precisely, we activate NEON SIMD (-mfpu=neon), hardware
tems Oracle’s Java Virtual Machine (JVM) version 1.7.0 45. floating point operations (-mfloat-abi=hard) and aggresive
loop optimizations (-faggressive-loop-optimizations). We
2.2 Benchmark Results obtain a score of 3.52 per core per MHz, as opposed to the
reported 4.68. We attribute this difference to different com-
Big Data applications stress all system components, such piler and system setup. However, little cores are again more
as CPU cores, memory, storage and network I/O. Hence, energy efficient, obtaining more than half the score of big
we first evaluate the individual peak performance of these cores with only 0.3 W of power. The difference between
components, before running complex data-intensive work- ARM cores and Xeon cores is similar for both Dhrystone
loads. For this evaluation, we employ benchmarks that are and CoreMark benchmarks. Xeon cores obtain almost two
widely used in industry and systems research. For exam- times higher scores per MHz than ARM cores.
ple, we measure how many Million Instructions per Second Since Big Data frameworks, such as Hadoop and Spark,
(MIPS) a core can deliver using traditional Dhrystone [29, run on top of Java Virtual Machine, we also benchmark
5] and emerging CoreMark [4] benchmarks. For storage and Java execution. We develop a synthetic benchmark per-
network throughput and latency, we use Linux tools such as forming integer and floating point operations such that it
dd, ioping, iperf and ping. Because Odroid XU is a hetero- stresses core’s pipeline. As expected, the little Cortex-A7
geneous system, we individually benchmark both little and cores obtain less than half the MIPS of Cortex-A15 cores.
big cores configurations. Table 1 summarizes system char- On the other hand, the big Cortex-A15 cores achieve just 7%
acteristics in terms of CPU, storage and network I/O, and fewer MIPS than Xeon cores, but using quarter the power.
Figure 1 compares the memory bandwidth of Xeon and all Thus, in terms of core performance-per-power, little cores
three Odroid XU configurations. are the best with around 800 MIPS/W, big cores come sec-
We measure CPU MIPS native performance by initially ond with 180 MIPS/W and Xeon cores are the worst with
running traditional Dhrystone benchmark [29]. We com- 40 MIPS/W.
pile the code with gcc using maximum level of optimiza-

764
Xeon E5-2603
1024 ARM Cortex-A7 Table 2: Workloads
ARM Cortex-A15
ARM big.LITTLE Workload Input Type Input Size
TestDFSIO synthetic 12 GB
256
Terasort synthetic 12 GB
Bandwidth [GB/s]

Pi - 16 Gsamples
Kmeans Netflix 4 GB
64
Wordcount Wikipedia 12 GB
Grep Wikipedia 12 GB
16 TPC-C Benchmark TPC-C Dataset 12 GB
TPC-H Benchmark TPC-H Dataset 2 GB
Shark Scan Query Ranking 21 GB
4 43 MB/1.3 GB
Shark Join Query Ranking/UserVisit
86 MB/2.5 GB
1kB 32kB 1MB 32MB 1GB
Memory Access Size Cluster under test 1Gbps Ethernet
(Xeon/Odroid XU)
Controller system
Figure 1: Memory bandwidth comparison
Serial interface

To measure memory bandwidth we use pmbw 0.6.2 (Par-


allel Memory Bandwidth Benchmark) [7]. Figure 1 plots Yokogawa WT210 240V AC
the memory bandwidth of Xeon and the three ARM config- power meter outlet
urations, in log-log scale. When data fits into cache, Xeon
has a bandwidth of 450 GB/s when using eight cores and
225 GB/s when using only four cores. The four Cortex- Figure 2: Experimental setup
A15 cores have around ten times less bandwidth than Xeon,
while Cortex-A7 cores have 20 times less. When accessing
TCP/IP stack has significant CPU usage. Moreover, there
the main memory, the gap decreases. Main memory band-
are frequent context switches between user and kernel space
width for Cortex-A15 cores and Cortex-A7 cores is two and
which lower the overall system performance.
four times, respectively, less than Xeon’s.
Summary. Our characterization shows that Odroid XU
We measure storage I/O read and write throughput and
ARM-based platform has lower overall performance than a
latency using dd (version 8.20), and ioping (version 0.7),
representative server system based on Intel Xeon proces-
respectively. Write throughput when using big cores is four
sor. The ARM system is significantly vulnerable at memory
times worse than for Xeon node. When using little cores,
level, with its only 2 GB of RAM and its four to twenty
the throughput is even smaller, suggesting that disk driver
times lower memory bandwidth.
and file system have important CPU usage. Since modern
operating systems tend to cache small files in memory, we
also measured buffered read. The results are correlated with 3. MEASUREMENTS-DRIVEN ANALYSIS
memory bandwidth values considering that only one core is
used. For example, buffered read on big cores has 1.2 GB/s 3.1 Methodology
throughput, while the main memory bandwidth when using We characterize Big Data execution on small nodes in
all four cores is 4.9 GB/s. One surprising result is that comparison with traditional server-class nodes, by evalu-
eMMC write latency is bigger even than traditional hard- ating the performance of Hadoop Distributed File System
disk latency. This can be explained by the fact that (i) (HDFS), Hadoop MapReduce and query processing frame-
eMMC uses NAND flash which has big write latency and (ii) works such as MySQL and Shark. Our analysis is based on
modern hard-disks have caches and intelligent controllers to measuring execution time and total energy at cluster level.
hide the write latency. We run well known MapReduce applications on Hadoop,
Lastly, we measure networking subsystem bandwidth and the widely-used open-source implementation for MapReduce
latency using iperf (version 2.0.5) and ping (from iputils- framework [20]. We use Hadoop 1.2.1 running on top of
sss20101006 on Xeon and iputils-s20121221 on Odroid XU). Oracle Java 1.7.0 45. We choose workloads that stress all
We measure both TCP and UDP bandwidth since mod- systems components (CPU, memory, I/O) as described in
ern server software may use both. TCP bandwidth is three Table 2. All workloads are part of Hadoop examples, ex-
times lower on Odroid XU when using big cores and more cept Kmeans which was adapted from PUMA benchmark
than four times lower when using little cores. For UDP, the suite [1]. For all workloads, except Pi and Kmeans, we use
gap is smaller since ARM bandwidth is two and three times 12 GB input size such that, even when running on a 6-node
lower on big and little cores, respectively. On the one hand, cluster, each node processes 2 GB of data which cannot be
the difference can be explained by the fact that we use an accommodated by Odroid XU RAM. For Pi with 12 billion
adapter connected through USB 3.0. Even if USB 3.0 has samples, execution time on Xeon nodes is too small, thus,
a theoretical bandwidth of 5 Gbit/s, the actual implemen- we increase the input size to 16 billion samples. For Kmeans
tation can be much slower. Moreover, the communication with 12 GB input, execution on Odroid XU takes too long,
path is longer, a fact shown by the three times bigger la- thus, we reduce the input size to 4 GB. For Wordcount and
tency of Ondroid XU. On the other hand, the difference Grep, we use the latest dump of Wikipedia articles and trim
when using little and big cores is explained by the fact that it to 12 GB.

765
150 30 3500
1 node Throughput 1 node Pi Java
Energy Pi C++
3000
100 20 2 nodes 6 nodes

2500

50 10 2000

Time [s]
Throughput [MB/s]

Energy [kJ]
1500
0 0
6 nodes
1000

100 200 500

0
big litt big Xe big litt big Xe big litt big Xe
le .LI on le .LI on le .LI on
50 100 TT TT TT
LE LE LE

0 0 Figure 4: MapReduce Pi estimator in Java and C++


W W W W Re Re Re Re
rite rite rite rite ad ad ad ad
(X (b (lit (b (X (b (lit (b
eo ig) tle ig. eo ig) tle ig.
n) ) LIT n) ) LIT
TE TE and write distributed operations using Hadoop’s TestDFSIO
) )
benchmark with 12 GB input. Figure 3 plots the through-
Figure 3: HDFS performance put, as reported by TestDFSIO, and measured energy con-
sumption of write and read on single node and 6-node clus-
ters. The throughput significantly decreases when writing
We evaluate disk-based query processing frameworks by on multiple nodes, especially for Xeon nodes. This decrease
running TPC-C [11] and TPC-H [12] benchmarks on MySQL occurs because of HDFS replication mechanism, which, by
(version 5.6.16). For TPC-C, we populate the database with default, replicates each block three times. The additional
130 warehouses, which occupy about 12 GB of storage. We network and storage operations due to replication increase
set the scale factor of TPC-H data generator to 2, thus cre- the execution time and lower the overall throughput. This
ating 2 GB in total for all the 8 tables. To evaluate dis- observation is validated by the less visible degradation of
tributed, in-memory query processing, we choose scan and throughput for read operation. The increasing execution
join queries from AMPLab Big Data Benchmark [3] running time of write on multiple nodes leads to higher energy con-
on Shark [32]. We use Ranking and UserVisit datasets from sumption, especially for Xeon nodes. On a 6-node cluster,
S3 where the benchmark prepares datasets of different sizes. the write throughput of Xeon is two times higher compared
We use Shark 0.9.1 running on top of Oracle’s Java 1.7.0 45. to ARM, but the energy usage is more than four times big-
We run the workloads on clusters of Odroid XU and Intel ger. For read, Xeon’s throughput is three times better that
Xeon E5-2603-based nodes. For power and energy measure- ARM’s big.LITTLE, while the energy ratio is five. On ARM
ments, we use Yokogawa WT210 power monitor connected nodes with little cores, the execution times of HDFS write
to cluster’s AC input line. A controller system is used to and read operations increase due to lower JVM performance.
start the benchmarks and collect all the logs. This setup Hence, the energy consumption is higher compared to run-
is summarized in Figure 2. Since we want to analyze the ning on big and big.LITTLE configurations.
behavior of Big Data applications and their usage of differ- Summary. ARM big.LITTLE is more energy-efficient
ent subsystems, we use dstat tool (version 0.7.2) to log the than Xeon when executing HDFS read and write operations,
utilization of CPU, memory, storage and network. All the at the cost of 2-3 times lower throughput.
experiments are repeated three times and the average values
are reported. For Hadoop measurements on Xeon, the stan- 3.3 Hadoop
dard deviation (SD) is less than 10% and 16% of the average We evaluate time performance and energy-efficiency of
for time and energy respectively. On ARM nodes, the SD is Hadoop by running five widely used workloads, as shown in
less than 24% and 26% of the average for time and energy Table 2. We use default Hadoop settings, except that we set
respectively. We observe that the biggest SD values are ob- the number of slots to four such that it equals the number of
tained by I/O-intensive workloads, such as TestDFSIO. On cores on each node. Using this configuration, all workloads
the other hand, CPU-intensive workloads have lower SDs. run without errors, except for Terasort and Kmeans which
For example, some of the measurements for Pi have a SD fail on Odroid XU due to insufficient memory. After experi-
of zero. For the query processing measurements, the SD on menting with more alternative configurations, we found two
Xeon nodes is less than 9% and 6% of the average for time that allow both programs to finish without failure. Firstly,
and energy respectively, while for ARM nodes it is less than we decrease the number of slots to two on Odroid XU. Sec-
13% and 14%, respectively. ondly, we keep using four slots but limit the io.sort.mb to
50 MB, half of its default value. These two settings have
3.2 HDFS different effects on the two programs. For example, on 4-
HDFS is the underlying file system for many Big Data node cluster, Terasort running on two slots is 10-20% faster
frameworks such as Hadoop, Hive, Spark, among others. than using a limited io.sort.mb. This result is due to the
We measure the throughput and energy usage of HDFS read fact that Terasort is data-intensive, hence, it benefits less

766
100000
Xeon Xeon
10000 ARM big ARM big
ARM LITTLE ARM LITTLE
ARM big.LITTLE 10000 ARM big.LITTLE

Time [s]
1000
1000
100
100
Time [s]

Pi Java Pi C++ Grep 10


10
100000

10000

Power [W]
1000 100

100
Kmeans Terasort Wordcount 10
10 1 2 4 6 1 2 4 6 1 2 4 6
Nodes

Figure 5: MapReduce scaling 100

Energy [kJ]
10
2000 200
Xeon Time 8 ARMs*
ARM Time 1
Xeon Energy
ARM Energy 0.1
1500 150
0.01
6 ARMs
Pi Java Pi C++ Grep Kmeans Terasort Wordcount
Energy [kJ]
Time [s]

1000 100 Figure 7: MapReduce on 6-node cluster


6 ARMs

2 ARMs
mentation is around five times faster on ARM nodes and
500 50 only 1.2 times faster on Xeon-based nodes. With this minor
6 ARMs

2 ARMs
software porting, we obtain a significant improvement in ex-
100 10
ecution time which leads to energy savings, as we further
show. In the remainder of this section, we shall show the
Pi Java Pi C++ Grep Kmeans Terasort Wordcount
results for both Pi Java and Pi C++ implementations.
We present time and energy performance of the six work-
Figure 6: Xeon-ARM performance equivalence
loads on Xeon and ARM clusters. First, since scalability
is a main feature of MapReduce framework, we investigate
form using more cores but having limited memory buffer. how Hadoop scales on clusters of small nodes. We show
On the other hand, Kmeans benefits more from running on time scaling in log scale on four cluster sizes in Figure 5.
four slots, being 20% faster on big cores and 35% faster on All workloads exhibit sublinear scaling on both Intel and
little cores, compared to running on two slots. Kmeans is a ARM nodes, which we attribute to housekeeping overheads
CPU-intensive workload executing a large number of float- of Hadoop when running on more nodes. When the over-
ing point operations in both map and reduce phases. Thus, heads dominate the useful work, the scaling degrades. For Pi
it benefits from running on higher core counts. In the re- workload running on six nodes there is too little useful work
mainder of this paper, we present the results on two slots for mappers to perform, hence, there is not much improve-
for Terasort, and on four slots with io.sort.mb of 50 for ment in the execution time on both types of servers. On the
Kmeans, when running on ARM big.LITTLE nodes. other hand, Kmeans and Grep exhibit higher speedup on the
When running the experiments, we observe low perfor- 6-node ARM cluster compared to Xeon because the slower
mance of Pi on Odroid XU. Compared to Xeon, Pi on big ARM cores have enough CPU-intensive work to perform.
and big.LITTLE runs 7-9 times slower, and on little cores Secondly, Figure 6 shows how many ARM-based nodes
up to 20 times slower. This is surprising because Pi is CPU- can achieve the execution time of one Xeon node. We select
intensive and we show in Section 2 that the performance ra- ARM big.LITTLE configuration which exhibits the closest
tio between Xeon and ARM cores is at most five. We further execution time compared to one Xeon. For Wordcount, the
investigate the cause of this result. Firstly, we profile Task- difference between six ARM nodes and one Xeon node is
Tracker execution on Odroid XU. We observed that JVM large, and thus, we estimate based on the scaling behavior
spends 25% of the time in __udivsi3. This function emu- that eight ARM nodes exhibit a closer execution time.
lates 32-bit unsigned integer division in software, although Thirdly, Figure 7 shows the time, power and energy of
the Exynos 5410 SoC on Odroid XU board supports UDIV 6-node clusters using log scale. Based on the energy usage,
hardware instruction. But other SoCs may not implement the workloads can be categorized into three classes:
this instruction, since it is defined as optional in ARMv7-A • Pi Java and Kmeans execution times are much larger
ISA [6]. Thus, JVM uses the safer approach of emulating it on ARM compared to Xeon. Both workloads incur
in software. Secondly, we port Pi in C++ and run it using high CPU usage on ARM, which results in high power
Hadoop Pipes mechanism. We use the same gcc compilation usage. The combined effect is a slightly higher energy
flags as for native benchmarks in Section 2. The compar- usage on ARM nodes.
ison between Java and C++ implementations is shown in
Figure 4. Compared to original Java version, C++ imple- • Pi C++ and Grep exhibit a much smaller execution

767
Table 3: MapReduce Performance-to-power Ratio
ARM (Odroid XU)
Xeon
Workload Unit big LITTLE big.LITTLE
1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6
Pi Java Msamples/J 1.44 1.58 0.88 0.63 0.68 0.60 0.60 0.56 0.78 0.83 0.80 0.58 0.67 0.60 0.61 0.57
Pi C++ Msamples/J 2.51 1.89 1.04 0.71 3.23 3.03 2.95 2.64 4.56 4.37 4.01 2.78 3.33 2.95 2.78 2.56
Grep MB/J 0.56 0.46 0.27 0.21 1.03 0.93 0.92 0.92 1.47 1.34 1.31 1.27 1.03 0.93 0.86 0.92
Kmeans MB/J 0.50 0.41 0.25 0.22 0.21 0.19 0.19 0.20 0.28 0.25 0.23 0.23 0.21 0.19 0.18 0.20
Terasort MB/J 0.28 0.22 0.15 0.14 0.31 0.25 0.30 0.27 0.35 0.28 0.35 0.30 0.32 0.25 0.30 0.27
Wordcount MB/J 0.17 0.14 0.09 0.08 0.12 0.11 0.10 0.09 0.18 0.16 0.12 0.10 0.12 0.11 0.10 0.10

time gap. Both are CPU-intensive and have high power


usage, but overall, their energy usage is significantly Table 4: TPC-C performance
90th-Percentile/ Average
lower on ARM. System tpmC RT < 5s
Max RT [s] Power [W]
Xeon 315.5 2.75/6.6 99.6% 38.2
• Wordcount and Terasort are I/O-intensive workloads, ARM big 125.1 4.2/7.1 93.2% 4.9
ARM LITTLE 112.1 8.2/16.5 33.4% 4.2
as indicated by lower power usage on ARM compared ARM big.LITTLE 130.8 4.1/7.3 98.4% 4.9
to the other workloads. They obtain better execution
time on Xeon due to higher memory and storage band- 100
widths. However, time improvement does not offset Xeon
80 ARM big
the higher power usage of Xeon, therefore, energy on ARM LITTLE
ARM is lower. ARM big.LITTLE

CDF [%]
60

Summary. We sum up by showing the performance-to- 40


power ratio (PPR) of all workloads on all cluster configu-
rations as a heat-map in Table 3. PPR is defined as the 20
amount of useful work performed per unit of energy. For 0
workloads that scan all input, we compute the PPR as the 0 5 10 15 20
ratio between input size and energy. For Pi, the input file Response time [s]
contains the number of samples to be generated during the
map phase. Hence, we express the PPR as millions of sam- Figure 8: MySQL TPC-C CDF of response time
ples (Msamples) per unit of energy. Higher (green) PPR
represents a more energy-efficient execution. In correlation ARM big/big.LITTLE. On ARM little cores, however, re-
with our classification, Pi Java and Kmeans exhibit better sponse time performance is about two times worse. Our key
PPR on Xeon, while all other workloads have the highest observations are:
PPR on ARM little cores. As indicated in Table 3, 1-node
cluster achieves maximum PPR because there is no com- • Although storage read latency on Odroid XU is about
munication overhead and fault-tolerance mechanism as on 3.5 times lower than on Xeon, file system cache hides
multi-node clusters. this, and, Xeon’s lower memory latency leads to better
response time.
3.4 Query Processing • Write performance is affected by 1.5 times higher stor-
age write latency on Odroid XU, as shown in Table 1.
3.4.1 MySQL
Nevertheless, average power consumption of Xeon is around
To show the performance of OLTP and OLAP workloads
ten times higher than Ordoid XU, compared to only two
on Odroid XU and Xeon nodes, we run TPC-C (for OLTP)
times throughput performance gain.
and TPC-H (for OLAP) benchmarks on MySQL, the widely-
used database system in both academia and industry. We 3.4.1.2 TPC-H Benchmark.
use the default MySQL configuration and conduct the ex- TPC-H queries are read-only, I/O bounded, and represent
periments on a single node cluster. Input dataset settings the most common analytics scenarios in databases. We run
are shown in Table 2. all 22 queries of TPC-H benchmark, and attempt to elim-
inate experimental variations by flushing file system cache
3.4.1.1 TPC-C Benchmark.
before running each query. We plot the TPC-H time per-
TPC-C workload is data-intensive and mostly incurs ran-
formance and energy usage for both Xeon and ARM in Fig-
dom data access with 1.9:1 read-to-write ratio [9]. In TPC-C
ure 9. We observe two opposing performance results for
benchmark experiment, we tune the configuration by mod-
different queries.
ifying the number of simultaneous connections to achieve
the best throughput and response time on each type of • For scan-based queries (e.g. Q1, Q6, Q12, Q14, Q15,
server. Hence, we set one connection on ARM server and Q21) and long-running queries with a large working
64 connections on Xeon server. We summarize TPC-C re- set (e.g. Q9, Q17 and Q20), Xeon performs 2 to 5
sults in Table 4 and plot cumulative distribution function times better than all three ARM configurations. This
(CDF) of response time in Figure 8, from which we can see behavior is due to higher read throughput and larger
that TPC-C throughput (tmpC) on Xeon is more than two memory on Xeon node. Moreover, the higher amount
times higher than on each Odroid XU configuration, while of free memory used as file cache can significantly re-
the transactions response time (RT) is similar on Xeon and duce read latency in subsequent accesses.

768
10000
Xeon
ARM big
ARM LITTLE
1000 ARM big.LITTLE
Time [s]

100

10

10
Energy [kJ]

0.1

0.01
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Query #

Figure 9: MySQL TPC-H performance

1000 60 600 30
Xeon Xeon
ARM big ARM big
ARM LITTLE 50 ARM LITTLE
800
ARM big.LITTLE ARM big.LITTLE
40 400 20
Energy [kJ]

Energy [kJ]
600
Time [s]

Time [s]

30
400
20 200 10

200
10

0 0 0 0
1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6
Nodes Nodes

Figure 10: Shark scan query performance Figure 11: Shark join query performance

• For random access queries with small working set (e.g. SELECT sourceIP, totalRevenue, avgPageRank FROM
Q3, Q5, Q7, Q8, Q11, Q16, Q19, Q22), ARM has (SELECT sourceIP,
1.2–6.9 times better performance, which we mainly at- AVG(pageRank) as avgPageRank,
tribute to lower read latency of Odroid XU flash-based SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
storage.
WHERE R.pageURL = UV.destURL
Summary. In terms of energy, Xeon node consumes AND UV.visitDate BETWEEN Date(‘1980-01-01’)
AND Date(‘X’)
1.4 to 44 times more than ARM-based node. Overall, the GROUP BY UV.sourceIP)
energy-efficiency of ARM executing queries is higher com- ORDER BY totalRevenue DESC LIMIT 1
pared to traditional Xeon.
from the Big Data Benchmark provided by AMPLab [3].
3.4.2 Shark We set cache size of Shark at 896 MB for both types of
We investigate the performance of in-memory Big Data nodes to leave enough memory for other processes, such as
analytics using Shark framework [32]. This is an open source Spark master, worker, and HDFS daemons, on Odroid XU.
distributed SQL query engine built on top of Spark [33], To show the potential of unrestricted system, we conduct
a distributed in-memory processing framework. With the an experiment with Spark cache set to 5 GB on Xeon node.
increasing velocity of Big Data analytics, Shark and Spark We observe that for scan query, cache size does not affect
are increasingly used because of their low-latency and fault- the performance, hence, we present the results with identical
tolerance. In this experiment, we evaluate the performance cache size configuration. For join query, we use two datasets
of scan and join queries. We list the scan query: as shown in Table 2. We choose these two sets to investi-
gate three scenarios: (i) both servers have enough memory,
SELECT pageURL, pageRank
FROM rankings WHERE pageRank > X (ii) Xeon has enough memory while ARM does not, (iii)
both servers do not have enough memory. These scenarios
and join query: cover memory management issues in modern database sys-

769
tems [34]. The join query is both memory and I/O bounded,
as the smaller table is usually used to build an in-memory Table 5: TCO notations and values
structure and the other table is scanned once from the stor- Notation Value Description
age. The in-memory structure is either a hash table for hash Cs,Xeon $1100 + cost of Xeon-based server node
join implementation, or a table kept in memory for nested Cs,ARM $280 + cost of ARM-based server node
join implementation. Moreover, Spark swaps data between Ts 3 years ∗ server lifetime
memory and disk, thus, it benefits from larger cache sizes. Ul 10% ∗ low server utilization
The results for scan query are shown in Figure 10. For join Uh 75% ∗ high server utilization
#
query, we only plot the third scenario in Figure 11 due to Cd datacenter total costs
space limitation. Based on these results, we formulate the Td 12 years ∗ datacenter lifetime
#
following comments. Cp electricity total costs
• For scan query, ARM big and big.LITTLE are just Cph ∗
electricity cost per hour
+
1.1-1.7 times slower than Xeon, but more than three Pa average server power
times better in energy usage. ARM little cores are Pp,Xeon 55 W + Xeon-based node peak power
twice slower, but more than four times better in energy Pp,ARM 16 W + ARM-based node peak power
usage. Therefore, in terms of PPR, ARM is much Pi,Xeon 35 W + Xeon-based node idle power
better for this kind of query. Pi,ARM 4W+ ARM-based node idle power

• For join query, when both types of server nodes have


enough cache, ARM big/big.LITTLE are about 2–4 server lifetime and 12 years lifetime for a datacenter [15].
times slower than Xeon node, and 1–3.6 times better For typical server utilization, we consider a lower bound of
in energy usage. ARM LITTLE is 3–7 times slower 10%, typical for cloud servers [21], and an upper bound of
than Xeon at runtime, and just 1.5–2.9 times better in 75% as exhibited by Google datacenters [15].
energy usage. The cost of electricity is a key factor in the overall dat-
acenter costs. But electricity price is not the same all-over
• For join query, when both servers do not have enough the world. Thus, we consider more alternatives for servers’
cache, runtime and energy ratios are slightly decreas- location and electricity price [30]. Among these alternatives,
ing. For runtime, ARM is slower by up to 2.4 times we select a lower bound of 0.024 $/kWh (price of electricity
on big and big.LITTLE, and 4.2 times on little cores. in Russia) and an upper bound of 0.398 $/kWh (price in
However, ARM has up to 4 times better energy us- Australia). Although we acknowledge that datacenter loca-
age, meaning a better PPR compared to Xeon. This tion may also influence equipment, hosting and manpower
happens because Xeon has to read data from storage, costs, throughout this study we consider only the difference
hence, it does not benefit from its much higher memory in electricity price.
bandwidth.
• For join query, when Xeon node has enough memory, 4.1 Marginal Cost
runtime gap increases, leading to high energy usage We begin by describing a simple cost model which incorpo-
on ARM. The ratio between Xeon and ARM energy rates equipment and electricity costs. This model estimates
usage is between 1.4 and 2.5, thus, increasing the PPR the marginal cost of self-hosted systems, being suitable for
of traditional server systems. small, in-house computing clusters. Total cost is
Summary. We conclude that ARM has much better en- C = Cs + Cp (1)
ergy efficiency when processing I/O bounded scan queries.
Nevertheless, for memory and I/O bounded join queries, tra- where electricity cost for server lifetime period is:
ditional servers are more suitable because of larger memory Cp = Ts · Cph · (U · Pa + (1 − U ) · Pi ) (2)
and higher memory and I/O bandwidths.
We further investigate the effects of server utilization and
4. TCO ANALYSIS idle power on marginal cost. As we define lower and up-
per bounds for server utilization, there are two scenarios for
We analyze the total cost of ownership (TCO) of exe- evaluating electricity costs. Firstly, given a low Xeon server
cuting Big Data applications on emerging low-power ARM utilization of 10% and the execution times of Pi and Terasort
servers, in comparison with traditional x86-64 servers. We workloads on Xeon and ARM nodes, we obtain two ARM-
derive lower and upper bounds for per hour cost of CPU- and based server utilizations. For Pi, ARM server exhibits 20%
I/O-intensive workloads on a single nodes. We consider Pi utilization, while for Terasort, the utilization increases to al-
and Terasort as representatives for CPU- and I/O-intensive most 50%. Secondly, given the upper bound of 75% for Xeon
workloads, respectively, as discussed in Section 3.3. More- server utilization, we obtain over 100% utilization for ARM
over, we use execution time and energy results of Pi C++ server. Thus, we must employ more than one ARM server
implementation because it better exploits ARM nodes. to execute the workload of one Xeon. We use server substi-
Throughout this section we use a series of notations and tution ratios derived in Section 3.3 and depicted in Figure 6.
default values as summarized in Table 5. All costs are ex- For CPU-intensive Pi, we use two ARM servers with 82%
pressed in US dollars. The values in Table 5 are either based utilization to achieve the performance of one Xeon server.
on our direct measurements or taken from the literature, as For I/O-intensive Terasort, we use six ARM servers with
indicated 2 . For example, we assume three years of typical 86% utilization to execute the same workload as one 75%-
2
Listed values are marked with ∗ if they are taken from the utilized Xeon. The six ARM servers occupy less space than
literature, with + if they are based on our measurements, one rack-mounted traditional server but may have a higher
and with # if they represent output values. equipment cost. We present the results for both scenarios

770
Table 6: Effect of server utilization on marginal cost Table 7: Effect of server utilization on TCO
Job Utilization Server Min cost [$/h] Max cost [$/h] Job Utilization Server Min cost [$/h] Max cost [$/h]
type ratio [%] ratio Xeon ARM Xeon ARM type ratio [%] ratio Xeon ARM Xeon ARM
CPU-int. 10:20 1:1 0.043 0.011 0.044 0.013 CPU-int. 10:20 1:1 0.066 0.018 0.086 0.025
I/O-int. 10:49 1:1 0.043 0.011 0.056 0.013 I/O-int. 10:49 1:1 0.066 0.017 0.085 0.021
CPU-int. 75:82 1:2 0.043 0.022 0.060 0.031 CPU-int. 75:82 1:2 0.066 0.035 0.086 0.051
I/O-int. 75:86 1:6 0.043 0.065 0.059 0.079 I/O-int. 75:86 1:6 0.066 0.104 0.085 0.127

0.1 120
with idle power datacenter
no idle power high utilization server
100 power
0.08 low utilization
CPU-intensive I/O-intensive
80
0.06
Cost [$/h]

Cost [$]
60
0.04
40
0.02
20

0
Xe AR Xe AR Xe A X A
on M o M, on RM eon RM 0
Xe AR Xe AR 0 Xe AR Xe AR
, C , C n, I/ I ,C ,C , I/ ,
PU PU O- / O PU PU O- I/O-i on M o M
, M , M n, M , M
on M o M
, M , M n, M , M
-in -in int. -int. -in -in int. nt.
in in ax ax in in ax ax
t. t. t. t.

Figure 12: Effect of idle power on marginal cost Figure 13: Costs per month

as cost per hour in Table 6. For low utilization, the cost per datacenter costs include capital and operational expenses.
hour of ARM is almost four times lower compared to Xeon. Capital expenses represent the cost for designing and build-
Moreover, CPU- and I/O-intensive jobs have the same cost. ing a datacenter. This cost depends on datacenter power
On the other hand, the cost of highly utilized servers is capacity, and it is expressed as price per Watt. We use a
slightly higher. Surprisingly, for I/O-intensive jobs, ARM default value of 15 $/W as in [15]. Operational expenses rep-
incurs up to 50% higher cost because six ARM servers are resent the cost for maintenance and security, and depend on
required to perform the work of one Xeon. datacenter size which, in turn, is proportional to its power
Next, we investigate the influence of idle power, as a key capacity. We use a default value of 0.04 $/kWmonth [15].
factor in total electricity costs. This influence may be alle- Secondly, for server costs, beside the equipment itself, there
viated by employing energy-saving strategies, such as All-In are operational expenses related to maintenance. These ex-
Strategy [17]. This strategy assumes that servers can be penses are expressed as overhead per Watt per year. We use
inducted to a low-power state during inactive periods. At the default value of 5% for both types of servers. Moreover,
certain intervals, they are woken-up to execute the jobs, and for building a real datacenter, the business may take loan.
afterwards put back to sleep. In reality, servers consume a The model includes the interest rate for such a loan. We use
small amount of power in deep-sleep or power-off mode and a value of 8% per year, although for building a datacenter
may incur high power usage during wake-up phase. How- with emerging ARM systems this rate may be higher due
ever, we assume that during inactive periods servers draw to potential risk associated with this emerging server plat-
no power, and perform the study on both utilization scenar- form. Thirdly, electricity expenses are modeled based on
ios described above. With these assumptions, the influence the average power consumption. In addition, the overhead
of idle power is more visible on low-utilized Xeon servers, as costs, such as those for cooling, are expressed based on the
shown in Figure 12. In this case, putting Xeon servers to Power Usage Effectiveness (PUE) of the servers. For the
sleep can reduce hourly cost by 22%. For ARM servers, cost employed Xeon servers, we use the lowest PUE value of 1.1
reduction is 6–10% since idle power is much lower. At high representing the most energy-efficient Google servers [15].
utilization, the reductions are smaller because the servers For ARM servers, we use a higher PUE of 1.5 to incorpo-
are active most of the time. rate less energy-efficient power supply and the power drawn
by the fan, which is up to 1.5 W and represents ∼10% of
4.2 TCO the 16 W peak power.
We analyze a more complex TCO model which includes In Figure 13 we present TCO values for high utilization
datacenter costs. We use Google TCO calculator which im- scenario. We show these values as break-down of monthly
plements the model described in [15]. For this model, total cost into datacenter, server equipment and power costs, as
cost is defined in Equation 3. The cost is dominated by equip-
ment expenses. For I/O-intensive workloads, equipment
C = Cd + Cs + Cp (3)
and power expenses of the six ARM nodes make low-power
We conduct our study based on the following assumptions servers more expensive than traditional Xeon. We summa-
regarding all three components of the TCO model. Firstly, rize TCO values for both utilization scenarios in Table 7.

771
5. RELATED WORK and low-power embedded systems, respectively. Moreover,
Related work analyzing energy efficiency of Big Data exe- embedded systems based on Intel Atom processors suffer
cution focuses mostly on traditional x86/x64 architecture, from poor I/O subsystem. This is in concordance with our
with some projects considering heterogeneous clusters of findings on recent high-end ARM-based systems. Knight-
low-power Intel Atom and high-performance Intel Xeon pro- Shift [31] is a heterogeneous server architecture which cou-
cessors [8, 1, 31, 24]. More generally, the related work can ples a wimpy Atom-based node with a brawny Xeon-based
be classified in two categories: energy-proportionality stud- node to achieve energy proportionality for datacenter work-
ies [19, 17, 26, 10] and building blocks for energy-efficient loads. This architecture can achieve up to 75% energy sav-
servers [8, 16, 31, 24, 27], as we further present. ings which also leads to cost savings. WattDB [24] is an
energy-efficient query processing cluster. It uses nodes with
5.1 Energy Proportionality Intel Atom and SSDs, and dynamically powers them on or
The survey in [18] highlights two techniques for saving off depending on load. Running TPC-H queries, the authors
energy in Hadoop MapReduce deployments: Covering Set show that dynamic configurations achieve the performance
(CS) [19] and All-In Strategy (AIS) [17]. Both techniques of static configurations while saving energy.
propose shutting-down or hibernating the systems when the The impressive evolution of ARM-based systems leads to
cluster is underutilized. CS proposes to shut-down all the their possible adoption as servers [2, 28]. In this context,
nodes but a small set (the Covering Set) which keeps at Tudor and Teo investigate the energy-efficiency of ARM
least one replica of each HDFS block. On the other hand, Cortex-A9 based server executing both compute- and data-
AIS shows it is more energy-efficient to use the entire cluster intensive server workloads [27]. The study shows that ARM
and finish the MapReduce jobs faster and then shut-down system is unsuitable for network I/O- and memory-intensive
all nodes. Berkeley Energy Efficient MapReduce (BEEMR) jobs. This is correlated with our evaluation showing that
[10], proposes to split MapReduce cluster into interactive even for newer, ARM big.LITTLE-based servers, small mem-
and batch zones. The nodes in batch zone are kept in a low- ory size and low memory and I/O bandwidths lead to ineffi-
power state when inactive. This technique is appropriate for cient data-intensive processing. Mühlbauer et al. show the
MapReduce with Interactive Analysis (MIA) workloads. For performance of ARM big.LITTLE systems executing OLTP
this kind of workloads, interactive MapReduce jobs tend to and OLAP workloads, in comparison with Xeon servers [22].
access only a fragment of the whole data. Hence, an interac- They show a wider performance gap between small and big
tive cluster zone is obtained by identifying these interactive nodes executing TPC-C and TPC-H benchmarks. How-
jobs and their required input data. The rest of the jobs are ever, they run these benchmarks on a custom, highly opti-
executed on the batch zone at certain time intervals. Using mized, in-memory database system, while we use disk-based
both simulation and validation on Amazon EC2, BEEMR MySQL.
reports energy savings of up to 50%. Feller et al. study time In summary, related work lacks a study of Big Data ex-
performance and power consumption of Hadoop on clusters ecution on the fast evolving high-end ARM systems. Our
with collocated and separated data and compute nodes [13]. work addresses this by investigating how far are these types
Two unsurprising results are highlighted, namely, that (i) of systems from efficient data analytics processing.
PPR of collocated nodes is better compared to separated
data and compute deployment, and (ii) power varies across 6. CONCLUSIONS
job phases. Tarazu [1] optimizes Hadoop on a heteroge- In this paper, we present a performance study of execut-
neous cluster with nodes based on Intel Xeon and Atom ing Big Data analytics on emerging low-power nodes in com-
processors. It proposes three optimizations for reducing the parison with traditional server nodes. We build clusters of
imbalance among low-power and high-performance, which Odroid XU boards representing high-end ARM big.LITTLE
lead to a speedup of 1.7. However, no energy usage study architecture, and Intel Xeon systems as representative of
is conducted. Tsirogiannis et al. propose a study on perfor- traditional server nodes. We evaluate time, energy and
mance and power of database operators on different system cost performance of well-known Hadoop MapReduce and
configurations [26]. One of the conclusions is that, almost MySQL database system, and emerging in-memory query
always, the best-performing configuration is also the most processing using Shark. We run workloads exercising CPU
energy-efficient. However, our study shows that this may cores, memory and I/O in different proportion. The results
not be the case, especially if the performance gain cannot show that there is no one size fits all rule for the efficiency
offset the high power usage. of the two types of server nodes. However, small memory
size, low memory and I/O bandwidth, and software immatu-
5.2 Energy-efficient Servers rity concur in canceling the lower-power advantage of ARM
With the evolution of low-power processors and flash stor- nodes. For CPU-intensive MapReduce Pi estimator imple-
age, many research projects combine them to obtain fast, mented in Java, a software-emulated instruction results in
energy-efficient data processing systems [8, 24]. For exam- ten times slower execution time on ARM. Implementing this
ple, Gordon [8] uses systems with Intel Atom processors and workload in C++ improves the execution time by a factor
flash-based storage to obtain 2.5 times more performance- of five, leading to almost four times cheaper data analyt-
per-power than disk-based solutions. For energy evaluation, ics on ARM servers compared to Xeon. For I/O-intensive
they use a power model, whereas we directly measure the workloads, such as Terasort, six ARM nodes are required
power consumption. The study in [16] investigates the en- to perform the work of one 75%-utilized Xeon. This sub-
ergy efficiency of a series of embedded, notebook, desktop stitution leads to 50% higher TCO of ARM servers. Lastly,
and server x86-64 systems. This work shows that high- for query processing, ARM servers are much more energy
end notebooks with Intel Core processors are 300% and efficient, at the cost of slightly lower throughput. More-
80% more energy-efficient than low-power server systems over, small, random database accesses are even faster on

772
ARM due to lower I/O latency. On the other hand, sequen- Machines, Morgan and Claypool Publishers, 1st edition,
tial database scan benefit more from bigger memory size 2009.
of Xeon servers, which acts as cache. In future, with the [16] L. Keys, S. Rivoire, J. D. Davis, The Search for
development of 64-bit ARM server systems having bigger Energy-efficient Building Blocks for the Data Center, Proc.
of 2010 International Conference on Computer
memory and faster I/O, and with software improvements, Architecture, pages 172–182, 2012.
ARM-based servers are well positioned to become a serious [17] W. Lang, J. M. Patel, Energy Management for MapReduce
contender for traditional Intel/AMD server systems. Clusters, Proc. of VLDB Endowment, 3(1-2):129–139, 2010.
[18] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, B. Moon,
Parallel Data Processing with MapReduce: A Survey,
7. ACKNOWLEDGMENTS SIGMOD Record, 40(4):11–20, 2012.
This work was in part supported by the National Research [19] J. Leverich, C. Kozyrakis, On the Energy (in)Efficiency of
Foundation, Prime Minister’s Office, Singapore, under its Hadoop Clusters, SIGOPS Oper. Syst. Rev., 44(1):61–65,
Competitive Research Programme (CRP Award No. NRF- 2010.
CRP8- 2011-08). We thank the anonymous reviewers for [20] F. Li, B. C. Ooi, M. T. Özsu, S. Wu, Distributed Data
Management Using MapReduce, ACM Computing Surveys,
their insightful comments and suggestions, which helped us 46(3):31:1–31:42, 2014.
improve this paper. [21] H. Liu, A Measurement Study of Server Utilization in
Public Clouds, Proc. of IEEE Ninth International
Conference on Dependable, Autonomic and Secure
8. REFERENCES Computing, pages 435–442, 2011.
[22] T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser,
[1] F. Ahmad, S. T. Chakradhar, A. Raghunathan, T. N. A. Kemper, T. Neumann, One DBMS for All: The Brawny
Vijaykumar, Tarazu: Optimizing MapReduce on Few and the Wimpy Crowd, Proc. of ACM SIGMOD
Heterogeneous Clusters, Proc. of 17th International International Conference on Management of Data, pages
Conference on Architectural Support for Programming 697–700, 2014.
Languages and Operating Systems, pages 61–74, 2012.
[23] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic,
[2] AMD, AMD to Accelerate the ARM Server Ecosystem with A. Ramirez, The Low Power Architecture Approach
the First ARM-based CPU and Development Platform Towards Exascale Computing, Journal of Computational
from a Server Processor Vendor,
Science, 4(6):439–443, 2013.
http://www.webcitation.org/6PgFAdEFp, 2014.
[24] D. Schall, T. Härder, Energy-proportional Query Execution
[3] AMPLab, Big Data Benchmark, Using a Cluster of Wimpy Nodes, Proc. of the Ninth
https://amplab.cs.berkeley.edu/benchmark, 2014. International Workshop on Data Management on New
[4] ARM, ARM Announces Support For EEMBC CoreMark Hardware, pages 1:1–1:6, 2013.
Benchmark, http://www.webcitation.org/6RPwNECop, [25] A. L. Shimpi, The ARM vs x86 Wars Have Begun:
2009.
In-Depth Power Analysis of Atom, Krait & Cortex A15,
[5] ARM, Dhrystone and MIPs Performance of ARM http://www.webcitation.org/6RIqMPQKg, 2013.
Processors, http://www.webcitation.org/6RPwC2TUb, [26] D. Tsirogiannis, S. Harizopoulos, M. A. Shah, Analyzing
2010. the Energy Efficiency of a Database Server, Proc. of ACM
[6] ARM, ARM Architecture Reference Manual. ARMv7-A SIGMOD International Conference on Management of
and ARMv7-R edition, ARM, 2012. Data, pages 231–242, 2010.
[7] T. Bingmann, Parallel Memory Bandwidth Benchmark / [27] B. M. Tudor, Y. M. Teo, On Understanding the Energy
Measurement, http://panthema.net/2013/pmbw/, 2013. Consumption of ARM-based Multicore Servers, Proc. of
[8] A. M. Caulfield, L. M. Grupp, S. Swanson, Gordon: Using SIGMETRICS, pages 267–278, 2013.
Flash Memory to Build Fast, Power-efficient Clusters for [28] S. J. Vaughan-Nichols, Applied Micro, Canonical claim the
Data-intensive Applications, Proc. of 14th International first ARM 64-bit server production software deployment,
Conference on Architectural Support for Programming http://www.webcitation.org/6RLczwpch, 2014.
Languages and Operating Systems, pages 217–228, 2009. [29] R. P. Weicker, Dhrystone: A Synthetic Systems
[9] S. Chen, A. Ailamaki, M. Athanassoulis, P. B. Gibbons, Programming Benchmark, Commun. of ACM,
R. Johnson, I. Pandis, R. Stoica, TPC-E vs. TPC-C: 27(10):1013–1030, 1984.
Characterizing the New TPC-E Benchmark via an I/O [30] Wikipedia, Electricity Pricing,
Comparison Study, SIGMOD Record, 39(3):5–10, 2011.
http://www.webcitation.org/6R9bgVRLG, 2013.
[10] Y. Chen, S. Alspaugh, D. Borthakur, R. Katz, Energy
[31] D. Wong, M. Annavaram, KnightShift: Scaling the Energy
Efficiency for Large-scale MapReduce Workloads with Proportionality Wall through Server-Level Heterogeneity,
Significant Interactive Analysis, Proc. of 7th ACM Proc. of 45th International Symposium on
European Conference on Computer Systems, pages 43–56, Microarchitecture, pages 119–130, 2012.
2012.
[32] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,
[11] T. P. P. Council, TPC-C benchmark specification,
S. Shenker, I. Stoica, Shark: SQL and Rich Analytics at
http://www.tpc.org/tpcc, 2010. Scale, Proc. of ACM SIGMOD International Conference
[12] T. P. P. Council, TPC-H benchmark specification, on Management of Data, pages 13–24, 2013.
http://www.tpc.org/tpch, 2013. [33] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
[13] E. Feller, L. Ramakrishnan, C. Morin, On the Performance M. McCauley, M. J. Franklin, S. Shenker, I. Stoica,
and Energy Efficiency of Hadoop Deployment Models, Resilient Distributed Datasets: A Fault-Tolerant
Proc. of 2013 IEEE International Conference on Big Data, Abstraction for In-Memory Cluster Computing, Proc. of
pages 131–136, 2013. the 9th USENIX Conference on Networked Systems Design
[14] V. Gupta, K. Schwan, Brawny vs. Wimpy: Evaluation and and Implementation, pages 15–28, 2012.
Analysis of Modern Workloads on Heterogeneous [34] H. Zhang, G. Chen, W.-F. Wong, B. C. Ooi, S. Wu, Y. Xia,
Processors, Proc. of 27th International Symposium on ”Anti-Caching”-based Elastic Data Management for Big
Parallel and Distributed Processing Workshops and PhD Data, Proc. of 31th International Conference on Data
Forum, pages 74–83, 2013. Engineering, 2015.
[15] U. Hoelzle, L. A. Barroso, The Datacenter As a Computer:
An Introduction to the Design of Warehouse-Scale

773

You might also like