A Performance Study of Big Data On Small
A Performance Study of Big Data On Small
Dumitrel Loghin, Bogdan Marius Tudor, Hao Zhang, Beng Chin Ooi, Yong Meng Teo
Department of Computer Science
National University of Singapore
{dumitrel,bogdan,zhangh,ooibc,teoym}@comp.nus.edu.sg
762
energy-efficient servers, where various software and hard- 2.1 Setup
ware techniques are used to optimize bottleneck resources
in servers [8, 26, 31, 24, 27]. As the dominating architec- 2.1.1 ARM Server Nodes
ture in datacenters has traditionally been Intel/AMD x64, The ARM server node analyzed throughout this paper is
most of the studies are geared toward these types of servers. the Odroid XU development board with Samsung Exynos
In the database community, studies on Big Data workloads 5410 System on a Chip (SoC). This board is representative
almost always use Intel/AMD servers, with a few works us- for high-end mobile phones. For example, Samsung Exynos
ing Intel Atom systems as low-power servers [8, 1, 31, 24]. 5410 is used in the international version of the Samsung
The shift to ARM-based servers is a game-changer, as these Galaxy S4 phones. Other high end contemporary mobile
systems have different CPU architectures, embedded-class devices employ SoCs with very similar performance charac-
storage systems, and a typically lower power consumption teristics, such as Qualcomm Snapdragon 80x and NVIDIA
than even the most energy-efficient Intel systems [25]. As the Tegra 4.
basic building blocks of servers are changing, this warrants Specific to the Exynos 5410 SoC is that the CPU has
a new look at the energy-efficiency of Big Data workloads. two types of cores: ARM Cortex-A7 little cores, which con-
In this paper, we conduct a measurement-driven analy- sume a small amount of power and offer slow in-order execu-
sis of the feasibility of executing data analytics workloads tion, and ARM Cortex-A15 big cores which support faster
on ARM-based small server nodes, and make the following out-of-order execution, but with a higher power consump-
contributions: tion. This heterogeneous CPU architecture is termed ARM
big.LITTLE. The CPU has a total of eight cores, split in two
• We present the first comparative study of Big Data
groups1 of cores: one group of four ARM Cortex-A7 little
performance on ARM big.LITTLE small nodes versus
cores, and one group of four ARM Cortex-A15 big cores.
traditional Intel Xeon nodes. This analysis covers the
Each core has a pair of dedicated L1 data and instruction
performance, energy-efficiency and total cost of owner-
caches, and each group of cores has an L2 unified cache.
ship (TCO) of data analytics. Our analysis shows that
Although the CPU has eight cores, Exynos 5410 allows
there is no one size fits all rule regarding the efficiency
either the four big cores, or the four little cores to be active
of the two server systems: sometimes ARM systems
at one moment. To save energy, when one group is active,
are better, other times Intel systems are better. Sur-
the other one is powered down. Thus, a program cannot
prisingly, deciding which workloads are more efficient
execute on both the big and the little cores at the same
on small nodes versus traditional systems is not pri-
time. Instead, the operating system (OS) can alternate the
marily affected by the bottleneck hardware resource,
execution between them. Switching between the two groups
as observed in high-performance computing and tradi-
incurs a small performance price, as the L2 and L1 caches
tional server workloads [27]. Instead, software imma-
of the newly activated group must warm up.
turity and the challenges imposed by limited RAM size
The core clock frequency of the little cores ranges from 250
and bandwidth are the main culprits that compromise
to 600 MHz, and that of big cores ranges from 600 MHz to
the performance on current generation ARM devices.
1.60 GHz. Dynamic voltage and frequency scaling (DVFS)
Due to these limitations, Big Data workloads on ARM
is employed to increase the core frequency in response to
servers often cannot make full use of the native CPU or
the increase in CPU utilization. On this ARM big.LITTLE
I/O performance. During this analysis, we identify a
architecture, the OS can be instructed to operate the cores
series of features that can improve the performance of
in three configurations:
ARM devices by a factor of five, with minor software
modifications. 1. little: only use the ARM Cortex-A7 little cores, and
their frequency is allowed to range from 250 to 600
• Using Google’s TCO model [15], our analysis shows MHz.
that ARM servers could potentially lead to four times
cheaper CPU-intensive data analytics. However, I/O- 2. big: only use the ARM Cortex-A15 big cores, and their
intensive jobs may incur higher cost on these small frequency is allowed to range from 600 to 1600 MHz.
server nodes. 3. big.LITTLE : when the OS is allowed to switch be-
tween the two types of cluster. The switching fre-
The rest of the paper is organized as follows. In Sec- quency is 600 MHz.
tion 2 we provide a detailed performance comparison be-
tween single-node Xeon and ARM systems. In Section 3 Each Odroid XU node has 2 GB of low-power DDR3 mem-
we present measurement results of running MapReduce and ory, a 64 GB eMMC flash-storage and a 100 Mbit Ethernet
query processing on clusters of Xeon and ARM nodes. We card. However, to improve the network performance, we
discuss our TCO analysis in Section 4 and present the re- connect a Gbit Ethernet adapter on the USB 3.0 interface.
lated work in Section 5. Finally, we conclude in Section 6.
2.1.2 Intel Server Nodes
We compare the Odroid XU nodes with server-class Intel
2. SYSTEMS CHARACTERIZATION x86-64 nodes. We use Supermicro 813M 1U server system
based on two Intel Xeon E5-2603 CPUs with four cores each.
This section provides a detailed performance characteri-
This system has 8 GB DDR3 memory, 1 TB hard disk and
zation of an ARM big.LITTLE system by comparison with a
1
server-class Intel Xeon system. We employ a series of widely In the computer architecture literature, this group of cores
used micro-benchmarks to assess the static performance of is termed cluster of cores. However, due to potential confu-
the CPU, memory bandwidth, network and storage. sion with cluster of nodes encountered in distributed com-
puting, we shall use the term group of cores.
763
Table 1: Systems characterization
ARM (Odroid XU)
Xeon
Cortex-A7 (LITTLE) Cortex-A15 (big)
ISA x86-64 ARMv7l ARMv7l
Cores 4 4 4
Frequency 1.20 - 1.80 GHz 250 - 600 MHz 0.60 - 1.60 GHz
Specs
L1 Data Cache 128 KB 32 KB 32 KB
L2 Cache 1 MB 2 MB 2 MB
L3 Cache 10 MB N/A N/A
Dhrystone [MIPS/MHz] 5.8 3.7 3.1
CPU power [W] 15.0 0.5 3.4
System power [W] 50.0 4.4 7.3
CPU CoreMark [iterations/MHz] 5.3 5.0 3.52
(one core, CPU power [W] 15.6 0.3 2.5
max frequency) System power [W] 50.6 4.2 6.4
Java [MIPS/MHz] 0.36 0.40 0.38
CPU power [W] 16.5 0.3 3.4
System power [W] 51.5 3.0 6.1
Write throughput [MB/s] 165.0 32.6 39.2
Read throughput [MB/s] 173.0 118.0 121.0
Storage Buffered read throughput [GB/s] 4.6 0.8 1.2
Write latency [ms] 9.8 14.2 14.6
Read latency [ms] 2.7 0.9 0.8
TCP bandwidth [Mbits/s] 942 199 308
Network UDP bandwidth [Mbits/s] 811 295 420
Ping latency [ms] 0.2 0.7 0.7
1 Gbit Ethernet network card. For a fair comparison with tion, -O3, and tuning the code for the target processor (e.g.
the 4-core ARM nodes, we remove one of the Xeon CPUs -mcpu=cortex-a7 -mtune=cortex-a7 for little cores). In
from each node. The idle power of the 4-core Xeon node is terms of Dhrystone MIPS per MHz, we obtain a surpris-
35 W, and its peak power is around 55 W. Hence, this node ing result: little cores perform 21% better than big cores,
has a lower power profile compared to traditional nodes. as per MHz. This is unexpected because ARM reports that
Cortex-A7 has lower Dhrystone MIPS per MHz than Cortex-
2.1.3 Software Setup A15, but they use internal armcc compiler [5]. We conclude
The ARM-based Odroid XU board runs Ubuntu 13.04 that it is the gcc way of generating machine code that leads
operating system with Linux kernel 3.4.67, which is the lat- to these results. To check our results, we run newer Core-
est kernel version working on this platform. For compil- Mark CPU benchmark which is being increasingly used by
ing native C/C++ programs, we use gcc 4.7.3 arm-linux- embedded market players, including ARM [4]. We use com-
gnueabihf. The Xeon server runs Ubuntu 13.04 with Linux piler optimization flags to match those employed in the re-
kernel 3.8.0 for x64 architecture. The C/C++ compiler ported performance results for an ARM Cortex-A15. More
available on this system is gcc 4.7.3. We install on both sys- precisely, we activate NEON SIMD (-mfpu=neon), hardware
tems Oracle’s Java Virtual Machine (JVM) version 1.7.0 45. floating point operations (-mfloat-abi=hard) and aggresive
loop optimizations (-faggressive-loop-optimizations). We
2.2 Benchmark Results obtain a score of 3.52 per core per MHz, as opposed to the
reported 4.68. We attribute this difference to different com-
Big Data applications stress all system components, such piler and system setup. However, little cores are again more
as CPU cores, memory, storage and network I/O. Hence, energy efficient, obtaining more than half the score of big
we first evaluate the individual peak performance of these cores with only 0.3 W of power. The difference between
components, before running complex data-intensive work- ARM cores and Xeon cores is similar for both Dhrystone
loads. For this evaluation, we employ benchmarks that are and CoreMark benchmarks. Xeon cores obtain almost two
widely used in industry and systems research. For exam- times higher scores per MHz than ARM cores.
ple, we measure how many Million Instructions per Second Since Big Data frameworks, such as Hadoop and Spark,
(MIPS) a core can deliver using traditional Dhrystone [29, run on top of Java Virtual Machine, we also benchmark
5] and emerging CoreMark [4] benchmarks. For storage and Java execution. We develop a synthetic benchmark per-
network throughput and latency, we use Linux tools such as forming integer and floating point operations such that it
dd, ioping, iperf and ping. Because Odroid XU is a hetero- stresses core’s pipeline. As expected, the little Cortex-A7
geneous system, we individually benchmark both little and cores obtain less than half the MIPS of Cortex-A15 cores.
big cores configurations. Table 1 summarizes system char- On the other hand, the big Cortex-A15 cores achieve just 7%
acteristics in terms of CPU, storage and network I/O, and fewer MIPS than Xeon cores, but using quarter the power.
Figure 1 compares the memory bandwidth of Xeon and all Thus, in terms of core performance-per-power, little cores
three Odroid XU configurations. are the best with around 800 MIPS/W, big cores come sec-
We measure CPU MIPS native performance by initially ond with 180 MIPS/W and Xeon cores are the worst with
running traditional Dhrystone benchmark [29]. We com- 40 MIPS/W.
pile the code with gcc using maximum level of optimiza-
764
Xeon E5-2603
1024 ARM Cortex-A7 Table 2: Workloads
ARM Cortex-A15
ARM big.LITTLE Workload Input Type Input Size
TestDFSIO synthetic 12 GB
256
Terasort synthetic 12 GB
Bandwidth [GB/s]
Pi - 16 Gsamples
Kmeans Netflix 4 GB
64
Wordcount Wikipedia 12 GB
Grep Wikipedia 12 GB
16 TPC-C Benchmark TPC-C Dataset 12 GB
TPC-H Benchmark TPC-H Dataset 2 GB
Shark Scan Query Ranking 21 GB
4 43 MB/1.3 GB
Shark Join Query Ranking/UserVisit
86 MB/2.5 GB
1kB 32kB 1MB 32MB 1GB
Memory Access Size Cluster under test 1Gbps Ethernet
(Xeon/Odroid XU)
Controller system
Figure 1: Memory bandwidth comparison
Serial interface
765
150 30 3500
1 node Throughput 1 node Pi Java
Energy Pi C++
3000
100 20 2 nodes 6 nodes
2500
50 10 2000
Time [s]
Throughput [MB/s]
Energy [kJ]
1500
0 0
6 nodes
1000
0
big litt big Xe big litt big Xe big litt big Xe
le .LI on le .LI on le .LI on
50 100 TT TT TT
LE LE LE
766
100000
Xeon Xeon
10000 ARM big ARM big
ARM LITTLE ARM LITTLE
ARM big.LITTLE 10000 ARM big.LITTLE
Time [s]
1000
1000
100
100
Time [s]
10000
Power [W]
1000 100
100
Kmeans Terasort Wordcount 10
10 1 2 4 6 1 2 4 6 1 2 4 6
Nodes
Energy [kJ]
10
2000 200
Xeon Time 8 ARMs*
ARM Time 1
Xeon Energy
ARM Energy 0.1
1500 150
0.01
6 ARMs
Pi Java Pi C++ Grep Kmeans Terasort Wordcount
Energy [kJ]
Time [s]
2 ARMs
mentation is around five times faster on ARM nodes and
500 50 only 1.2 times faster on Xeon-based nodes. With this minor
6 ARMs
2 ARMs
software porting, we obtain a significant improvement in ex-
100 10
ecution time which leads to energy savings, as we further
show. In the remainder of this section, we shall show the
Pi Java Pi C++ Grep Kmeans Terasort Wordcount
results for both Pi Java and Pi C++ implementations.
We present time and energy performance of the six work-
Figure 6: Xeon-ARM performance equivalence
loads on Xeon and ARM clusters. First, since scalability
is a main feature of MapReduce framework, we investigate
form using more cores but having limited memory buffer. how Hadoop scales on clusters of small nodes. We show
On the other hand, Kmeans benefits more from running on time scaling in log scale on four cluster sizes in Figure 5.
four slots, being 20% faster on big cores and 35% faster on All workloads exhibit sublinear scaling on both Intel and
little cores, compared to running on two slots. Kmeans is a ARM nodes, which we attribute to housekeeping overheads
CPU-intensive workload executing a large number of float- of Hadoop when running on more nodes. When the over-
ing point operations in both map and reduce phases. Thus, heads dominate the useful work, the scaling degrades. For Pi
it benefits from running on higher core counts. In the re- workload running on six nodes there is too little useful work
mainder of this paper, we present the results on two slots for mappers to perform, hence, there is not much improve-
for Terasort, and on four slots with io.sort.mb of 50 for ment in the execution time on both types of servers. On the
Kmeans, when running on ARM big.LITTLE nodes. other hand, Kmeans and Grep exhibit higher speedup on the
When running the experiments, we observe low perfor- 6-node ARM cluster compared to Xeon because the slower
mance of Pi on Odroid XU. Compared to Xeon, Pi on big ARM cores have enough CPU-intensive work to perform.
and big.LITTLE runs 7-9 times slower, and on little cores Secondly, Figure 6 shows how many ARM-based nodes
up to 20 times slower. This is surprising because Pi is CPU- can achieve the execution time of one Xeon node. We select
intensive and we show in Section 2 that the performance ra- ARM big.LITTLE configuration which exhibits the closest
tio between Xeon and ARM cores is at most five. We further execution time compared to one Xeon. For Wordcount, the
investigate the cause of this result. Firstly, we profile Task- difference between six ARM nodes and one Xeon node is
Tracker execution on Odroid XU. We observed that JVM large, and thus, we estimate based on the scaling behavior
spends 25% of the time in __udivsi3. This function emu- that eight ARM nodes exhibit a closer execution time.
lates 32-bit unsigned integer division in software, although Thirdly, Figure 7 shows the time, power and energy of
the Exynos 5410 SoC on Odroid XU board supports UDIV 6-node clusters using log scale. Based on the energy usage,
hardware instruction. But other SoCs may not implement the workloads can be categorized into three classes:
this instruction, since it is defined as optional in ARMv7-A • Pi Java and Kmeans execution times are much larger
ISA [6]. Thus, JVM uses the safer approach of emulating it on ARM compared to Xeon. Both workloads incur
in software. Secondly, we port Pi in C++ and run it using high CPU usage on ARM, which results in high power
Hadoop Pipes mechanism. We use the same gcc compilation usage. The combined effect is a slightly higher energy
flags as for native benchmarks in Section 2. The compar- usage on ARM nodes.
ison between Java and C++ implementations is shown in
Figure 4. Compared to original Java version, C++ imple- • Pi C++ and Grep exhibit a much smaller execution
767
Table 3: MapReduce Performance-to-power Ratio
ARM (Odroid XU)
Xeon
Workload Unit big LITTLE big.LITTLE
1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6
Pi Java Msamples/J 1.44 1.58 0.88 0.63 0.68 0.60 0.60 0.56 0.78 0.83 0.80 0.58 0.67 0.60 0.61 0.57
Pi C++ Msamples/J 2.51 1.89 1.04 0.71 3.23 3.03 2.95 2.64 4.56 4.37 4.01 2.78 3.33 2.95 2.78 2.56
Grep MB/J 0.56 0.46 0.27 0.21 1.03 0.93 0.92 0.92 1.47 1.34 1.31 1.27 1.03 0.93 0.86 0.92
Kmeans MB/J 0.50 0.41 0.25 0.22 0.21 0.19 0.19 0.20 0.28 0.25 0.23 0.23 0.21 0.19 0.18 0.20
Terasort MB/J 0.28 0.22 0.15 0.14 0.31 0.25 0.30 0.27 0.35 0.28 0.35 0.30 0.32 0.25 0.30 0.27
Wordcount MB/J 0.17 0.14 0.09 0.08 0.12 0.11 0.10 0.09 0.18 0.16 0.12 0.10 0.12 0.11 0.10 0.10
CDF [%]
60
768
10000
Xeon
ARM big
ARM LITTLE
1000 ARM big.LITTLE
Time [s]
100
10
10
Energy [kJ]
0.1
0.01
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Query #
1000 60 600 30
Xeon Xeon
ARM big ARM big
ARM LITTLE 50 ARM LITTLE
800
ARM big.LITTLE ARM big.LITTLE
40 400 20
Energy [kJ]
Energy [kJ]
600
Time [s]
Time [s]
30
400
20 200 10
200
10
0 0 0 0
1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6
Nodes Nodes
Figure 10: Shark scan query performance Figure 11: Shark join query performance
• For random access queries with small working set (e.g. SELECT sourceIP, totalRevenue, avgPageRank FROM
Q3, Q5, Q7, Q8, Q11, Q16, Q19, Q22), ARM has (SELECT sourceIP,
1.2–6.9 times better performance, which we mainly at- AVG(pageRank) as avgPageRank,
tribute to lower read latency of Odroid XU flash-based SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
storage.
WHERE R.pageURL = UV.destURL
Summary. In terms of energy, Xeon node consumes AND UV.visitDate BETWEEN Date(‘1980-01-01’)
AND Date(‘X’)
1.4 to 44 times more than ARM-based node. Overall, the GROUP BY UV.sourceIP)
energy-efficiency of ARM executing queries is higher com- ORDER BY totalRevenue DESC LIMIT 1
pared to traditional Xeon.
from the Big Data Benchmark provided by AMPLab [3].
3.4.2 Shark We set cache size of Shark at 896 MB for both types of
We investigate the performance of in-memory Big Data nodes to leave enough memory for other processes, such as
analytics using Shark framework [32]. This is an open source Spark master, worker, and HDFS daemons, on Odroid XU.
distributed SQL query engine built on top of Spark [33], To show the potential of unrestricted system, we conduct
a distributed in-memory processing framework. With the an experiment with Spark cache set to 5 GB on Xeon node.
increasing velocity of Big Data analytics, Shark and Spark We observe that for scan query, cache size does not affect
are increasingly used because of their low-latency and fault- the performance, hence, we present the results with identical
tolerance. In this experiment, we evaluate the performance cache size configuration. For join query, we use two datasets
of scan and join queries. We list the scan query: as shown in Table 2. We choose these two sets to investi-
gate three scenarios: (i) both servers have enough memory,
SELECT pageURL, pageRank
FROM rankings WHERE pageRank > X (ii) Xeon has enough memory while ARM does not, (iii)
both servers do not have enough memory. These scenarios
and join query: cover memory management issues in modern database sys-
769
tems [34]. The join query is both memory and I/O bounded,
as the smaller table is usually used to build an in-memory Table 5: TCO notations and values
structure and the other table is scanned once from the stor- Notation Value Description
age. The in-memory structure is either a hash table for hash Cs,Xeon $1100 + cost of Xeon-based server node
join implementation, or a table kept in memory for nested Cs,ARM $280 + cost of ARM-based server node
join implementation. Moreover, Spark swaps data between Ts 3 years ∗ server lifetime
memory and disk, thus, it benefits from larger cache sizes. Ul 10% ∗ low server utilization
The results for scan query are shown in Figure 10. For join Uh 75% ∗ high server utilization
#
query, we only plot the third scenario in Figure 11 due to Cd datacenter total costs
space limitation. Based on these results, we formulate the Td 12 years ∗ datacenter lifetime
#
following comments. Cp electricity total costs
• For scan query, ARM big and big.LITTLE are just Cph ∗
electricity cost per hour
+
1.1-1.7 times slower than Xeon, but more than three Pa average server power
times better in energy usage. ARM little cores are Pp,Xeon 55 W + Xeon-based node peak power
twice slower, but more than four times better in energy Pp,ARM 16 W + ARM-based node peak power
usage. Therefore, in terms of PPR, ARM is much Pi,Xeon 35 W + Xeon-based node idle power
better for this kind of query. Pi,ARM 4W+ ARM-based node idle power
770
Table 6: Effect of server utilization on marginal cost Table 7: Effect of server utilization on TCO
Job Utilization Server Min cost [$/h] Max cost [$/h] Job Utilization Server Min cost [$/h] Max cost [$/h]
type ratio [%] ratio Xeon ARM Xeon ARM type ratio [%] ratio Xeon ARM Xeon ARM
CPU-int. 10:20 1:1 0.043 0.011 0.044 0.013 CPU-int. 10:20 1:1 0.066 0.018 0.086 0.025
I/O-int. 10:49 1:1 0.043 0.011 0.056 0.013 I/O-int. 10:49 1:1 0.066 0.017 0.085 0.021
CPU-int. 75:82 1:2 0.043 0.022 0.060 0.031 CPU-int. 75:82 1:2 0.066 0.035 0.086 0.051
I/O-int. 75:86 1:6 0.043 0.065 0.059 0.079 I/O-int. 75:86 1:6 0.066 0.104 0.085 0.127
0.1 120
with idle power datacenter
no idle power high utilization server
100 power
0.08 low utilization
CPU-intensive I/O-intensive
80
0.06
Cost [$/h]
Cost [$]
60
0.04
40
0.02
20
0
Xe AR Xe AR Xe A X A
on M o M, on RM eon RM 0
Xe AR Xe AR 0 Xe AR Xe AR
, C , C n, I/ I ,C ,C , I/ ,
PU PU O- / O PU PU O- I/O-i on M o M
, M , M n, M , M
on M o M
, M , M n, M , M
-in -in int. -int. -in -in int. nt.
in in ax ax in in ax ax
t. t. t. t.
Figure 12: Effect of idle power on marginal cost Figure 13: Costs per month
as cost per hour in Table 6. For low utilization, the cost per datacenter costs include capital and operational expenses.
hour of ARM is almost four times lower compared to Xeon. Capital expenses represent the cost for designing and build-
Moreover, CPU- and I/O-intensive jobs have the same cost. ing a datacenter. This cost depends on datacenter power
On the other hand, the cost of highly utilized servers is capacity, and it is expressed as price per Watt. We use a
slightly higher. Surprisingly, for I/O-intensive jobs, ARM default value of 15 $/W as in [15]. Operational expenses rep-
incurs up to 50% higher cost because six ARM servers are resent the cost for maintenance and security, and depend on
required to perform the work of one Xeon. datacenter size which, in turn, is proportional to its power
Next, we investigate the influence of idle power, as a key capacity. We use a default value of 0.04 $/kWmonth [15].
factor in total electricity costs. This influence may be alle- Secondly, for server costs, beside the equipment itself, there
viated by employing energy-saving strategies, such as All-In are operational expenses related to maintenance. These ex-
Strategy [17]. This strategy assumes that servers can be penses are expressed as overhead per Watt per year. We use
inducted to a low-power state during inactive periods. At the default value of 5% for both types of servers. Moreover,
certain intervals, they are woken-up to execute the jobs, and for building a real datacenter, the business may take loan.
afterwards put back to sleep. In reality, servers consume a The model includes the interest rate for such a loan. We use
small amount of power in deep-sleep or power-off mode and a value of 8% per year, although for building a datacenter
may incur high power usage during wake-up phase. How- with emerging ARM systems this rate may be higher due
ever, we assume that during inactive periods servers draw to potential risk associated with this emerging server plat-
no power, and perform the study on both utilization scenar- form. Thirdly, electricity expenses are modeled based on
ios described above. With these assumptions, the influence the average power consumption. In addition, the overhead
of idle power is more visible on low-utilized Xeon servers, as costs, such as those for cooling, are expressed based on the
shown in Figure 12. In this case, putting Xeon servers to Power Usage Effectiveness (PUE) of the servers. For the
sleep can reduce hourly cost by 22%. For ARM servers, cost employed Xeon servers, we use the lowest PUE value of 1.1
reduction is 6–10% since idle power is much lower. At high representing the most energy-efficient Google servers [15].
utilization, the reductions are smaller because the servers For ARM servers, we use a higher PUE of 1.5 to incorpo-
are active most of the time. rate less energy-efficient power supply and the power drawn
by the fan, which is up to 1.5 W and represents ∼10% of
4.2 TCO the 16 W peak power.
We analyze a more complex TCO model which includes In Figure 13 we present TCO values for high utilization
datacenter costs. We use Google TCO calculator which im- scenario. We show these values as break-down of monthly
plements the model described in [15]. For this model, total cost into datacenter, server equipment and power costs, as
cost is defined in Equation 3. The cost is dominated by equip-
ment expenses. For I/O-intensive workloads, equipment
C = Cd + Cs + Cp (3)
and power expenses of the six ARM nodes make low-power
We conduct our study based on the following assumptions servers more expensive than traditional Xeon. We summa-
regarding all three components of the TCO model. Firstly, rize TCO values for both utilization scenarios in Table 7.
771
5. RELATED WORK and low-power embedded systems, respectively. Moreover,
Related work analyzing energy efficiency of Big Data exe- embedded systems based on Intel Atom processors suffer
cution focuses mostly on traditional x86/x64 architecture, from poor I/O subsystem. This is in concordance with our
with some projects considering heterogeneous clusters of findings on recent high-end ARM-based systems. Knight-
low-power Intel Atom and high-performance Intel Xeon pro- Shift [31] is a heterogeneous server architecture which cou-
cessors [8, 1, 31, 24]. More generally, the related work can ples a wimpy Atom-based node with a brawny Xeon-based
be classified in two categories: energy-proportionality stud- node to achieve energy proportionality for datacenter work-
ies [19, 17, 26, 10] and building blocks for energy-efficient loads. This architecture can achieve up to 75% energy sav-
servers [8, 16, 31, 24, 27], as we further present. ings which also leads to cost savings. WattDB [24] is an
energy-efficient query processing cluster. It uses nodes with
5.1 Energy Proportionality Intel Atom and SSDs, and dynamically powers them on or
The survey in [18] highlights two techniques for saving off depending on load. Running TPC-H queries, the authors
energy in Hadoop MapReduce deployments: Covering Set show that dynamic configurations achieve the performance
(CS) [19] and All-In Strategy (AIS) [17]. Both techniques of static configurations while saving energy.
propose shutting-down or hibernating the systems when the The impressive evolution of ARM-based systems leads to
cluster is underutilized. CS proposes to shut-down all the their possible adoption as servers [2, 28]. In this context,
nodes but a small set (the Covering Set) which keeps at Tudor and Teo investigate the energy-efficiency of ARM
least one replica of each HDFS block. On the other hand, Cortex-A9 based server executing both compute- and data-
AIS shows it is more energy-efficient to use the entire cluster intensive server workloads [27]. The study shows that ARM
and finish the MapReduce jobs faster and then shut-down system is unsuitable for network I/O- and memory-intensive
all nodes. Berkeley Energy Efficient MapReduce (BEEMR) jobs. This is correlated with our evaluation showing that
[10], proposes to split MapReduce cluster into interactive even for newer, ARM big.LITTLE-based servers, small mem-
and batch zones. The nodes in batch zone are kept in a low- ory size and low memory and I/O bandwidths lead to ineffi-
power state when inactive. This technique is appropriate for cient data-intensive processing. Mühlbauer et al. show the
MapReduce with Interactive Analysis (MIA) workloads. For performance of ARM big.LITTLE systems executing OLTP
this kind of workloads, interactive MapReduce jobs tend to and OLAP workloads, in comparison with Xeon servers [22].
access only a fragment of the whole data. Hence, an interac- They show a wider performance gap between small and big
tive cluster zone is obtained by identifying these interactive nodes executing TPC-C and TPC-H benchmarks. How-
jobs and their required input data. The rest of the jobs are ever, they run these benchmarks on a custom, highly opti-
executed on the batch zone at certain time intervals. Using mized, in-memory database system, while we use disk-based
both simulation and validation on Amazon EC2, BEEMR MySQL.
reports energy savings of up to 50%. Feller et al. study time In summary, related work lacks a study of Big Data ex-
performance and power consumption of Hadoop on clusters ecution on the fast evolving high-end ARM systems. Our
with collocated and separated data and compute nodes [13]. work addresses this by investigating how far are these types
Two unsurprising results are highlighted, namely, that (i) of systems from efficient data analytics processing.
PPR of collocated nodes is better compared to separated
data and compute deployment, and (ii) power varies across 6. CONCLUSIONS
job phases. Tarazu [1] optimizes Hadoop on a heteroge- In this paper, we present a performance study of execut-
neous cluster with nodes based on Intel Xeon and Atom ing Big Data analytics on emerging low-power nodes in com-
processors. It proposes three optimizations for reducing the parison with traditional server nodes. We build clusters of
imbalance among low-power and high-performance, which Odroid XU boards representing high-end ARM big.LITTLE
lead to a speedup of 1.7. However, no energy usage study architecture, and Intel Xeon systems as representative of
is conducted. Tsirogiannis et al. propose a study on perfor- traditional server nodes. We evaluate time, energy and
mance and power of database operators on different system cost performance of well-known Hadoop MapReduce and
configurations [26]. One of the conclusions is that, almost MySQL database system, and emerging in-memory query
always, the best-performing configuration is also the most processing using Shark. We run workloads exercising CPU
energy-efficient. However, our study shows that this may cores, memory and I/O in different proportion. The results
not be the case, especially if the performance gain cannot show that there is no one size fits all rule for the efficiency
offset the high power usage. of the two types of server nodes. However, small memory
size, low memory and I/O bandwidth, and software immatu-
5.2 Energy-efficient Servers rity concur in canceling the lower-power advantage of ARM
With the evolution of low-power processors and flash stor- nodes. For CPU-intensive MapReduce Pi estimator imple-
age, many research projects combine them to obtain fast, mented in Java, a software-emulated instruction results in
energy-efficient data processing systems [8, 24]. For exam- ten times slower execution time on ARM. Implementing this
ple, Gordon [8] uses systems with Intel Atom processors and workload in C++ improves the execution time by a factor
flash-based storage to obtain 2.5 times more performance- of five, leading to almost four times cheaper data analyt-
per-power than disk-based solutions. For energy evaluation, ics on ARM servers compared to Xeon. For I/O-intensive
they use a power model, whereas we directly measure the workloads, such as Terasort, six ARM nodes are required
power consumption. The study in [16] investigates the en- to perform the work of one 75%-utilized Xeon. This sub-
ergy efficiency of a series of embedded, notebook, desktop stitution leads to 50% higher TCO of ARM servers. Lastly,
and server x86-64 systems. This work shows that high- for query processing, ARM servers are much more energy
end notebooks with Intel Core processors are 300% and efficient, at the cost of slightly lower throughput. More-
80% more energy-efficient than low-power server systems over, small, random database accesses are even faster on
772
ARM due to lower I/O latency. On the other hand, sequen- Machines, Morgan and Claypool Publishers, 1st edition,
tial database scan benefit more from bigger memory size 2009.
of Xeon servers, which acts as cache. In future, with the [16] L. Keys, S. Rivoire, J. D. Davis, The Search for
development of 64-bit ARM server systems having bigger Energy-efficient Building Blocks for the Data Center, Proc.
of 2010 International Conference on Computer
memory and faster I/O, and with software improvements, Architecture, pages 172–182, 2012.
ARM-based servers are well positioned to become a serious [17] W. Lang, J. M. Patel, Energy Management for MapReduce
contender for traditional Intel/AMD server systems. Clusters, Proc. of VLDB Endowment, 3(1-2):129–139, 2010.
[18] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, B. Moon,
Parallel Data Processing with MapReduce: A Survey,
7. ACKNOWLEDGMENTS SIGMOD Record, 40(4):11–20, 2012.
This work was in part supported by the National Research [19] J. Leverich, C. Kozyrakis, On the Energy (in)Efficiency of
Foundation, Prime Minister’s Office, Singapore, under its Hadoop Clusters, SIGOPS Oper. Syst. Rev., 44(1):61–65,
Competitive Research Programme (CRP Award No. NRF- 2010.
CRP8- 2011-08). We thank the anonymous reviewers for [20] F. Li, B. C. Ooi, M. T. Özsu, S. Wu, Distributed Data
Management Using MapReduce, ACM Computing Surveys,
their insightful comments and suggestions, which helped us 46(3):31:1–31:42, 2014.
improve this paper. [21] H. Liu, A Measurement Study of Server Utilization in
Public Clouds, Proc. of IEEE Ninth International
Conference on Dependable, Autonomic and Secure
8. REFERENCES Computing, pages 435–442, 2011.
[22] T. Mühlbauer, W. Rödiger, R. Seilbeck, A. Reiser,
[1] F. Ahmad, S. T. Chakradhar, A. Raghunathan, T. N. A. Kemper, T. Neumann, One DBMS for All: The Brawny
Vijaykumar, Tarazu: Optimizing MapReduce on Few and the Wimpy Crowd, Proc. of ACM SIGMOD
Heterogeneous Clusters, Proc. of 17th International International Conference on Management of Data, pages
Conference on Architectural Support for Programming 697–700, 2014.
Languages and Operating Systems, pages 61–74, 2012.
[23] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic,
[2] AMD, AMD to Accelerate the ARM Server Ecosystem with A. Ramirez, The Low Power Architecture Approach
the First ARM-based CPU and Development Platform Towards Exascale Computing, Journal of Computational
from a Server Processor Vendor,
Science, 4(6):439–443, 2013.
http://www.webcitation.org/6PgFAdEFp, 2014.
[24] D. Schall, T. Härder, Energy-proportional Query Execution
[3] AMPLab, Big Data Benchmark, Using a Cluster of Wimpy Nodes, Proc. of the Ninth
https://amplab.cs.berkeley.edu/benchmark, 2014. International Workshop on Data Management on New
[4] ARM, ARM Announces Support For EEMBC CoreMark Hardware, pages 1:1–1:6, 2013.
Benchmark, http://www.webcitation.org/6RPwNECop, [25] A. L. Shimpi, The ARM vs x86 Wars Have Begun:
2009.
In-Depth Power Analysis of Atom, Krait & Cortex A15,
[5] ARM, Dhrystone and MIPs Performance of ARM http://www.webcitation.org/6RIqMPQKg, 2013.
Processors, http://www.webcitation.org/6RPwC2TUb, [26] D. Tsirogiannis, S. Harizopoulos, M. A. Shah, Analyzing
2010. the Energy Efficiency of a Database Server, Proc. of ACM
[6] ARM, ARM Architecture Reference Manual. ARMv7-A SIGMOD International Conference on Management of
and ARMv7-R edition, ARM, 2012. Data, pages 231–242, 2010.
[7] T. Bingmann, Parallel Memory Bandwidth Benchmark / [27] B. M. Tudor, Y. M. Teo, On Understanding the Energy
Measurement, http://panthema.net/2013/pmbw/, 2013. Consumption of ARM-based Multicore Servers, Proc. of
[8] A. M. Caulfield, L. M. Grupp, S. Swanson, Gordon: Using SIGMETRICS, pages 267–278, 2013.
Flash Memory to Build Fast, Power-efficient Clusters for [28] S. J. Vaughan-Nichols, Applied Micro, Canonical claim the
Data-intensive Applications, Proc. of 14th International first ARM 64-bit server production software deployment,
Conference on Architectural Support for Programming http://www.webcitation.org/6RLczwpch, 2014.
Languages and Operating Systems, pages 217–228, 2009. [29] R. P. Weicker, Dhrystone: A Synthetic Systems
[9] S. Chen, A. Ailamaki, M. Athanassoulis, P. B. Gibbons, Programming Benchmark, Commun. of ACM,
R. Johnson, I. Pandis, R. Stoica, TPC-E vs. TPC-C: 27(10):1013–1030, 1984.
Characterizing the New TPC-E Benchmark via an I/O [30] Wikipedia, Electricity Pricing,
Comparison Study, SIGMOD Record, 39(3):5–10, 2011.
http://www.webcitation.org/6R9bgVRLG, 2013.
[10] Y. Chen, S. Alspaugh, D. Borthakur, R. Katz, Energy
[31] D. Wong, M. Annavaram, KnightShift: Scaling the Energy
Efficiency for Large-scale MapReduce Workloads with Proportionality Wall through Server-Level Heterogeneity,
Significant Interactive Analysis, Proc. of 7th ACM Proc. of 45th International Symposium on
European Conference on Computer Systems, pages 43–56, Microarchitecture, pages 119–130, 2012.
2012.
[32] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,
[11] T. P. P. Council, TPC-C benchmark specification,
S. Shenker, I. Stoica, Shark: SQL and Rich Analytics at
http://www.tpc.org/tpcc, 2010. Scale, Proc. of ACM SIGMOD International Conference
[12] T. P. P. Council, TPC-H benchmark specification, on Management of Data, pages 13–24, 2013.
http://www.tpc.org/tpch, 2013. [33] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
[13] E. Feller, L. Ramakrishnan, C. Morin, On the Performance M. McCauley, M. J. Franklin, S. Shenker, I. Stoica,
and Energy Efficiency of Hadoop Deployment Models, Resilient Distributed Datasets: A Fault-Tolerant
Proc. of 2013 IEEE International Conference on Big Data, Abstraction for In-Memory Cluster Computing, Proc. of
pages 131–136, 2013. the 9th USENIX Conference on Networked Systems Design
[14] V. Gupta, K. Schwan, Brawny vs. Wimpy: Evaluation and and Implementation, pages 15–28, 2012.
Analysis of Modern Workloads on Heterogeneous [34] H. Zhang, G. Chen, W.-F. Wong, B. C. Ooi, S. Wu, Y. Xia,
Processors, Proc. of 27th International Symposium on ”Anti-Caching”-based Elastic Data Management for Big
Parallel and Distributed Processing Workshops and PhD Data, Proc. of 31th International Conference on Data
Forum, pages 74–83, 2013. Engineering, 2015.
[15] U. Hoelzle, L. A. Barroso, The Datacenter As a Computer:
An Introduction to the Design of Warehouse-Scale
773