Supercomputing Frontiers: Rio Yokota Weigang Wu
Supercomputing Frontiers: Rio Yokota Weigang Wu
Weigang Wu (Eds.)
LNCS 10776
Supercomputing Frontiers
4th Asian Conference, SCFA 2018
Singapore, March 26–29, 2018
Proceedings
Lecture Notes in Computer Science 10776
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zurich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology Madras, Chennai, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7407
Rio Yokota Weigang Wu (Eds.)
•
Supercomputing Frontiers
4th Asian Conference, SCFA 2018
Singapore, March 26–29, 2018
Proceedings
Editors
Rio Yokota Weigang Wu
Tokyo Institute of Technology Sun Yat-sen University
Tokyo Guangzhou
Japan China
© The Editor(s) (if applicable) and The Author(s) 2018, corrected publication 2018. This book is an open
access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
We would like to express our gratitude to all our colleagues for submitting papers to
the SCA18 scientific sessions, as well as to the members of the Program Committee for
organizing this year’s attractive program.
Big Data
GPU/FPGA
Performance Tools
Linear Algebra
Pin Chen1, Xin Yan1, Jiahui Li1, Yunfei Du1,2(&), and Jun Xu1(&)
1
National Supercomputer Center in Guangzhou and Research Center
for Drug Discovery, School of Data and Computer Science and School
of Pharmaceutical Sciences, Sun Yat-Sen University,
132 East Circle at University City, Guangzhou 510006, China
[email protected], [email protected]
2
School of Computer Science, National University of Defense Technology,
Changsha 410073, China
1 Introduction
from tens of thousands to millions, require a high volume of lots-of-small files scenario
for a virtual screening campaign. With the development of high-performance com-
puters, the virtual drug screening is accelerating. However, HTVS still faces challenges
while a large scale virtual screening application is executed on High-Performance
Computing (HPC) resources, such as distributing massive tasks, analyzing lots-of-
small molecular structure files, and implementing fault tolerance.
Tools have been developed to accelerate the process of HTVS on HPC resources.
Falkon [5] is a lightweight execution framework to enable loosely coupled program to
run on peta-scale systems. The benchmark [6] shows that DOCK5 can scale up to
116,000 cores with high efficiency by Falkon. VinaMPI is a MPI version program
based on ADV package, which uses a large number of cores to speed-up individual
docking tasks. VinaMPI successfully ran on 84,672 cores on Kraken supercomputer
and efficiently reduce the total time-to-completion. While all the above works focus on
performance and efficiency of distributing tasks, ignoring the whole HTVS process, for
instance, robustness, recoverability and result analysis. FireWorks (FWS) [7] is a
workflow software for high-throughput calculation running on supercomputer, effec-
tively solve the problem of concurrent task distribution and fault tolerance manage-
ment, and provide an intuitive graphical interface. However, FWS pays more attention
on versatility and usability. DVSDMS [8] is a distributed virtual screening data
management system, only focusing on high-throughput docking process in the data
management issues. Therefore, the architecture of high-performance computers, as well
as the computational characteristics of the application, needs to be considered to design
the framework for HTVS on high-performance computers.
In this work, we report a general framework - High-performance High-throughput
Virtual Screening Framework (HHVSF) - to enable large-scale, multitasking and
small-size input and output (IO) applications to efficiently execute on HPC resources.
This framework contains task management and data management systems, which can
handle thousands of tasks, manage a large volume of lots-of-small files, and reduce the
long processing time for analyzing. The purpose of HHVSF is to provide high com-
putational performance based on portability, availability, serviceability and stability
(PASS).
The framework of HHVSF is comprised of two parts: task management and distributed
data management (see Fig. 1). In order to access and store data efficiently and flexibly,
the executions of a program are coupled loosely by MongoDB C driver, while the
application codes do not need to be modified. The following three subsections docu-
ment the overall framework of HHVSF, the simulation parameters and the data sets of
the experiments are introduced at the end of this section. ADV [9] and WEGA [10] are
chosen as typical applications to carry out the experiments, and others can be integrated
into the HHVSF in similar way.
HHVSF: A Framework to Accelerate Drug-Based HTVS 5
Execute machine
HPC cluster
Shard 1
HTCondor for task
dispatching
Shard N-1
Check job status
Get ranked results
MongoDB for data storage
Login node
tolerance in this case, one is monitoring the job status during the running by job
management system, another is making a successful or failed tag on each task after the
task is finished. HTCondor provides checkpoint mechanism in the standard universe by
using condor_compile to relink the execution with the HTCondor libraries, while those
coupled program vina_wrapper and wega_wrapper, containing system calls like system
(), cannot provide check pointing services with HTCondor. As a result, we choose the
second method. When a worker calls the execution of ADV or WEGA successfully, a
tag that represents the task status will insert into the corresponding document in
MongoDB database. After the job is finished, it needs to check the document’s failed
tag and then restart the failed jobs.
compound library. Hence, the RAMDISK in computing nodes are used to temporarily
store the IO files needed by the applications (see Fig. 3). The RAMDISK provides
high-speed, low-latency IO operations for handling lots-of-small files, while the high
storage capacity of shared file disk is still fully occupied to store the resulting data. By
relocating data between MongDB and RAMDISK, the IO pressure for shared file
storage is effectively mitigated.
Access speed
Local RAM in node
fast
NoSQL database
insert mol2 files into MongoDB. MGLTools (version 1.5.6) was used to convert mol2
files into pdbqt file for docking. We prepared five different sized data sets (from zinc_
ligand_1*5), as shown in Table 2. All data sets are sorted by heavy atom number
arranged in ascending order. After finishing molecular docking, the result pdbqt file
format was converted to mol file format by Open Babel package [23] (version 2.4.0).
The protein target is a crystal structure of the alpha subunit of glycyl tRNA syn-
thetase (PDB codes: 5F5 W). The (x, y, z) coordinates (in Å) for the center of the
docking site is (−94.666, 51.401, 8.991), and the side of the cubic box is (14, 18, 12).
The argument of num_modes is set to 1.
2.3.2 WEGA
A SDF file containing about twenty million molecules was obtained from ZINC
database FTP server. Approximately 958 million conformers were generated from the
SDF file using the CAESAR algorithm [24] in discovery studio (version 3.5) [25] for
shape-feature similarity calculation. In order to take advantage of the 16 MB storage
space in MongoDB, the conformer files were split into smaller files which occupied
15 MB for each file, and then inserted into the database. Table 2 gives two data sets for
WEGA (zinc_conformer_1 and zinc_conformer_2).
The query molecule is 4-amino-1-[4,5-dihydroxy-3-(2-hydroxyethyl)-1-cyclopent-
2-enyl]-pyrimidin-2-one (ZINC ID: ZINC03834084). The method for molecular
overlay is set to 2 (combing the shape similarity and pharmacophore similarity).
Each SDF file corresponds up to 100 similar molecules. The Table 1 shows the detailed
information of the data sets which are used throughout the article.
Table 1. Data sets for testing. The zinc_ligand_1*5 databases are prepared for Audock_vina,
the zinc_ligand_2*5 databases were extracted from zinc_ligand_1 in accordance with a certain
proportion. The zinc_conformer_1*2 databases are prepared for WEGA, and the zinc_con-
former_2 are extracted from zinc_conformer_1 randomly.
Database name Number Description
zinc_ligand_1 20430347 ZINC purchasable subset
zinc_ligand_2 107 Enumerate one from every 2 molecules of ZINC
purchasable subset
zinc_ligand_3 106 Enumerate one from every 20 molecules of ZINC
purchasable subset
zinc_ligand_4 105 Enumerate one from every 200 molecules of ZINC
purchasable subset
zinc_ligand_5 104 Enumerate one from every 2000 molecules of ZINC
purchasable subset
zinc_conformer_1 *9.58 * 108 Up to 50 conformers per molecule of ZINC
purchasable subset
zinc_conformer_2 *106 Up to 50 conformers per molecule of ZINC
purchasable subset
10 P. Chen et al.
Fig. 4. The number of heavy atoms in a compound (x-axis), the computing time (logarithmic
form) of a molecular docking (y-axis). The results are based upon zinc_ligand_5 data set.
HHVSF: A Framework to Accelerate Drug-Based HTVS 11
(a)
(b)
Fig. 5. (a) The computing time of each worker without load balancing. The red line is the
average computing time per task. (b) The computing time per worker with load balance. The red
line is the average computing time per task. (Color figure online)
accomplished, the resulting data (score, structural conformation, and running status)
will be inserted into MongoDB’s collection. Figure 6 shows the “connection” and
“insert” operations of MongoDB’s server every second with vina_wrapper during the
whole computing period, the points of inverted triangle clearly reveal the three stages
of running tasks: startup, steady-running and finish. The total time during the startup
was 1,396 s to start 16,000 workers, averaging 11 tasks per second. The points of
rhombus become higher gradually as time progresses, reaching up to 1,324 molecules
per second and averaging 145 molecules per second. Table 2 gives the results for other
data sets. As for WEGA, Fig. 7 shows that the data throughput can reach up to 9,448
molecules per second, averaging 6,430 molecules per second, indicating a high per-
formance and a high data throughput.
Fig. 6. The number of “insert” operation and “connection” operation in MongoDB’s server
when running ADV application. The zinc_ligand_3 data set was used to run on 16,000 cores.
Table 2. Data throughput for ADV and WEGA on different data sets.
Program Test Cores Startup Maximum data Average data
number time throughput throughput
(second) (molecules/second) (molecules/second)
ADV 107 16000 1222 1957 130
ADV 106 16000 1396 1324 145
ADV 105 8000 564 473 76
WEGA 95712 4000 313 9448 6430
HHVSF: A Framework to Accelerate Drug-Based HTVS 13
Fig. 7. The number of “insert” operation and “connection” operation in MongoDB’s server
when running WEGA application. The zinc_conformer_1 data set was used to run on 4,000
cores.
3.3 Scalability
To test scalability, we perform the experiments of speedup and parallel efficiency with
zinc_ligand_4 data set and zinc_ligand_3 data set. Figure 8a shows zinc_ligand_4 data
set can be scaled to 8,000 cores with parallel efficiency of 0.84, and the zinc_ligand_4
data set can be scaled to 16,000 cores with parallel efficiency of 0.83 (see Fig. 8b). It is
shown that the parallel efficiency decreases sharply when computing resource is scaled
up to more than 8,000 cores. This is because more cores represent more workers, and
thus, more time will be cost by HTCondor to start those workers.
(a)
(b)
Fig. 8. (a) Speedup (right triangle) and parallel efficiency (read dot) of molecular docking
experiment on zinc_ligand_4 data set. (b) Speedup (block dot) and parallel efficiency (upper
triangle) of molecular docking experiment on zinc_ligand_3 data set.
Table 3. The failure rate and computing time for ADV and WEGA on different data sets.
Program Data set Cores Failure rate Last task time Average time
ADV zinc_ligand_2 16000 0.01171 22.31 h 20.14 h
ADV zinc_ligand_3 16000 0.00390 3.34 h 2.43 h
ADV zinc_ligand_4 8000 0.00001 48.10 min 31.31 min
WEGA zinc_conformer_1 4000 0.00002 34.12 min 28.20 min
HHVSF: A Framework to Accelerate Drug-Based HTVS 15
4 Conclusions
HHVSF includes task management and relocating data management, and supports the
high-throughput applications of large-scale, multitasking and small sized IO files
running on HPC resources. There are two types of virtual drug screening applications:
(1) computation-intensive applications (such as molecular docking), and
(2) data-intensive applications (such as molecular structure similarity based virtual
screening campaigns). With HHVSF, two types of applications can run on Tianhe-2
supercomputer with high performance. Testing results show that when use ADV for
molecular docking, the protein target (PDB code: 5W5F) was used to screen nearly half
of compounds from the ZINC database within one day on 16,000 cores. For WEGA,
958 million conformations were screened by using about a half hour on 4,000 cores.
The ranked ligands or conformers can be accessed in milliseconds by specifying the
“sort” method from the database. Meanwhile, the IO pressure of shared file storage
affected by lots-of-small files in HPC resources can be mitigated. Thus, HHVSF can
significantly accelerate HTVS campaigns on HPC resources.
Acknowledgments. We would like to thank Prof. Xin Yan for permission to use WEGA
program to test. Helpful discussions with Guixin Guo, Lin Li, Wen Wan and technical assistance
by HTCondor Team (University of Wisconsin-Madison) are gratefully acknowledged. This work
was performed by the auspices of the NSFC (U1611261), GD Frontier & Key Techn, Innovation
Program (2015B010109004).
References
1. Manglik, A., Lin, H., Aryal, D.K., Mccorvy, J.D., Dengler, D., Corder, G., Levit, A., Kling,
R.C., Bernat, V., Hübner, H.: Structure-based discovery of opioid analgesics with reduced
side effects. Nature 537(7619), 1 (2016)
2. Rodrigues, T., Reker, D., Schneider, P., Schneider, G.: Counting on natural products for
drug design. Nat. Chem. 8(6), 531–541 (2016)
3. Hao, G.F., Wang, F., Li, H., Zhu, X.L., Yang, W.C., Huang, L.S., Wu, J.W., Berry, E.A.,
Yang, G.F.: Computational discovery of picomolar Q(o) site inhibitors of cytochrome bc1
complex. J. Am. Chem. Soc. 134(27), 11168–11176 (2012)
4. Forli, S., Huey, R., Pique, M.E., Sanner, M.F., Goodsell, D.S., Olson, A.J.: Computational
protein-ligand docking and virtual drug screening with the AutoDock suite. Nat. Protoc.
11(5), 905 (2016)
5. Raicu, I.: Falkon: a Fast and Light-weight tasK executiON framework, p. 43 (2007)
6. Raicu, I., Zhao, Z., Wilde, M., Foster, I., Beckman, P., Iskra, K., Clifford, B.: Toward
loosely coupled programming on petascale systems, pp. 1–12 (2008)
7. Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G.,
Rignanese, G.M., Hautier, G.: FireWorks: a dynamic workflow system designed for
high-throughput applications. Concurr. Comput. Pract. Exp. 27(17), 5037–5059 (2015)
8. Zhou, T., Caflisch, A.: Data management system for distributed virtual screening. J. Chem.
Inf. Model. 49(1), 145–152 (2009)
9. Trott, O., Olson, A.J.: AutoDock Vina: improving the speed and accuracy of docking with a
new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(2),
455–461 (2010)
16 P. Chen et al.
10. Yan, X., Li, J., Liu, Z., Zheng, M., Ge, H., Xu, J.: Enhancing molecular shape comparison
by weighted gaussian functions. J. Chem. Inf. Model. 53(8), 1967–1978 (2013)
11. Jones, G., Willett, P., Glen, R.C., Leach, A.R., Taylor, R.: Development and validation of a
genetic algorithm for flexible docking. J. Mol. Biol. 267(3), 727–748 (1997)
12. Friesner, R.A., Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, J.J., Mainz, D.T., Repasky,
M.P., Knoll, E.H., Shelley, M., Perry, J.K.: Glide: a new approach for rapid, accurate
docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47(7),
1739–1749 (2004)
13. Rarey, M., Kramer, B., Lengauer, T., Klebe, G.: A fast flexible docking method using an
incremental construction algorithm. J. Mol. Biol. 261(3), 470–489 (1996)
14. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource
management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003.
LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/
10968987_3
15. Bode, B., Halstead, D.M., Kendall, R., Lei, Z., Jackson, D.: The Portable batch scheduler
and the Maui Scheduler on Linux Clusters (2000)
16. Gentzsch, W.: Sun Grid Engine: Towards Creating a Compute Power Grid, pp. 35–36
(2001)
17. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the Condor
experience: research articles. Concurr. Comput. Pract. Exp. 17(2–4), 323–356 (2010)
18. Wang, Y., Xiao, J., Suzek, T.O., Jian, Z., Wang, J., Bryant, S.H.: PubChem: a public
information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 37
(Web Server issue), W623 (2009)
19. Irwin, J.J., Sterling, T., Mysinger, M.M., Bolstad, E.S., Coleman, R.G.: ZINC: a free tool to
discover chemistry for biology. J. Chem. Inf. Model. 52(7), 1757–1768 (2012)
20. Gaulton, A., Bellis, L.J., Bento, A.P., Chambers, J., Hersey, A., Light, Y., Mcglinchey, S.,
Michalovich, D., Allazikani, B.: ChEMBL: a large-scale bioactivity database for drug
discovery. Nucleic Acids Res. 40(Database issue), D1100 (2012)
21. Chen, J., Swamidass, S.J., Dou, Y., Bruand, J., Baldi, P.: ChemDB: a public database of
small molecules and related chemoinformatics resources. Bioinformatics 21(22), 4133–4139
(2005)
22. Banker, K.: MongoDB in Action. Manning Publications Co., Greenwich (2011)
23. O’Boyle, N.M., Banck, M., James, C.A., Morley, C., Vandermeersch, T., Hutchison, G.R.:
Open babel: an open chemical toolbox. J. Cheminform. 3(1), 1–14 (2011)
24. Li, J., Ehlers, T., Sutter, J., Varma-O’Brien, S., Kirchmair, J.: CAESAR: a new conformer
generation algorithm based on recursive buildup and local rotational symmetry consider-
ation. J. Chem. Inf. Model. 47(5), 1923–1932 (2007)
25. Visualizer, D.S.: Release 3.5. Accelrys Inc., San Diego (2012)
26. Jaghoori, M.M., Bleijlevens, B., Olabarriaga, S.D.: 1001 ways to run AutoDock Vina for
virtual screening. J. Comput. Aided Mol. Des. 30(3), 1–13 (2016)
HHVSF: A Framework to Accelerate Drug-Based HTVS 17
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appro-
priate credit to the original author(s) and the source, provide a link to the Creative Commons
license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
HBasechainDB – A Scalable Blockchain
Framework on Hadoop Ecosystem
1 Introduction
A Blockchain is a distributed ledger of blocks which records all the transactions
that have taken place. It was first popularized by a person or a group under
the pseudonym Satoshi Nakamoto, in 2008 by introducing Bitcoin [11]: A Peer-
to-Peer Electronic Cash System. This technology revolutionized the decentral-
ized paradigm by introducing and using a Consensus mechanism: Proof-of-Work
(PoW). Proof-of-Work defines the requirement of an expensive calculation also
called mining, to be performed so as to create a new trustless set of transactions,
also called blocks on the blockchain. The major breakthrough for Bitcoin was
the hash based blockchain which made the blocks of transactions tamper-proof,
transparent and aversive to DoS attack.
Blockchains can support a variety of applications like decentralized financial
services, Internet-of-Things [12], smart properties, etc. Several works have cen-
tered around the evaluation of potential use cases for the blockchain [3,9,13].
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 18–29, 2018.
https://doi.org/10.1007/978-3-319-69953-0_2
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 19
A lot of work has been underway for addressing the scalability of blockchains.
Vukolic [14] has contrasted PoW-based blockchains to those based on BFT state
machine replication for scalability. Eyal et al. [8] introduces Bitcoin-NG as a scal-
able blockchain protocol based on BFT consensus protocols. These approaches
20 M. S. Sahoo and P. K. Baruah
are focused upon improving the consensus protocol. McConaghy et al. [10]
adopted a different approach to scalability. They started with a distributed
database, MongoDB, and added the blockchain features of decentralized con-
trol, immutability while supporting the creation and movement of digital assets
to provide a scalable, decentralized database BigchainDB. The major contri-
bution of BigchainDB that enables this scalability is the concept of blockchain
pipelining. In blockchain pipelining, blocks are added to the blockchain with-
out waiting for the current block to be agreed upon by the other nodes. The
consensus is taken care of by the underlying database. The validation of blocks
is not done during block addition but eventually by a process of voting among
nodes. This has huge performance gains and BigchainDB has points to trans-
action throughputs of over a million transactions per second and sub-second
latencies.
In creating HBasechainDB, we have adopted an approach similar to that of
BigchainDB. Instead of using MongoDB as the underlying database, we use the
Hadoop database, Apache HBase. Apache HBase is a distributed, scalable Big
Data store. It supports random, real-time read/write access to Big Data. Apache
HBase is an open-source, distributed, versioned, non-relational, column-family
oriented database modeled after Google’s Bigtable [6]. HBase provides both lin-
ear and modular scaling, along with strongly consistent reads/writes. HBase
tables are distributed on the cluster via regions. HBase supports automatic
sharding by splitting and re-distributing regions automatically as data grows.
HBasechainDB is a scalable, decentralized data store akin to BigchainDB.
3 Terminology
– Blockchain: A chain of blocks where every block has a hash link to the
previous block i.e. every block stores the hash of the previous block. An
advantage is, just by storing the hash of the last block we can easily detect if
any change has been made to any of the block.
– Double spending: It is an attack where the asset is spent in more than
one transaction. To prevent double spending, blockchain framework needs to
check whether a particular asset is spent in any of the previous transactions.
For instance: user U2 wants to spend/transfer an asset A1, in transaction T2,
to another user U3. Say, the asset A1 was transferred to U2 by user U1 in
some previous transaction T1. U2 specifies T1’s Id in T2, which shows that
T1 was the transaction which contained asset A1, and U2 got it from U1.
Now, U2 wants to spend/transfer it to U3. So before validating transaction
T2 with asset A1, a Blockchain framework checks all the transaction with
asset A1 that has occurred/lies in between T1 and T2, in order. If A1 does
not occur in any of the transactions then A1 is not double spent else it’s
double spent.
– Blockchain Pipeline: In Blockchain pipelining, blocks are written to the
underlying database without waiting for a vote which confirms the block’s
validity. Voting for a block and forming a chain happens as a separate layer.
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 21
4 Architecture
4.1 Data Model of Transaction
The Transaction Model of all Blockchain Platforms has three important fields:
Transaction Id, List of Inputs, List of Outputs. Apart from these, there are fields
which are platform dependent. HBasechainDB’s transaction model consists of;
Transaction Id, Asset, List of Inputs, List of Outputs and Metadata.
HBasechainDB. Let us say there are n federation nodes N1, N2, ..., Nn. When
a client submits a transaction t, it is assigned to one of the federation nodes,
say Nk. The node Nk is now responsible for entering this transaction into the
blockchain. Nk first checks for the validity of the transaction. Validity of a trans-
action includes having a correct transaction hash, correct signatures, existence
of the inputs to the transaction, if any, and the inputs not having been already
spent. Once Nk has validated a set of transactions, it bundles them together in
a block, and adds it to the blockchain. Any block can only contain a specified
maximum number of transactions. Let us say t was added in the block B.
When the block B is added to the blockchain its validity is undecided.
Since the federation is allowed to grow or shrink during the operation of
HBasechainDB, blocks also include a list of voters based on the current fed-
eration. All the nodes in the voter list for a block vote upon B for its validity.
For voting upon a block, a node validates all the transactions in the block. A
block is voted valid only if all the transactions are found to be valid, else it is
voted invalid. If a block gathers a majority of valid or invalid votes, its validity
changes from undecided to valid or invalid respectively. Only the transactions in
a valid block are considered to have been recorded in the blockchain. The ones in
the invalid blocks are ignored altogether. However, the chain retains both valid
and invalid blocks. A block being invalid does not imply that all the transac-
tions in the block are invalid. Therefore, the transactions from an invalid block
are re-assigned to federation nodes to give the transactions further chance of
inclusion in the blockchain. The reassignment is done randomly. This way, if a
particular rogue node was trying to add an invalid transaction to the blockchain,
this transaction will likely be assigned to a different node the second time and
dropped from consideration. Thus, if block B acquires a majority of valid votes,
then transaction t would have been irreversibly added to the blockchain. On the
other hand, if B were invalid, then t would be reassigned to another node and
so on until it is included in the chain or removed from the system.
As discussed in the previous section, the chain is not formed when blocks are
created. When a block is entered into hbasechain table, the blocks are stored
in HBase in the lexicographical order of their ids. The chain is actually formed
during vote time. When a node votes on a block, it also specifies the previous
block that it had voted upon. Thus, instead of waiting for all the federation nodes
to validate the current block before proceeding to the creation of a new block,
blocks are created independent of validation. This is the technique of blockchain
pipelining described earlier. Over time, the blockchain accumulates a mix of
valid and invalid blocks. The invalid blocks are not deleted from the chain to
keep the chain immutable. What we also note here is that while it would seem
that different nodes could have a different view of the chain depending upon
the order in which they view the incoming blocks, it is not seen in practice in
HBasechainDB due to the strong consistency of HBase and the fact that the
blocks to be voted upon are ordered based on their timestamp. Thus, each node
sees the same order of blocks, and we have the same chain view for different
nodes.
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 23
To tamper with any block in the blockchain, an adversary will have to modify
the block, leading to a change in its hash. This changed hash would not match
the vote information for the block in the votes table, and also in subsequent
votes that refer to this block as the previous block. Thus an adversary would
have to modify the vote information all the way up to the present. However,
we require that all the votes being appended by nodes are signed. Thus, unless
an adversary can forge a node’s signature, which is cryptographically hard, he
cannot modify the node’s votes. In fact, he has to forge multiple signatures to
affect any change in the blockchain preventing any chances of tampering. This
way HBasechainDB provides a tamper-proof blockchain over HBase.
the transactions are written only once but read many times for various pur-
poses like checking double spending and performing checks on whether any
tampering took place.
3. HBase provides us with various ways in which we can run our custom code on
the region-server. HBase co-processor and custom filters are two such ways.
HBase co-processor can act as database triggers. In our implementation we
use these features in following ways:
(a) The check for double spending is generally done by loading the transac-
tions to the federation nodes(i.e. the client system). Loading this many
transactions from region-server to the federations node system is a major
bottleneck for the system throughput. In our approach, instead of pulling
the data required for double spending check on to the client-system, we
push the computation check to the region-server using HBase custom
filter. This approach improves the performance in two ways:
i. Data does not move towards the computation node rather computa-
tion moves towards the Data node. Since the code size is exponentially
lesser than data size, we improve the system by decreasing the com-
munication time.
ii. Computation for double spending is done in parallel on multiple
region-server compared to the traditional approach of checking on
a single Client node.
(b) Changefeed brings a great benefit to the Blockchain framework. We use
HBase co-processor to implement changefeed which will notify imme-
diately whenever a hacker tries to change or delete the content of the
database.
5 Implementation Details
The Federation Nodes in HBasechainDB are initialized with a key-pair; Ed25519
[2,4] signing system. SHA3-256 [5] hashing scheme is used for hashing the trans-
actions and blocks. The current implementation of HBasechainDB uses six HBase
tables. A critical issue in the current design of HBase tables is that of designing
the row key, since the region splits and the scans on HBase tables are done in
the lexicographical order of the row key. The row key pattern depends upon the
access pattern for the data in the Hbase table.
Following is the description of the HBase tables:
1. backLog: When a transaction is submitted to the Federation nodes, the trans-
action is randomly assigned to one of the nodes. All such assigned transactions
are stored in the backlog table with each transaction stored in a single row.
A node scanning the backlog table should only have to read the transactions
assigned to itself. Thus, the first segment of the row key for backlog table is
the public key of the node to whom the transaction was assigned, to ensure
that a node can scan the backlog table with the row prefix being its own pub-
lic key. The last segment of the row key contains the transaction reference id.
So the row key looks like: <publicKey> <transactionId>
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 25
2. block: This is the table that contains all the blocks in the blockchain. Each
block is a logical block which contains only the id’s of the transaction
which are present in the block. The actual transaction details are stored
in "hbasechaindb" table. Since the access pattern for this table is looking
up blocks based on block id, the row key for this table is just the block id:
<blockId>
3. hbasechaindb: This is the table where all the transaction details are stored
after a transaction is put on the blockchain. In this table each row cor-
responds to a single transaction. Since the access pattern for this table
is looking up transaction based on transaction link id, the row key of
this table is <transaction link id>. The transaction link id consists of
<block id> <transaction id>. This transaction link id which is of previous out-
put is used in inputs of current transaction while spending an asset
4. toVote: Every new block created has to be voted upon by the Federation
nodes. For this, we need to inform the Federation nodes of their need to vote
upon a newly created block. To this end, every block created is added to
this table to signal the node for voting. It is removed from the table once
the node has finished voting on it. The row key of this table is : <federation
node’s signing key> <block id>
5. vote: This is the table in which all the votes are recorded. There has to be an
entry for every federation node which votes for their respective blocks. The
row key of the table is: <block id> <decision> <Fed. Node public key>
6. reference: This is the table which stores the map between transaction link
id and transaction id. This table acts as an index when the details of a
transaction is queried. Since the access pattern of the table is transaction
reference id, the row key of this table is just the transaction reference id:
<transacation link Id>
6 Performance
6.1 Experimental Setup
We have used three nodes for the initial performance testing of HBasechainDB
with the following configurations:
– 3 nodes with Intel Core i5-4670 CPU @ 3.40 GHz*4 processor and 16 GB of
memory, with Ubuntu 16.04 OS.
– Each of the 3 nodes runs HBase region-server. There is a HBase master run-
ning in one of the system.
– There is a Replication factor of 3 for the underlying HDFS.
– The HBase is backed by 3 quorum zookeeper.
– We consider only creation of transactions for our case.
6.2 Results
We have tested HBasechainDB for scalability over three nodes. There are two
parameters that we describe the performance of HBasechainDB with:
– Transaction Latency: This is defined as the time elapsed since the sub-
mission of a transaction to HBasechainDB until the block in which it has
been recorded is validated. The transaction latency is found for streaming
transactions.
– Throughput: This is the number of transactions that are recorded in the
blockchain per second. To find the peak throughput the blockchain is capable
of, we store the transactions in the backlog beforehand and then run the
nodes. The throughput observed then is the peak throughput.
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 27
The main reason behind the linear scale of HBasechainDB is, almost all the
computation which includes computation for transaction’s validity and check for
double spending is pushed to server side. Therefore if we increase the HBase
nodes keeping the federation node constant, the system scales linearly.
7 Conclusion
Blockchain technologies can be very useful in the Big Data scenario by helping
us immutably record data and decentralizing data services. However, current
blockchain implementations with their extremely low transaction throughputs
and high transaction latencies do not lend themselves to Big Data. Discussions
on improving blockchain scalability have largely focused on using better con-
sensus protocols as against the PoW protocol used by Bitcoin. BigchainDB
provides an alternative idea where instead of scaling blockchains to provide
scalable data stores, they implement a blockchain over an existing scalable
distributed database. Such an implementation inherits the scalability of the
underlying database, while adding the immutability and decentralization offered
by blockchains. While BigchainDB was implemented upon the MongoDB and
RethinkDB database, with our work we provide an alternate implementation
over HBase. HBasechainDB is an hitherto unavailable blockchain implementa-
tion integrated with the Hadoop ecosystem. It supports very high transaction
throughputs with sub-second latencies and the creation and movement of digital
assets. HBasechainDB scales linearly and also is good platform for analyzing
data that are present on blockchain.
References
1. https://blockchain.info/charts/blocks-size
2. https://github.com/str4d/ed25519-java/tree/master/src/net/i2p/crypto/eddsa
3. Aron, J.: Automatic world (2015)
4. Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.-Y.: High-speed high-
security signatures. J. Cryptographic Eng. 2(2), 1–13 (2012)
5. Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: Keccak specifications. Sub-
mission to NIST (Round 2) (2009)
6. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M.,
Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for
structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
7. Croman, K., et al.: On scaling decentralized blockchains. In: Clark, J., Meiklejohn,
S., Ryan, P.Y.A., Wallach, D., Brenner, M., Rohloff, K. (eds.) FC 2016. LNCS,
vol. 9604, pp. 106–125. Springer, Heidelberg (2016). https://doi.org/10.1007/978-
3-662-53357-4 8
HBasechainDB – A Scalable Blockchain Framework on Hadoop Ecosystem 29
8. Eyal, I., Gencer, A.E., Sirer, E.G., Van Renesse, R.: Bitcoin-NG: a scalable
blockchain protocol. In: NSDI, pp. 45–59 (2016)
9. Liebenau, J., Elaluf-Calderwood, S.M.:. Blockchain innovation beyond bitcoin and
banking (2016)
10. McConaghy, T., Marques, R., Müller, A., De Jonghe, D., McConaghy, T.,
McMullen, G., Henderson, R., Bellemare, S., Granzotto, A.: BigchainDB: a scalable
blockchain database. White paper, BigChainDB (2016)
11. Nakamoto, S.: Bitcoin: A Peer-to-Peer Electronic Cash System (2008)
12. Panikkar, S., Nair, S., Brody, P., Pureswaran, V.: Adept: an IoT practitioner per-
spective. IBM Institute for Business Value (2014)
13. Swan, M.: Blockchain: Blueprint for a New Economy. O’Reilly Media Inc.,
Sebastopol (2015)
14. Vukolić, M.: The quest for scalable blockchain fabric: proof-of-work vs. BFT repli-
cation. In: Camenisch, J., Kesdoğan, D. (eds.) iNetSec 2015. LNCS, vol. 9591, pp.
112–125. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39028-4 9
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
DETOUR: A Large-Scale Non-blocking
Optical Data Center Fabric
1 Introduction
(1) High Fan-Out. Traces from production clusters (e.g., Microsoft [16], Face-
book [18], and Google [19]) show that source top-of-rack (ToR) electrical
packet switches (EPSes) usually communicate with tens to hundreds of other
EPSes simultaneously and have the stability across time periods from sec-
onds to days. Constructing high fan-out EPSes connections in large-scale is
significant to improve the network throughput and reduce flow completion
times (FCT).
(2) Various Communication Patterns. The iterative computing frameworks
(e.g., MapReduce, Spark, Hadoop) for large scale data analytics contain
various communication patterns, such as unicast, multicast and broadcast
(*-cast). Multicast and broadcast data dissemination are always the perfor-
mance bottleneck for data analytics applications [11].
Along with the scale of DCN expanding, the ultimate goal is to provide
non-blocking network services in large-scale with high flexibility. However, exist-
ing optical switching networks fail to meet all of the goals (as summarized in
Table 1). Most designs are based on the techniques of Microelectromechanical
system (MEMS) Optical Circuit Switch (OCS), Wavelength Division Multiplex-
ing (WDM) and Wavelength Selective Switch (WSS).
The rest of this paper is organized as follows. Section 2 describes the archi-
tecture of DETOUR. Section 3 details the algorithms realized in the controller.
In Sect. 4, we implement a flow level simulator and evaluate the performance of
DETOUR. Section 5 summarizes the related work. Finally, Sect. 6 concludes the
paper.
DETOUR: A Large-Scale Non-blocking Optical Data Center Fabric 33
used to boost the DWDM signals before being broadcasted out. This ensures all
the dropped signals’ intensity greater than the transceivers’ receiver sensitivity,
all signals can be recovered correctly.
To ensure the consistency of OCS architecture and simplify the connection
between neighbor OCSes, OCS takes advantage of the uniform passive routing
fabric (PRF) to reroute the multiple broadcasted signals. PRF also couples with
passive drop-continue splitters to drop the broadcasted signals. The ratio of
drop and continue is determined by the scale of DETOUR. As shown in Fig. 2,
OCS contains 4 PRFs and each dimension has 2 PRFs. PRF1 and PRF3 are
used for the signals that are broadcasted from the same dimension. PRF2 and
PRF4 are used for the signals forwarded from other dimensions. Focusing on one
dimension, for each OCS:
– The source DWDM signals are transmitted out from port E1 and the source
forwarded signals are transmitted out from port EN +1 .
– The signals from port Wi (1 ≤ i < N ) are transmitted out from port Ei+1
and dropped to the (N + i)-th (5:5) splitter.
– The signals from port Wi (i = N ) are only dropped to the 2N -th (5:5) splitter.
– The signals from port Wi (N + 1 ≤ i < 2N ) are transmitted out from port
Ei+1 and dropped to the i-th port of WSS4.
– The signals from port Wi (i = 2N ) are only dropped to the 2N -th port of
WSS4.
36 J. Bao et al.
Thus, OCS has consistency architecture and can be directly connected with
neighbor OCSes to construct a 2D-Torus topology. And OCS uses 2N -fiber opti-
cal ribbon to simplify the complexity of cabling, as shown in Fig. 3.
– The signals from port Wi (1 ≤ i ≤ N ) mean that the source OCS of these
signals is the i-th OCS on the west of this OCS. The dropped signals are
equally split by the (N + i)-th splitter, then transmitted to the i-th port of
WSS2 and the (N + i)-th port of WSS3.
– The signals from port Wi (N + 1 ≤ i ≤ 2N ) mean that they are forwarded
by the (i − N )-th OCS on the west of this OCS, and the source OCS of these
signals is on the south-north dimension passing the forwarding OCS. Then
the dropped signals are transmitted to the i-th port of WSS4.
– The signal from port Si (1 ≤ i ≤ N ) means that the source OCS of these
signals is the i-th OCS on the south of this OCS. The dropped signals are
equally split by the i-th splitter, then transmitted to the i-th port of WSS1
and WSS3.
– The signals from port Si (N + 1 ≤ i ≤ 2N ) mean that they are forwarded by
the (i − N )-th OCS on the south of this OCS, and the source OCS of these
signals is on the west-east dimension passing the forwarding OCS. Then the
dropped signals are transmitted to the (i − N )-th port of WSS4.
– If the source OCS are not in the same dimension, it needs jointly configure
WSSes of the forwarding OCS and destination OCS. WSS1 or WSS2 of the
forwarding OCS selects the demand wavelengths from the input port asso-
ciated with the source OCS and broadcasts it to the orthogonal dimension.
Then, WSS4 of the destination OCS passes the demand wavelengths from the
input port associated with the forwarding OCS. As shown in Fig. 3, the opti-
cal channel from EPS1 to EPS4 is assigned a red wavelength and forwarded
by OCS2. So WSS2 of OCS2 passes the red wavelength from the 1-th port
and forwards it to the south-north dimension, then WSS4 of OCS4 passes the
red wavelength from the 1-th port.
Contents Specifications
Transceiver [1] Output power −1 ∼ 3 dBm
Receiver sensitivity −7 ∼ −23 dBm
EDFA [2] Input power range −32 ∼ −1 dBm
Saturated output power 17.3 ± 0.3 dBm
1 × 2 splitter (m:n) Dropped (m) loss −10log(m/(m + n)) dB
Passed (n) loss −10log(n/(m + n)) dB
Connector loss 1 dB
WSS loss 4 dB
Coupler loss 1 dB
DeMux loss 2.5 dB
1 × 2 splitter/1 × 4 splitter 3.5 dB/7 dB
R
The signal loss SLoss for the receiving side of i-th switch is calculated as
follows:
R
SLoss = −10log(l) ∗ (i − 1) − 10log(1 − l) + i + 11
F
The signal loss SLoss for the forwarding side of i-th switch is calculated as
follows:
F
SLoss = −10log(l) ∗ (i − 1) − 10log(1 − l) + i + 11
FR
And the forwarding signal loss SLoss for the receiving side of i-th switch is
calculated as follows:
FR
SLoss = −10log(l) ∗ (i − 1) − 10log(1 − l) + i + 7.5
hops i and the splitter transmittance l. From the related work [6], the number
of optical splits increases with transmittance l. When the transmittance is up to
0.9, the number of i equals to 13. Thus, DETOUR can support up to 27 OCSes
in one dimension.
Through the above analysis, there are at most min(27, N + 1) OCSes at
each dimension to construct a non-blocking optical fabric. With state-of-the-art
technologies, the option N of a N × 1 WSS can be as high as 32 at reasonable
cost [25]. So DETOUR is scalable to connect 27 × 27 OCSes. A 27 × 27 2D-
Torus network is then achievable to connect up to 729 OCSes. As described in
the ITU-T G.692 standard, the C-band can be divided up to 96 wavelengths at
DETOUR: A Large-Scale Non-blocking Optical Data Center Fabric 39
to y dimension by OCS f , and fyx does the opposite. Edge (4, 2) and edge (6,
8) are conflicting as they are assigned the same color and both forwarded from
x dimension to y dimension by OCS 5. To avoid the conflict, we use OCS 9 to
forward edge (6, 8), as shown in Fig. 5(e).
Considering the constraint of forwarding OCSes, we cast the wavelength
assignment problem into a constrained edge-coloring solution on a bipartite
multigraph. König’s theorem [23] states that any bipartite graph G has an edge-
coloring solution with Δ(G) (maximal degree) colors. The challenge in our situa-
tion is that whether the bipartite multigraph Gb converted from the wavelength
demand matrix Gw is Δ(Gb )-colorable. Since Δ(Gb ) ≤ k, that φw can always be
satisfied if Gb is Δ(Gb )-colorable. We solve this problem by designing a conflict
avoiding algorithm which utilizes the properties of DETOUR.
3.3 Reconfiguration
4 Evaluation
(1) Simulator: Because existing packet-level simulators (e.g., NS2, NS3) are
time consuming to simulate hundreds to thousands servers, and we are more
interested in network throughput rather than packet-level behaviors. There-
fore, we implemented a event-based flow-level simulator to perform simula-
tions at large scale. The simulator takes flows with start time, size, source
server and destination server as input. When the network status changes
(e.g., flow arrival, flow departure, EPS and OCS reconfiguration), it updates
the rate and remaining size of all active flows. The rate of each active flow
is calculated by the progressive filling algorithm [3], which allocates band-
width satisfying max-min fairness without considering the detailed transport
layer protocol behaviors. A flow transmission is finished when the receiver
receives all the data. In this simulator, we also realized a centralized con-
troller, which maintains a global view of the network and manages all the
EPSes and OCSes. It periodically (0.1 s in our simulation) predicts the traffic
demand between ToRs and assigns optical wavelengths to meet the demand.
The OCS reconfiguration and controller communication overhead is setted
to 10 ms.
44 J. Bao et al.
(a) (b)
Fig. 7. (a) CDF distribution of FCT and (b) overall average FCT
46 J. Bao et al.
From the figure, we find that DETOUR reduces the overall average energy con-
sumption by ∼21% and ∼30% compared with OvS and Jellyfish, respectively.
The reason is that flows in DETOUR traverse through less EPSes and OCSes
compared with OvS and Jellyfish.
5 Related Work
Firefly [15] equips ToR EPSes with free-space optics and uses Galvo or switch-
able mirrors to dynamically establish optical links. ProjecToR [16] combines dig-
ital micromirror device (DMD) and mirror assembly to construct a high-fanout
free-space topology. However, the beam of FSO is narrow and susceptible to
interferences.
6 Conclusion
We presented DETOUR, a large-scale non-blocking optical data center fabric,
which supports up to 700+ racks and 69K+ servers. We designed a recursive
wavelength assignment algorithm based on the architecture of DETOUR. And
We also implemented a flow-level simulator and realized the control algorithms.
Extensive evaluation results show that DETOUR delivers high performance com-
parable to a non-blocking switching fabric. It outperforms up to 2.14× higher
throughput, reduce 34% FCT and 21% energy consumption compared with the
state-of-the-art works.
References
1. Cisco DWDM SFP+ module. http://www.cisco.com/c/en/us/products/
collateral/interfaces-modules/dwdm-transceiver-modules/data sheet c78-711186.
html
2. Cisco ONS15501 erbium doped fiber amplifier. http://www.cisco.com/en/US/
products/hw/optical/ps2011/products data sheet09186a008008870d.html
3. Progressive filling algorithm. https://en.wikipedia.org/wiki/Max-min fairness
4. Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network
architecture. In: ACM SIGCOMM (2008)
5. Al-Fares, M., Radhakrishnan, S., Raghavan, B., Huang, N., Vahdat, A.: Hedera:
dynamic flow scheduling for data center networks. In: NSDI (2010)
6. Bao, J., Dong, D., Zhao, B., Luo, Z., Wu, C., Gong, Z.: FlyCast: free-space optics
accelerating multicast communications in physical layer. In: ACM SIGCOMM
(2015)
7. Barker, K.J., Benner, A., Hoare, R., Hoisie, A., Jones, A.K., Kerbyson, D.K., Li,
D., Melhem, R., Rajamony, R., Schenfeld, E., et al.: On the feasibility of optical
circuit switching for high performance computing systems. In: IEEE SC (2005)
8. Chen, K., Singla, A., Singh, A., Ramachandran, K., Xu, L., Zhang, Y., Wen, X.,
Chen, Y.: OSA: an optical switching architecture for data center networks with
unprecedented flexibility. In: NSDI (2012)
9. Chen, K., Wen, X., Ma, X., Chen, Y., Xia, Y., Hu, C., Dong, Q.: WaveCube:
a scalable, fault-tolerant, high-performance optical data center architecture. In:
IEEE INFOCOM (2015)
10. Chen, L., Chen, K., Zhu, Z., Yu, M., Porter, G., Qiao, C., Zhong, S.: Enabling
wide-spread communications on optical fabric with megaswitch. In: NSDI 2017,
Boston, MA, pp. 577–593 (2017)
DETOUR: A Large-Scale Non-blocking Optical Data Center Fabric 49
11. Chowdhury, M., Stoica, I.: Coflow: a networking abstraction for cluster applica-
tions. In: ACM HotNets (2012)
12. Farrington, N., Porter, G., Radhakrishnan, S., Bazzaz, H.H., Subramanya, V.,
Fainman, Y., Papen, G., Vahdat, A.: Helios: a hybrid electrical/optical switch
architecture for modular data centers. In: ACM SIGCOMM (2010)
13. Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., Lu, S.:
BCube: a high performance, server-centric network architecture for modular data
centers. In: ACM SIGCOMM (2009)
14. Halperin, D., Kandula, S., Padhye, J., Bahl, P., Wetherall, D.: Augmenting data
center networks with multi-gigabit wireless links. In: ACM SIGCOMM (2011)
15. Hamedazimi, N., Qazi, Z., Gupta, H., Sekar, V., Das, S.R., Longtin, J.P., Shah,
H., Tanwer, A.: Firefly: a reconfigurable wireless data center fabric using free-space
optics. In: ACM SIGCOMM (2015)
16. Monia (Manya), G., Ratul, M., Amar, P., Nikhil, R., Gireeja, R., Jana, K.: Pro-
jector: agile reconfigurable data center interconnect. In: ACM SIGCOMM (2016)
17. Porter, G., Strong, R., Farrington, N., Forencich, A., Chen-Sun, P., Rosing, T.,
Fainman, Y., Papen, G., Vahdat, A.: Integrating microsecond circuit switching
into the data center. In: ACM SIGCOMM (2013)
18. Roy, A., Zeng, H., Bagga, J., Porter, G., Snoeren, A.C.: Inside the social network’s
(datacenter) network. In: ACM SIGCOMM (2015)
19. Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R.,
Boving, S., Desai, G., Felderman, B., Germano, P., et al.: Jupiter rising: a decade
of clos topologies and centralized control in Google’s datacenter network. In: ACM
SIGCOMM (2015)
20. Singla, A., Hong, C.Y., Popa, L., Godfrey, P.B.: Jellyfish: networking data centers
randomly. In: NSDI (2012)
21. Wang, G., Andersen, D.G., Kaminsky, M., Papagiannaki, K., Ng, T., Kozuch, M.,
Ryan, M.: c-Through: part-time optics in data centers. In: ACM SIGCOMM (2010)
22. Wang, H., Chen, L., Chen, K., Li, Z., Zhang, Y., Guan, H., Qi, Z., Li, D., Geng,
Y.: Flowprophet: generic and accurate traffic prediction for data-parallel cluster
computing. In: IEEE ICDCS (2015)
23. Wikipedia: König’s theorem (graph theory) – wikipedia, the free encyclopedia
(2015)
24. Zhou, X., Zhang, Z., Zhu, Y., Li, Y., Kumar, S., Vahdat, A., Zhao, B.Y., Zheng,
H.: Mirror mirror on the ceiling: flexible wireless links for data centers. In: ACM
SIGCOMM (2012)
25. Zhu, Z., Zhong, S., Chen, L., Chen, K.: Fully programmable and scalable optical
switching fabric for petabyte data center. Opt. express 23(3), 3563–3580 (2015)
50 J. Bao et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Querying Large Scientific Data Sets
with Adaptable IO System ADIOS
1 Introduction
Modern scientific experiments such as large accelerators rely heavily on high-
performance simulations for design, calibration and data analysis [13,24]. These
simulation programs typically need to read and write a vast amount of data,
for example to read the definition of the complex geometry of an accelerator
design, to checkpoint the state of the simulation, and to produce analysis output
[23]. The output from these simulations is used to understand the experimental
observations and to guide the next experiment. Often, the critical information is
only a small fraction of a large data collection. Reading and writing the necessary
data records efficiently is the challenge we address in this work.
The rights of this work are transferred to the extent transferable according to title
17 § 105 U.S.C.
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 51–69, 2018.
https://doi.org/10.1007/978-3-319-69953-0_4
52 J. Gu et al.
2 Related Work
In commercial applications, large datasets are typically managed by a data man-
agement system [15,18,25]. These systems take control of the user data and pro-
vide high-level languages for analysis tasks. In contrast, most scientific projects
store their data as files and use the file systems as the primary tools for data
management [24]. This file-based approach allows users full control of their data
and their analysis tasks; however, it also requires the users to spend much more
time to manage their data than using a data management system. In this work,
we combine a number of techniques to reduce the data management time, more
specifically the time to select a relatively small fraction of the data records. In
this section, we briefly review the key technologies involved.
For example, it accepts an XML configuration file for users to describe the vari-
ables, their types, and the path to take from memory to disk. This capability
allows the users to change how they process the data without changing the sim-
ulation program. This approach gives a level of adaptability that no other IO
system could match. A special feature created by this flexibility is the in situ
processing capability to be described next. To effectively support the query-
ing capability over ADIOS files, we also utilize this in situ capability to create
indexes, which reduces the effort required to generate indexes and ensures the
indexes are available as soon as the data is available.
To effectively support the queries, the system needs to create indexes [12],
such as, B-Tree [6], bitmap index [26], and hashing [28]. Because the scientific
data collections are typically analyzed without modification (or with infrequently
modifications), we plan to concentrate on indexing techniques that are designed
for query-intensive workloads. The queries on scientific data typically returns
a number of data records instead of a single record. Additionally, the users
often explore a large variety of combinations of query conditions. From research
literature, we see that bitmap indexes satisfy these requirements. To support
newly designed ADIOS query interface, we choose to use an open source bitmap
index library named FastBit5 [26]. At the same time, we are also exploring
additional indexing techniques that might be better suited for ADIOS [27].
3 ADIOS Overview
In the next three sections, we describe our work on ADIOS to address various
IO challenges. We start with the basic bulk IO operation of checkpointing, then
move on to in situ processing and querying in the next two sections.
ADIOS is known for its simple API and high performance. The core insight
guiding the ADIOS design is to separate the description of IO operations from
the IO strategies employed for the actual lower level operations. This allows
the application programming interface (API) to only describe what variables
to read or write, while leaving the responsibility of selecting the actual trans-
port operations to the ADIOS system. In particular, ADIOS has implemented
a variety of transport mechanisms [16]. Its ability to seamlessly select the best
transport mechanism is also at the root of its support for in situ operations.
Other important factors contributing the high-performance include log-based
file format, buffered writing, subfiling, asynchronous transport operations, and
so on [16].
ADIOS was designed in 2005 to reduce the IO time for a number of mission
critical applications [17]. Since then, ADIOS has been the leading software sys-
tem for in situ data processing on many of largest high-performance computers.
Some of the early success stories include improving the IO rate of S3D check-
point operation by more than a factor of 10 from 3 GB/s to over 30 GB/s [16].
The developers of ADIOS have published a number of studies showing the dra-
matic improvement of IO performance for various applications. Next, we add
our experience with the IMPACT code.
IMPACT employs the particle-in-cell paradigm to model the dynamics of
particles. Each particle has immutable properties such as rest mass and charge,
and dynamic properties such as position and momentum, recorded as x, y, z, px ,
py , and pz . IMPACT produces two types of output for analyses: checkpoint files
and particle statistics. We describe our work on checkpointing in this section and
the work on utilizing in situ processing to accelerate the production of particle
statistics in the next section.
Because the particles on each processor are independent from other parti-
cles, IMPACT produces its checkpoint files by writing one file per processor.
This option has the advantage of minimizing the coordinate needed among the
processors and could significantly reduce the time spent on IO operations.
ADIOS offers a variety of IO options and the parallel file system (Lustre)
additionally offers a number of file system parameters; all of these parameters
could affect IO performance. Instead of providing an exhaustive exploration of
these parameters, in Fig. 1, we provide one set of performance measurements to
show that ADIOS is able to support very efficient IO operations. This particular
set of measurements was collected on Edison at NERSC. The measurements are
conducted on Lustre file system with 24 OST and a peak IO rate of 168 GB/s. To
avoid contention with other active jobs, we only used 16 OST for each ADIOS
file, which have a nominal peak IO rate of about 112 GB/s.
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 57
(a) time (seconds) to write checkpoint files (b) time (seconds) to read checkpoint files
The write tests were performed with a fixed number of particles on each MPI
process. The reported IO rates in Fig. 1 are computed using the median observed
IO time. The write operations reported in Fig. 1(a) all uses 16 OST and uses
about 2 million particles per MPI process. Up to 1024 processes, the average
write speeds rises to over 50 GB/s.
In Fig. 1(b), we reported the observed performance of reading the different
checkpoint files. Clearly, the number of OST used to store the files has a strong
influence on the observed read performance. One important feature we want to
demonstrate is the reading of the same checkpoint files with a different number
of processes. In this particular case, reading the same file with different number
of processes took about the same amount of time and producing about the same
aggregate IO speed.
4 In Situ Indexing
The checkpoint files capture the position and momentum of each particle peri-
odically, but infrequently. To capture more dynamic behavior of the parti-
cles, IMPACT also compute high-level statistics about the particles at a much
higher frequency. However, these statistics are programmed by the developers
of IMPACT code and is difficult for the end users to modify to suit their own
needs. The in situ processing capability of ADIOS is a flexible mechanism to
introduce these custom statistics. It can also be used to provide asynchronous
computation including index building, without blocking the main simulation
computation. Next we describe a simple test to compute histograms at every
simulation step to demonstrate the capability of ADIOS.
Using the ADIOS framework, IMPACT sends the positions and momentums
to the libsim system, and a histogram function from VTK is attached to produce
1-D and 2-D histograms for each of the six variables. The histogram functions
are instructed to divide the data records into 100 equal-width bins between the
minimum and maximum values.
58 J. Gu et al.
Fig. 2. Histogram of px and pz at time steps 0, 500, 1000, 1500 and 2000 of an IMPACT
run. Histogram of py is similar to that of px .
Fig. 3. Histogram of z and pz at time steps 500, 1000, 1500, and 2000 of an IMPACT
run (time progresses from left to right, top to bottom). The two tall peaks appear to
move to the right indicating the bulk of the particles are moving along the z direction
with increased momentum pz .
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 59
Fig. 4. The fraction of total execution time spent on I/O operations: using File I/O vs
using ADIOS in situ capability to staging the output data before committing to disk.
60 J. Gu et al.
to stage the data away from the compute nodes before committing the data to
disk7 . We note that the fraction of time spent on I/O operations is dramatically
reduced. More importantly, it is possible to attach an index generation function
and the above mentioned statistics computation to the in situ workflow without
delaying the main simulation computation.
5 Query API
A common query interface is web search box on a web browser, where user
enters a set of keywords to locate relevant pages on the web. A similar interface
for finding interesting data records in large scientific data collection would also
be very useful, however, this functionality is not widely available. An important
reason for this lack of querying function is that most scientific datasets are stored
as files. Because the POSIX file systems treat a file as a container of bytes, there
is no general way of extracting meaningful data records for querying. The first
step in breaking this limitation is to have a model to describe the data records.
In this work, we are using the ADIOS library and will follow the array data
model. In the remaining of this section, we will describe this data model and
the query use cases. The latest version of ADIOS release contains the query
interface and detailed description of how to use the functions is available in the
user’s manual8 .
7
This time measurement was obtained with a large XGC simulation running on titan
at ORNL.
8
ADIOS source code and documentation could be found at https://www.olcf.ornl.
gov/center-projects/adios/.
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 61
Case 1: Regular mesh data, all variables are named explicitly. Given a dataset
defined on m dimensions: D1 , D2 , . . . , Dm , the n physical properties such as tem-
perature, pressure and humidity, could be defined as separate m-dimensional
arrays: A1 , A2 , . . . , An . Each of these variables can be thought of as a column
of a relational table and each point of the mesh as a row of the same table.
Given this simple mapping between multidimensional data model and the rela-
tional data model, we can transplant all SQL queries to queries on mesh data.
For example, “select humidity from mesh data where temperature > 280 and
pressure > 100000” is meant to select all mesh points where temperature and
pressure values satisfy the specified conditions and then output the values of
humidity on those mesh points.
Case 2: using bounding boxes to partition arrays of the same shape and size.
Given a dataset defined on a 3-D mesh of size 10 × 20 × 30, we might divide
this mesh for 8 processors as a 2 × 2 × 2 blocks. To accommodate this use case,
we will define a set of 8 non-overlapping bounding boxes, one for each of the 8
processors. This would allow each processor to answer queries on 1/8th of the
data, corresponding to mesh size 5 × 10 × 15.
A query over this structure consists of 3 parts:
Fig. 6. Illustration of a use case with different variables packed as another dimension
of the data array.
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 63
Fig. 7. Illustration of a user query involving multiple arrays of different shapes and
sizes.
complex relationship than described above. For example, the values for tem-
perature, pressure, and humidity, could be produced from different measuring
instruments and recorded as different time resolutions in space and time, as
illustrated in Fig. 7.
Now if we want to compare values at a particular city, we will need to use
different bounding boxes on these arrays. This use case is similar to the previous
one, the key difference is that the array names would be different. Again, the
bounding boxes are required to be of the same size, i.e., having the same number
of data points.
Reading Multiple Variables. To start with, the current design of the ADIOS
query interface retrieves values from on variable at a time. If a use case requires
multiple output variables, the caller needs to repeat the invocation of the read
function. Introducing a mechanism to specify multiple output variables at once
will increase the likelihood of additional optimization in the implementation.
However, we choose to keep the interface relatively simple so that we can explore
the implementation challenges associated with the basic tasks of integrating
with indexing techniques. This and other performance optimization issues will
be considered in the future.
Expressing Query Conditions. To avoid the need to introducing a query parser,
we have opted to introduce a set of functions for users to compose query expres-
sions instead of allowing the user to specify the query conditions in the string
form, even though the string form is a more common form of query interface.
This choice also has the benefit of not imposing any restrictions on the variable
names. A typical database management system supports query in the SQL lan-
guage, which imposes a number of restrictions on the variable names, such as,
not allowing punctuations, which would introduce extra challenge is expressing
the bounding boxes.
64 J. Gu et al.
6 Query Performance
The naive way of resolving a query would be to read through all data records to
find those satisfying the user specified conditions. This option is generally known
as scanning. In this work, we plan to use two different indexing techniques, Fast-
Bit index and block MinMax index, to accelerate the query answering process.
FastBit implements a number of different bitmap indexes that have been
shown to work well for a number of scientific use case [26]. The block MinMax
index is a structure that keeps the minimum and maximum for each variable in
a data block. It is a mechanism developed to take advantage of the metadata
already captured in the ADIOS BP file format. When processing a user query,
this mechanism first examine each data block’s header information to determine
whether there are any possible entries satisfying the specified conditions using
the minimum and maximum values. It only examines the data values in a block
if there are possible hits. The mechanism allows us to avoid some blocks. For
the data blocks with hits, since the minimum unit of an IO operation is a block,
this query answering mechanism is reading the minimum number of blocks and
performing the minimum amount of IO operations.
Figure 8 shows the time needed to resolve the queries with three different
mechanisms: scanning the raw data, using FastBit indexes, and using the Min-
Max mechanism. We see that using the two indexing schemes could dramatically
reduce the query processing time compared to scanning the raw data. Compared
these two indexing mechanisms, we see that the FastBit indexing is typically
faster, however, the FastBit indexes take up a lot more storage space than the
MinMax mechanism, which can be regarding as not using any extra space in
ADIOS BP format.
Another important observation from Fig. 8 is that the time needed to read the
selected values (i.e., the difference between “FastBit + read” and “FastBit”) is
significantly longer than resolving the query using one of the indexing techniques.
This is because the query results are typically randomly scattered in the data file.
Extracting these randomly scattered values takes a long time. Often, reading a
relatively small number of bounding boxes to encompass the randomly scattered
points and then extract the selected values could reduce the overall time need
to extract these values.
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 65
10000
1000
Time (sec)
100
10
# Hits
1
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Fig. 8. Query processing time with a set of S3D data (total number of records is
1.67 × 109 , organized into a 3D array of 1100 × 1080 × 1408).
Fig. 9. Query processing time changes dramatically with the number of blocks.
When working with a large dataset, we typically employ multiple CPUs and
process each data blocks independently on each CPU core. However, the query
processing time could be dramatically affected by the number of blocks used
to generated the indexes. Figure 9 shows the query processing time of a small
number of queries when the FastBit indexes are generated on different number
of blocks. Clearly, the more blocks are used the longer it takes to resolve a query.
This is largely because the extra work needed to process each index block. On the
other hand, using more processors can significantly reduce the query response
time, as shown in Fig. 10. Additional studies are needed to further optimize these
and other parameters affecting the performance of indexing and query processing
techniques [27].
66 J. Gu et al.
Fig. 10. Using more processors reduces the query processing time.
7 Summary
This work reports our experience in designing and implementing a query inter-
face for ADIOS. We explored a number of different indexing data structures
for supporting such a query interface. We observe that for queries that select
a relatively small fraction of total number of records in a dataset, answering a
query with these indexing methods could be dramatically faster than reading
the whole data and then filtering the data records in memory.
In addition to using external indexing libraries, ADIOS also implements a
block MinMax mechanism to take advantage of the built-in blocking structure.
Tests show that it has the potential to significantly accelerate the query answer-
ing process. One challenge we have noticed is that the number of blocks has a
strong impact on the overall system performance. We have started exploring pos-
sible options to select this and other parameters affecting the query processing
time [27].
This work also demonstrates two useful capability of ADIOS in improving
the IO pipeline of a simulation program called IMPACT: checkpointing and cus-
tomizing analysis. In reading and writing of checkpoint files, ADIOS allows the
user to manage the IO operations more efficiently. We rely on the in situ pro-
cessing capability of ADIOS to enable IMPACT users to customize the particle
statistics during the simulation process.
The in situ mechanism is also used to generate the indexes needed to accel-
erate the query processing algorithms, without increasing the elapsed time used
by the simulation programs. It allows the indexes to be generated when the data
file is generated, which means the indexes are available when the data is ready.
This is very convenient for the users.
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 67
In the future, we plan to more fully explore the two capabilities described
above. In addition, we plan to compare the query capability with well-known
systems such as RasdaMan [2] and SciDB [3].
References
1. Bauer, A.C., Abbasi, H., Ahrens, J., Childs, H., Geveci, B., Klasky, S., Moreland,
K., O’Leary, P., Vishwanath, V., Whitlock, B., Bethel, E.W.: In situ methods,
infrastructures, and applications on high performance computing platforms. Com-
put. Graph. Forum 35(3), 577–597 (2016)
2. Baumann, P.: rasdaman - raster data manager, January 2018. rasdaman.org
3. Brown, P.G.: Overview of sciDB: large scale array storage, processing and analysis.
In: Proceedings of the 2010 ACM SIGMOD International Conference on Manage-
ment of Data, SIGMOD 2010, pp. 963–968. ACM, New York (2010)
4. Chen, J.H., Choudhary, A., De Supinski, B., DeVries, M., Hawkes, E.R., Klasky,
S., Liao, W.-K., Ma, K.-L., Mellor-Crummey, J., Podhorszki, N., et al.: Terascale
direct numerical simulations of turbulent combustion using S3D. Comput. Sci.
Discov. 2(1), 015001 (2009)
5. Chou, J., Wu, K., Rübel, O., Howison, M., Qiang, J., Prabhat, Austin, B., Bethel,
E.W., Ryne, R.D., Shoshani, A.: Parallel index and query for large scale data
analysis. In: SC11 (2011)
6. Comer, D.: The ubiquitous B-tree. Comput. Surv. 11(2), 121–137 (1979)
7. Dong, B., Byna, S., Wu, K.: Expediting scientific data analysis with reorgani-
zation of data. In: 2013 IEEE International Conference on Cluster Computing
(CLUSTER), pp. 1–8, September 2013. http://ieeexplore.ieee.org/xpls/abs all.jsp?
arnumber=6702675
8. Dong, B., Byna, S., Wu, K.: SDS: a framework for scientific data services. In:
Proceedings of the 8th Parallel Data Storage Workshop (2013). http://www.pdsw.
org/pdsw13/papers/p27-pdsw13-dong.pdf
9. Dong, B., Byna, S., Wu, K., Prabhat, Johansen, H., Johnson, J.N., Keen, N.: Data
elevator: low-contention data movement in hierarchical storage system. In: 2016
IEEE 23rd International Conference on High Performance Computing (HiPC), pp.
152–161, December 2016
10. Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the
HDF5 technology suite and its applications. In: Proceedings of the EDBT/ICDT
2011 Workshop on Array Databases, pp. 36–47. ACM (2011). http://www.
hdfgroup.org/HDF5/
11. Gosink, L., Shalf, J., Stockinger, K., Wu, K., Bethel, W.: HDF5-FastQuery: acceler-
ating complex queries on HDF datasets using fast bitmap indices. In: SSDBM 2006,
Vienna, Austria, July 2006, pp. 149–158. IEEE Computer Society Press (2006)
12. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv.
(CSUR) 25(2), 73–169 (1993)
68 J. Gu et al.
13. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scien-
tific Discovery. Microsoft, Redmond (2009)
14. Im, H.G., Chen, J.H., Law, C.K.: Ignition of hydrogen/air mixing layer in tur-
bulent flows. In: Twenty-Seventh Symposium (International) on Combustion, The
Combustion Institute, pp. 1047–1056 (1998)
15. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing
with mapreduce: a survey. ACM SIGMOD Record 40(4), 11–20 (2012)
16. Liu, Q., Logan, J., Tian, Y., Abbasi, H., Podhorszki, N., Choi, J.Y., Klasky, S.,
Tchoua, R., Lofstead, J., Oldfield, R., Parashar, M., Samatova, N., Schwan, K.,
Shoshani, A., Wolf, M., Wu, K., Yu, W.: Hello ADIOS: the challenges and lessons
of developing leadership class I/O frameworks. Concurr. Comput. Pract. Exp. 26,
1453–1473 (2014). https://www.olcf.ornl.gov/center-projects/adios/
17. Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and
integration for scientific codes through the adaptable IO system (ADIOS). In:
CLADE 2008, pp. 15–24. ACM, New York (2008)
18. Ozsu, M.T.: Principles of Distributed Database Systems, 3rd edn. Prentice Hall
Press, Upper Saddle River (2007)
19. Qiang, J., Lidia, S., Ryne, R.D., Limborg-Deprey, C.: Three-dimensional qua-
sistatic model for high brightness beam dynamics simulation. Phys. Rev. Spec.
Topics-Accel. Beams 9(4), 044204 (2006)
20. Qiang, J., Ryne, R.D., Habib, S., Decyk, V.: An object-oriented parallel particle-
in-cell code for beam dynamics simulation in linear accelerators. J. Comput. Phys.
163(2), 434–451 (2000)
21. Qiang, J., Ryne, R.D., Venturini, M., Zholents, A.A., Pogorelov, I.V.: High resolu-
tion simulation of beam dynamics in electron linacs for X-ray free electron lasers.
Phys. Rev. ST Accel. Beams 12, 100702 (2009)
22. Rew, R., Davis, G.: NetCDF: an interface for scientific data access. IEEE Com-
put. Graphics Appl. 10(4), 76–82 (1990). http://www.unidata.ucar.edu/software/
netcdf/
23. Roman, E.: A survey of checkpoint/restart implementations. Technical report,
Lawrence Berkeley National Laboratory (2002)
24. Shoshani, A., Rotem, D. (eds.): Scientific Data Management: Challenges, Technol-
ogy, and Deployment. Chapman & Hall/CRC Press, Boca Raton (2010)
25. White, T.: Hadoop - The Definitive Guide: MapReduce for the Cloud. O’Reilly,
Sebastopol (2009)
26. Wu, K., Ahern, S., Bethel, E.W., Chen, J., Childs, H., Cormier-Michel, E., Geddes,
C., Gu, J., Hagen, H., Hamann, B., Koegler, W., Lauret, J., Meredith, J., Messmer,
P., Otoo, E., Perevoztchikov, V., Poskanzer, A., Prabhat, Rübel, O., Shoshani, A.,
Sim, A., Stockinger, K., Weber, G., Zhang, W.-M.: FastBit: interactively searching
massive data. In: SciDAC 2009. LBNL-2164E (2009)
27. Wu, T., Chou, J., Podhorszki, N., Gu, J., Tian, Y., Klasky, S., Wu, K.: Apply
block index technique to scientific data analysis and I/O systems. In: Proceedings
of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing, CCGrid 2017, pp. 865–871. IEEE Press, Piscataway, May 2017
28. Zhang, H., Wen, Y., Xie, H., Yu, N.: Distributed Hash Table: Theory, Platforms
and Applications. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-
9008-1
Querying Large Scientific Data Sets with Adaptable IO System ADIOS 69
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
On the Performance of Spark on HPC
Systems: Towards a Complete Picture
1 Introduction
Data is a driving power in almost every aspect of our lives and thus large amounts
of data generated everyday. For instance, International Data Research report [6]
estimates that the global data volume subject to data analysis will grow by a
factor of 50 to reach 5.2 zettabytes in 2025. This huge growth in the data volumes,
the deluge of Big Data, results in a big challenge in managing, processing and
analyzing these gigantic data volumes.
To benefit from this huge amount of data, different data processing models
have emerged [13,20]. Among these models, MapReduce [13,23] has stood out as
the most powerful Big Data processing model, in particular for batch processing.
MapReduce, and its open-source implementation Hadoop [3], is adopted in both
industry and academia due to its simplicity, transparent fault tolerance and scala-
bility. For instance, Yahoo! claimed to have the world’s largest Hadoop cluster [7]
with more than 100000 CPUs in over 40000 machines running MapReduce jobs.
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 70–89, 2018.
https://doi.org/10.1007/978-3-319-69953-0_5
On the Performance of Spark on HPC Systems: Towards a Complete Picture 71
from the high performance offered by these systems due to their large sizes and
shared architecture [16,18,25,33].
In response, several efforts have been conducted to leverage Spark for fast
Big Data processing on HPC systems. These works have mainly tried to allevi-
ate the high latency problem by focusing on the intermediate data storage (i.e.,
map output for batch jobs and temporary output produced between stages for
iterative jobs) [12,21,28,32,34]. For example, Islam et al. [21] utilized NVRAM
as an intermediate storage layer (i.e., burst buffer) between compute nodes and
Lustre file system [14]. This brought 24% improvement to the application per-
formance by reducing the latency when reading/writing the intermediate data.
However, Big Data applications mostly run in batches and there is a continuous
interaction with the parallel file system for reading the input data and writing
the output data, thus it is important to study the impact of latency on the
performance of Big Data applications by considering the different phases of Big
Data applications as input, intermediate and output data. Moreover, none of
these efforts considered the contention problem which can contribute to a sig-
nificant performance degradation by up to 2.5x [33]. Hence, as we argue in this
paper, current efforts and solutions to adopt Spark on HPC systems may fail in
practice to achieve the desired performance and this may hinder such adoption.
Our Contributions. In an effort to complement existing efforts on understand-
ing the performance of Big Data applications on HPC systems, in this paper, we
perform an experimental study characterizing the performance of Spark [36] on
HPC systems. We use representative Big Data workloads on the Grid’5000 [22]
testbed to evaluate how the latency, contention, and file system’s configuration
impact the application performance. We make the following contributions:
2 Methodology
We conducted a series of experiments in order to assess the impact of the poten-
tial issues regarding HPC systems (i.e., latency, contention, file system’s con-
figuration) on the performance of Big Data applications. We further describe
the experimental environment: the platform, deployment setup, and Big Data
workloads.
2.3 Workloads
For Sort and Wordcount workloads, we used 200 GB input data set generated
with RandomTextWriter in HiBench suite. For the PageRank workload, we also
used HiBench suite which uses the data generated from Web data with 25 million
edges as an input data set.
3 Experimental Results
First, we try to understand the impact of the data location on the application
performance. While storage resources are co-located with Spark tasks under the
data-centric paradigm (i.e., when using Spark with HDFS), Spark tasks need to
communicate with the parallel file system either to fetch the input data or to
write the output data under the compute-centric paradigm (i.e., when Spark is
using PVFS as the storage space). This remote data access results in a higher
latency compared to the data-centric paradigm which leverages data locality (i.e.,
executing tasks on the machines where the input data resides). Figure 2 shows
how latency can affect the application performance. Note that, intermediate data
is stored locally on the aforementioned settings for Spark in order to focus on
the latency resulting from reading the input data in map phase. We explore the
intermediate data storage separately in the next subsection.
On the Performance of Spark on HPC Systems: Towards a Complete Picture 75
Time (s)
Time (s)
60
300
100
40
200
50
20
100
0 0 0
Total Map Reduce Total Map Reduce Total S0 S1 S2 S3 S4 S5
Fig. 2. Performance of Big Data workloads on Spark under data-centric and compute-
centric paradigms.
Figure 2(a) displays the execution time of the Wordcount workload for both
paradigms with a performance in map and reduce phases. Overall, Wordcount
performs 1.9x worse under the compute-centric paradigm compared to the data-
centric one. When we look at the performance in each phase, we observe that
the performance degradation contributed by the map phase (2.3x) is higher
compared to the reduce phase. This stems from the fact that Wordcount has a
light reduce phase and generates only a small amount of output data.
Similarly, in Fig. 2(b) we observe that the data-centric configuration out-
performs the compute-centric one by 4.9x for the Sort workload. In contrast
to Wordcount, the reduce phase is the major contributor to the performance
degradation. For the Sort workload, the amount of the output data is equal to
the input data thus it suffers from a higher latency in the reduce phase as data
is written to the parallel file system. As a result, having a higher latency on
both input and output phases led to higher performance degradation for the
compute-centric paradigm.
Lastly, we ran the PageRank workload in both settings for Spark and Fig. 2(c)
shows the results. Here, performance degradation with the compute-centric
paradigm is only 26%. The reason behind this is that I/O phases of the PageR-
ank workload (i.e., Stage 0 and Stage 5 (denoted as S0 and S5)) accounts for a
small fraction of PageRank execution time and Spark computes the iterations
(i.e., Stage 1, 2, 3 and 4) locally.
The Impact of the Input Data Sizes. We also investigated the impact
of the input data size on the application performance. To do so, we ran the
Wordcount workload with different input sizes as 2 GB, 20 GB and 200 GB.
Figure 3 displays the performance of the Wordcount workload in each phase for
data and compute-centric paradigms. Overall, we observe that the impact of I/O
latency is only visible in the map phase for the compute-centric paradigm with
increasing input sizes: there is a performance degradation for the map phase by
1.2x, 1.8x and 2.3x with 2 GB, 20 GB and 200 GB input sizes, respectively.
This is mainly due to the fact that Wordcount is a map-heavy workload which
generates a small amount of output data and therefore reduce phase results do
not vary significantly with respect to different data sizes. To further investigate
76 O. Yildiz and S. Ibrahim
60 60 120
Data-centric Data-centric Data-centric
Compute-centric Compute-centric Compute-centric
50 50 100
40 40 80
Time (s)
Time (s)
Time (s)
30 30 60
20 20 40
10 10 20
0 0 0
Total Map Reduce Total Map Reduce Total Map Reduce
CDF
CDF
Data-centric
Compute-centric
0.4 0.4 0.4
0 0 0
2.5 3 3.5 4 4.5 5 5.5 6 0 2 4 6 8 10 12 14 16 18 20 22 0 5 10 15 20 25 30 35 40 45
Time(s) Time(s) Time(s)
CDF
CDF
0 0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 55
Time(s) Time(s) Time(s)
these different behaviors in map and reduce phases, we display the CDF of map
and reduce task durations in Figs. 4 and 5.
Interestingly, Fig. 4(a) shows that some map task durations are smaller for
the compute-centric paradigm compared to the data-centric one. This is due to
the fact that Spark employs delay scheduling [35] to increase the chances of a
map task to be launched locally for the data-centric paradigm. This delay while
launching the map tasks, which results in a performance degradation for the
jobs with small input data sizes, is due to the default Spark configuration for
the maximum waiting time (i.e., 3 s) in scheduling the map tasks. This is only
valid for the data-centric paradigm since there is no data locality objective when
scheduling the tasks in the compute-centric paradigm where all the machines
have an equivalent distance to the parallel file system. On the other hand, we
observe an increase in the map task durations with larger input sizes for the
compute-centric paradigm. This results from the higher latency while fetching
the input data from parallel file system with larger input sizes.
On the Performance of Spark on HPC Systems: Towards a Complete Picture 77
Another interesting trend we observe is that the maximum map task duration
also increases with the increasing data sizes, especially with 200 GB input data
size in Fig. 4(c). We believe that this behavior is due to the higher contention
with the increased number of concurrent map tasks. It is important to note
that there are 33, 594 and 800 concurrent map tasks with 2 GB, 20 GB and
200 GB input sizes. Moreover, we see that this increase is much higher with
the compute-centric paradigm which can highlight the severity of the contention
problem for this paradigm. We will further explain the impact of the contention
on the application performance in Sect. 3.2.
In Fig. 5, we observe a similar trend for the reduce task durations for the
compute-centric paradigm. With larger data sizes, we observe an increase in
those durations too. This again stems from an increased amount of the remote
data transfer while writing the reducer outputs to the parallel file system. More-
over, we discover that there is a high performance variability in the reduce phase
and the maximum task duration is quite high even with 2 GB data size. This
is due to the static Spark configuration which employs 800 reducers regardless
of the input data size. These high number of reducers overload the parallel file
system and results in this performance variability. Hence, we do not see the
impact of latency in Fig. 3 for the reduce phase. However, when the output data
size is large enough as shown for the Sort workload in the previous experiment
(Fig. 2(b)), the impact of the I/O latency is quite clear as it results in a significant
performance degradation.
For the data-centric paradigm, this time we see that reduce task durations
are inlined with the data sizes, different from the map phase. While for the map
phase there is an increase in the maximum task duration due to the increased
number of concurrent map tasks, for the reduce phase the number of reduce
tasks is fixed and the increase in the reduce task durations is mainly due to the
increased amount of reducer output with larger input sizes.
800
Local(Disk)
Remote(PVFS)
700
600
500
Time (s)
400
300
200
100
0
Total Map Reduce
Fig. 6. Impact of the location of intermediate data on the performance of the Sort
workload.
phase, we observe that there is a 8% increase in the completion time due to the
additional I/O latency when fetching the intermediate data from PVFS.
Findings. In all of the workloads, we observe that the remote data access to the
parallel file system leads to a significant performance degradation, especially for
the input and output data. We also confirm that the degree of this performance
degradation depends on the characteristics of the workloads and on the input
data size.
Measuring the contention when running concurrent Big Data applications. Since
the storage system is shared by all the nodes, this can create a serious contention
problem on the storage path including network, server and storage devices. Here,
we ran two Wordcount workloads concurrently under compute and data-centric
paradigms by employing the Fair scheduler in Spark. The Fair scheduler allows
these workloads to have equal share of the resources in the Spark cluster (i.e.,
each workload employ 400 tasks which is equal to the half of the cluster capacity).
Figure 7 displays the execution times of the Wordcount workload when it runs
alone and together with the other identical Wordcount workload for data and
compute-centric paradigms. As shown in Fig. 7(a), the performance degradation
when running in contention with the other Wordcount workload is negligible
with the data-centric paradigm. In contrast, we observe that there is a 41%
performance degradation with the compute-centric paradigm when two work-
loads are running concurrently. This stems from sharing the same parallel file
system with compute-centric paradigm while these two workloads perform their
I/O operations on their individual storage devices in the data-centric paradigm.
On the Performance of Spark on HPC Systems: Towards a Complete Picture 79
180 180
Alone Alone
Interfering Interfering
160 160
140 140
120 120
Time (s)
Time (s)
100 100
80 80
60 60
40 40
20 20
0 0
Total Map Reduce Total Map Reduce
200
Alone
With IOR
180
160
140
120
Time (s)
100
80
60
40
20
0
Total Map Reduce
Fig. 8. Performance of the Wordcount workload when running alone and together with
IOR workload.
Measuring the contention when co-locating HPC and Big Data applications. This
contention problem can even become more significant when we consider the ulti-
mate objective of the HPC and Big Data convergence which is co-locating sci-
entific and Big Data applications on a same platform. To emulate this objective,
we ran the Wordcount workload alone and together with the IOR workload.
IOR [29] is a popular I/O benchmark that allows users to specify different I/O
configurations and thus measures the I/O performance of HPC systems. For IOR
workload, we employed 224 processes (on a different set of nodes separated from
the ones running the Wordcount workload) where each process issues a 512 MB
write request in 32 MBs of chunks. Figure 8 shows the execution times of the
Wordcount workload for both cases. Due to resource sharing (file system and
network) with the IOR workload, there is a 1.4x performance degradation in the
80 O. Yildiz and S. Ibrahim
total execution time of the Wordcount workload. When we look at the perfor-
mance in each phase, we observe that this performance degradation is mainly due
to the reduce phase. This stems from the fact that reduce phase performs write
operations as the IOR workload and this results in a write/write contention.
Findings. We demonstrate that contention appears as a limiting factor for Big
Data applications on HPC systems due to employing a shared storage system.
180
Sync OFF
Sync ON
160
140
120
Time (s)
100
80
60
40
20
0
Total Map Reduce
Table 1. Execution time of the Sort workload and its phases under different configu-
rations of PVFS.
Findings. Parallel file systems are equipped with several features which are
important for HPC applications (i.e., synchronization feature in PVFS to provide
resiliency, distributed locking mechanism in Lustre to ensure file consistency).
However, as we demonstrated in our experiments and also reported earlier in
[12,32], these features may bring a significant performance degradation for Big
Data applications.
35
40 GB
10 GB
30
25
20
Time (s)
15
10
0
Total Map Reduce
Fig. 10. Impact of the memory capacity of the burst buffer on the performance of the
Wordcount workload.
Measuring the impact of the deployment location of the burst buffer. We ran the
Wordcount workload with the same configuration as in the previous experiment
and deployed the burst buffer in two scenarios: in the first one, the burst buffer
is deployed as a disjoint set of nodes and in the second one it is located as
a subset of the compute cluster. Figure 11 displays that Wordcount performs
better when burst buffer is deployed as a separate set of nodes. We hypothesize
the following explanation. When the burst buffer is using the subset of the nodes
of the compute cluster, I/O and compute tasks on those nodes conflict with each
other thus resulting in a significant performance degradation (38% slowdown).
This is in line with the observations reported in [9].
Findings. Our experiments show that the storage capacity and the location
of burst buffers can have a significant impact on the performance of Big Data
applications. With limited storage capacity, we demonstrate that burst buffers
can not mitigate the latency problem fully since compute nodes still need to fetch
most of the data from the parallel file system. For the deployment location, we
On the Performance of Spark on HPC Systems: Towards a Complete Picture 83
20
Disjoint
Subset
15
Time (s)
10
0
Total Map Reduce
Fig. 11. Impact of the location of the burst buffer on the performance of the Wordcount
workload.
observe that co-locating the burst buffer and compute resources on the same
node can not be appropriate due to the possible interference among them.
5 Related Work
Several research efforts have been conducted to evaluate the performance of
Big Data analytics frameworks on HPC systems. Wang et al. [32] performed an
experimental study where they investigated the characteristics of Spark on a
HPC system with a special focus on the impact of the storage architecture and
locality-oriented task scheduling. Tous et al. [31] evaluated the Spark perfor-
mance on a MareNostrum supercomputer. In particular, they studied the impact
of different Spark configurations on the performance of Sort and K-means appli-
cations. In [30], the authors compared the performance of MapReduce applica-
tions on PVFS and HDFS file systems by using Hadoop framework and give
insights into how to emulate HDFS behavior by using PVFS. Li and Shen [24]
compared the performance of MapReduce applications on scale-up and scale-out
clusters and proposed a hybrid scale-up/out Hadoop architecture based on their
findings.
Aforementioned studies provide useful findings towards leveraging HPC sys-
tems for Big Data processing. However, they do not illustrate a complete analysis
of the potential performance issues (e.g., latency and contention). For the latency
On the Performance of Spark on HPC Systems: Towards a Complete Picture 85
problem, most of the studies focus on the intermediate data storage and ignore
the latencies which can occur in other I/O phases. We provide a detailed analysis
of the impact of latency on the application performance by giving a breakdown
of the latency problem into its different phases (i.e., input, intermediate and
output data). Although these studies mention contention as a problem, none
of them investigate its impact on the application performance. Hence, we aim
to complement those studies by providing a detailed analysis of the impact of
latency and contention on the performance of Spark applications. Furthermore,
we show potential performance issues specific to different PVFS configurations.
Some works proposed adoption of burst buffers for efficient Big Data process-
ing on HPC systems. Chaimov et al. [12] employed a dedicated set of nodes with
NVRAM as the storage space for the intermediate data of Big Data applications.
This in turn improved the scalability of the Spark framework compared to the
scenario when using Lustre file system as the storage space. Islam et al. [21] pro-
posed a novel design for HDFS which uses NVRAM-based burst buffer nodes on
top of a parallel file system for improving the performance of Spark applications.
Yildiz et al. [34] present Eley, a burst buffer solution that helps to accelerate
the performance of Big Data applications while guaranteeing the performance of
HPC applications. Eley employs a prefetching technique that fetches the input
data of these applications to be stored close to computing nodes thus reduc-
ing the latency of reading data inputs. Moreover, Eley is equipped with a full
delay operator to guarantee the performance of HPC applications. Similarly,
our findings illustrate that there is a need for burst buffer solutions to alleviate
Table 2. Our major findings on the characteristics of Big Data applications on HPC
systems.
the latency problem. In addition, we give insights into designing efficient burst
buffer solutions. Specifically, we claim that future burst buffer implementations
should be aware of the contention problem and also try to eliminate the latency
problem for the input phase and output phase.
References
1. Big Data and Extreme-scale Computing (BDEC) Workshop. http://www.exascale.
org/bdec/
2. HiBench Big Data microbenchmark suite. https://github.com/intel-hadoop/
HiBench
3. The Apache Hadoop Project. http://www.hadoop.org
4. Apache Storm (2012). https://storm.apache.org/
5. Apache Spark primer (2017). http://go.databricks.com/hubfs/pdfs/Apache
Spark Primer 170303.pdf
6. IDC’s Data Age 2025 study (2017). http://www.seagate.com/www-content/our-
story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf
7. Powered by Hadoop (2017). http://wiki.apache.org/hadoop/PoweredBy/
8. Hadoop Workload Analysis. http://www.pdl.cmu.edu/HLA/index.shtml. Accessed
Jan 2017
On the Performance of Spark on HPC Systems: Towards a Complete Picture 87
9. Bent, J., Faibish, S., Ahrens, J., Grider, G., Patchett, J., Tzelnic, P., Woodring, J.:
Jitter-free co-processing on a prototype exascale storage stack. In: 2012 IEEE 28th
Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–5. IEEE
(2012)
10. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data
processing on large clusters. Int. J. Very Large Databases 3(1–2), 285–296 (2010)
11. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.:
Apache flink: stream and batch processing in a single engine. Bull. IEEE Com-
put. Soc. Tech. Comm. Data Eng. 36(4), 28–38 (2015)
12. Chaimov, N., Malony, A., Canon, S., Iancu, C., Ibrahim, K.Z., Srinivasan, J.:
Scaling Spark on HPC systems. In: Proceedings of the 25th ACM International
Symposium on High-Performance Parallel and Distributed Computing, pp. 97–
110. ACM (2016)
13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters.
Commun. ACM 51(1), 107–113 (2008)
14. Donovan, S., Huizenga, G., Hutton, A.J., Ross, C.C., Petersen, M.K., Schwan, P.:
Lustre: building a file system for 1000-node clusters (2003)
15. Dorier, M., Antoniu, G., Cappello, F., Snir, M., Sisneros, R., Yildiz, O., Ibrahim,
S., Peterka, T., Orf, L.: Damaris: addressing performance variability in data man-
agement for post-petascale simulations. ACM Trans. Parallel Comput. (TOPC)
3(3), 15 (2016)
16. Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating
I/O interference in HPC systems through cross-application coordination. In: Pro-
ceedings of the IEEE International Parallel and Distributed Processing Symposium
(IPDPS 2014), Phoenix, AZ, USA, May 2014. http://hal.inria.fr/hal-00916091
17. Fox, G., Qiu, J., Jha, S., Ekanayake, S., Kamburugamuve, S.: Big data, simulations
and HPC convergence. In: Rabl, T., Nambiar, R., Baru, C., Bhandarkar, M., Poess,
M., Pyne, S. (eds.) WBDB -2015. LNCS, vol. 10044, pp. 3–17. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-49748-8 1
18. Gainaru, A., Aupy, G., Benoit, A., Cappello, F., Robert, Y., Snir, M.: Schedul-
ing the I/O of HPC applications under congestion. In: International Parallel and
Distributed Processing Symposium, pp. 1013–1022. IEEE (2015)
19. Guo, Y., Bland, W., Balaji, P., Zhou, X.: Fault tolerant MapReduce-MPI for HPC
clusters. In: Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, p. 34. ACM (2015)
20. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-
parallel programs from sequential building blocks. In: Special Interest Group on
Operating Systems Review, vol. 41, pp. 59–72. ACM (2007)
21. Islam, N.S., Wasi-ur Rahman, M., Lu, X., Panda, D.K.: High performance design
for HDFS with byte-addressability of NVM and RDMA. In: Proceedings of the
2016 International Conference on Supercomputing, p. 8. ACM (2016)
22. Jégou, Y., Lantéri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Primet, P.,
Quetier, B., Richard, O., Talbi, E.G., Iréa, T.: Grid’5000: a large scale and highly
reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl.
20(4), 481–494 (2006)
23. Jin, H., Ibrahim, S., Qi, L., Cao, H., Wu, S., Shi, X.: The MapReduce programming
model and implementations. In: Buyya, R., Broberg, J., Goscinski, A. (eds.) Cloud
Computing: Principles and Paradigms, pp. 373–390. Wiley, New York (2011)
24. Li, Z., Shen, H.: Designing a hybrid scale-up/out hadoop architecture based on
performance measurements for high application performance. In: 2015 44th Inter-
national Conference on Parallel Processing (ICPP), pp. 21–30. IEEE (2015)
88 O. Yildiz and S. Ibrahim
25. Lofstead, J., Zheng, F., Liu, Q., Klasky, S., Oldfield, R., Kordenbrock, T., Schwan,
K., Wolf, M.: Managing variability in the I/O performance of petascale storage sys-
tems. In: International Conference for High Performance Computing, Networking,
Storage and Analysis, pp. 1–12. IEEE (2010)
26. Lopez, I.: IDC talks convergence in high performance data analysis (2013). https://
www.datanami.com/2013/06/19/idc talks convergence in high performance
data analysis/
27. Ross, R.B., Thakur, R., et al.: PVFS: a parallel file system for Linux clusters. In:
Annual Linux Showcase and Conference, pp. 391–430 (2000)
28. Sato, K., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Maruyama, N.,
Matsuoka, S.: A user-level infiniband-based file system and checkpoint strategy
for burst buffers. In: 2014 14th IEEE/ACM International Symposium on Cluster,
Cloud and Grid Computing (CCGrid), pp. 21–30. IEEE (2014)
29. Shan, H., Shalf, J.: Using IOR to analyze the I/O performance for HPC platforms.
In: Cray User Group Conference 2007, Seattle, WA, USA (2007)
30. Tantisiriroj, W., Patil, S., Gibson, G.: Data-intensive file systems for internet
services: a rose by any other name. Parallel Data Laboratory, Technical report
UCB/EECS-2008-99 (2008)
31. Tous, R., Gounaris, A., Tripiana, C., Torres, J., Girona, S., Ayguadé, E., Labarta,
J., Becerra, Y., Carrera, D., Valero, M.: Spark deployment and performance evalua-
tion on the MareNostrum supercomputer. In: 2015 IEEE International Conference
on Big Data (Big Data), pp. 299–306. IEEE (2015)
32. Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of
memory-resident MapReduce on HPC systems. In: 2014 IEEE 28th International
Parallel and Distributed Processing Symposium, pp. 799–808. IEEE (2014)
33. Yildiz, O., Dorier, M., Ibrahim, S., Ross, R., Antoniu, G.: On the root causes
of cross-application I/O interference in HPC storage systems. In: IPDPS-
International Parallel and Distributed Processing Symposium (2016)
34. Yildiz, O., Zhou, A.C., Ibrahim, S.: Eley: on the effectiveness of burst buffers for
big data processing in HPC systems. In: 2017 IEEE International Conference on
Cluster Computing (CLUSTER), pp. 87–91, September 2017
35. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.:
Delay scheduling: a simple technique for achieving locality and fairness in cluster
scheduling. In: Proceedings of the 5th European Conference on Computer Systems,
pp. 265–278. ACM (2010)
36. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster
computing with working sets. In: HotCloud 2010, p. 10 (2010)
On the Performance of Spark on HPC Systems: Towards a Complete Picture 89
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Experiences of Converging Big Data Analytics
Frameworks with High Performance
Computing Systems
Abstract. With the rapid development of big data analytics frameworks, many
existing high performance computing (HPC) facilities are evolving new capa-
bilities to support big data analytics workloads. However, due to the different
workload characteristics and optimization objectives of system architectures,
migrating data-intensive applications to HPC systems that are geared for tra-
ditional compute-intensive applications presents a new challenge. In this paper,
we address a critical question on how to accelerate complex application that
contains both data-intensive and compute-intensive workloads on the Tianhe-2
system by deploying an in-memory file system as data access middleware; we
characterize the impact of storage architecture on data-intensive MapReduce
workloads when using Lustre as the underlying file system. Based on our
characterization and findings of the performance behaviors, we propose shared
map output shuffle strategy and file metadata cache layer to alleviate the impact
of metadata bottleneck. The evaluation of these optimization techniques shows
up to 17% performance benefit for data-intensive workloads.
1 Introduction
The strong need for increased computational performance has led to the rapid devel-
opment of high-performance computing (HPC) systems, including Sunway TaihuLight
[1], Tianhe-2 [2], Titan [3], etc. These HPC systems provide an indispensable com-
puting infrastructure for scientific and engineering modeling and simulations [4–6].
While HPC systems mostly focus on large computational workloads, the emerging big
data analytics frameworks target applications that need to handle very large and
complex data sets on commodity machines. Hadoop MapReduce [7] and Spark [8] are
the most commonly used frameworks for distributed large-scale data processing and
gained wide success in many fields over the past few years.
Recently, many researchers have predicted the trend of converging HPC and big
data analytics frameworks to address the requirements of complex applications that
contains both compute-intensive and data-intensive workloads [9, 10]. The motivation
behind this converging trend is twofold. Firstly, traditional data analytics applications
need to process more data in given time, but the dynamics of the network environment
and cloud services result in a performance bottleneck. Compared with normal
machines, HPC systems that equipped with better hardware and high performance
network can provide much higher capacity. Secondly, scientific applications are more
complex to fully utilize the computing capacity of the HPC systems and the high
resolution data from advanced sensors. For example, in the NASA Center for Climate
Simulation (NCCS), climate and weather simulations can create a few terabytes of
simulation data [11]. To visualize the data of interesting events such as hurricane center
and thunderstorms, part of these data need to be processed by a visualization tool under
Hadoop environment. These complex applications contain both compute-intensive and
data-intensive jobs and need to be processed in HPC environment.
However, the converging trend presents a new challenge when data-intensive
applications migrate to HPC systems that are geared for traditional compute-intensive
applications, mainly due to the different workload characteristics and optimization
objectives.
System architectures are designed to best support the typical workloads running on
the clusters. Traditional HPC systems are invented to solve compute-intensive work-
loads, such as scientific simulation, with the optimization goal of providing maximum
computational density and local bandwidth for given power/cost constraint. In contrast,
big data analytics systems aim to solve data-intensive workloads with the optimization
goal of providing maximum data capacity and global bandwidth for given power/cost
constraint. Consequently, big data analytics systems are different to HPC systems, as
shown in Fig. 1.
HDFS MapReduce
CN CN CN CN Namenode Scheduler
Interconnect
Executor Executor Executor
Typical HPC systems consist of a large collection of compute nodes (CN), which are
connected through high-speed, low-latency interconnects (such as InfiniBand [12]).
Parallel file system (e.g. Lustre [13]) on top of disk array are used for persistent data
storage. In most HPC systems, the compute node is diskless and performs well for
compute-intensive workloads with high ratios of compute to data access. Parallel file
systems simplify data sharing between compute nodes, but its performance is bottlenecked
by metadata operations and fails to provide spatial data locality for computation tasks.
Big data processing frameworks like Hadoop and Spark utilize low-cost commodity
machines to solve data-intensive problems, where each machine co-locate processing
unit and local disks together. Hadoop distributed file system (HDFS [14]) is built on
top of these local disks to provide the ability of persistent storage. Computation tasks
are launched on physical machines where the data locality of required data can be
leveraged maximally.
On the one hand, these distinctions between HPC and big data analytics systems
have significant performance implications for different types of application workloads.
On the other hand, the converging trend of HPC and big data is imperative and
provides a lot of chance for researchers to make their attempt.
In this paper, we try to figure out three problems:
1. How to accelerate the complex application that contains both simulation and ana-
lytics jobs? The output data of HPC workloads are stored in parallel file systems,
while traditional big data analytics frameworks rely on HDFS to read or write data.
Hence, it is highly desirable to utilize a middleware that allows applications to
access data stored in different data source without redundant data movement.
2. What is the impact of using Lustre parallel file system as the underlying file system
of big data analytics frameworks since compute nodes in most HPC systems are
diskless?
3. How to reconcile and converge the architectural differences between the two
paradigms so that data-intensive MapReduce applications can be accelerated in
HPC environments?
Previous works in [15, 16] have analyzed the performance differences when
deploying Hadoop and Spark on HPC systems, but they did not provide optimizations
for complex applications. Many efforts have explored directly deploying Hadoop atop
of existing parallel file systems, such as Ceph [17], PVFS [18] and GPFS [19], but
these works are limited to the specific version of Hadoop. Compared with these works,
we deploy an in-memory file system, Alluxio [20], as data access middleware to
accelerate complex applications. We analyze its performance with intensive experi-
ments on the Tianhe-2 system and introduce shared map output file shuffle strategy and
file metadata cache layer targeting at compute-centric HPC systems to accelerate
data-intensive Hadoop applications.
Our contributions in this paper can be summarized as follows.
(1) We have utilized an in-memory file system on Tianhe-2 system to reconcile the
architectural differences and accelerate complex applications.
(2) We have evaluated the feasibility and performance impacts of in-memory file
system through different workloads.
Experiences of Converging Big Data Analytics Frameworks 93
(3) We proposed advanced acceleration techniques, including shared map output file
shuffle strategy and file metadata cache layer, to accelerate data-intensive
MapReduce applications on HPC systems.
Section 2 gives a brief background of this paper. Section 3 explore the feasibility of
data access middleware and analyze the performance impact. We propose our design of
shared map output file shuffle strategy and file metadata cache layer in Sect. 4. Sec-
tion 5 provides some related studies currently existing in the literature. We conclude
the paper and talk about the future work in Sect. 6.
2 Background
In this section, we give a direct comparison between HDFS and Lustre file system and
review the design of Alluxio.
metadata, including file attributes, file permissions, and the layout of file objects in the
form of extended attributes (EA). With the EA, the client can communicate with
corresponding OSSs to get the file data.
2.2 Alluxio
Due to the growing I/O bandwidth gap between main memory and disk, the storage
layer became the bottleneck that limits the application scalability. To alleviate the
storage pressure, a variety of in-memory file systems that act as a fast, distributed cache
are developed to enhance I/O performance [21–23].
Alluxio is an open source memory speed virtual distributed storage system that sits
between the underlying storage systems and processing framework layers. It has the same
architecture as HDFS where a master is primarily responsible for managing the global
metadata of the system, Alluxio workers store data as blocks in local resources. These
resources could be local memory, SSD, or hard disk and are user configurable. Alluxio
provides Hadoop API, Java API and POSIX interface for user applications and compu-
tation frameworks while it can connect with different underlying storage systems such as
Amazon S3, Apache HDFS through encapsulated adapters. Because of the memory-
centric design and the outstanding compatibility with different computation frameworks
and storage systems, Alluxio plays an important role in the big data ecosystem.
We choose Alluxio as data access middleware to converge big data analytics
frameworks with HPC systems because of the compatibility to different kinds of
underlying file system and the POSIX-compliant interface it provides.
We used 32 nodes to simulate the typical data analytics environment where HDFS
is set on top of the local disk. Spark-2.1.0, Hadoop-2.7.3 and Oracle Java 1.8.0 are
used. The HDFS block size is set to 128 MB.
Another 32 nodes are used to simulate the typical HPC environment. We did not
utilize the local disk as underlying storage but mounted Lustre file system on these
nodes. In our test environments, the mounted Lustre file system contains 48 OSTs. To
enable Hadoop and Spark to read/write data from/to Lustre, we deployed Alluxio-1.4.0
on these nodes.
As mentioned, we evaluate IOZone, Mdtest, HiBench benchmark and a simulated
complex application on HPC and data analytics environment, where the underlying file
system is Lustre and HDFS respectively. Each benchmark has been executed at least
three times and we report the mean performance.
We compare the performance of HAL and Alluxio via HiBench benchmark. All
input data are stored in Lustre and we run each workload in data analytics environment
where each node can write the intermediate data to local disk. We did not use the
shuffle strategy implemented in HAL because of the existence of local disk. Figure 2
shows the performance comparison of different data access strategy where hdfs-put
means copy data from Lustre to HDFS before running the workload, hadoop-hdfs
represents Hadoop workload read input data from HDFS and hadoop-alluxio represents
Hadoop workload read input data from Lustre directly via Alluxio by such analogy. All
the intermediate data are stored in the local disk. Obviously, both Alluxio and HAL
provide an efficient way for Hadoop and Spark workloads to read data stored in Lustre
without redundant data movement.
160
spark-hal spark-alluxio spark-hdfs
140
hadoop-hal hadoop-alluxio hadoop-hdfs
120
hdfs-put
Execution time (s)
100
80
60
40
20
0
Workloads
It is worth mentioning that Hadoop/Sprak over HAL run faster than over Alluxio.
The reason behind this phenomenon is that Alluxio provides HDFS view for user
application which in turn makes a more complicated data access process than HAL. In
Fig. 3, to access a file that stored in underlying file system, a client requests the Alluxio
master to get file metadata. After receiving the metadata request, the master will create
an inode object and generate file metadata in HDFS metadata format according to the
information from underlying file system and sent back to the client. According to the
metadata, the client will construct the underlying file system info and sent to Alluxio
worker if the file is not stored in Alluxio and the block locations info is null. After
receiving the data requests from the client, Alluxio worker will create a packet reader
and act as a client of underlying file system to read file data and sent them back. Data
access in HAL is much simpler because each client sent data access request to
underlying Lustre file system directly without providing HDFS function for user
application.
Experiences of Converging Big Data Analytics Frameworks 97
1. Metadata request
Alluxio Alluxio
Client Master
3.Metadata
7. Data 4. Data request 2. Load metadata from
UFS and create inode
Alluxio
Worker
Although the data access performance of HAL is better, Alluxio can provide a
memory-centric distributed storage space and is compatible with different underlying
file system simultaneously. We believe that Alluxio is a better choice for complicated
applications. To validate our assumption, we define a complex application that consists
of RandomWriter and Sorter. RandomWriter simulates HPC workload, which is exe-
cuted in HPC environment and generates 10 GB data per node. Sorter simulates data
analytic workload that analyzes the output data of RandomWriter in Hadoop envi-
ronment. The results are shown in Fig. 4. We scale the number of nodes from 1 to 32,
RandomWriter-HAL indicates that RandomWriter writes data into Lustre and
Sort-HAL read data from Lustre via HAL. RandomWriter-Alluxio represents the
output data of RandomWriter are stored in Alluxio and Sort-Alluxio can read data from
Alluxio directly. Obviously, Sort-Alluxio spent less time than Sort-HAL since the
output data are stored in memory and can be accessed directly while Sort-HAL needs to
read data from underlying Lustre file system.
600 Sort-Alluxio
500
RandomWriter-
400 Alluxio
Sort-HAL
Execution time (s)
300
RandomWriter-
200 HAL
100
0
1 1 4 4 8 8 16 16 32 32
Number of nodes
800
Lustre
700
LocalDisk
600
500
Bandwidth (MB/s)
400
300
200
100
0
write rewrite read reread
Read/write Bandwidth
10000
1000
100
10
1
Directory Directory Directory File File read File
creation stat removal creation removal
Operation
700
HAL-Local disk
Execution time (s)
500
400 Alluxio-Lustre
300 HAL-Lustre
200
100
0
30 150 300
Intermediate data size (GB)
The shuffle phase of MapReduce jobs is shown in Fig. 8: (1) each map task is
assigned a portion of the input file and applies the user-defined map function on each
key-value pair. The processed key-value data are stored in a memory buffer named
kvbuffer first and will be spill to the intermediate data directory every time the available
space of kvbuffer is less than 20%. These spill files will be sorted and merged into one
output file before each map task finished. (2) Reduce tasks starts fetching these inter-
mediate data that stored in local disk form each node after all map tasks finished. These
data from different map output files are sorted and merged again to generate one final
input data for each reduce task. Overall, the shuffle phase contains gigantic file create
and read/write operations and is sensitive to network latency and disk bandwidth.
HDFS KV
MapTask-1
buffer
ReduceTask-1
Split-1
Split-2 … …
… ReduceTask-R
Split-M
KV
MapTask-M
buffer
During the shuffle phase, each map/reduce task sent file create/open requests to local
file system to write/read intermediate data in data analytics environment with local disk,
and its performance is subject to network latency and disk latency. In HPC environment,
however, these requests will be sent to the underlying parallel file system, and its
performance is subject to metadata latency along with network latency and disk latency.
The reason why HAL-Lustre performs worse than Alluxio-Lustre is that the shuffle
strategy in HAL will generate more intermediate data. To prevent repetitive data
movements, HAL reimplements the shuffle phase to allow each reduce task retrieve
data from Lustre directly. In default shuffle strategy, each map task will generate one
intermediate file for all reduce tasks every time the kvbuffer spill the data to local disk.
These intermediate files generated from the same map task will be merged into one
final output file and is fetched by reduce tasks. The total number of intermediate files
will be n * M, where n represents the number of spill operations and M represents the
number of map tasks. However, in HAL shuffle strategy, each map task will generate
one intermediate file for each reduce task and all the intermediate files that belong to
one reduce task are stored in one directory. The total number of intermediate files will
be n * M * R, where R represents the number of reduce tasks. HAL shuffle strategy
can avoid the merge phase of map tasks and prevent the repetitive data movements
cost, but the metadata operation cost to gigantic intermediate files result in the per-
formance loss.
Experiences of Converging Big Data Analytics Frameworks 101
In summary, when using Lustre as the underlying file system of big data analytics
frameworks, Lustre can provide higher aggregate bandwidth than traditional HDFS that
build on top of the local disk, but the costly metadata operation may result in serious
performance loss if massive intermediate data were stored in Lustre.
Mapper A Reducer 1
Map
(key, value) Merge Reduce
Partition 1
…
Partition n
Input split A Map A: partition 1 Output 1
Spill file
Lustre
Firstly, each map task generates one intermediate file every time the kvbuffer spill
the data to underlying Lustre file system. Each spill file contains multiple partitions and
every partition stores the data that corresponding to one reduce task. Secondly, all the
spill files generated by each map task are retrieved and sorted. Due to the effect of large
memory space in a compute node, it is likely that those spill files still reside in local
memory and can be retrieved quickly. Finally, the sorted map output data are stored in
multiple intermediate files where each intermediate file contains the data that belongs to
one reduce task. In other words, each map task generates R intermediate files no matter
how many spill operations it went through, where R is the number of reduce tasks. The
intermediate files that will be processed by the same reduce task are stored in the same
directory.
102 P. Cheng et al.
The proposed shuffle strategy has several advantages: (1) It can utilize the effect of
large memory space in a compute node and reduce the time of retrieving spill files.
(2) Compared with the shuffle strategy of HAL described in Sect. 3.3, the proposed
shuffle strategy can reduce the total number of intermediate files form n * M * R to
M * R, where n represents the number of spill operations, M represents the number of
map tasks and R represents the number of reduce tasks. Therefore, the number of costly
metadata operations can be reduced. (3) Compared with default shuffle strategy, each
reduce task can fetch the map output files from the corresponding directory via Lustre
directly without repetitive data movements cost.
4.3 Evaluation
To validate the effectiveness of our proposed optimizations, we run the terasort
workload of HiBench benchmark in the HPC environment, and the results are shown in
Fig. 10. Default strategy represents the default shuffle strategy of Hadoop. HAL rep-
resents the original HAL shuffle strategy without optimizations. Shared shuffle repre-
sents the proposed shared map output shuffle strategy and file metadata cache
represents using file metadata cache layer with shared map output shuffle strategy
together. We vary the data size from 300 GB to 1500 GB and all the intermediate data
are stored in Lustre file system.
Experiences of Converging Big Data Analytics Frameworks 103
8000
Shared shuffle
7000
6000
3000
HAL
2000
1000
0
300 600 900 1200 1500
Intermediate data size (GB)
When intermediate data size is less than 1200 GB, the default shuffle strategy
performs the best since the data server that serves the requests of reduce tasks can fetch
the intermediate data from local memory due to the effect of large buffer cache in a
compute node. As intermediate data size grows, the memory space of compute node is
insufficient to store all the intermediate data, data servers need to fetch data from Lustre
and sent them back to reduce tasks. The repetitive data movement cost in default shuffle
strategy results in the performance loss.
In contrast, our proposed optimizations allow reduce tasks to fetch data from
underlying Lustre file system directly without repetitive data movement cost. More-
over, it reduces the total number of intermediate files and shows obvious performance
benefit compared to original HAL shuffle strategy. For 1500 GB data size, shared map
output shuffle strategy has a performance benefit of 11% compared to HAL and it can
provide 17% benefit when file metadata cache layer is used together.
5 Related Work
In recent years, big data analytics has gained great success in different fields. Previous
work [9, 28–30] has identified and analyzed the characteristic of bid data and discussed
the big data challenges, including data storage and transfer, data security, scalability of
data analytics systems etc.
Many efforts have been conducted to integrate big data analytics frameworks with
HPC infrastructure. Chaimov et al. [15] ported Spark on Cray XC systems and evaluate
a configuration with SSDs attached closer to compute nodes for I/O acceleration. Wang
et al. [16] characterizes the performance impact of key differences between
compute-centric and data-centric paradigms and then provides optimizations to enable
a dual-purpose HPC system that can efficiently support conventional HPC applications
and new data analytics applications. Wasi-ur-Rahman et al. [31] proposed a
high-performance design for running YARN MapReduce on HPC clusters by utilizing
104 P. Cheng et al.
Lustre as the storage provider for intermediate data and introduced RDMA-based
shuffle approach. These works analyzed the performance differences when deployed
Hadoop and Spark on HPC systems, but they were lacking in providing optimizations
for complex applications.
In-memory file systems like MemFS [21], FusionFS [22] and AMFS [23] are
developed to alleviate the storage pressure, but they provide limited compatibility with
underlying file systems. Two-level storage [32] is the closest research work that
integrates an upper-level in-memory file system with a lower-level parallel file system
for accelerating Hadoop/Spark workloads on HPC clusters. However, it lacks an
in-depth discussion on the performance impact of data-intensive analytics workloads
when using Lustre as underlying file system. In this paper, we make a detailed com-
parison of system architectures and provide two optimizations to alleviate the metadata
bottleneck of Lustre.
There are also many research works that directly deployed big data analytics
frameworks atop of existing parallel file systems. Maltzahn et al. [17] describe Ceph
and its elements and provide instructions for installing a demonstration system that can
be used with Hadoop. Yang et al. [18] propose PortHadoop, an enhanced Hadoop
architecture that enables MapReduce applications reading data directly from HPC
parallel file systems. Xuan et al. [32] present a two-level storage system that integrates
an upper-level in-memory file system with a lower-level parallel file system. Com-
paring with previous works, this paper presents our experiences of converging big data
analytics frameworks with the Tianhe-2 system. We aim at the growing need of
complex applications and provide a feasible solution to accelerate it by utilizing an
in-memory file system.
6 Conclusion
Acknowledgment. This work was supported by National Nature Science Foundation of China
under Grant No. U1611261 and No. 61433019, the National Key R&D Program of China
2017YFB0202201, and the Program for Guangdong Introducing Innovative and Entrepreneurial
Teams under Grant No. 2016ZT06D211.
References
1. Fu, H.H., Liao, J.F., Yang, J.Z., Wang, L.N., Song, Z.Y., Huang, X.M., et al.: The Sunway
TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59(7), 1–16 (2016)
2. Liao, X.K., Xiao, L.Q., Yang, C.Q., Lu, Y.T.: Milkyway-2 supercomputer: system and
application. Front. Comput. Sci. 8(3), 345–356 (2014)
3. Titan - Cray XK7 (2017). https://www.olcf.ornl.gov/titan/
4. Wang, F., Yang, C.Q., Du, Y.F., Chen, J., Yi, H.Z., Xu, W.X.: Optimizing Linpack
benchmark on GPU-accelerated petascale supercomputer. J. Comput. Sci. Technol. 26(5),
854–865 (2011)
5. Yang, C., Wu, Q., Tang, T., Wang, F., Xue, J.: Programming for scientific computing on
peta-scale heterogeneous parallel systems. J. Cent. South Univ. 20(5), 1189–1203 (2013)
6. French, S., Zheng, Y., Romanowicz, B., Yelick, K.: Parallel Hessian assembly for seismic
waveform inversion using global updates. In: IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pp. 753–762. IEEE (2015)
7. Bhandarkar, M.: MapReduce programming with apache Hadoop. In: IEEE International
Symposium on Parallel and Distributed Processing (IPDPS), p. 1 (2010)
8. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M.: Resilient distributed
datasets: a fault-tolerant abstraction for in-memory cluster computing. In: USENIX
Conference on Networked Systems Design and Implementation, p. 2 (2012)
9. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel
Distrib. Comput. 74(7), 2561–2573 (2014)
10. Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68
(2015)
11. NASA Center for Climate Simulation (2017). http://www.nasa.gov/topics/earth/features/
climate-sim-center.html
12. InfiniBand Homepage (2017). http://www.infinibandta.org/
13. Donovan, S., Kleen, A., Wilcox, M., Huizenga, G., Hutton, A.J.: Lustre: building a file
system for 1,000-node clusters. In: Proceedings of the Linux Symposium, p. 9 (2003)
14. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In:
MASS Storage Systems and Technologies, pp. 1–10 (2010)
15. Chaimov, N., Malony, A., Canon, S., Iancu, C., Ibrahim, K.Z., Srinivasan, J.: Scaling Spark
on HPC systems. In: Proceedings of the 25th ACM International Symposium on
High-Performance Parallel and Distributed Computing (HPDC), pp. 97–110 (2016)
16. Wang, Y., Goldstone, R., Yu, W., Wang, T.: Characterization and optimization of
memory-resident MapReduce on HPC systems. In: IEEE International Symposium on
Parallel and Distributed Processing (IPDPS), pp. 799–808 (2014)
17. Maltzahn, C., Molinaestolano, E., Khurana, A., Nelson, A.J., Brandt, S.A., Weil, S.: Ceph as
a scalable alternative to the Hadoop distributed file system. The Magazine of USENIX and
SAGE, pp. 38–49 (2010)
18. Yang, X., Liu, N., Feng, B., Sun, X.H., Zhou, S.: PortHadoop: support direct HPC data
processing in Hadoop. In: IEEE International Conference on Big Data, pp. 223–232 (2015)
106 P. Cheng et al.
19. Fadika, Z., Dede, E., Govindaraju, M., Ramakrishnan, L.: MARIANE: MApReduce
implementation adapted for HPC environments. In: International Conference on Grid
Computing, pp. 82–89 (2011)
20. Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed
storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on
Cloud Computing, pp. 1–15. (2014)
21. Uta, A., Sandu, A., Costache, S., Kielmann, T.: Scalable in-memory computing. In:
International Symposium on Cluster, Cloud and Grid Computing, pp. 805–810 (2015)
22. Zhao, D., Zhang, Z., Zhou, X., Li, T.: FusionFS: toward supporting data-intensive scientific
applications on extreme-scale high-performance computing systems. In: IEEE International
Conference on Big Data, pp. 61–70 (2014)
23. Zhang, Z., Katz, D.S., Wozniak, J.M., Espinosa, A.: Design and analysis of data
management in scalable parallel scripting. In: International Conference on
High PERFORMANCE Computing, Networking, Storage and Analysis, pp. 1–11 (2012)
24. IOzone Filesystem Benchmark (2017). http://www.iozone.org/
25. MDTest Metadata Benchmark (2017). https://github.com/MDTEST-LANL/mdtest
26. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite:
characterization of the MapReduce-based data analysis. In: International Conference on Data
Engineering Workshops, pp. 41–51 (2010)
27. Hadoop Adapter for Lustre (HAL) (2017). https://github.com/intel-hpdd/lustre-connector-
for-hadoop
28. Hu, H., Wen, Y., Chua, T.S., Li, X.: Toward scalable systems for big data analytics: a
technology tutorial. IEEE Access 2(1), 652–687 (2017)
29. Brohi, S.N., Bamiah, M.A., Brohi, M.N.: Identifying and analyzing the transient and
permanent barriers for big data. J. Eng. Sci. Technol. 11(12), 1793–1807 (2016)
30. Tolle, K.M., Tansley, D.S.W., Hey, A.J.G.: The fourth paradigm: data-intensive scientific
discovery [point of view]. Proc. IEEE 99(8), 1334–1337 (2011)
31. Wasi-ur-Rahman, M., Lu, X., Islam, N.S., Rajachandrasekar, R., Panda, D.K.:
High-performance design of YARN MapReduce on modern HPC clusters with Lustre and
RDMA. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS),
pp. 291–300 (2015)
32. Xuan, P., Ligon, W.B., Srimani, P.K., Ge, R., Luo, F.: Accelerating big data analytics on
HPC clusters using two-level storage. Parallel Comput. 61, 18–34 (2016)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appro-
priate credit to the original author(s) and the source, provide a link to the Creative Commons
license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
GPU/FPGA
MACC: An OpenACC Transpiler
for Automatic Multi-GPU Use
1 Introduction
Graphics Processors Units (GPUs) are the workhorse of modern, state-of-the-
art, supercomputers. Each node in a supercomputer node often consists of sev-
eral GPUs, each carrying its own distributed memory and each being capable
of executing asynchronously to one another. Due to GPU’s high performance
and compute-to-power ratio (FLOPs/Watt), modern supercomputers such as
TSUBAME3.0 [1], DGX SATURNV [2] and the upcoming SUMMIT [3] include
multiple GPUs per supercomputing node.
Programming GPUs has historically been done through low-level program-
ming languages (often derivatives or dialects of C) such as CUDA [4] and
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 109–127, 2018.
https://doi.org/10.1007/978-3-319-69953-0_7
110 K. Matsumura et al.
OpenCL [5]. Here the programmer is responsible for both creating the program
code, and — in cases where multiple GPUs are involved — orchestrating the
concurrent execution of multiple GPUs; an often non-trivial task.
A better (and arguably more portable) way is to use compiler directives to
indicate sources of potential parallelism in the application. A compiler can then
use these directives to abstract the complex architectural details away from the
programmer and instead automatically generated device-specific program code.
One example of such directive-driven approach is OpenACC [6] and OpenMP [7].
While models such as OpenACC increase productivity through raised pro-
gramming abstraction, they are currently limited in targeting only a single GPU
device. The user is still responsible for manually orchestrating the multi-GPU
execution.
We propose a method to enable OpenACC-annotated applications to exploit
multiple GPUs. We implemented a source-to-source compiler (transpiler ) that
analyzes and optimizes OpenACC applications. Our transpiler is transparent
to the user— kernel scheduling, data-movement and inter-GPU communication
(including the recent GPU-to-GPU links) are automatically done.
Our contributions in short:
(1) A transpiler that extends the OpenACC programming model to allow appli-
cations to seamlessly use multiple GPUs.
(2) A novel communication algorithm that preserves data coherency across
GPUs by extracting source-code information.
(3) An empirical evaluation of above contributions using well-known HPC
benchmark, positioning the performance against hand-written MPI code and
the recent Unified Memory abstraction layer.
The remaining of this paper is organized as follows. Section 2 discusses related
work. Section 3 provides an overview of OpenACC. Section 4 describes our pro-
posed method. Section 5 describes our experimental methodology. In Sect. 6, we
evaluate our transpiler. Finally, Sect. 7 concludes this paper.
2 Related Work
NVIDIA provides Unified Memory [8], which allows multiple NVIDIA GPUs
to share the global address space between each other. Unified Memory recently
supports coherence through the NVLink interconnect [9], and allows GPUs to
effortlessly communicate between each other. Unlike Unified Memory, which is
very architecture dependent, our approach is more general and oblivious of which
accelerator is being targeted as long as compilers’ OpenACC backend supports
it. Moreover, our method is able to accelerate GPU-to-GPU communications
using GPU interconnects. We also see performance benefits using our method
as compared to Unified Memory in Sect. 6.
Komada et al. [10] used a compiler to distribute OpenACC fairly across GPUs
and execute them in parallel. Their approach is to divide loop iterations into
chunks of equal size and also keep these chunks coherent across different GPUs.
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use 111
Their coherence mechanism is similar to that of Unified Memory, except that the
chunk size can be changed manually by the user and the chunks are prepared for
each array. Unlike Komada et al., we focus on identifying where communication
needs to happen between GPUs through data-flow analysis inside the transpiler.
Rameshekar et al. [11] propose to execute parts of application (written in
C) on multiple GPUs by analyzing loops using the polyhedral model. The poly-
hedral compilation precisely detects necessary communication between GPUs
using superpositions of fine regions and a buffer management. However, their
approach is only applicable to loops with affine iteration and array accesses. We
(unlike Rameshekar et al.) build and extend upon OpenACC, which allows us to
have more information regarding the sources of parallelism, increasing generality
as long as the application uses OpenACC. However, we complement their study
and show how the polyhedral compilation can enhance our method in certain
cases.
HYDRA [12] is a compiler system for distributed environments that use a
single GPU per node. We both share a similar system of determining commu-
nication patterns between GPUs. Unlike HYDRA, which takes as input simple
directives and generates a distributed application, our method leverages Ope-
nACC and OpenMP and hence focuses parallelism within a single “shared” node.
The output of our transpiler uses both OpenACC and OpenMP, and thus can
further use existing OpenACC and OpenMP profiling tools to further improve
performance. In evaluation, we compare hand-written MPI code with our tran-
spiled code.
Scogland et al. [13] combine a well-designed task-based runtime with
directive-driven model to facilitate efficient work-sharing in heterogeneous sys-
tems. They provide new directives that help to identify data dependency. We
consider extending our work to leverage existing task-based runtime systems to
perform the load-balancing.
Xu et al. [14] present new directives to extend OpenACC to support multiple
accelerators. The proposal is based on an evaluation using the hybrid model of
OpenACC and OpenMP.
Accelerate [15] is a purely-functional domain-specific language for array pro-
cessing. Accelerate has a potential to utilize multi-GPU [16].
Also, programming models targeting accelerator clusters are proposed [17,18].
These models provide explicit functions to distribute computations over multiple
accelerators.
3 Overview of OpenACC
Fig. 1. Two examples illustrating the difference in OpenACC code that targets (a)
single-GPU use, and (b) multi-GPU use through mixed OpenMP/OpenACC
kernel are affine with respect to the loop counters and the write-section does
not intersect with the write-section of other GPUs; we fall back on single GPU
execution if the state condition does not hold. Switching between single- and
multi-GPU execution occurs at runtime. Also, the number of GPUs used can
dynamically be decided and changed, leaving room for autotuners.
Identifying the data-regions needed for the GPUs is difficult because the
order of kernel executions is dynamically decided. Therefore, we replicate Host-
to-GPU communications according to data constructs (copyin) for all GPUs
(Algorithm 1).
When multiple GPUs are used, it is important to resolve the data depen-
dencies between the GPUs because each GPU is (often) a discrete device with
its own distributed memory. We have adopted Kwon et al. [19] ’s method (from
distributed-memory programming) to identify the necessary communication ac-
ross GPUs. Our implementation calculates the section of the read (called USE)
and the write (called DEF) for each combination of parallel regions, GPUs and
data (arrays). We apply data-flow analysis (described in Subsect. 4.3) to derive
necessary information.
Before each execution of parallel regions, we compute the necessary commu-
nication among GPUs based on the superpositions of the calculated sections;
after that, we update the section (called DIRTY) for each combination of GPUs
and data/arrays. Here, we call the section whose master is a GPU, as DIRTY.
Algorithm 2 describes this process.
All sections contain an upper- and a lower-bound. Communication between
GPUs is performed either through host memory (CPU-to-GPU) or — if sup-
ported — using the interconnected (GPU-to-GPU or P2P). MACC also removes
any duplicated transfers in order to reduce the amount of communication needed.
116 K. Matsumura et al.
MACC uses data-flow analysis to identify USE and DEF sections of parallel
regions. Data-flow analysis is invoked on every parallel region to extract array
references/indices for read and write accesses. Array references are composed
of constants and loop iterator variables, as well as variables defined outside the
parallel region. Note that MACC only synthesize the code for automatically
analyzing the USE and DEF sections; the actual analysis is performed at runtime
during execution before every parallel region.
During data-flow analysis, we collect array references/indices as well as
extracting variables that are defined or overwritten in the parallel region. We
iteratively analyze the parallel region to account for all paths of the control-flow
graph as long as the collected array references/indices change (so-called Iterative
Data-Flow Analysis [20]).
In MACC, we do these through the following two steps:
(1) We transform source-code into static single assignment form (SSA [20]).
(2) Array indices are collected and all variables (except the loop counter) are
extracted.
(1) all write accesses for each array are affine and definite,
(2) the outermost loop of the kernel in the parallel region is dividable (which
statically-or-dynamically has an affine range and statically has no loop-
carried dependency), and
(3) the write-sections are not duplicated among GPUs.
The data construct and update directive are converted into corresponding
concurrent versions using OpenMP’s parallel construct, seen in Fig. 2(a) and
(b) respectively. When transforming data-constructs, MACC will always append
appropriate present clause to parallel sections within data construct in order
to specify that the data are already on the GPUs.
Figure 2(c) shows how we transpile OpenACC’s parallel constructs. We
start by identifying the loop ranges by calculating USE and DEF sections. Once
we know the loop ranges, we spawn one OpenMP thread for each GPU device.
Each thread then continues to generate the needed communication based on
the algorithm described in Subsect. 4.2; a barrier is inserted to synchronize all
threads before entering the compute part. Finally, the parallel region is executed
by all the threads and the GPU they orchestrate.
If the parallel region satisfies the conditions described in Subsect. 4.3, the
outermost loop is divided and the execution is distributed to each GPU. An
actual example of the transpilating is shown in Fig. 3. At section calculations,
Fig. 2. This mapping illustrating how MACC transpiles each directive including con-
struct (left) into combined OpenMP/OpenACC code (right)
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use 119
Fig. 3. This actual example showing how OpenACC kernel is transpiled and where
MACC inserts section calculation, communication and parallel region
120 K. Matsumura et al.
the last result is used as long as all component values are not changed from the
last calculation. Since variables defined outside the parallel region are shared
among threads, our multi-GPU execution can overwrite them. As an exception,
variables used as loop counters are duplicated on every thread by private clauses
of OpenMP. Reductions are firstly calculated for each GPU by OpenACC, then
the overall results are computed among threads by OpenMP’s reduction clause.
5 Experimental Methodology
5.1 Implementation
MACC was implemented as a prototype coupled with the Omni [22] compiler’s
C frontend/backend using XcodeML [23]. Currently, MACC only support Ope-
nACC applications written using the C language (and not, for example, FOR-
TRAN). This is a minor limitation (and resolving it is more of an engineering
effort), since the methods and techniques introduced in this paper is general
enough to not be tied to any specific programming language.
MACC also requires that arrays copied to GPU devices are contiguous, as
multidimensional arrays are converted into singledimensional arrays. The size of
coarse-grain parallelism gang specified in input is divided for each GPU equally,
and other parallelisms (worker, vector) are kept.
We evaluated the three versions of MACC: baseline which conducts GPU-
to-GPU communications through shared host memory, MACC with NVIDIA
Unified Memory (UM) which entrusts data coherency to UM, and MACC with
P2P. We also leveraged PLUTO together with MACC where applicable (for one
of the benchmarks). We compared transpiled code against the original version
of the benchmark, and also against MPI + ACC versions that we prepared by
appending OpenACC directives to the official MPI code.
Each benchmark was executed 10 times and we used the average to represent
the performance. We report performance with respect to computational perfor-
mance or execution time (OP/s, FLOP/s or seconds depending on benchmark)
as well as speedup over the original version (speedup = toriginal /tmulti−GPU ).
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use 121
5.3 Environment
We evaluated the performance using a single node on the new TSUBAME3.0
supercomputer at the Global Scientific Information and Computing Center
(GSIC), Tokyo Institute of Technology. A node in the TSUBAME3.0 super-
computer contains 4 NVIDIA P100 GPUs [24]. The GPUs are interconnected in
an all-to-all fashion using NVLink technology; note, however, that the links are
heterogeneous and different: two of the links (GPU0 ↔ GPU2 , GPU1 ↔ GPU3 ) have
80 GB/s bidirectional bandwidth and the remaining links have 40 GB/s bidirec-
tional bandwidth. Each TSUBAME 3 node also contain two CPUs (Intel Xeon
E5-2680v4) with a total of 28 general-purpose x86-64 cores. Table 1 provides
more detailed system information.
For all experiments, we used PGI Compiler version 17.10 and NVIDIA CUDA
version 9.0. Inside MACC we have OpenMP threads each orchestrate individual
GPU; more specifically, the mapping is as follows: {thread0 , GPU0 }, {thread1 ,
GPU2 }, {thread2 , GPU1 }, {thread3 , GPU3 }. Our mapping follows the heteroge-
neous links of the GPU interconnect.
PGI Compiler supports UM only for dynamically allocated memory. We
implemented an extension in MACC to force static allocations to be dynam-
ically allocated.
5.4 Benchmarks
6 Results
The performance with respect to the number of GPUs is seen in Fig. 4. Overall,
we see that our transpiler do provide the means to increase the performance of
the application by multi-GPU. However, depending on the application charac-
teristics, different behaviors are observed.
Using MACC, we measured the performance of two data coherence imple-
mentations: our own described in Sect. 4 (with and without P2P support), and
using NVIDIA Unified Memory (UM). Despite the fact that UM internally use
P2P, we find that our implementation without P2P outperforms it in all but one
case; enabling P2P in our implementation always executes faster than UM.
We also find that for applications whose data patterns require plenty inter-
GPU communication (e.g. NPB-CG and in-parts the Himeno benchmark),
enabling the P2P acceleration inside MACC can have significant performance
increases. For applications that MACC’s transform is inadequate, we show that
we can leverage other optimization techniques (the polyhedral compilation in
case of this evaluation) to overcome bottlenecks otherwise hard to deal with.
Finally, we also find that MACC can automatically generate multi-GPU code
that is performance comparable to handwritten MPI+OpenACC code.
The remaining section continues to in-detail provide the analysis on a per-
benchmark basis.
NPB-CG. Performance results for the NPB-CG is shown in Fig. 4(b). We see
that MACC (with and without P2P) scales with the given GPUs, yielding a
2.16× and 1.54× performance speedup respectively. MACC with P2P enabled
scales stably better (19.9%, 34.9% and 40.9% when using 2 ∼ 4 GPUs respec-
tively). Direct data transfer between MPI processes incurs a large 72.9% over-
head when using 4 GPUs, which limits scalability; the average increase in perfor-
mance experienced by the MPI version is 1.09×. Note that our version that use
UM experience a loss of application performance (negative scaling) when increas-
ing the number of GPUs. We found that UM thashes the memory (by thrashes
we mean that it frequent causes page faults and page migrations), which leads
to large performance losses.
124 K. Matsumura et al.
3.5
ORIGINAL
3
MACC ( UM )
500
MACC
2.5
MACC ( + P2P )
SPEEDUP
GFLOPS
MPI + ACC ( N x 1 x 1 )
2
MPI + ACC ( 2 x 2 x 1 )
300
1.5
1
0 100
0.5
0
1 2 3 4
# GPUs
(b) NPB−CG
100
2.5
ORIGINAL
MACC ( UM )
80
2
MACC
MACC ( + P2P )
SPEEDUP
1.5
60
MPI + ACC
Gop/s
40
1
0.5
20
0
0
1 2 3 4
# GPUs
(c) SHOC−MD
1.5
5
ORIGINAL
MACC ( UM )
1.2
4
MACC
MACC ( + P2P )
SPEEDUP
0.9
TFLOPS
3
0.6
2
0.3
1
0.0
1 2 3 4
# GPUs
ORIGINAL
140
MACC ( UM )
Execution Time (sec)
MACC ( + P2P )
2
SPEEDUP
1.5
1
0.5
0 20
1 2 3 4
# GPUs
Fig. 4. This result with respect to number of GPUs displaying computational perfor-
mance or execution time as well as speedup against original version
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use 125
7 Conclusion
References
1. Global Scientific Information and Computing Center, Tokyo Institute of Technol-
ogy. TSUBAME. http://www.gsic.titech.ac.jp/en/tsubame
2. NVIDIA: DGX SATURNV Supercomputer for AI and Deep Learning. https://
www.cscs.ch/computers/piz-daint/
3. Oak Ridge Leadership Computing Facility. Summit. https://www.olcf.ornl.gov/
summit/
4. NVIDIA: About CUDA. https://developer.nvidia.com/about-cuda
5. The Khronos Group Inc.: OpenCL Overview. https://jp.khronos.org/opencl/
6. OpenACC-standard.org. OpenACC. https://www.openacc.org/
7. The OpenMP ARB: The OpenMP API specification for parallel programming.
http://www.openmp.org
8. Unified Memory in CUDA 6: NVIDIA. https://devblogs.nvidia.com/parallelforall/
unified-memory-in-cuda-6/
9. NVIDIA NVLink High-Speed Interconnect. NVIDIA. http://www.nvidia.com/
object/nvlink.html
10. Komoda, T., Miwa, S., Nakamura, H., Maruyama, N.: Integrating multi-GPU exe-
cution in an OpenACC compiler. In: The 42nd International Conference on Parallel
Processing (ICPP) (2013)
11. Ramashekar, T., Bondhugula, U.: Automatic data allocation and buffer manage-
ment for multi-GPU machines. ACM Trans. Architect. Code Optim. (TACO) 10(4)
(2013)
12. Sakdhnagool, P., Sabne, A., Eigenmann, R.: HYDRA: extending shared address
programming for accelerator clusters. In: Shen, X., Mueller, F., Tuck, J. (eds.)
LCPC 2015. LNCS, vol. 9519, pp. 140–155. Springer, Cham (2016). https://doi.
org/10.1007/978-3-319-29778-1 9
13. Scogland, T.R.W., Feng, W.-C., Rountree, B., de Supinski, B.R.: CoreTSAR: core
task-size adapting runtime. IEEE Trans. Parallel Distrib. Syst. (TPDS) 26(11),
2970–2983 (2015)
14. Xu, R., Tian, X., Chandrasekaran, S., Chapman, B.: Multi-GPU support on single
node using directive-based programming model. In: Scientific Programming (2015)
15. Chakravarty, M.M.T., Keller, G., Lee, S., McDonel, T.L., Grover, V.: Accelerating
haskell array codes with multicore GPUs. In: The Sixth Workshop on Declarative
Aspects of Multicore Programming (DAMP) (2011)
16. Svensson, B.J., Vollmer, M., Holk, E., McDonell, T.L., Newton, R.R.: Converting
data-parallelism to task-parallelism by rewrites. In: 4th ACM SIGPLAN Workshop
on Functional High-Performance Computing (FHPC) (2015)
17. Nakao, M., Murai, H., Shimosaka, T., Tabuchi, A., Hanawa, T., Kodama, Y.,
Boku, T., Sato, M.: XcalableACC: extension of XcalableMP PGAS language using
OpenACC for accelerator clusters. In: 2014 First Workshop on Accelerator Pro-
gramming using Directives (WACCPD) (2014)
18. Kim, J., Lee, S., Vetter, J.S.: An OpenACC-based unified programming model
for multi-accelerator systems. In: The 20th ACM symposium on Principles and
Practice of Parallel Programming (PPoPP) (2015)
MACC: An OpenACC Transpiler for Automatic Multi-GPU Use 127
19. Kwon, O., Jubair, F., Min, S.-J., Bae, H., Eigenmann, R., Midkiff, S.P.: Automatic
scaling of OpenMP beyond shared memory. In: Rajopadhye, S., Mills Strout, M.
(eds.) LCPC 2011. LNCS, vol. 7146, pp. 1–15. Springer, Heidelberg (2013). https://
doi.org/10.1007/978-3-642-36036-7 1
20. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques,
and Tools, 2nd edn. Addison-Wesley, Reading (2006)
21. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical auto-
matic polyhedral parallelizer and locality optimizer. In: ACM SIGPLAN Pro-
gramming Languages Design and Implementation (PLDI) (2008). http://pluto-
compiler.sourceforge.net
22. Omni Compiler Project: Omni Compiler. http://omni-compiler.org
23. Omni Compiler Project: XcodeML. http://omni-compiler.org/xcodeml.html
24. NVIDIA: Tesla P100 Most Advanced Data Center Accelerator. http://www.nvidia.
com/object/tesla-p100.html
25. ACCC: RIKEN. Himeno benchmark. http://accc.riken.jp/en/supercom/hime
nobmt/
26. NASA Advanced Supercomputing Division. NAS Parallel Benchmarks. https://
www.nas.nasa.gov/publications/npb.html
27. Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS paral-
lel benchmarks for GPGPUs using a directive-based programming model. In:
Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer,
Cham (2015). https://doi.org/10.1007/978-3-319-17473-0 5. https://github.com/
uhhpctools/openacc-npb
28. Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K.,
Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) bench-
mark suite. In: Third Workshop on General-Purpose Computation on Graph-
ics Processing Units (GPGPU-3) (2010). https://github.com/vetter/shoc/tree/
openacc
29. Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning
a high-level language targeted to GPU codes. In: Proceedings of Innovative Parallel
Computing (InPar) (2012). https://cavazos-lab.github.io/PolyBench-ACC/
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Acceleration of Wind Simulation Using Locally
Mesh-Refined Lattice Boltzmann Method
on GPU-Rich Supercomputers
1 Introduction
converge efficiently because the problem becomes ill-conditioned with increasing the
problem size and the overhead of node-to-node inter-communication increases with the
number of nodes.
The Lattice Boltzmann Method (LBM) [2–5] is a class of CFD method that solves
the discrete-velocity Boltzmann equation. Since the LBM is based on a weak com-
pressible formulation, the time integration is explicit and we do not need to solve the
pressure Poisson equation. This makes the LBM scalable, and thus, suitable for
large-scale computation. As an example, researches performing large-scale calculation
using the LBM were nominated for the Gordon Bell prize in SC10 [6] and SC15 [7].
However, it is difficult to calculate multi-scale analysis with a uniform grid from the
viewpoint of computational resources and calculation time. In this work, we address
this issue based on two approaches, one is the development of an adaptive mesh
refinement (AMR) method for the LBM, and the other is optimization of the
AMR-LBM on the latest Pascal GPU architecture.
The AMR method was proposed to overcome this kind of problem [8, 9]. Since the
AMR method arranges fine grids only in a necessary region, we can realize a
high-resolution multi-scale analysis covering global simulation areas. AMR algorithms
for the LBM have been proposed, and they have achieved successful results [10, 11].
Recently, GPU based simulations have been emerging as an effective technique to
accelerate many important classes of scientific applications including CFD applications
[12–14]. Studies on LBM have also been reported on implementation of GPU [15, 16].
Since there are not many examples of AMR-based applications on the latest GPU
architectures, there is a room for research and development of such advanced appli-
cations. In this work, we implement an AMR-based LBM code to solve multi-scale air
flows. The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the
Tokyo Institute of Technology, and the GPU kernel functions are tuned to realize a
real-time simulation of the environmental dynamics of radioactive substances.
This paper reports implementation strategies of the AMR-LBM on the latest Pas-
cal GPU architectures and its performance results. The code is written in CUDA 8.0
and CUDA-aware MPI. The Host/Device memory is managed by using Unified
memory, and the GPU/CPU buffers are directly passed to a MPI function. We
demonstrate the performance of both CPU and GPU on the TSUBAME3.0. A single
GPU process (a single NVIDIA TESLA P100 processor) achieves 383.3 mega-lattice
update per second (MLUPS) when leaf size equals to 43 in single precision. The
performance is about 16 times higher than that of a single CPU process (two
Broadwell-EP processors, 14 2 cores, 2.4 GHz). Regarding the weak scalability
results, the AMR-LBM code achieves 22535 MLUPS using 36 GPU nodes, which is
85% efficiency compared with the performance on a single GPU node.
The LBM solves the discrete Boltzmann equation to simulate the flow of a weakly
compressible fluid. The flow field is expressed by a limited number of pseudo particles,
which evolve through streaming and collision processes. The configuration space is
discretized by uniform grids. Since pseudo particles move onto the neighbor lattice
130 N. Onodera and Y. Idomura
points after one time step in the streaming process, this process is completed without
any error. The macroscopic diffusion and the pressure gradient are expressed by the
local collisional process. The time evolution of the discretized velocity function is
Here, Dt is the time interval, ci is the lattice vectors of pseudo particles, and Xi is the
collision operator.
It is important to choose a proper lattice velocity (vector) model by taking account
of the tradeoff between efficiency and accuracy. Since their low computational cost and
high efficiency, the D3Q15 and D3Q19 models are popular. Recently, it was pointed
out that these velocity models do not have enough accuracy at high Reynolds number
with complex geometries [17]. On the other hand, the D3Q27 model is suitable model
for a weakly compressible flow at high Reynolds number.
Figure 1 shows schematic figures of the above velocity vector models. Since air-
flows in urban cities are turbulent with high Reynolds number, we adapt the D3Q27
model. The components of the velocity vector are defined as
8
> ð0; 0; 0Þ i¼0
>
>
< ðc; 0; 0Þ; ð0; c; 0Þ; ð0; 0; cÞ i¼16
ci ¼ ð2Þ
>
> ðc; c; 0Þ; ð0; c; cÞ; ðc; 0; cÞ i ¼ 7 18
>
:
ðc; c; cÞ i ¼ 19 26
Fig. 1. Components of the velocity vector of (a) D3Q15, (b) D3Q19, and (c) D3Q27 models.
Here, c is sound speed, and is normalized as c = 1. Each velocity refers the pre-
determined upwind quantity. Since memory accesses are simple and continuous, the
streaming process is suitable for high performance computing.
Acceleration of Wind Simulation 131
1
Xiðx;tÞ ¼ ðfi ðx; tÞ fieq ðx; tÞÞ; ð3Þ
s
where s is relaxation time, and fieq is a local equilibrium distribution function. The
relaxation time in the collisional process is determined using the dynamic viscosity and
the sound speed
1 3m
s¼ þ : ð4Þ
2 c2 Dt
In this wind simulation, since the Mach number is less than 0.3, the flow can be
regarded as incompressible. The equilibrium distribution function fieq of incompressible
model is given as
!
3ci ~ uÞ2 3~
u 9ðci ~ u2
fieq ðx; tÞ ¼ xi 1þ þ 2 : ð5Þ
c2 2c4 2c
Since the SRT model is unstable at high Reynolds number, a Large-Eddy Simu-
lation (LES) model has to be used to solve the LBM equation. The dynamic
Smagorinsky model [19, 20] is often used, but it requires an averaging process over a
wide area to determine the model constant. This is a huge overhead for large-scale
computations, and it will negate the simplicity of the SRT model.
@a@b@c
cabc ¼ cabc lnðFðN; !; ZÞÞ: ð8Þ
@ a N@ b @ c Z
Here, the subscripts a, b, and c are indices of the cumulant. All decay processes are
computed by
cabc ¼ ceq
abc þ ð1 xabc Þcabc : ð9Þ
The asterisk is the post collision cumulant, and xabc is the relaxation frequency.
The Maxwellian equilibrium is expressed as a finite Taylor expansion.
q c2 h 2
lnðF ðN; ; ZÞÞ ¼ ln
eq
Nu v Zw þ N þ !2 þ Z2 Þ : ð10Þ
q0 2
Fig. 2. Interpolated bounce-back boundary conditions of (a) D\ 12 and (b) D 12. The velocity
distribution function f is computed by a linear interpolation in the upwind cell.
where subscript ð1Þ is the direction of each velocity component, and Fi is force on the
solid boundary given as
ci !
ub
Fi;ð1Þ ¼ 3xi q : ð12Þ
c2
Here !ub is a velocity vector of the boundary. Since each velocity function refers the
predetermined neighbor upwind and downwind quantities, it is more suitable for high
performance computing than the IBM [23, 24].
Fig. 3. Schematic figures of computational leaves: (a) Interpolating operations of (red) linear
interpolation, (green) exchange values between coarse and fine grids, (blue) copy values from
fine to coarse grid in 1D case. (b) An example of leaf arrangement in 2D case. Calculation region
is surrounded by the halo (boundary) region of the same refined level. (Color figure online)
The AMR method is applied to resolve the boundary layer near the buildings. The
octree is initialized at the beginning of the simulation and does not dynamically change
the mesh during the time step.
To keep a constant viscosity on coarse and fine grids, the relaxation time s satisfies the
following expression
1 1
sf ¼ m sc : ð14Þ
2 2
Here the super- and sub-scripts c and f denote the value of the coarse and fine grids,
respectively. The coefficient m is the refinement factor. The time step is also redefined
for each resolution as Dtf ¼ ðDtc Þ=m. To take account of the continuity of hydrody-
namic variables and their derivatives on the interface between two resolutions, the
distribution functions satisfy the following equations
sc 1 f
fic ¼ fieq;f þ m fi fieq;f ; ð15Þ
sf 1
1 sf 1 c
fi f ¼ fieq;c þ fi fieq;c : ð16Þ
m sc 1
Acceleration of Wind Simulation 135
Fig. 4. Flowchart of the computational procedure on coarse grid ðLv:0Þ and fine grid: ðLv:1Þ ①
Streaming and collision on each grid, ② time and space interpolation, ③ streaming and collision
on fine grid, and ④ space interpolation on each level. Processes ② and ④ are executed in halo
region.
Fig. 5. Pseudocodes for stencil computation as (top) function to call CPU or GPU instruction,
(middle left) function executed on the CPU, (middle right) function executed on the GPU, and
(bottom) common function of both CPU and GPU.
boundary objects are fixed, optimal kernel functions are created at the beginning of
calculation. We show the PTX information generated by NVIDIA CUDA Compiler
8.0.61 in single precision.
As described above, the function without boundary conditions (Func1) can reduce
the number of registers compared to the original function (Func2). By executing two
functions asynchronously, it is possible to use more threads than the original calcu-
lation. Details of computational performance are discussed in Sect. 6.1 below.
Figure 6 shows velocity profiles of velocities along a vertical line and a horizontal
line passing through the center of the cavity at (a) Re = 1000, (b) Re = 3200,
(c) Re = 5000, and (d) Re = 10000. Calculation results are in good agreement with the
138 N. Onodera and Y. Idomura
reference results. If we used the SRT model, calculation was diverged at a high
Reynolds number such as 3200. We conclude that our simulation is robust against high
Reynolds number, and physical phenomena can be reproduced with few grid points.
Fig. 6. Velocity profiles of u along a vertical line (green solid line) and v along a horizontal line
(orange solid line) passing through the center of the cavity at (a) Re = 1000, (b) Re = 3200,
(c) Re = 5000, and (d) Re = 10000. Each axis is normalized by the half-length of computational
domain and the velocity of the moving wall. (Color figure online)
Fig. 7. Schematic figures of the wind tunnel test: (a) top view and (b) side view. A cube is
placed on the center of the floor.
Fig. 8. Mean velocity profiles (m/s) in stream wise direction: (a) in horizontal plane at the center
of the cube (z = 1/2H), and (b) in vertical plane at the center of the cube (y = 0). Red solid lines
show calculation results and blue dots show experiment data as uplot ¼ 0:02umean þ xline .
Simulation and experiment data have been measured along the lines: xline ¼ ð50; 0; 65; 100;
150; 200; 250mmÞ. (Color figure online)
Table 4 shows the benchmark parameters and the single process performance on
TSUBAME 3.0. Here, the single process performance is estimated by subtracting the
communication cost from the total cost. We scan the number of grid points in a leaf
(Nleaf), while the total number of grid point are set to be equal. The performances in
mega-lattice update per second (MLUPS) are measured in single precision. Table 4
shows the performances of the GPU version are about 10 times higher than those of the
CPU version under various leaf size. It is unclear why the GPU performance is much
higher than the ratio of GPU and CPU memory bandwidth. We estimate that the main
kernel is compute intensive, and the NVIDIA CUDA compiler may not generate the
SIMD-optimized CPU code. There is a possibility that the Intel compiler can generate
faster CPU code.
The performances of the Optimal GPU version are about 1.5 times higher than
those of the GPU version under the conditions of NLeaf ¼ ð43 ; 83 ; 163 Þ. Since the
benchmark is executed including the whole AMR leaves, the boundary separate
technique works well under the condition with a small leaf size.
(Note: OpenMPI 2.1.2 supports GPUDirect RDMA, which enables a direct P2P
(Peer-to-Peer) data transfer between GPUs. However, we do not succeed in MPI
communications using the GPUDirect RDMA in TSUBAME 3.0.)
Fig. 9. Weak scaling results of the LBM simulation on (a) GPUs and (b) CPUs. 4 MPI
processes are executed in each node.
In the weak scaling tests, the parallel efficiencies from 1 node to 36 nodes of CPUs
and GPUs are 98% and 85%, respectively. Although CPUs show better scalability, the
performance on a single GPU node (733MLUPS) is comparable to that on 36 CPU
nodes (767MLUPS).
Utarget
Dxrealtime ¼ Dtcal : ð18Þ
CFLtarget
Here Utarget is a wind velocity, and CFLtarget is the CFL number at Utarget , and Dtcal is
the elapse time per step.
We estimate the mesh resolution under the condition of ðUtarget ; CFLtarget Þ
¼ ð5:0m=s; 0:2Þ. The computational condition is based on a single GPU node case in
the previous Subsect. 6.3. The fine leaves are placed near the ground surface, and the
resolution changes in the height direction. The leaves are arranged with 24 24 17
at Lv. 0, 48 48 16 at Lv. 1, and 96 96 16 at Lv. 2. The computational per-
formance is achieved 733MLUPS using a single GPU node. The minimum mesh
becomes Dxrealtime ¼ m that corresponds to the whole computation domain
resolution
size of Lx ; Ly ; Lz ¼ ð2:8 km; 2:8 km; 3:3 kmÞ. The above estimation shows that a
detailed real-time wind simulation is realized by GPU computing.
This paper presented the GPU implementation of air flow simulations on the envi-
ronmental dynamics of radioactive substances. We have successfully implemented the
AMR-based LBM with a state-of-the-art cumulant collision operator. Our code is
written in CUDA 8.0, and executed both on CPUs and GPUs by using the CUDA
runtime API “cudaMallocManaged”. Since the LBM kernel needs a lot of register
memories on GPUs, the number of threads executed is limited by the lack of registers.
We propose the effective optimization to create a kernel function for each conditional
branch. This technique can reduce the number of registers compared to the original
function, and the single GPU performance is accelerated by *1.5 times. The perfor-
mance of a single GPU process (NVIDIA TESLA P100) achieved 383.3 mega-lattice
update per second (MLUPS) with the leaf size of 43 in single precision. The perfor-
mance is about 16 times higher than that of a single CPU process (two Broadwell-EP
14 cores 2.4 GHz).
We have also discussed the weak scalability results. Regarding the weak scalability
results, 36 GPU nodes achieved 22535 MLUPS with the parallel efficiency of 85%
compared with a single GPU node. The present scaling studies revealed a severe
performance bottleneck due to MPI communication, which will be addressed via
GPUDirect RDMA or NVLink in the future work.
Finally, we estimate the minimum mesh resolution Dxrealtime at which air flow
simulations can be executed in real time. The above estimation shows that a detailed
real-time wind simulation is realized by GPU computing. We conclude that the present
scheme is one of efficient approaches to realize a real-time simulation of the envi-
ronmental dynamics of radioactive substances.
Acknowledgements. This research was supported in part by the Japan Society for the Pro-
motion of Science (KAKENHI), a Grant-in-Aid for Scientific Research (C) 17K06570 and a
144 N. Onodera and Y. Idomura
Grant-in-Aid for Scientific Research (B) 17H03493 from the Ministry of Education, and “Joint
Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” in Japan
(Project ID: jh170031-NAH). Computations were performed on the TSUBAME 3.0 at the Tokyo
Institute of Technology, and the ICEX at the Japan Atomic Energy Agency.
References
1. Nakayama, H., Takemi, T., Nagai, H.: Adv. Sci. Res. 12, 127–133
2. Rothman, D.H., Zaleski, S.: J. Fluid Mech. 382(01), 374–378 (1997)
3. Inamuro, T.: Fluid Dyn. Res. 44, 024001 (2012). 21 pp.
4. Inagaki, A., Kanda, M., et al.: Boundary-Layer Meteorology, pp. 1–21 (2017)
5. Kuwata, Y., Suga, K.: J. Comp. Phys. 311 (2016)
6. Rahimian, A., Lashuk, I., et al.: In: Proceedings of the 2010 ACM/IEEE International
Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11.
IEEE Computer Society (2010)
7. Rossinelli, D., Tang, Y.H., et al.: In: Proceedings of the 2015 ACM/IEEE International
Conference on High Performance Computing, Networking, Storage and Analysis, vol. 2.
IEEE Computer Society (2015)
8. Berger, M.J., Oliger, J.: J. Comp. Phys. 53(3), 484–512 (1984)
9. Zhao, Y., Liang-Shih, F.: J. Comp. Phys. 228(17), 6456–6478 (2009)
10. Zhao, Y., Qiu, F., et al.: Proceedings of 2007 Symposium on Interactive 3D Graphics,
pp. 181–188 (2007)
11. Yu, Z., Fan, L.S.: J. Comput. Phys. 228(17), 6456–6478 (2009)
12. Wang, X., Aoki, T.: Parallel Comput. 37(9), 521–535 (2011)
13. Shimokawabe, T., Aoki, T., et al.: In: Proceedings of the 2010 ACM/IEEE International
Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11.
IEEE Computer Society (2010)
14. Shimokawabe, T., Aoki, T., et al.: In: Proceedings of the 2011 ACM/IEEE International
Conference on High Performance Computing, Networking, Storage and Analysis, vol. 3.
IEEE Computer Society (2011)
15. Feichtinger, C., Habich, J., et al.: Parallel Computing 37(9), 536–549 (2011)
16. Zabelock, S., et al.: J. Comput. Phy. 303(15), 455–469 (2015)
17. Kang, S.K., Hassan, Y.A.: J. Comput. Phys. 232(1), 100–117 (2013)
18. Zou, Q., He, X., et al.: Phys. Fluid 9(6), 1591–1598 (1996)
19. Germano, M., Piomelli, U., Moin, P., Cabot, W.H.: Physics of Fluids A: Fluid Dynamics 3
(7), pp.1760–1765 (1991)
20. Lilly, D.K.: Phys. Fluids A 4(3), 633–635 (1992)
21. Geier, M., Schonherr, M., et al.: Comput. Math. Appl. 70(4), 507–547 (2015)
22. Geier, M., Psquali, A., et al.: J. Comput. Phys. 348, 889–898 (2017)
23. Kim, J., Kim, D., Choi, H.: J. Comput. Phys. 171(20), 132–150 (2001)
24. Peng, Y., Shu, C., et al.: J. Comput. Phys. 218(2), 460–478 (2006)
25. Chun, B., Ladd, A.J.C.: Phys. Rev. E 75(6), 066705 (2007)
26. Yin, X., Zhang, J.: J. Comput. Phys. 231(11), 4296–4303 (2012)
27. Guzik, S.M., Weisgraber, T.H., et al.: J. Comput. Phys. 259(15), 461–487 (2014)
28. Laurmaa, V., Picasso, M., Steiner, G.: Comput. Fluids 131(5), 190–204 (2016)
29. Zuzio, D., Estivalezes, J.L.: Comput. Fluids 44(1), 339–357 (2011)
30. Usui, H., Nagara, A., et al.: Proc. Comput. Sci. 29, 2351–2359 (2014)
Acceleration of Wind Simulation 145
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appro-
priate credit to the original author(s) and the source, provide a link to the Creative Commons
license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
Architecture of an FPGA-Based
Heterogeneous System for Code-Search
Problems
1 Introduction
Fields such as cryptography, data encoding, error correcting, etc. often requires
bit patterns that satisfy particular conditions. However, finding such bit patterns
is a very time consuming problem. For example, in order to find 64-bit code that
satisfies particular conditions, we have to search 264 bit patterns. Even if we can
search one bit pattern in one clock cycle using a 4 GHz CPU, it requires over 146
years to search all combinations. For a 128-bit code search problem, the required
processing time exceeds the age of the universe. Therefore, how we can solve such
code search problems. Mathematicians propose many algorithms to generate a
particular code that satisfies the conditions, instead of search for all possible bit
patterns. For example, to find all 64-bit numbers that are divisible by four, we
can fix the least significant two bits to zero and generate all combinations of bit
patterns of the other 62 bits. This increases the processing speed by four times.
For more complex problems, many different methods are available to reduce the
amount of searches. Most of those methods use bit operations or fixed-point
computations. On the other hand, CPUs and GPUs are specialized for floating-
point computations, and using those for such simple bit operations is not an
efficient method.
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 146–155, 2018.
https://doi.org/10.1007/978-3-319-69953-0_9
Architecture of an FPGA-Based Heterogeneous System 147
2 Code-Search Problems
In this paper, we consider the acceleration of extremal doubly even self-dual code
search [8], as an example to show the efficiency of the FPGA-based heterogeneous
system for such problems. Self-dual codes are an important class of linear codes
with both theoretical importance and practical applications [7]. It is important in
the fields such as cryptography, error correcting, etc. In this section, we briefly
explain the extremal doubly even self-dual code search algorithm. Note that,
we restrict the details of the mathematical background since it is not in the
scope of this paper. Readers can refer [8] for the details. We focus on the types
of computations required in such code search problems, and how to accelerate
those computations using FPGA-based heterogeneous system.
In the work in [8], the extremal doubly even self-dual code is described as follows.
“A binary self-dual code C of length n is a code over F2 satisfying C = C ⊥ where
the dual code C ⊥ of C is defined as C ⊥ = {x ∈ Fn2 |x · y = 0 for all y ∈ C}
under the standard inner product x · y. A self-dual code C is doubly even if all
codewords of C have hamming weight divisible by four, and singly even if there
is at least one codeword of hamming weight ≡ 2 (mod 4). Note that a doubly
even self-dual code of length n exists if and only if n is divisible by eight. It was
148 Y. Hiradate et al.
shown in [9] that the minimum hamming weight d of a doubly even self-dual
code of length n is bounded by d ≤ 4[n/24] + 4. A doubly even self-dual code
meeting this upper bound is called extremal.”
For example, an extremal doubly even self-dual code C of length 128 satisfies
the following three conditions.
To find such a code, work in [8] proposes the following algorithm that contains
four steps.
Step 1: x ∈ F64
2 and wt(x) ≡ 3 (mod 4)
Step 2: If AAT + BB T = I32 go to step 1. A and B are circulant matrices given
by Eq. (1).
⎛ ⎞ ⎛ ⎞
x1 x2 · · · x32 x33 x34 · · · x64
⎜ x32 x1 · · · x31 ⎟ ⎜ x64 x33 · · · x63 ⎟
⎜ ⎟ ⎜ ⎟
A=⎜ . . . ⎟ , B=⎜ . . .. ⎟ (1)
⎝ .. .. .. ⎠ ⎝ .. .. . ⎠
x2 x3 · · · x1 x34 x35 · · · x33
Step 3: The matrices G and H in Eq. (2) are the generator matrices of C and
C ⊥ respectively. If the hamming weight of the sum until the 10th row of G
is less than 20, go to step 1. If the hamming weight of the sum until the 10th
row of H is less than 20, go to step 1.
A B
M= , G = (I64 , M ) , H = M T , I64 (2)
B T AT
In order to satisfy the step 3 of the code search algorithm, the hamming
weight of {x1 · · · x64 } must be equal or larger than 19. That is, at least 19 bits
of the 64 bits in the code must be ones. Therefore, we have to search k-out-of-64
codes where 19 ≤ k ≤ 64. Searching for such a code is a very time consuming
problem.
Fig. 1. The amount of data transferred among each step of the code search method.
There are a few methods such as [10,11] for k-out-of-n code generation. However,
these methods have a data dependency among code searches. That is, the search
of a new bit pattern must be started only after the search of the previous bit
pattern is finished. As a result, it is extremely difficult to accelerate such methods
using parallel processing. Therefore, we use the “circular permutation generation
algorithm” proposed in [12] to accelerate k-out-of-n code generation. A p-ary
circular permutation of length n is an n-character string of an alphabet of size p,
where all rotations of the string are considered as equivalent. Therefore, we can
regard a circular permutation code as a seed and generate the other bit patterns
by rotating the bits. Figure 3 shows two seeds and the generated bit patterns
of 2-out-of-4 codes. The rotation of bits can be done in parallel using bit-shift
operations. Therefore, even if we generate the seeds in serial, we can still have a
large amount of parallel operations.
The algorithm to generate circular permutation [12] is a serial one. There-
fore, the only way to increase the processing speed of the circular permutation
generation is to increase the clock frequency. Unfortunately, the clock frequency
of an FPGA is usually less than 300 MHz. Therefore, we use a CPU for the
permutation generation which has more than 10 times larger clock frequency
compared to that of an FPGA. Once the permutations are generated, those are
transferred to the external memory (DRAM) of the FPGA board. The FPGA
accelerator access those data and performs 63 shift operations in parallel and
select the codes that satisfies the step 1 of the algorithm explained in Sect. 2.1.
The permutation generation and bit-shift can be done in parallel as shown in
Fig. 4. The time required for a circular permutation generation and parallel bit-
shift operations are nearly equal. This way, the CPU and the FPGA are used in
parallel manner.
Architecture of an FPGA-Based Heterogeneous System 151
Fig. 4. Parallel processing of k-out-of-n code generation using a CPU and FPGA.
step 2 is reduced without affecting the total processing time. A part of matrix
calculation program code is shown in Fig. 5.
In step 3, hamming weight until the 10th row is calculated. However, most
codes can be rejected by computing the hamming weight until the first few rows.
Therefore, we divide the step 3 into two stages. In the first stage, the hamming
weight until the first 5 rows are computed. The codes satisfy this condition go
to the second stage. We use only 2 modules in the second stage since a smaller
degree of parallelism is required. A part of step 3 program code is shown in
Fig. 6.
Architecture of an FPGA-Based Heterogeneous System 153
4 Evaluation
We used two systems for the evaluation, where one contains only one CPU and
the other contains one CPU and one FPGA. In the CPU only system, the CPU is
Intel Xeon E5-1650 v3 (3.50 GHz). In the heterogeneous system, the CPU is Intel
Xeon E5-2643, and the FPGA is Terasic DE5a-net FPGA board [13] with Intel
Arria 10 FPGA. FPGA is configured using Quartus prime pro 16.1 and Intel
FPGA SDK for OpenCL [14]. CPU codes are compiled using Intel compiler 17.0
with OpenMP directives for parallel computation.
Table 1 shows the comparison of the processing time of k-out-of-n code gen-
eration using different methods. In this evaluation, n is 64 and k is 8. The fastest
CPU implementation is a nested-loop implementation that search all bit patterns
to find the desired code. Some part of the loop can be processed in parallel so
that the processing time is reduced. Compared to that, proposed heterogeneous
implementation produced over 2.4 times speed-up compared to the nested-loop
implementation.
Table 2 shows the comparison of the total processing time of extremal doubly
even self-dual code search. Note that the clock frequency of the FPGA is reduced
to 207 MHz from 309 MHz in Table 1 due to increased computation. Even with
such low clock frequency, the speed-up of the proposed implementation is 86.9
times compared to the CPU-only implementation. This shows that FPGAs are
very efficient for bit operations. Moreover, nearly 64 codes can be checked per
clock cycle in FPGA due to its massively parallel computations.
Table 3 shows the resource usage of the FPGA. Since only 37% of the logic
resources are used, there is a potential to increase the processing speed further
by doing more parallel computations. If we increase the degree of parallelism, the
bottleneck would be the circular permutation generation in the CPU. Therefore,
Table 2. Results
Conventional Proposed
Device CPU only CPU & FPGA
Clock frequency (MHz) 3500 207
Processing time (s) 29.13 0.33
9
Number of clock cycles (10 ) 10.21 0.07
Codes checked per clock cycle 0.04 63.5
154 Y. Hiradate et al.
5 Conclusion
In this paper, we propose an FPGA-based heterogeneous system for extremal
doubly even self-dual code search. Although we are yet to solve the problem,
there is a great potential to find a solution in near future due to over 86 times
of speed-up of the proposed system compared to a conventional one with only
a CPU. Moreover, we used only 34% of the FPGA resources, so that further
increase of speed is possible. It is very important to exploit the possibility of
accelerating other code search problems using FPGAs in future.
References
1. Marchal, P.: Field-programmable gate arrays. Commun. ACM 42(4), 57–59 (1999)
2. https://www.altera.com/content/dam/altera-www/global/en US/pdfs/
literature/hb/arria-10/arria 10 aib.pdf
3. Czajkowski, T.S., Neto, D., Kinsner, M., Aydonat, U., Wong, J., Denisenko, D.,
Yiannacouras, P., Freeman, J., Singh, D.P., Brown, S.D.: OpenCL for FPGAs:
prototyping a compiler. In: Proceedings of the International Conference on Engi-
neering of Reconfigurable Systems and Algorithms (ERSA), p. 1 (2012)
4. Waidyasooriya, H.M., Hariyama, M., Uchiyama, K.: Design of FPGA-Based Com-
puting Systems with OpenCL (2017)
5. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes. North-
Holland, Amsterdam (1977)
6. Pasquier, G.: A binary extremal doubly even self-dual code (64, 32, 12) obtained
from an extended Reed-Solomon code over F16. IEEE Trans. Inform. Theory 27,
807–808 (1981)
Architecture of an FPGA-Based Heterogeneous System 155
7. Rains, E., Sloane, N.J.A.: Self-dual codes. In: Pless, V.S., Huffman, W.C. (eds.)
Handbook of Coding Theory, pp. 177–294. Elsevier, Amsterdam (1998)
8. Harada, M.: An extremal doubly even self-dual code of length 112. Electron. J.
Comb. 15, 1–5 (2008)
9. Mallows, C.L., Sloane, N.J.A.: An upper bound for self-dual codes. Inform. Control
22, 188–200 (1973)
10. Harbison, S.P., Steele Jr., G.L.: C: A Reference Manual. Prentice Hall, Englewood
Cliffs (1987)
11. https://docs.python.org/2/library/itertools.html
12. Sawada, Joe: A fast algolithm to generate neckleces with fixed content. Theoret.
Comput. Sci. 301, 477–489 (2003)
13. Terasic, DE5-Net FPGA Development Kit. http://www.terasic.com.tw/cgi-bin/
page/archive.pl?Language=English&CategoryNo=158&No=526
14. Intel FPGA SDK for OpenCL, Programming Guide. https://www.altera.com/en
US/pdfs/literature/hb/opencl-sdk/aocl programming guide.pdf
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Performance Tools
TINS: A Task-Based Dynamic Helper
Core Strategy for In Situ Analytics
1 Introduction
The exascale era will bring more computational capabilities enabling the sim-
ulation of more complex phenomena with higher precision. This will generate
a growing amount of data. Traditionally, simulation codes output data into the
filesystem and these data are later read back for postmortem analytics. However,
the growing gap between computational capabilities and IO bandwidth calls for
new data processing methods.
The in situ paradigm proposes to reduce data movement and to analyze data
while still resident in the memory of the compute node by co-locating simulation
and analytics on the same compute node [1]. The simplest approach consists in
modifying the simulation timeloop to directly call analytics routines. However,
several works have shown that an asynchronous approach where analytics and
simulation run concurrently can lead to a significantly better performance [2–4].
Today, the most efficient approach consists in running the analytics processes on
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 159–178, 2018.
https://doi.org/10.1007/978-3-319-69953-0_10
160 E. Dirand et al.
a set of dedicated cores, called helper cores, to isolate them from the simulation
processes [3]. Simulation and analytics thus run concurrently on different cores
but this static isolation can lead to underused resources if the simulation or the
analytics do not fully use all the assigned cores.
In this paper, we introduce TINS, a task-based in situ framework that imple-
ments a novel dynamic helper core strategy. TINS relies on a work stealing
scheduler and on task-based programming. Simulation and analytics tasks are
created concurrently and scheduled on a set of worker threads created by a single
instance of the work stealing scheduler. Helper cores are assigned dynamically:
some worker threads are dedicated to analytics when analytics tasks are avail-
able while they join the other threads for processing simulation tasks otherwise,
leading to a better resource usage. We leverage the good compositionality prop-
erties of task-based programming to seamlessly keep the analytics and simulation
codes well separated and a plugin system enables to develop parallel analytics
codes outside of the simulation code.
TINS is implemented with the Intel R
Threading Building Blocks (TBB)
library that provides a task-based programming model and a work stealing sched-
uler. The experiments are conducted with the hybrid MPI+TBB ExaStamp
molecular dynamics code [5] that we associate with a set of analytics represen-
tative of computational physics algorithms. We show up to 40% performance
improvement over various other approaches, including the standard helper core,
on experiments on up to 14,336 Broadwell cores.
The paper is organized as follows. After an overview of related work (Sect. 2),
we present the TINS task-based in situ method (Sect. 3) and we compare the
dynamic helper core method with state-of-the art approaches (Sect. 4).
2 Related Work
The more direct way to perform in situ processing is called synchronous and
consists in in-lining analytics code in the simulation code. The total execution
time is the addition of simulation and analytics times, plus some possible over-
heads due to cache trashing. The analytics can directly access the simulation
data structures, but more often a copy is performed to build a data structure
adapted to the analytics needs [6]. ParaView/Catalyst [7] and VisIt/Libsim [8]
are both relying on this approach to enable in situ visualization. They recently
worked on a unified in situ API for the simulation codes, called SENSEI [9], to
switch between Catalyst, Libsim and the IO framework ADIOS [10] with very
limited code modifications.
Parallel simulations are almost never 100% efficient, some cores being idle
during communication phases for instance or because some code sections do not
provide enough parallelism to feed all the cores. One idea is to harvest these
CPU cycles to execute analytics, leading to execution times shorter than with
the synchronous execution. This is called asynchronous in situ. A simple app-
roach consists in relying on the OS scheduler capabilities to allocate resources.
The analytics run its own processes or threads concurrently with the ones of the
TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics 161
simulation. The simulation only needs to give a copy of the relevant data to the
local in situ analytics processes. The analytics can next proceed concurrently
with the simulation. However, works [11,12] show that relying on the OS sched-
uler does not prove efficient because the presence of analytics processes tends to
disturb the simulation.
To circumvent this problem, a common approach consists in dedicating one
or more cores, called helper cores, to the analytics. The simulation runs on less
cores, but, because it is usually not 100% efficient, its performance decreases
by less than the ratio of confiscated cores. Damaris [3], FlowVR [2] Functional
Partitioning [13], GePSeA [14], Active Buffer [15] or FlexIO [4] support this
approach and have demonstrated its benefit in different contexts. Performance
gains are usually significant compared to a synchronous approach. However,
because the analytics and simulation are both isolated on distinct subsets of
cores, this helper core strategy does not allow the analytics to harvest unused
cycles of the simulation cores and vice versa.
GoldRush [11] takes a different approach. It implements a custom time-
sharing scheduling to interleave simulation and analytics while limiting the inter-
ference on the simulation. Goldrush detects sequential sections in the OpenMP
code of the simulation to schedule the analytics processes. The simulation sends
resume signals to the analytics during these sections while the analytics are sus-
pended otherwise. Experiments show the simulation performance is improved
compared to OS controlled scheduling or a synchronous approach. However,
Goldrush does not enable overlapping simulation and analytics during short
simulation sequential sections and weakly scalable parallel sections.
All previously mentioned approaches applied to MPI or MPI+OpenMP simu-
lations. New programming models are also developed as alternatives to message
passing. StarPU [16], PaRSEC [17], Legion [18] and HPX [19] propose task-
based runtime systems for distributed heterogeneous architectures. The program
defines a directed acyclic graph where vertices are tasks and edges data depen-
dencies between tasks. The runtime is in charge of mapping tasks to resources,
and triggering task execution and the necessary data movements when data
dependencies are resolved. Early experiments have been reported using Legion
for in situ analytics [20,21]. They show that Legion runtime is able to overlap
analytics and simulation tasks, but globally the performance is not yet compet-
itive with MPI approaches.
In a more general context the shortcomings of standard OS for scheduling
concurrent parallel applications on one multi-core node motivated the develop-
ment of specific co-scheduling strategies. Space-sharing is often favored compared
to time-sharing as it usually leads to better performance. But these solutions
require a specific OS scheduler or modifications to the parallel runtimes [12,22].
162 E. Dirand et al.
MPI process
Simulation thread
Spawn analytics
master thread Analytics thread
Compute a simu-
lation timestep Wait for dataReady
Run analytics
if analyticsBreakpoint
Wait for
analyticsDone Notify
analyticsDone
Copy data
Notify dataReady
Fig. 1. Timeloops of the simulation (left) and analytics (right) master threads inside
one MPI process. The green-framed blocks contain sequential regions (MPI commu-
nications for example) and parallel regions where simulation or analytics tasks are
scheduled on the worker threads spawned by TBB inside the MPI process. The red
arrows depict the synchronization between the master threads. (Color figure online)
The user defines an analytics breakpoint frequency that sets the frequency of
data processing. Every time the simulation reaches such analytics breakpoint,
data are copied into a temporary buffer. When data are copied, the simulation
master thread notifies the analytics master thread that data are ready to be
processed with the dataReady signal and resumes the simulation execution.
On the other side, the analytics master thread waits for the simulation mas-
ter thread dataReady signal to launch the analytics on the data written into
the temporary buffer. It creates analytics tasks while the simulation master
thread creates simulation tasks in its own timeloop, leading to an asynchronous
in situ pattern. Once the analytics are executed, the analytics master thread
notifies the simulation master thread with the analyticsDone signal. This sec-
ond synchronization is added to avoid having to store more than one temporary
buffer. This synchronization can be delayed if enough memory is available to
store various buffers. The simulation master thread therefore has to wait for the
analyticsDone signal before writing data in the temporary buffer. This signal
is disabled for the first analytics breakpoint to avoid a deadlock.
Fig. 2. Gantt diagram of the execution of simulation and analytics tasks on 6 threads
(T0 to T5) for a static (a) or dynamic (b) helper core strategy. T0 and T5 are respec-
tively the simulation and analytics master threads, T1, T2 and T3 are worker threads
assigned to simulation and T4 is a worker thread assigned to analytics. The diagram
represents two iterations of a simulation, both being the alternation of four sequential
regions (grey areas) and three parallel regions (blue areas). The analytics is composed
of one parallel region (orange areas). The purple areas highlight the periods when the
threads are idle. The dynamic helper core strategy enables worker threads to switch
to simulation (resp. analytics) tasks when there is no analytics (resp. simulation) work
left, while this is not possible with static helper cores. (Color figure online)
tasks execution while the remaining execute simulation tasks. The main differ-
ence with the static approach is that the isolation is temporary: when the exe-
cution of a simulation (resp. analytics) parallel region is completed, the worker
threads involved in the computation can enter the analytics (resp. simulation)
arena if its concurrency level permits it. This method aims at reducing the thread
idleness periods induced by the static helper core approach. We set ns = N − 1
so that all the worker threads and the simulation master thread can work on
simulation tasks if available. Note that the analytics master thread cannot exe-
cute simulation tasks because it is not allowed to enter the simulation arena.
To restrict the number of threads in the analytics arena, we can choose different
values for na . na = ns means that half of the threads will execute analytics tasks
when both arenas are active while na < ns gives a higher priority to the simu-
lation. We tested various binding strategies for the worker threads, but because
they can execute tasks from both arenas, we did not observe that a binding
strategy was overcoming the others. We therefore adopted the less constraining
one by not binding the worker threads.
4 Experimental Evaluation
We compare the dynamic helper core strategy implemented with TINS with
several other approaches on a molecular dynamics simulation using Intel
R
Xeon
processors available in the CCRT French Computing Center.
4.2 Analytics
Analytics Description
write dat Write the positions of the particles inside each MPI process in a
file (one file per MPI process)
statistics seq Compute sequentially the mean of the positions for the particles
inside each MPI process
statistics par Compute in parallel the mean of the positions for the particles
inside each MPI process (with 1 TBB parallel reduction)
radial Compute in parallel a local radial distribution function for the
particles inside each MPI process (with 2 nested TBB parallel for)
histogram Compute in parallel a global histogram of rx positions (locally
with 2 TBB parallel reductions, and globally with 2 MPI REDUCE)
In the write dat routine, each MPI process writes a file with the positions
of each particle at each analytics breakpoint. This analytics mimics a native
file writing pattern commonly used in ExaStamp to write particles in an XYZ
format suitable for post-processing tools. This analytics plugin neither generates
TBB tasks nor MPI communications.
The two statistics routines trigger local computations and do not perform
any MPI communication. They both compute the mean of the positions of the
particles from the data copied in each MPI process. We implemented a sequential
version (statistics seq) and a parallel version (statistics par) where the
mean is computed through one TBB parallel reduction. Each task consists in a
few summations but is very memory intensive. When simulating the behavior
of 4,000,000 particles per MPI process, the positions represent approximately
96 MB of data per MPI process, significantly more than the caches available
on a Broadwell processor (see below for the processor specifications). Reading
these data therefore evicts simulation data from the caches. Moreover, these
analytics highlight NUMA effects because data are split between the caches of the
different NUMA nodes. To further stress memory accesses for the experiments,
the statistics routines can be executed several times at each analytics breakpoint.
The histogram algorithm (histogram) mixes TBB tasks creation and MPI
communications. This routine counts how many particles have a position in
intervals of the form [rxi , rxi + Δx]. A first collective communication is neces-
sary to determine the bounds of the system: each MPI process computes its own
minimum and maximum positions with a TBB parallel reduction and the global
168 E. Dirand et al.
bounds are found thanks to a MPI REDUCE operation. The global domain is then
split into smaller intervals of the form [rxi , rxi + Δx]. The number of particles
in each interval is computed inside each MPI process thanks to a TBB parallel
reduction and the global histogram is then computed with a MPI REDUCE. The
histogram is computed on 1,000 intervals. For experimenting with analytics hav-
ing different MPI communication loads, we can increase the size of the arrays
communicated in the second MPI REDUCE. This way, we can see the influence of
an analytics that spends most of its time in MPI communications.
The local radial distribution function (radial) is a common algorithm in
computational physics and consists in a local histogram over the distances
between the particles. For each particle, we compute the distance with all the
other particles and store them in a local histogram of 1,000 bins. This analyt-
ics requires two nested for loops and is parallelized with TBB thanks to the
tbb::blocked range2d feature. This algorithm is used because it demonstrates
the effect of a compute intensive analytics.
the analytics process and a SIGSTOP signal to suspend it at the end of this
sequential region. We instrumented ExaStamp with the Goldrush API delimiting
the sequential regions where no TBB task is created. We ported the sequential
statistics and the parallel one with its TBB parallelization that can run tasks
on all cores when resumed by Goldrush.
Experiments run on the Cobalt supercomputer from the CCRT high perfor-
mance computing center. Each node has two Intel R
Broadwell CPUs running at
2.40 GHz and 128 GB of memory. Each CPU has 2 NUMA nodes with 7 cores
each and a shared L3 cache of 17,920 KB. Hyperthreading is not activated. The
nodes are connected through a EDR InfiniBand network. The codes use Intel R
TBB 16.0.3.210, are compiled with icpc compiler (version 17.0.4.196) and are
launched with IntelR
MPI (version 2017.0.4.196).
Experiments are conducted timing 32 consecutive iterations of ExaStamp,
with the analytics performed after each timestep. In production codes, outputs
are usually not produced at each timestep to avoid slowing down to much the
execution. Here we stress the system by analyzing data at each iteration to make
the overheads more visible.
Tests are performed on simulations with 4,000,000 particles per MPI pro-
cess and one MPI process per node. Simulation codes usually run several MPI
processes per node, but we run only one MPI process per node to probe TBB
scheduler with a larger pool of cores. We compared the performance of running
ExaStamp with 1 process per node and 4 processes per node and measured only
a 2% performance drop.
4.5 Results
Table 2. Total execution times in seconds of ExaStamp co-located with three analytics
executed with different TINS configurations and with Goldrush for a simulation of
4,000,000 atoms on 28 Broadwell cores (1 MPI process)
Goldrush approach. For longer analytics, like the sequential statistics, the TINS
approach with dynamic helper core strategy can be up to 34.74% faster than
the Goldrush approach. The long execution time of the stat seq 1000 analyt-
ics reflects that Goldrush only manages to overlap with the simulation a small
portion of the analytics because it executes analytics only during long enough
sequential periods. The remaining of the analytics computations that Goldrush
does not manage to execute during the simulation sequential sections is thus
completed after the end of the simulation. The TINS task interleaving strat-
egy prevents this issue by using both the simulation sequential periods and the
periods when the simulation is not efficient enough to schedule analytics tasks.
Static versus Dynamic Helper Cores
In order to compare the different in situ strategies, we run a simulation of
256,000,000 particles with 64 MPI processes on 1,792 cores (Figs. 3 and 4). We
tested various configurations to stress the memory accesses or the MPI com-
munications for the statistics and the histogram routines. The statistics
routines were executed 1 to 1,000 times at each analytics breakpoint. We present
here only the results with 100 and 1,000 executions representative of the two main
behaviors that emerged from these tests. The histogram was tested with global
reductions applied on arrays of 1,000 to 1,000,000,000 integers. We include here
the results for the intermediate size of 100,000,000 integers. For large arrays,
execution times are similar for all strategies, dominated by the MPI communi-
cation. Analytics cost is too short with small array sizes to exhibit significant
performance differences.
For each analytics, we tested various numbers of helper cores and arena sizes.
damaris-a-s corresponds to Damaris running the analytics on a helper cores and
the simulation on the remaining s cores. SHC-a-s (resp. DHC-a-s) corresponds to
the TINS approach running the static (resp. dynamic) helper core strategy with an
analytics arena of size a and a simulation arena of size s. Each histogram bar gives
the total execution time of one strategy. A bar is divided into four areas: left part
is the simulation master thread idle (no pattern) and active times (cross pattern);
right part is the analytics master thread execution time split into idle (no pattern)
and active times (dashed pattern).
TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics 171
the total execution time is dominated by the analytics execution time. If we use
more threads for the analytics, the simulation runs on fewer threads and the
total execution time is dominated by the simulation execution time.
The dynamic helper core strategy is in general less sensitive to the configu-
ration. For the small analytics in Fig. 3, there is less than 1% difference for the
total execution time from one configuration to another. The different dynamic
helper core configurations are therefore equivalent to a static helper core app-
roach where one helper core is used. The analytics can be performed with an
overhead of less than 5% with respect to ExaStamp alone and the dynamic helper
core strategy can be up to 3% faster than the pure asynchronous approach and
28% faster than the synchronous approach that suffers from NUMA issues.
The results are similar with the sequential statistics performed 1,000 times
(Fig. 4), with approximately 1% difference in the total execution time from one
configuration to another. In the case of the parallel statistics performed 1,000
times (Fig. 4), setting an analytics arena of size 1 is too restrictive because the
analytics cannot benefit from its parallelization. It therefore presents a total
execution time 10% longer than the simulation alone while the other dynamic
helper core configurations reduce this overhead to 6%. For these analytics, the
dynamic helper core strategy is up to 40% better than the Damaris approach
set with the appropriate number of static helper cores.
The radial analytics shows a slightly different behavior for the dynamic
helper core strategy: increasing the concurrency level of the analytics arena
also increases the total execution time. An analytics arena of size 1 induces
an overhead of 6% with ExaStamp alone, this overhead growing up to 39% with
an analytics arena of size 27. This analytics differs from the others because it
executes two nested parallel loops. TBB does not support task switching on
nested parallel loops. When a thread enters the analytics arena during simu-
lation sequential periods, it cannot move back to the simulation arena before
all the analytics tasks have been executed. In particular, it cannot switch back
to support the simulation when the sequential region is over, slowing down the
progress of the simulation. This effect is all the more visible as the analytics
arena size increases. It is therefore necessary to reduce the size of the analytics
arena in the dynamic helper core strategy, sizes of 4 and 7 being good tradeoff
in this situation.
Experiments show that TINS implemented with the dynamic helper core
strategy gives generally better performance than the static helper core strategy
implemented by Damaris. In addition, our system shows greater flexibility for the
choice of the number of helper cores, the execution times between the different
dynamic configurations being relatively close.
Task versus Analytics Master Thread
TBB constrains master threads to execute only the tasks of the arena they
created. Thus the analytics master thread never executes simulation tasks. As
we spawn only N − 2 worker threads, there is always one core that cannot
execute simulation tasks, potentially leading to underusing this core. We tried
oversubscription by creating N −1 worker threads, but the performance degrades
TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics 173
100
80
60
40
20
0
100
80
60
40
20
0
us us s s
- 2 7 -27 -27
s s s
- 2 4 - 2 4 -27
s s s
-21 - 2 1 -27
s 4s 4 s 7 s
ono ono - 1 a - 1a -1a -4a - 4 a - 4 a -7a - 7 a - 7 a a-1 a - 1 a - 2
chr chr aris SHCDHC aris SHCDHC aris SHCDHC -14 - 1 4 - 2 7
syn asyn a m a m a m m aris SHCDHC
e D D D a
pur D
Simulation Analytics Sleeping ExaStamp alone
Fig. 3. Comparison of the different strategies on 1,792 Broadwell cores (64 MPI pro-
cesses) for three analytics quicker than the simulation timestep: file writing (a), sequen-
tial statistics performed 100 times (b) and histogram with an array of 100,000,000
integers for the MPI collective communication (c).
200
Time (s)
150
100
50
100
80
60
40
20
0
100
80
60
40
20
0
us us s s
- 2 7 -27 -27
s s s
- 2 4 - 2 4 -27
s s s
-21 - 2 1 -27
s 4s 4 s 7 s
ono ono - 1 a - 1a -1a -4a - 4 a - 4 a -7a - 7 a - 7 a a-1 a - 1 a - 2
chr chr aris SHCDHC aris SHCDHC aris SHCDHC -14 - 1 4 - 2 7
syn asyn a m a m a m m aris SHCDHC
e D D D a
pur D
Simulation Analytics Sleeping ExaStamp alone
Fig. 4. Comparison of the different strategies on 1,792 Broadwell cores (64 MPI pro-
cesses) for three analytics equivalent to or larger than the simulation timestep: sequen-
tial statistics performed 1,000 times (a), parallel statistics performed 1,000 times and
radial.
statistics seq analytics: the total execution time is up to 74% higher with an
analytics arena of size 27. Performance measurements with VTune show that the
percentage of DRAM remote accesses is of 18.5% with an analytics arena of size 7
and increases to 67.5% with an analytics arena size of 27 while it remains around
15% for TINS. In the thread approach, the sequential analytics will always be
executed by the analytics master thread, guaranteeing data locality. In the task
approach, we can bind the analytics threads on a set of cores but we cannot
guarantee that the task will be executed on a particular thread. The task app-
roach is also more intrusive in the simulation because the simulation needs to
enqueue the task while it is left to a separate thread in the TINS approach. The
TINS approach shows therefore better performance than a task approach and is
less intrusive in the simulation.
TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics 175
100
80
60
40
20
0
us u s s s
-27 - 2 7 - 2 7
s s s
-24 -24 -27
s s s
-21 - 2 1 - 2 7
s s s
-14 -147a-27
s
ono ono -1a - 1 a - 1 a -4a - 4 a - 4 a -7a - 7 a - 7 a 14a - 1 4 a
chr chr aris SHCDHC aris SHCDHC aris SHCDHC ris-SHCD C-2
syn asyn m m m a H
e Da Da Da Da
m
pur Simulation Analytics Sleeping ExaStamp alone
Fig. 5. Comparison of the different strategies on 14,336 Broadwell cores for an analytics
scheme where the executed analytics depends on the iteration number.
176 E. Dirand et al.
5 Conclusion
Many previous works investigated how to perform asynchronous in situ process-
ing at a process level for MPI applications. The helper core strategy emerged as
the best approach to share the resources. In this paper, we propose the TINS
approach that goes one step further by proposing a dynamic helper core strat-
egy with a temporary thread isolation in a task-based programming model. The
helper cores are assigned to analytics only when analytics tasks are available
while they join the other threads for simulation processing instead. The TINS
approach is a minimally intrusive method where it is easy to switch between
static and dynamic helper core strategies without code recompilation and that
is easy to use by the end-user. It enables use of both the simulation sequential
regions and the part of the simulation that are not parallelized well enough. The
experiments conducted on up to 14,336 Broadwell cores on representative ana-
lytics codes show that the TINS framework implemented with the Intel R
TBB
library can be up to 40% faster than the Damaris and Goldrush approaches
on the ExaStamp molecular dynamics code that shows a good MPI and TBB
efficiency. In particular, when the analytics workload varies from an iteration
to another, no fixed number of static helper cores is capable of ensuring the
best performance while the dynamic helper core strategy proves more flexible.
Experiments also show that the obtained performance are close to the raw sim-
ulation, demonstrating that our approach enables to perform analytics at a high
frequency. Future work will investigate the behavior of TINS on real analytics
use cases. We also plan to study how to port TINS on other task-based runtimes,
OpenMP in particular.
Acknowledgments. This work was partly funded by the French Programme d’Inves-
tissements d’Avenir (PIA) project SMICE. We thank Fang Zheng for having provided
the Goldrush code and Matthieu Dorier for his help with Damaris.
References
1. Bennett, J.C., Abbasi, H., Bremer, P.-T., Grout, R., Gyulassy, A., Jin, T., Klasky,
S., Kolla, H., Parashar, M., Pascucci, V., Pebay, P., Thompson, D., Yu, H., Zhang,
F., Chen, J.: Combining in-situ and in-transit processing to enable extreme-scale
scientific analysis. In: International Conference on High Performance Computing,
Networking, Storage and Analysis, pp. 49:1–49:9. IEEE Computer Society Press
(2012)
2. Dreher, M., Raffin, B.: A flexible framework for asynchronous in situ and in transit
analytics for scientific simulations. In: 14th IEEE/ACM International Symposium
on Cluster, Cloud and Grid Computing (CCGRID 2014) (2014)
3. Dorier, M., Antoniu, G., Cappello, F., Snir, M., Orf, L.: Damaris: how to effi-
ciently leverage multicore parallelism to achieve scalable, jitter-free I/O. In: IEEE
International Conference on Cluster Computing (2012)
4. Zheng, F., Zou, H., Eisnhauer, G., Schwan, K., Wolf, M., Dayal, J., Nguyen, T.A.,
Cao, J., Abbasi, H., Klasky, S., Podhorszki, N., Yu, H.: FlexIO: I/O middleware
for location-flexible scientific data analytics. In: IPDPS 2013 (2013)
TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics 177
5. Cieren, E., Colombet, L., Pitoiset, S., Namyst, R.: ExaStamp: a parallel framework
for molecular dynamics on heterogeneous clusters. In: Lopes, L., et al. (eds.) Euro-
Par 2014. LNCS, vol. 8806, pp. 121–132. Springer, Cham (2014). https://doi.org/
10.1007/978-3-319-14313-2 11
6. Lorendeau, B., Fournier, Y., Ribes, A.: In situ visualization in fluid mechanics
using Catalyst: a case study for Code Saturne. In: IEEE Symposium on Large
Data Analysis and Visualization (LDAV) (2013)
7. Fabian, N., Moreland, K., Thompson, D., Bauer, A., Marion, P., Geveci, B.,
Rasquin, M., Jansen, K.: The ParaView coprocessing library: a scalable, general
purpose in situ visualization library. In: Large Data Analysis and Visualization
Workshop (LDAV 2011), pp. 89–96 (2011)
8. Whitlock, B., Favre, J.M., Meredith, J.S.: Parallel in situ coupling of simulation
with a fully featured visualization system. In: 11th Eurographics Conference on
Parallel Graphics and Visualization, pp. 101–109 (2011)
9. Ayachit, U., Whitlock, B., Wolf, M., Loring, B., Geveci, B., Lonie, D., Bethel, E.:
The SENSEI generic in situ interface. In: 2nd Workshop on In Situ Infrastructures
for Enabling Extreme-Scale Analysis and Visualization (ISAV 2016), pp. 40–44
(2016)
10. Lofstead, J.F., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and
integration for scientific codes through the adaptable IO system (ADIOS). In:
6th International Workshop on Challenges of Large Applications in Distributed
Environments, pp. 15–24 (2008)
11. Zheng, F., Yu, H., Hantas, C., Wolf, M., Eisenhauer, G., Schwan, K., Abbasi,
H., Klasky, S.: GoldRush: resource efficient in situ scientific data analytics using
fine-grained interference aware execution. In: International Conference on High
Performance Computing, Networking, Storage and Analysis (SC 2013), pp. 78:1–
78:12 (2013)
12. Harris, T., Maas, M., Marathe, V.J.: Callisto: co-scheduling parallel runtime sys-
tems. In: Proceedings of the Ninth European Conference on Computer Systems
(EuroSys 2014), pp. 24:1–24:14 (2014)
13. Li, M., Vazhkudai, S.S., Butt, A.R., Meng, F., Ma, X., Kim, Y., Engelmann, C.,
Shipman, G.: Functional partitioning to optimize end-to-end performance on many-
core architectures. In: International Conference for High Performance Computing,
Networking, Storage and Analysis, pp. 1–12 (2010)
14. Singh, A., Balaji, P., Feng, W.: GePSeA: a general-purpose software acceleration
framework for lightweight task offloading. In: International Conference on Parallel
Processing, pp. 261–268 (2009)
15. Ma, X., Lee, J., Winslett, M.: High-level buffering for hiding periodic output cost
in scientific simulations. IEEE Trans. Parallel Distrib. Syst. 17(3), 193–204 (2006)
16. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform
for task scheduling on heterogeneous multicore architectures. Concurr. Comput.
Pract. Exper. 23, 187–198 (2011)
17. Hoque, R., Herault, T., Bosilca, G., Dongarra, J.: Dynamic task discovery in PaR-
SEC: a data-flow task-based runtime. In: Proceedings of the 8th Workshop on
Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2017, pp.
6:1–6:8. ACM, New York (2017). http://doi.acm.org/10.1145/3148226.3148233
18. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and
independence with logical regions. In: Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis (SC 2012)
(2012)
178 E. Dirand et al.
19. Kaiser, H., Heller, T., Adelstein-Lelbach, B., Serio, A., Fey, D.: HPX: a task based
programming model in a global address space. In: Proceedings of the 8th Inter-
national Conference on Partitioned Global Address Space Programming Models
(PGAS 2014) (2014)
20. Pébaÿ, P., Bennett, J.C., Hollman, D., Treichler, S., McCormick, P.S., Sweeney,
C.M., Kolla, H., Aiken, A.: Towards asynchronous many-task in situ data analy-
sis using legion. In: 2016 IEEE International Parallel and Distributed Processing
Symposium Workshops (IPDPSW), pp. 1033–1037, May 2016
21. Heirich, A., Slaughter, E., Papadakis, M., Lee, W., Biedert, T., Aiken, A.: In situ
visualization with task-based parallelism. In: Workshop on In Situ Infrastructures
on Enabling Extreme-Scale Analysis and Visualization (ISAV 2017) (2017)
22. Cho, Y., Oh, S., Egger, B.: Adaptive space-shared scheduling for shared-memory
parallel programs. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol.
10353, pp. 158–177. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
61756-5 9
23. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work
stealing. J. ACM 46(5), 720–748 (1999)
24. Dorier, M., Sisneros, R., Peterka, T., Antoniu, G., Semeraro, D.: Damaris/Viz: a
nonintrusive, adaptable and user-friendly in situ visualization framework. In: IEEE
Symposium on Large Data Analysis and Visualization (LDAV) (2013)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Machine Learning Predictions
for Underestimation of Job Runtime
on HPC System
The original version of this chapter was revised: The affiliation of the second author
has been corrected. The erratum to this chapter is available at https://doi.org/10.
1007/978-3-319-69953-0 17
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 179–198, 2018.
https://doi.org/10.1007/978-3-319-69953-0_11
180 J. Guo et al.
1 Introduction
Modern high-performance computing (HPC) systems are built with an increas-
ing number of CPU/GPU cores, memory, and storage space. Meanwhile, scientific
applications have been growing in complexity. However, not all users have enough
experience working reasonably with supercomputing resources. Writing and exe-
cuting programs on an HPC system requires more experience and techniques than
on a PC. First, HPC users need to have relevant knowledge about system-specific
information, such as parallel programming on multi cores or multi nodes in HPC
environments, and how many compute nodes or cores are appropriate for a spe-
cific application job. Furthermore, when submitting a job to an HPC system, users
are usually requested to estimate the runtime of said job for system scheduling. In
general, an underestimated runtime will lead to the HPC system terminating the
job before its completion. On the other hand, an overestimated runtime of the job
usually results in a longer queuing time. In both cases, the productivity of HPC
users is hindered [1]. Especially in the case of underestimation, the system will
directly terminate the undergoing job when its estimated runtime expires. Users
will lose their processing data, and furthermore, can no longer get the final results
they need. Therefore, most users have to resubmit their jobs again and run them
again from the beginning, which is a costly situation for users and systems since
they waste time and system resources.
Predicting jobs, especially those which may not finish before its allocated
time expires, can mitigate wastes of time and system resources by taking early
actions for those jobs. For instance, if an ongoing task execution of a job is
predicted to be runtime-underestimated based on the characteristic patterns,
system administrators or an automated agent can explicitly send a notification
to the user who submitted the job. The user can fix the problem by killing the
job and resubmitting it after parameter adjustment.
In this study, we propose a data-driven approach for predicting job statuses
on HPC systems. Here, “data-driven” means that our approach actively observes,
analyzes, and logs jobs collected on TSUBAME, a large-scale supercomputer
deployed at the Tokyo Institute of Technology. Supervised machine learning
algorithms (i.e., XGBoost and Random Forest) are applied to address this binary
classification problem (having runtime-underestimation or not).
Our experimental results show that our approach predicts the underesti-
mated job with 80% precision, 70% recall and 74% F1-score on the entirety of a
given dataset. Then, we split the entire job data set into subsets categorized by
scientific application name. The best precision, recall and F1-score of subsets on
job runtime-underestimated prediction achieve 90%, 95% and 92% respectively.
This achievement means that, for some scientific applications on HPC systems,
our model can be used to accurately predict whether a job can be completed
before its estimated runtime expires.
Our specific contributions are:
The rest of the paper is organized as follows: Related work about gathering
and analyzing job logs in HPC systems are introduced in Sect. 2, followed by an
overview description of the dataset and feature engineering used for preprocess-
ing the dataset in Sect. 3. Design and implementation of our machine learning-
based prediction methods and the evaluation of our approach are described in
Sect. 4. In Sect. 5, we presented detailed analysis based on experiment results and
discussion. Finally, we gave our future work and conclude the paper in Sect. 6.
2 Related Work
Gathering and analyzing job logs in HPC systems is a widely studied topic in
computer science literature. In recent years, there have been many studies on
analyzing job logs focusing on anomaly detection, failure prediction, runtime
prediction, and so on.
Klinkenberg et al. [2] proposed and evaluated a method for predicting fail-
ures with framed cluster monitoring data and extracted features describing the
characteristic of the signals. Authors in [3] presented a machine learning based
Random forests (RF) classification model for predicting unsuccessful job execu-
tions. In modern supercomputing centers, successful or health jobs occupy a very
large part of job databases. However, authors used the overall accuracy as an eval-
uation metric in those works, which cannot truly reflect unsuccessful execution
results. Tuncer et al. [4] presented a method to detect anomalies and performance
variations in HPC and Cloud environments. However, they run kernels represent-
ing common HPC workloads and infuse synthetic anomalies to mimic anomalies
observed in HPC systems, which may deviate from anomaly situations in reality.
There exists research focusing on predicting other job features, such as I/O,
CPU, GPU, memory usage and runtime in clusters. McKenna et al. [5] utilized
several machine learning methods (kNN, Decision Tree, and RF) for predict-
ing runtime and I/O usage for HPC jobs with training data from job scripts.
Rodrigues et al. [6] predicted job execution, wait time, and memory usage with
job logs and batch schedulers by an ensemble of machine learning algorithms
such as RF and kNN. Fan et al. [7] proposed an online runtime adjustment
framework for trade-off between prediction accuracy and underestimation rate
in job runtime estimates.
Additionally, others have worked on log file analysis with machine learning,
anomaly detection and so on. In this work, we use system log data collected by
ganglia to predict whether a job runtime was underestimated. We found that our
method has particularly good results at predicting underestimated runtime for
some applications after splitting the entire job data set into subset categorized
by scientific application name.
To the best of our knowledge, our work is the first to analyze job data and
build models according to different HPC scientific applications using machine
182 J. Guo et al.
Tuning Entire
Ganglia+ Dropping Feature Randomized
Hyperpara Dataset
PBS Nan Selection searchCV
ments
Table 1. List of computing resource usage features based on normalized time series data
and job requests information
help when solving the problem [11]. Features describe the structures inherent in
data, and furthermore, they are very important to the predictive models and will
influence the result. The quality and quantity of features have direct impact on
whether the model is good or not. Therefore, getting enough useful features from
the raw data is the first step in building good models for solving our problem.
Feature Selection. From the previous sections, we know that the raw data about
compute resources usage was time series data of extreme size. Directly using raw
time series data will produce unacceptable compute overhead, which may lead to
serious time gaps between data collection and analysis as well as wasted computa-
tional resource. Instead of using raw time series data, we selected a set of relevant
features from the raw job logs data for use in model construction by normalizing and
converting them to MySQL database. In machine learning tasks, this is an essential
step to make results easier to interpret by researchers. Additionally one can enjoy
shorter training times, avoid the curse of dimensionality and enhanced generaliza-
tion by reducing overfitting [12].
In this research, our purpose is building a machine learning technique-based
model that can predict whether a job is underestimated on its runtime. Therefore
we selected features as training set X by removing redundant or irrelevant features
such as used cputime, used nodesec, used walltime, queue time, start time and
end time without incurring much loss of information. This is a preliminary study
in which we try to reveal complex patterns hidden in utilization of computing
resources, user behaviors, and different applications on an HPC system. Those fea-
tures are redundant, which have a large impact on the prediction of job runtime.
Additionally, we needed to create the target variable as the test set y which is
then compared with the results produced with the training set X. We label the test
set by the following formula:
where j.used walltime is actual runtime of a job, and j.req walltime is user esti-
mated time of a job. If y < 0, we label this job as 0 in the test set y, which means
that the actual runtime of this job does not exceed the user’s estimated time when
its user submitted it. Relatively, if y >= 0, this job will be labeled as 1 in the test
set, which indicates runtime-underestimation. In this case, this job will be termi-
nated by the HPC system immediately before its completion. The purpose of our
work is to predict whether a job is runtime-underestimated after job submission.
Table 2. 5 instances with selected 18 features as training set and test set
the column queue is list including [G, H, L128F, S, S96, X] which represents vary-
ing queues in TSUBAME 2.5 HPC system. In addition, the column userhash and
grouphash keep hash values from 1 100 users and 421 user groups. Labelencoder
can also be used to transform non-numerical variables (as long as they are hash-
able and comparable) to numerical variables. For example, LabelEncodeing can
turn [G, S, G, H, S] into [1, 2, 1, 3, 2], but then the imposed ordinality means that
the average of G and H is S. In this work, we used Labelencoder to transform feature
variables in columns userhash, grouphash and queue from categorical variables to
numerical variables.
Second is feature standardization. Based on Table 1, we can see that the range
of values of columns varies widely. For instance, in column used mem, values range
from single units to millions of units. Meanwhile, in column used cpupercent, the
values range from 0 to hundreds of thousands. In contrast with these two columns,
the column is array is bool type (0 or 1). Given this wide variation of training set
values in some machine learning algorithms, objective functions will not work prop-
erly without normalization. For example, most of classifiers calculate the distance
between two points by the Euclidean distance. If one of the features has a broad
range of values, the distance will be governed by this particular feature. Therefore,
the range of all features should be feature scaled so that each feature contributes
approximately proportionately to the final distance.
Feature standardization can make the values of each feature in the data have
zero-mean (when subtracting the mean in the numerator) and unit-variance. This
method is widely used for normalization in many machine learning algorithms (e.g.,
SVM, logistic regression, and neural networks). The general method of calculation
is to determine the distribution mean and standard deviation for each feature. Next
we subtract the mean from each feature. Then we divide the values of each fea-
ture by its standard deviation (since mean is already subtracted) [13], which can be
186 J. Guo et al.
data set. First is to collect more minority class data or to re-sample the imbalanced
dataset by over-sampling (e.g. adding copies of instances from the minority class)
or by under-sampling (e.g. deleting instances from the majority class). We can-
not do either of these strategies, because over-sampling will increase the size of
the data set thereby greatly extending training time, and under-sampling may lose
188 J. Guo et al.
In machine learning tasks with extremely imbalanced datasets, we use a set of alter-
native metrics such as false positive rate (FPR), true positive rate (TPR), receiver
operating characteristic (ROC), Area under the Curve of ROC (AUC), precision,
recall, and F1-score to evaluate the performance of our model on imbalanced data:
True Positives (TP): the true positive are the cases when the actual class of the
target label was 1 (True) and the predicted is also 1 (True). In this research, the
case where a job is actually runtime-underestimated (1) and the model classifies
the case as runtime-underestimated (1) falls under True Positives.
True Negatives (TN): the true negative are the cases when the actual class of the
target label was 0 (False) and the predicted is also 0 (False). In this research, the
case where a job is NOT runtime-underestimated and the model classifies the case
as NOT runtime-underestimated falls under True Negatives.
False Positives (FP): the false positive are the cases when the actual class of the
target label was 0 (False) and the predicted is 1 (True). In this research, the case
where a job is NOT runtime-underestimated and the model classifies the case as
runtime-underestimated comes under False Positives.
False Negatives (FN): the false negative are the cases when the actual class of the
target label was 1 (True) and the predicted is 0 (False). In this research, the case
where there is a runtime-underestimated job and the model classifies the case as
NOT runtime-underestimated comes under False Negatives.
TP + TN
Accuracy = (1)
TP + TN + FP + FN
TP
P recision = (2)
TP + FP
TP
Recall = (3)
TP + FN
2 × P recision × Recall
F1-score = (4)
P recision + Recall
FP
FPR = (5)
FP + TN
TP
TPR = (6)
TP + FN
TN
SP C = (7)
TN + FP
Machine Learning Predictions for Underestimation of Job Runtime 189
The ROC is a kind of curve graph that represents the diagnostic ability for a
binary classification problem with all possible threshold values. ROC can be drawn
with coordinates ranging between FPR and TPR along the x and y axes. Adjusting
the threshold will change the FPR and TPR. In a binary classification problem, the
prediction result for each sample is usually made based on a continuous random
variable X, which is a “score” computed for this sample. Setting a threshold T , the
sample will be classified as “positive” if X > T , and “negative” otherwise.
The AUC it indicates the probability that a classifier will rank a randomly cho-
sen positive instance higher than a randomly chosen negative one (assuming ‘pos-
itive’ ranks higher than ‘negative’) [15]. The AUC is a single metric which can be
used for an overall performance summary of a classifier, calculated by following
formula:
−∞
AU C = T P R(T )(−F P R (T ))dT
∞
∞ ∞
= I(T > T )f1 (T )f0 (T )dT dT = P (X1 > X0 ) (8)
−∞ −∞
where X1 is the score for a positive instance and X0 is the score for a negative
instance, and f0 is the probability density when the sample actually belongs to
class “positive”, and f1 otherwise [16].
Due to space limitations, we will not describe it in detail here. What we need to
know about AUC are as follows: The range of the value of AUC is between 0 and 1,
the higher the better; When AUC is 1, this means that it is a perfect classifier, and
with this prediction classifier, there is at least one threshold that leads to a perfect
prediction (no FP and FN). However, there is no perfect classifier in most real world
cases. 0.5 < AUC < 1 means that the performance of this model is better in cases
of a random guess. If the AUC is around 0.5, that means the performance of this
model is generally the same as the result of a random guess.
The AUC was the first metric used to evaluate the overall accuracy performance
of a classifier in the evaluation stage. After the best classifiers were chosen with the
AUC, we used ROC to trade off precision vs recall in the minority class, because
the majority class always has very high scores on all metrics in extremely imbal-
anced datasets. F1-score was a useful metric as we desired harmonic average of the
precision and recall.
In all classifiers, a trade off will always occur between true negative rate (SPC,
specificity) and true positive rate (TPR). The same occurs with precision and
recall. In our study, we hope to train a classifier that gives high precision over the
minority class (label 1, a job having runtime-underestimation), while maintain-
ing reasonable precision and recall for the majority class. In the case of modeling
on extremely imbalanced dataset, quite often the minority class is of great signifi-
cance. For our imbalanced binary classification problem, we will take advantage of
the combination of the above-mentioned evaluation metrics to diagnose our model.
190 J. Guo et al.
Since we split the entire job data set into subsets, there are some subsets in which
the absolute number of minority class samples is too small. Therefore, we use the
leave-one-out cross-validation (LOOCV) in our work [22]. The LOOCV method
keeps a certain percentage of the full data set as a test set, then the rest of the data
is used to perform k-fold cross-validation (k-fold CV). Next, it records k scores and
calculates the standard deviation (std) of k scores as reference for choosing the best
classifier from them. At the same time, it evaluates the robustness of the model. The
final performance score of this model can be obtained from using the best-chosen
classifier to predict the test set.
Meanwhile, most machine learning algorithms have several hyperparameters
that will affect a model’s performance. Tuning hyperparameters is an indispensable
step to improve a model’s performance, which often improve its accuracy or other
metrics, like precision and recall, by 3–5%, depending on the algorithm and dataset.
In some cases, parameter tuning may improve the accuracy by around 50% [21]. In
this study, we train our model and tune hyperparameters via LOOCV with the
RandomizedSearchCV function from scikit-learn [23]. The RandomizedSearchCV
is an estimator used to optimizing hyperparameters from parameter settings. In
contrast to GridSearchCV, not all parameter values are attempted, but rather a
fixed number of parameter settings is sampled from the specified distributions. We
set 30% of the each dataset as the test set with a random state, n iter to 50, and we
also set AUC as the scoring metric in RandomizedSearchCV. Parameter settings
and optimized parameters are presented in Table 4.
We trained and tuned classifiers with the XGBoost and the RF on the entire
dataset. We used the best chosen classifiers based on 5-fold CV on the training
set (70% entire dataset) to predict the test set (30% entire dataset). Tables 5 and 6
shows that the XGBoost and the RF have an extremely similar overall performance
result. The result consist of similar values of runtime-underestimation prediction
Machine Learning Predictions for Underestimation of Job Runtime 191
Table 4. Hyperparameters settings of Random Forest, XGBoost and the best parameters
after tuning for our study
(in terms of overall precision, recall, and F1-score) in the entire dataset. As we esti-
mated, the precision, recall, and F1-score of the majority class are very high on
both algorithms (0.98, and as high as 0.99). In contrast, all metrics on the minor-
ity class are lower than those on the majority class (e.g. F1-score: 0.74 vs 0.99).
However, the overall average of precision, recall, and F1-score achieved very high
scores on both algorithms (all around 0.97), due to combining absolute quantity
and relative quantity subsets into an entire imbalanced dataset. There is a slight
difference in precision and recall between the two algorithms; XGBoost outper-
forms the RF in precision by 0.02, while decreases the RF’s recall by 0.01. Thus,
the precision, recall, and F1-score on the minority class are fairer metrics than
those of the majority class when evaluating model performance.
192 J. Guo et al.
In most HPC systems, there are a huge number of jobs submitted by thousands of
users who are potentially grouped into hundreds of user groups. In relevant research
about job logs analysis, researchers usually divide logs into subsets with different
rules or purposes for seeking hidden patterns from those logs [1–3].
In this research, our main purpose is predicting whether a job may or may not
finish before its runtime estimated by its user. The runtime is mainly affected by
many factors, such as user behaviors and computing resource usage in the HPC
environment. (In this study, we do not consider human intervention from users or
administrators, nor random hardware failures). The entire job dataset was split
into subsets categorized by scientific application name for mining potential pat-
terns which may affect runtime of HPC applications. According to Table 3, there are
almost one million job logs based on 27 pre-installed HPC applications in TSUB-
AME 2.5 (except those in the unlabeled “others” class). We used XGBoost and
RF to build prediction models with the optimized hyperparameters presented in
Table 4 and run them through on each subset by 5-fold LOOCV respectively. The
performance evaluation results including AUC, precision and recall on the minority
class were plotted in Figs. 2 and 3.
Figure 2 shows the AUC and the standard deviation (std) of the AUC by 5-fold
LOOCV for 26 subsets after taking “others” as a subset and removing “RISM”,
“CAE: MSC”, from all training dataset. This was because there is no instance
of runtime-underestimation (labeled 1, minority class) in their subsets. The AUC
(XGBoost) was chosen as an indicator to sort the results in descending order for
observation and analysis purposes. We can see that the XGBoost outperforms or tie
Machine Learning Predictions for Underestimation of Job Runtime 193
1 35
0.5
0.4 15
0.3 10
0.2
5
0.1
0 0
AUC (XGBoost) std of AUC (XGBoost) AUC (RF) std of AUC (RF) Relative Percentage of Minority Class (%)
Fig. 2. The AUC and its STD after running through subsets with 5-fold LOOCV
with RF slightly in most application subsets with the AUC as the indicator, except
application named “CAE: LS-DYNA”. The std of AUC show the model stability;
the smaller the std is, the more stable the model’s performance is. The percentage
of minority class of each application was also plotted in Fig. 2. We can see that, for
most of cases in this study, the percentage of minority class almost has no impact on
the AUC and the std of the AUC. However, we found that, the higher the absolute
number of minority class is, the more stable the model is relatively. We believe that
the high std of AUC in some subsets is due to the low absolute number of minor-
ity class. The AUC shows the overall performance of models. We can see that both
algorithms achieved very good AUC on 5 subsets including “CAE: Abaqus”, “Vis:
POV-RAY”, “MD: Tinker”, “MD: NAMD” and “MD: GROMACS”. Except for
“MD: NAMD” by RF, the AUC in the rest of 4 subsets are greater than or equal to
0.9, which means that both algorithms provide very good prediction about runtime-
underestimation for those 5 applications in the HPC environment. In contrast to
“CAE:Abagus” and “Vis: POV-RAY”, the results of “Bio:BLAST” by both 2 algo-
rithms are the worst in all subsets. Since in “Bio:BLAST” subset, the absolute
number (15) and the relative percentage (0.45%) on minority class are much lower
than those on other subsets, our models cannot handle with this kind of problem.
The “CAE:FLUENT” has similar result with “Bio:BLAST”, because of its abso-
lute number (33) on minority class is also very low. But its std of AUC is better
than “Bio:BLAST”, due to its relative percentage on minority class is higher than
“Bio:BLAST”.
In Fig. 3, we used best-chosen classifier from 5-fold LOOCV to plot precision
and recall in minority class on all subsets, which follows the sorting in Fig. 2. Tak-
ing stable, precision, recall and F1-score into consideration together, we think that
“Vis: POV-RAY” achieves the best result on minority class by XGBoost (90% pre-
cision, 95% recall, 92% F1-score). This figure helps to find out which algorithms
194 J. Guo et al.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
APPLICATION
Fig. 3. Precision and recall on minority class after running through XGBoost and Ran-
dom Forest
is good at which metric. For example, if we need the best recall on subset “Vis:
POV-RAY”, XGBoost will be the best selection to build model.
If we want our model to provide the best precision for “CAE: LS-DYNA”, RF
should be chosen to build model. In this research, from the user’s point of view,
the precision is more important than the recall, due to the FP is more important
than the FN in the job runtime-underestimated prediction. Since the FP can be
much more costly than FN. On the contrary, if looking at the angle of HPC system
administrators for saving system resources as much as possible, the recall will be
more critical than FP.
Figure 4 represent ROC, AUC and std of AUC after 5-fold CV on “CAE:Aba-
gus” and “Bio: BLAST” by XGBoost. Adjusting the threshold will change the
FPR. For instance, increasing the threshold will decrease FP (and increase FN),
which corresponds to moving in the left direction on the curve. The curve is more
Fig. 4. ROC, AUC and standard deviation after 5-fold CV on subsets “CAE:Abagus”
(left) and “Bio: BLAST” (right) by XGBoost
Machine Learning Predictions for Underestimation of Job Runtime 195
inclined to the upper left corner (0, 1), where the performance of the model is bet-
ter at distinguishing positive and negative classes. Adjusting the threshold on ROC
will be the last step to improve the performance of a model.
Feature importance gives a score (F score) for indicating how valuable or useful each
feature was when building boosted decision trees based models. With the features
Fig. 5. Important features for different applications; features are automatically named
according to their index, f0: used cpupercent, f1: used mem, f2: used ncpus, f3:
used vmem, f4: req mem, f5: req ncpus, f6: req walltime, f7: req gpus, f8: req pl, f9:
req et, f10:nhosts, f11: is array, f12: gpu utilization, f13: num gpu used, f14: group,
f15: queue and f16: user, from f0 to f16 respectively
196 J. Guo et al.
sorted according to how many times they appear, the more a feature was used to
make key decisions within the decision trees, the higher its relative importance was
to the model.
In our study, we plotted feature importance in the top 5 AUC indicated sub-
sets with the features ordered according to how many score they have (how impor-
tant it was) in Fig. 5. We can see that used mem, used vmem, used cpupercent,
req walltime and gpu utilization are the most important features in those applica-
tions. However, applications have different weights (namely prediction of runtime-
underestimation) on different features (namely computing resource usage), which
both affected job runtime. Our method recognized these patterns and used them
to predict job runtime-underestimation in HPC systems.
5.4 Discussion
Papers [2–5] demonstrate related research, such as job status prediction, failure
prediction and anomaly detection, based on log file analysis with machine learning
with good results. Whether abnormal detection or job status prediction, the num-
ber of correct instances (majority class) should be much more than the number
of incorrect instances (minority class) in a dataset, which leads to an imbalanced
dataset just like our dataset presented here. However, in those works, authors used
the overall accuracy, precision, recall, and F1-score to evaluate the model perfor-
mance without considering those of on the minority class. As we explained in this
paper, because of the imbalanced absolute number and relative percentage of the
majority classes and the minority classes (the minority class will be more than 1 in
multi-classification problems), the overall metrics cannot accurately reflect the pre-
dictions of minority class. Minority classes are more important than the predictions
of majority classes in classification problem with an imbalanced dataset. Therefore,
we propose that taking precision, recall, and F1-score on minority classes, rather
than overall, is a promising metric for future work.
References
1. Zhang, H., You, H., Hadri, B., Fahey, M.: HPC usage behavior analysis and perfor-
mance estimation with machine learning techniques. In: Proceedings of the Interna-
tional Conference on Parallel and Distributed Processing Techniques and Applica-
tions (PDPTA), the Steering Committee of The World Congress in Computer Sci-
ence, Computer Engineering and Applied Computing (WorldComp), p. 1 (2012)
2. Klinkenberg, J., Terboven, C., Lankes, S., Müller, M.S.: Data mining-based analysis
of HPC center operations. In: 2017 IEEE International Conference on Cluster Com-
puting (CLUSTER), pp. 766–773. IEEE (2017)
3. Yoo, W., Sim, A., Wu, K.: Machine learning based job status prediction in scientific
clusters. In: SAI Computing Conference (SAI), pp. 44–53. IEEE (2016)
4. Tuncer, O., Ates, E., Zhang, Y., Turk, A., Brandt, J., Leung, V.J., Egele, M., Coskun,
A.K.: Diagnosing performance variations in HPC applications using machine learn-
ing. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS,
vol. 10266, pp. 355–373. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
58667-0 19
5. McKenna, R., Herbein, S., Moody, A., Gamblin, T., Taufer, M.: Machine learning
predictions of runtime and IO traffic on high-end clusters. In: 2016 IEEE Interna-
tional Conference on Cluster Computing (CLUSTER), pp. 255–258. IEEE (2016)
6. Rodrigues, E.R., Cunha, R.L., Netto, M.A., Spriggs, M.: Helping HPC users specify
job memory requirements via machine learning. In: Proceedings of the Third Inter-
national Workshop on HPC User Support Tools, pp. 6–13. IEEE Press (2016)
7. Fan, Y., Rich, P., Allcock, W.E., Papka, M.E., Lan, Z.: Trade-off between prediction
accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE Interna-
tional Conference on Cluster Computing (CLUSTER), pp. 530–540. IEEE (2017)
8. Matsuoka, S.: The TSUBAME 2.5 evolution. TSUBAME e-Sci. J. 10, 2–8 (2013)
9. Feng, W., Cameron, K.: The Green500 list: encouraging sustainable supercomputing.
Computer 40(12), 50–55 (2007)
10. Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system:
design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
11. Brownlee, J.: Discover feature engineering, how to engineer features and how to get
good at it. Machine Learning Process (2014)
12. Bermingham, M.L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I.,
Campbell, H., Wright, A.F., Wilson, J.F., Agakov, F., Navarro, P., et al.: Application
of high-dimensional feature selection: evaluation for genomic prediction in man. Sci.
Rep. 5, 1–12 (2015)
198 J. Guo et al.
13. Grus, J.: Data Science from Scratch: First Principles with Python. O’Reilly Media,
Inc., Sebastopol (2015)
14. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data, vol.
110. University of California, Berkeley (2004)
15. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874
(2006)
16. Powers, D.M.: Evaluation: from precision, recall and F-measure to ROC, informed-
ness, markedness and correlation (2011)
17. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News
2(3), 18–22 (2002)
18. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 785–794. ACM (2016)
19. Song, R., Chen, S., Deng, B., Li, L.: eXtreme gradient boosting for identifying individ-
ual users across different digital devices. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu,
D. (eds.) WAIM 2016. LNCS, vol. 9658, pp. 43–54. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-39937-9 4
20. Nielsen, D.: Tree boosting with XGBoost-why does XGBoost win “every” machine
learning competition? Master’s thesis, NTNU (2016)
21. Olson, R.S., La Cava, W., Mustahsan, Z., Varik, A., Moore, J.H.: Data-driven
advice for applying machine learning to bioinformatics problems. arXiv preprint
arXiv:1708.05070 (2017)
22. Cawley, G.C., Talbot, N.L.: Efficient leave-one-out cross-validation of kernel fisher
discriminant classifiers. Pattern Recogn. 36(11), 2585–2592 (2003)
23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon-
del, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning
in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
A Power Management Framework
with Simple DSL for Automatic
Power-Performance Optimization
on Power-Constrained HPC Systems
1 Introduction
The need for high performance computing (HPC) in modern society never
recedes as more and more HPC applications are highly involved in every aspect
of our daily life. To achieve exascale performance, there are many technical chal-
lenges waiting to be addressed, ranging from the underlying device technology to
exploiting parallelism in application codes. Numerous reports including Exascale
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 199–218, 2018.
https://doi.org/10.1007/978-3-319-69953-0_12
200 Y. Wada et al.
Study from DARPA [2] and Top Ten Exascale Research Challenges from DOE
[12] have identified power consumption as one of the major constraints in scaling
the performance of HPC systems. In order to bridge the gap between required
and current power/energy efficiency, one of the most important research issues
is developing a power management framework which allows power to be more
efficiently consumed and distributed.
Power management is a complicated process involving the collection and
analysis of statistics from both hardware and software, power allocation and
control by available power-knobs, code instrumentation/optimization, and so
on. For large scale HPC systems, it would be more complex since handling these
tasks at scale is not easy. So far, these tasks are mostly carried out in a discrete
and hand-tuned way for a specific hardware or software component. This fact
causes several problems and limitations.
First, the lack of cooperation/automation makes power management very dif-
ficult and time-consuming. It is desirable that power management process is able
to be carried out under a common convention with little effort from users and
system administrators. Second, though there are many existing tools to be poten-
tially used for power optimization, each of them is usually designed for different
purpose. For example, PDT and TAU are used for application analysis, RAPL
is dedicated to power monitoring and capping while cpufreq targets mainly for
performance/power tuning [10,18,23]. It is necessary to have a holistic way to
use them together. Third, different HPC systems have different capabilities for
power management, resulting in system or hardware-specific ad-hoc solutions.
Of course, these are not portable. Therefore, it is crucial to provide a generalized
interface for portability and extensibility. Fourth, power management sometimes
needs a collaborative effort from both users and administrators. Many existing
solutions fail to draw a clear border between them.
To address these problems, some of power management related tools and
APIs such as GeoPM [5] and PowerAPI [8] have been developed. These
tools/APIs are user-centric and need hand-tuning efforts for efficient power
management. Hence, we design and implement a versatile power management
framework targeting at power constrained large scale HPC systems. We try to
provide a standard utility to people of different roles when managing/using such
systems. Through this framework, we provide an extensible hardware/software
interface between existing/future power-knobs and related tools/APIs so that
system administrators can help supercomputer users make full use of valuable
power budget. Since the framework defines clear roles for people participating
and software components, power management and control can be carried out
securely. The framework contains a very simple domain-specific language (DSL)
which serves as a front-end to many other utilities and tools. This enables users
to create, automate, port and re-use power management solutions.
This paper makes the following contributions:
– By showing three case studies, we demonstrate how this framework works and
how it is used to carry out certain power management tasks under different
power control strategies.
– We also prove the effectiveness of this framework through these three case
studies.
The rest of this paper is organized as follows. Section 2 covers the background
and related work, while Sect. 3 presents details of the power management frame-
work and its components. Section 4 is about the DSL. In Sect. 5, we demonstrate
this framework through a few case studies and make discussions on the results.
Finally, Sect. 6 concludes this paper.
2 Related Work
optimization process much harder and requires users to spend large cost and
effort. To alleviate the difficulties, it is desirable to develop a simple and easy-
to-use interface to control and monitor power-knobs. This simple interface should
play an important function for the unified framework.
In this paper, we aim to provide a simple and easy way to realize power-
performance managed/optimized application by using a versatile power manage-
ment framework which has access to all kinds of information from the systems,
users, and applications. This framework applies power-performance optimization
to the application automatically, and can reduce users’ burden drastically.
The main objective of the framework proposed in this paper is to make the
power management and power-performance optimizations processes more facili-
tating and flexible for both users and system administrators. In this framework,
we assume a standard HPC system with its hierarchical structure as shown
in Fig. 1. The target system consists of multiple compute nodes, interconnec-
tion network, and the storage subsystem. Each node consists of multiple proces-
sors, DRAM modules, and accelerators like GPUs. Each processor has multiple
cores. We assume some of the hardware components have power-knobs to con-
trol power consumption of the executing programs, but their availability to the
users depends on the control permission or the operational policy specified by
the system administrator.
instrumentation and profiling. However, these are extensible and other tool-sets
are easily supported.
This framework requires only two sets of inputs, the DSL code and the user
application source code. Based on them, our framework offers a semi-automatic
way of power-performance optimizations. Meanwhile, the administrators and
users can be free from the effort to understand the inside of the optimization
workflow. Once the DSL source code is prepared, the proposed framework pro-
vides an easy way to realize optimized execution of user applications.
Note that our framework supports two types of users. One is simply super-
computer “user” and the other is “administrator”. An administrator is able to
specify machine configurations, enable/disable power-knobs and calibrate the
hardware while a user is not allowed to do so. Switching between them is carried
out with the DSL.
Fig. 3. An example of the automatic instrumentation for both profiling and power-knob
control
Source code written in our DSL is composed of a basic element which is called the
“statement”. Each statement has a command, which is used to specify an action.
A Power Management Framework with Simple DSL 209
Listing 3. DSL Code Snippets to Submit a Job with a Specified Power Cap
1 CREATE JOB EP_G
2 ADD EP_G EXEC_PATH <absolute path to the executable>
3 ADD EP_G JOB_TYPE GENERAL
4 ADD EP_G MODULE_POWER 70
5 ADD EP_G CONTROL_MODE RAPL
6 SUBMIT EP_G
The DSL interpreter is designed and implemented with ANTLR v4. During inter-
pretation, various DSL statements are translated into shell scripts and applica-
tion instrumentation for different purposes such as hardware calibration, appli-
cation profiling, job submission, specifying power control in the application,
interfaces to other tools and so on. This interpretation process is uniform for
different systems but different hardware configurations may lead to variations in
the results.
Along with any created instances of a defined type in this DSL, there is also
an XML file created to store their attributes. For example, an instance of the
type “job” will have an accompanying XML file which stores its attributes such
as its name, path, executable, power caps and so on.
In this section, we provide three case studies to demonstrate some of the func-
tionalities of our framework. All these case studies are firstly programmed with
the DSL and then interpreted on a gateway node of an HPC system with its
specifications shown in Table 2.
A Power Management Framework with Simple DSL 211
Number of nodes 16
Processor Intel Xeon E5-2680
Number of sockets per node 2
Number of cores per socket 8
Memory size per node 128 GB
Interconnect Infiniband FDR
OS Red Hat Linux Enterprise with kernel 2.6.32
Compiler FUJITSU Software Technical Computing Suite
MPI Open MPI 1.6.3
Applications EP and IS (Class D) from the NPB Suite
In these case studies, we employed RAPL interface [10] as the available power-
knob, and considered only CPU power to be controlled through the RAPL under
the assumption that DRAM power consumption has strong correlations with
CPU performance and power. We used two applications (EP and IS) from the
NPB benchmark suite [14] to carry out these case studies. To understand their
performance and power characteristics, profiling is necessary and the results
212 Y. Wada et al.
are shown in Fig. 4 with an interval of 100 ms. The profiling processes are also
specified with our DSL as in Listing 4. How power capping should be applied
during the profiling process depends on the system and the power-performance
model to be used, and our framework can be easily extended to follow them.
The first case study shows how a maximum power demand of an application is
used as the power cap for the application. This case study requires our framework
to insert power-knob control API calls to the user application. Such API calls
are inserted into the application source file, and help both the profiling process
and capping tasks.
Listing 5. DSL Source for Case Study 1 (with Peak Power Demand)
1 CREATE MODEL MAX
2 ADD MAX MODEL_PATH <absolute path to the model script to find out max power>
3
4 CREATE JOB EP_MAX
5 ADD EP_MAX EXEC_PATH <absolute path to the executable>
6 ADD EP_MAX JOB_TYPE OPTIMIZATION
7 ADD EP_MAX CONTROL_MODE RAPL
8 ADD EP_MAX PROFILE_NAME EP
9 ADD EP_MAX MODEL_TO_USE MAX
10 SUBMIT EP_MAX
Fig. 5. Power performance optimization results under case study 1 (with peak power
demand)
A Power Management Framework with Simple DSL 213
For this case study, first we profile a target application without any power cap
to get its general power profile, and then search for its peak power consumption
in the profile. After finding the peak power consumption, we then launch a
production run with it as the power cap so that the application does not suffer
from any performance loss under the guarantee that the power consumption will
not exceed given power budget. The DSL source (for EP) for this case study is
shown in Listing 5.
Figure 5 presents the power profiles of two applications under the peak power
demand. As expected, there is no performance loss observed in this case study.
The second case study is used to prove how the average power demand of an
application is obtained through profiling and used as the power cap for the
application. Such power caps can help save the power and may lead to more
energy-efficient runs. Saved power can be distributed to other jobs simultane-
ously running on the same system by the system software like a job scheduler.
The DSL source (for EP) for this case study is shown in Listing 6.
In this case study, first we run the target application without any power
capping to get its general power profile like Case Study 1, and then obtained
average power consumption through a simple calculation. We set the power caps
with this average value for a power optimized run.
Figure 6 presents the power profiles of two applications under the average
power demand. For each application, there is a performance loss compared to
Case Study 1, but the power consumption is much less.
Listing 6. DSL Source for Case Study 2 (with Average Power Demand)
1 CREATE MODEL AVERAGE
2 ADD AVERAGE MODEL_PATH <absolute path to the model script to find out the average power>
3
4 CREATE JOB EP_AVE
5 ADD EP_AVE EXEC_PATH <absolute path to the executable>
6 ADD EP_AVE JOB_TYPE OPTIMIZATION
7 ADD EP_AVE CONTROL_MODE RAPL
8 ADD EP_AVE PROFILE_NAME EP
9 ADD EP_AVE MODEL_TO_USE AVERAGE
10 SUBMIT EP_AVE
The third case study is used to show how a linear performance/power model of
the application is constructed through profiling and how we use this model to
derive the power cap according to user’s performance demand for the application.
214 Y. Wada et al.
Fig. 6. Power performance optimization results under case study 2 (with average power
demand)
In addition to the power profiles shown in Fig. 4, four extra rounds of profiling
are required for this case study. The first two extra rounds are launched with
the peak power demand to find the shortest runtime. Then through the third
extra round of profiling where we set the power caps to a very small value
(10 W/Socket), we found that the minimum amount of power needed to run both
applications properly is around 30 W. We then set the power cap to 30 W per
socket and profile the forth extra round to find the runtime of the applications.
Using these profiled data, we can construct a linear performance/power model
for each application as shown in Fig. 7.
Using the models shown in Fig. 7, performance demand can be set from the
users when they submit their jobs through the DSL code. For example, if the
user allows the runtime to be doubled, the corresponding power caps can be
found from these two models as 59 W and 34 W for EP and IS, respectively. The
DSL source (for EP) for this case study is shown in Listing 7.
A Power Management Framework with Simple DSL 215
Fig. 8. Power performance optimization results under case study 3 (within a slowdown
of 2)
Figure 8 presents the power profiles of the two applications under power caps
obtained from the models to allow the elapsed time to be shorter than twice
the runtime under the peak power demand. Obviously, these two models are
not accurate enough so that both applications are slowed down for less than
two times (1.20x and 1.53x, respectively). Regardless of the accuracy of such
models, at least user’s performance demand is satisfied while the allocated power
is dramatically cut.
216 Y. Wada et al.
6 Conclusions
We have demonstrated a versatile power management framework for power-
constrained HPC systems to tackle the problem of power limitation. With this
framework, HPC system administrators can easily specify and calibrate their
system hardware. Meanwhile, it is also helpful for tasks such as how the user
applications should be tuned to maximize the performance or to cut the power
demand.
To verify the validity and usefulness of our framework, we tested it with sev-
eral case studies. In these case studies, we applied power management to two
selected applications and showed how a simple power model with linear relation-
ship between the CPU performance and power consumption can be constructed
and used to derive the power cap. These case studies simply proved that our
framework can provide the users an easy way to apply power optimization and
management to their applications.
In our future work, we plan to evaluate the proposed framework with other
power and performance optimization policies/algorithms, and to improve it with
more functionalities such as cooperation with system software, job schedulers and
other external tools to enrich its functionalities.
References
1. ANTLR. http://www.antlr.org/
2. Bergman, K., et al.: Exascale computing study: Technology challenges in achieving
exascale systems (2008)
3. Cao, T., Huang, W., He, Y., Kondo, M.: Cooling-aware job scheduling and node
allocation for overprovisioned HPC systems. In: Proceedings of the 31st IEEE
International Parallel and Distributed Processing Symposium (IPDPS 2017), pp.
728–737, May 2017
4. Chasapis, D., Casas, M., Moretó, M., Schulz, M., Ayguadé, E., Labarta, J., Valero,
M.: Runtime-guided mitigation of manufacturing variability in power-constrained
multi-socket NUMA nodes. In: Proceedings of the 2016 International Conference
on Supercomputing (ICS 2016), pp. 5:1–5:12, June 2016
5. GeoPM. https://github.com/geopm
6. Gholkar, N., Mueller, F., Rountree, B.: Power tuning HPC jobs on power-
constrained systems. In: Proceedings of the 2016 International Conference on Par-
allel Architectures and Compilation (PACT 2016), pp. 179–191, September 2016
7. IEEE Std 802.3az-2010 (2010). https://standards.ieee.org/findstds/standard/802.
3az-2010.html
8. Laros III, J.H., DeBonis, D., Grant, R., Kelly, S.M., Levenhagen, M., Olivier,
S., Pedretti, K.: High performance computing - power application programming
interface specification version 1.3, May 2016
A Power Management Framework with Simple DSL 217
9. Inadomi, Y., Patki, T., Inoue, K., Aoyagi, M., Rountree, B., Schulz, M.,
Lowenthal, D., Wada, Y., Fukazawa, K., Ueda, M., Kondo, M., Miyoshi, I.: Analyz-
ing and mitigating the impact of manufacturing variability in power-constrained
supercomputing. In: Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis (SC15), pp. 78:1–78:12,
November 2015
10. Intel Corporation: IntelR 64 and IA-32 architectures software developer’s manual,
September 2016
11. Lange, K.D.: Identifying shades of green: the SPECpower benchmarks. Computer
42(3), 95–97 March, 2009
12. Lucas, R., et al.: Top ten exascale research challenges (2014)
13. Miwa, S., Nakamura, H.: Profile-based power shifting in interconnection networks
with on/off links. In: Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis (SC15), pp. 37:1–37:11,
November 2015
14. NAS parallel benchmarks 3.3. http://www.nas.nasa.gov/
15. NVIDIA Corporation: NVML reference manual (2015)
16. Parr, T.: The Definitive ANTLR 4 Reference, 2nd edn. Pragmatic Bookshelf, Dallas
(2013)
17. Patki, T., Lowenthal, D.K., Sasidharan, A., Maiterth, M., Rountree, B.L., Schulz,
M., de Supinski, B.R.: Practical resource management in power-constrained, high
performance computing. In: Proceedings of the 24th International Symposium on
High-Performance Parallel and Distributed Computing, pp. 121–132, June 2015
18. PDT. https://www.cs.uoregon.edu/research/pdt/home.php
19. PomPP Library and Tools. https://github.com/pompp/pompp tools
20. Sakamoto, R., Cao, T., Kondo, M., Inoue, K., Ueda, M., Patki, T., Ellsworth,
D.A., Rountree, B., Schulz, M.: Production hardware overprovisioning: real-world
performance optimization using an extensible power-aware resource management
framework. In: Proceedings of the 31st IEEE International Parallel and Distributed
Processing Symposium (IPDPS 2017), pp. 957–966, May 2017
21. Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High
Perform. Comput. Appl. 20(2), 287–331 (2006)
22. Spafford, K.L., Vetter, J.S.: Automated design space exploration with aspen. Sci.
Program. 7:1–7:10 (2015)
23. TAU. https://www.cs.uoregon.edu/research/tau/home.php
24. Cao, T., Thang, C., He, Y., Kondo, M.: Demand-aware power management for
power-constrained HPC systems. In: Proceedings of 16th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing (CCGrid 2016), pp. 21–31,
May 2016
25. Wada, Y., He, Y., Thang, C., Kondo, M.: The PomPP framework: from simple
DSL to sophisticated power management for HPC systems. In: HPC Asia 2018
Poster Session, January 2018
26. Wallace, S., Yang, X., Vishwanath, V., Allcock, W.E., Coghlan, S., Papka, M.E.,
Lan, Z.: A data driven scheduling approach for power management on HPC sys-
tems. In: Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis (SC16), pp. 56:1–56:11, November 2016
27. Weaver, V.M., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra,
D., Moore, S.: Measuring energy and power with PAPI. In: Proceedings of the
41st International Conference on Parallel Processing Workshops (ICPPW-2012),
pp. 262–268, September 2012
218 Y. Wada et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Scalable Data Management of the Uintah
Simulation Framework
for Next-Generation Engineering
Problems with Radiation
1 Introduction
The exponential growth in High performance computing (HPC) over the past
20 years has fueled a wave of scientific insights and discoveries, many of which
S. Kumar and A. Humphrey—Authors contributed equally.
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 219–240, 2018.
https://doi.org/10.1007/978-3-319-69953-0_13
220 S. Kumar et al.
would not be possible without the integration of HPC capabilities. This trend
is continuing, for example, the DOE Exascale Computing Project [18] lists 25
major application focus areas [20] in energy, science, and national security. The
primary challenge in moving codes to new architectures at exascale is that,
although present codes may have good scaling characteristics on some current
architectures, those codes may likely have components that are not suited to the
high level of parallelism on these new computer architectures, or to the com-
plexity of real-world applications at exascale. One of the major challenges faced
by modern scalable scientific codes is with regard to data management. As the
gap between computing power and available disk bandwidth continues to grow,
the cost of parallel I/O becomes an important concern, especially for simula-
tions at the largest scales. Large-scale simulation I/O can be roughly split into
two use cases: checkpoint restarts in which the entire state of a simulation must
be preserved exactly, and analysis dumps in which a subset of information is
saved. Both checkpointing and analysis dumps are important, yet due to poor
I/O scaling and little available disk bandwidth, the trend of large-scale simula-
tion runs is to save fewer and fewer results. This not only increases the cost of
faults, since checkpoints are saved less frequently, but ultimately may affect the
scientific integrity of the analysis, due to the reduced temporal sampling of the
simulation. This paper presents a simple to implement method to enable parallel
I/O, which we demonstrate to efficiently scale up to 260K processes.
For most applications, the layout of
data distributed across compute cores
does not translate to efficient network
and storage access pattern for I/O.
Consequently, performing naive I/O
leads to significant underutilization of
the system. For instance, the patch
or block size of simulations is typi-
cally on the order of 123 to 203 voxels
(cells), mainly because a scientist typi-
cally works under a restricted compute
budget, and smaller patch sizes lead to
faster execution of individual compu-
tational timesteps (see Fig. 1), which Fig. 1. Time taken for execution of a
is critical in completion of the entire timestep for different patch sizes. Execu-
simulation. Small patch sizes do not tion time starts3to increase rapidly after a
bode well for parallel I/O, with either patch size of 12 .
file-per-process I/O or shared file I/O.
We find a middle ground by introducing a restructuring-based parallel I/O tech-
nique. We virtually regrid the data by imposing a restructuring phase that alters
the distribution of data among processes in a way such that only a few processes
end up holding larger patches/blocks, which are then written to a file indepen-
dently. The efficacy and scalability of this approach is shown in Sect. 3.
Scalable Data Management of the Uintah Simulation Framework 221
In order to gain scientific insight from such large-scale simulations, the visu-
alization software used must also scale well to large core counts and datasets,
introducing additional challenges in performing scientific simulations at scale for
domain scientists. To address I/O challenges on the read side of the scientific
pipeline, we also use our scalable parallel I/O library in combination with the
ray tracing library OSPRay [24] to create a lightweight remote viewer and movie
rendering tool for visualization of such large-scale data (Sect. 4).
Finally, we introduce a new, efficient radiation solve method into Uintah
based on spatial transport sweeps [2,4]. The radiation calculation is central to the
commercial 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler
being simulated in this work, as radiation is the dominant mode of heat transfer
within the boiler itself. To improve parallelism within these spatial sweeps, the
computation is split into multiple stages, which then expose spatial dependencies
to the Uintah task scheduler. Using the provided information about the stage’s
dependencies, the scheduler can efficiently distribute the computation, increasing
utilization. For the target boiler problem discussed in this paper, we find this
method up to 10× faster than previous reverse Monte Carlo ray tracing methods
(Sect. 5) due to this increased utilization.
This work demonstrates the efficacy of our approach by adapting the Uin-
tah computational framework [8], a highly scalable asynchronous many-task
(AMT) [7] runtime system, to use our I/O system and spatial transport sweeps
within a large-eddy simulation (LES). This work is aimed at predicting the per-
formance of a commercial 1000 MWe USC coal boiler, and has been considered
as an ideal exascale candidate given that the spatial and temporal resolution
requirements on physical grounds give rise to problems between 50 to 1000 times
larger than those we can solve today.
The principal contributions of this paper are:
1. A restructuring-based parallel I/O scheme.
2. A data parallel visualization system using OSPRay.
3. A faster approach to radiation using a spatial transport sweeps method.
2 Background
2.1 Uintah Simulation Framework
Uintah [22] is a software framework consisting of a set of parallel software compo-
nents and libraries that facilitate the solution of partial differential equations on
structured adaptive mesh refinement (AMR) grids. Uintah currently contains
four main simulation components: (1) the multi-material ICE code for both
low- and high-speed compressible flows; (2) the multi-material, particle-based
code MPM for structural mechanics; (3) the combined fluid-structure interac-
tion (FSI) algorithm MPM-ICE; and (4) the Arches turbulent reacting CFD
component that was designed for simulating turbulent reacting flows with par-
ticipating media radiation. Separate from these components is an AMT runtime,
considered as a possible leading alternative to mitigate exascale challenges at the
222 S. Kumar et al.
runtime system-level, which shelters the application developer from the complex-
ities introduced by future architectures [7]. Uintah’s clear separation between the
application layer and the underlying runtime system both keeps the application
developer insulated from complexities of the underlying parallelism Uintah pro-
vides, and makes it possible to achieve great increases in scalability through
changes to the runtime system that executes the taskgraph, without requiring
changes to the applications themselves [17].
Uintah decomposes the computational domain into a structured grid of rect-
angular cuboid cells. The basic unit of a Uintah simulation’s Cartesian mesh
(composed of cells) is termed a patch, and simulation variables that reside in Uin-
tah’s patches are termed grid or particle variables. The Uintah runtime system
manages the complexity of inter-nodal data dependencies, node-level parallelism,
data movement between the CPU and GPU, and ultimately task scheduling and
execution that make up a computational algorithm [19], including I/O tasks. The
core idea is to use a directed acyclic graph (DAG) representation of the com-
putation to schedule work, as opposed to, say, a bulk synchronous approach in
which blocks of communication follow blocks of computation. This graph-based
approach allows tasks to execute in a manner that efficiently overlaps commu-
nication and computation, and includes out-of-order execution of tasks (with
respect to a topological sort) where possible. Using this task-based approach
also allows for improved load balancing, as only nodes need to be considered,
not individual cores [8].
for interactive visualization and movie rendering, allowing for better rendering
performance and enabling us to take better advantage of parallel I/O for fast
data loading.
The 1000 MWe USC coal boiler being modeled by Uintah in this work has
thermal radiation as a dominant heat transfer mode, and involves solving the
conservation of energy equation and radiative heat transfer equation (RTE)
simultaneously. This radiation calculation, in which the radiative-flux divergence
at each cell of the discretized domain is calculated, can take up to 50% of the over-
all CPU time per timestep [11] using the discrete ordinates method (DOM) [6],
one of the standard approaches to computing radiative heat transfer. Using a
reverse Monte Carlo ray tracing approach combined with a novel use of Uintah’s
adaptive mesh refinement infrastructure, this calculation has been made to scale
to 262K cores [11], and further adapted to run on up to 16K GPUs [12]. The
spatial transport sweeps method discussed in Sect. 5 shows great promise for
future large-scale simulations.
200
100
Time (sec)
50
20
10
5
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K
Cores
Fig. 2. Strong and weak scalability of the coal boiler simulation on Mira using the dis-
crete ordinates solver. In these initial studies, we found scaling issues with the I/O and
radiation solve components of Uintah that needed to be addressed for the production
runs. Note the radiation solve is not executed each timestep, and not included in the
total time for a timestep.
224 S. Kumar et al.
Fig. 3. CAD rendering of GE Power’s 1000 MWe USC two-cell pulverized coal boiler.
Fig. 4. Schematic diagram of restructuring-based parallel I/O. (A) The initial simu-
lation patch size is 4 × 4. (B) A new grid of patch size 8 × 8 is imposed. (C) The
restructuring phase is executed using MPI point-to-point communication. (D) Finally,
using the restructured grid, every patch is written to a separate file.
Starting from the original simulation grid (Fig. 4(A)), we begin restructuring
by imposing a new grid on the simulation domain (Fig. 4(B)). The patch size of
the imposed grid is larger than the initial patch size assigned by the simulation.
As mentioned in Fig. 1, the patch size assigned by the simulation is on the order
of 123 to 203 , while the patch size of the new restructured grid is typically twice
that in each dimension. The simulation data is then restructured-based on the
new grid/patch configuration (Fig. 4(C)). During the restructuring, MPI point-
to-point communication is used to move data between processes [13]. Note that
the communication is distributed in nature and confined to small subsets of
processes, which is crucial for the scalability of the restructuring phase. At the
end of the restructuring phase, we end up with large-sized patches on a subset
of processes (Fig. 4(D)). Given that the new restructured patch size is always
bigger than the patch size assigned by the simulation (or equal in size at worst),
we end up with fewer patches held on a subset of the simulation processes.
Throughout the restructuring phase the data remains in the application layout.
Once the restructuring phases concludes, each processes holding the restructured
226 S. Kumar et al.
patches create a file and writes its data to the file. This scheme of parallel I/O
finds a middle-ground between file-per-process-based parallel I/O and shared
file I/O, both in terms of the total number of files generated and the extent of
communication required during the data aggregation phase. With our approach,
the total number of files generated is given by the following formula:
bounds x bounds y bounds z
number of files = × ×
nrst x nrst y nrst z
Based on the restructuring box size (nrst x × nrst y × nrst z), we can have
a range of total number of outputted files. The number of files will be equal
to the number of processes (i.e., file-per-process I/O) when the restructuring
patch size is equal to the simulation patch size. The number of files will be one
(i.e., shared-file I/O) when the restructuring patch size is equal to the entire
simulation domain (bounds x × bounds y × bounds z). For most practical sce-
narios, the latter situation is not feasible due to limitations on the available
memory on a single core. With file-per-process I/O, there is no communication
among processes, whereas with collective I/O associated with shared file I/O, the
communication is global in nature. With the restructuring approach, all commu-
nication is localized. The restructuring approach not only helps tune the total
number of outputted files, but also increases the file I/O burst size, which in
general is a requirement to obtain high I/O bandwidth. Our approach exhibits
good scaling characteristics, as shown in the following two sections.
Fig. 6. (Left) Performance of restructuring-based parallel I/O with varying box sizes.
(Right) Time breakdown between restructuring (communication) and file I/O for dif-
ferent box sizes. (Color figure online)
per node). This form of I/O is an extension to the file-per process style of I/O
commonly adopted by many simulations. An XML-based meta-data file is also
associated with every data file that stores type, extents, bounds, and other rele-
vant information for each of the different fields. For relatively small core counts
this I/O approach works well. However, I/O performance degrades significantly
for simulations with several hundreds of thousands of patches/processors. The
cost of both reads and writes for large numbers of small files becomes untenable.
We extended the Uintah simulation framework to use the restructuring-based
I/O scheme, and evaluated the weak scaling performance of the I/O system when
writing data for a representative Uintah simulation on Mira. In each run, Uin-
tah wrote out 20 timesteps consisting of 72 fields (grid variables). The patch
size for the simulation was 123 . The number of cores was varied from 7,920 to
262,890. Looking at the performance results in Fig. 7, our I/O system scales well
for all core counts and performs better than the original Uintah UDA I/O sys-
tem. The restructuring-based I/O system demonstrates almost linear scaling up to
262,890 cores whereas the performance of file-per-node I/O starts to decline after
16,200 cores. At 262,890 cores, our I/O system achieves an approximate speed-up
of 10× over Uintah’s default file-per-node I/O.
The restructuring-based I/O system was then used in production boiler sim-
ulations, carried out at 260K cores on Mira. Due to the improved performance
of the I/O system, scientists were able to save data at a much higher temporal
frequency. In terms of outputs, close to 200 terabytes of data was written which,
using our new restructuring I/O strategy, required only 2% of the entire simula-
tion time. If the simulation were run using the Uintah’s default file-per-process
node output format, nearly 50% of the time of the computation would be spent
on I/O, reducing the number of timesteps that could be saved, or increasing the
total computation time significantly.
When trying to visualize the data produced on Mira using the Cooley visualiza-
tion cluster at ANL, VisIt rendered at interactive framerates for smaller datasets;
however, when trying to visualize the recent large simulations using all the cores
on each node would consume too much memory, resulting in crashes or signifi-
cantly reduced performance due to swapping. To address these issues and allow
for quick, interactive visualization and high-quality offline movie rendering, we
wrote a lightweight renderer using OSPRay [24] which uses the restructuring-
based parallel I/O to read the data. OSPRay is a CPU-based open-source ray
tracing library for scientific visualization, and includes high-quality and high-
performance volume rendering, along with support for rendering distributed data
with MPI.
OSPRay includes support for two modes of MPI-parallel rendering: an offload
mode, where data is replicated across nodes, and subregions of the image are
distributed; and a distributed mode, where different ranks make OSPRay API
calls independently to setup their local data, and then work collectively to render
Fig. 8. Frames from the movie showing the O2 field over time. Using restructuring-
based parallel I/O backend and OSPRay we were able to render an animation of the
full 1030 timesteps in two hours using 128 KNL nodes on Theta.
230 S. Kumar et al.
the distributed data using sort-last compositing [24]. To leverage the benefits of
the restructuring-based parallel I/O in the viewer, we implemented our renderer
using the distributed mode of OSPRay, with each rank responsible for loading
and rendering a subregion of the dataset. To properly composite the distributed
data OSPRay requires the application to specify a set of regions on each rank,
which bound the data owned by that rank. In our case this is trivially the bounds
of the single subregion owned by the rank. The renderer supports two usage
modes, allowing for interactive remote visualization and offline movie rendering
for creating production animations of the evolution of the boiler state.
The interactive viewer runs a set of
render worker processes on the com-
pute nodes, with one per node as
OSPRay uses threads for on-node par-
allelism. The user then connects over
the network with a remote client and
receives back rendered JPG images,
while sending back over the network
camera and transfer function changes
to interact with the dataset. To decou-
ple the interface from network latency
effects and the rendering framerate, we
Fig. 9. Strong scaling of movie rendering
send and receive to the render workers on Theta.
asynchronously, and always take the
latest frame and send the latest appli-
cation state. With this application, users can explore the different timesteps of
the simulation and different fields of data interactively on their laptop, with the
rendering itself performed on Theta or Cooley. When rendering on 16 nodes of
Theta with a 1080 × 1920 framebuffer (oriented to match the vertical layout of
the boiler), the viewer was able to render at 11 FPS, allowing for interactive
exploration.
The offline movie renderer is run as a batch application and will render the
data using a preset camera path. The movies produced allow for viewing the
evolution of the boiler state smoothly over time, as the timesteps can be played
through at a constant rate, instead of waiting for new timesteps or fields to load.
A subset of frames from this animation is shown in Fig. 8, which was rendered
using 128 nodes on Theta. The majority of the time spent in the movie rendering
is in loading the data, which scales well with the presented I/O scheme (Fig. 9).
The animation is rendered at 1080 × 1920 with a high number of samples per
pixel to improve image quality.
While our lightweight viewer is valuable for visual exploration, it is missing
the large range of additional analysis tools provided by production visualiza-
tion and analysis packages like VisIt. To this end, we are working on integrating
OSPRay into VisIt as a rendering backend, enabling scalable interactive visual-
ization for end users.
Scalable Data Management of the Uintah Simulation Framework 231
where S is the local source term for radiative intensity, and IΩ is computed using
the RTE for grey non-scattering media requiring a global solve via:
dIΩ
= k ∗ (S − IΩ ) (2)
ds
Here, s is the 1-D spatial coordinate oriented in the direction in which inten-
sity IΩ is being followed, and k is the absorption or attenuation coefficient. The
lack of time in the RTE implies instantaneous transport of the intensity, appro-
priate for most applications. The methods for solving the RTE discussed here
aim to solve for IΩ using Eq. 2, which can then be integrated to compute the
radiative flux and divergence.
depending on the order of the first derivative. Instabilities arise when using the
higher order method, so often the 7-point stencil is avoided, or a combination of
the two stencils is used. The 4-point stencil results in numerical diffusion that
impacts the fidelity of the solve, but for low ordinate counts can improve solution
accuracy. As shown by Fig. 2, this method has been demonstrated to scale, but
it is computationally expensive, due to the numerous global sparse linear solves.
In the case shown, as many as 30–40 backsolves were required per radiation
step, with up to an order of magnitude more solves required in other cases. It
should be noted that, due to their computational cost, the radiation solves are
computed roughly once every 10 timesteps, as the radiation solution does not
change quickly enough to warrant a more frequent radiation calculation.
are defined below. Although this staging process is serialized by the reliance
of corner-to-corner dependencies, it can show good performance when sweep-
ing a large quantity of independent solves. For a non-scattering medium, the
angular and spectral intensities are all independent of each other, allowing for
parallelization of the solve.
On large, distributed memory systems, the intensities are stored on multiple
compute nodes, making communication between them expensive and inefficient.
To address this problem, one processor (or node) needs to operate only on intensi-
ties that have satisfied their spatial dependency. The method shown here is based
on the algorithm for a simple rectangular domain; however, it further supports
identification of these dependencies for complex domains with non-rectangular
shapes. To most easily convey the methodology used, we start by describing the
algorithm on a rectangular domain.
Consider a domain with 3 × 3 sub-units. Within Uintah, these sub-units
are referred to as patches. A diagram showing how these patches are divided is
shown in Fig. 10. The number labeling each patch (Fig. 10a) designates the phase
in which a sweep is relevant for a single intensity, from the x + y + z + octant,
with a single wave number. Note that these phases are defined as:
P = xi + yi + zi (3)
where xi , yi , and zi are the patch indices in the x, y, and z directions. The patch
indices are defined as the number of patches away from the origin patch. Hence,
the total number of phases required to complete a single complete full-domain
sweep is:
Pmax = xmax + ymax + zmax (4)
where xmax , ymax , zmax are the maximum. Numbers designate the designated
phase indices of the patches within the domain. We determine the patch indices
using the sub-domain with the patch ID provided by Uintah.
Uintah numbers its patches in the order of z, y, x (Fig. 10b). From this we
can determine the point in space in which the sweep is currently located using
modulo operators, the patch dimensions, and the patch ID. The patch index is
then converted to the patch indices xi , yi , zi for each patch. Using the Uintah
Fig. 10. A rectangular domain divided into 27 sub-domains, labeled by the designated
phase (a) and Uintah patch ID (b).
234 S. Kumar et al.
task scheduler, we can indicate to a task what this phase is. This process is more
complicated when conducting sweeps with multiple intensity directions. First,
consider additional intensities that are in the x+ , y + , z + directions. To keep as
many processors busy as possible in the computation, we create stages.
A stage S is defined as S = I + P , where I is the intensity index relevant
to a single octant. We know the maximum number of stages via the equation
Smax = Imax + Pmax . Now we have an algorithm that describes the sweeping
in a single direction, for intensities of the same octant: next, we will discuss
how to extend this to all octants. The phase equation for the x− , y − , z − octant
results in:
P = xmax − xi + ymax − yi + zmax − zi (5)
Hence, a total of eight phase equations are possible, depending on the com-
bination of directions. We discuss two equations in detail in this paper. The task
designates the stage and intensity, and then computes a function mapping its
patch ID to its spatial patch index using a series of modulos. If the patch and
intensity are relevant to the local processor, then it executes, otherwise it exits
the task.
solve can also be used as an initial guess, thus accelerating convergence and
making it possible for DOM to use as few as 30–40 iterations as compared to
the much larger number of backsolves shown in Table 1. In contrast to a static
problem, no initial guess is available, and significantly more iterations are used
with DOM than is the case in a full boiler simulation (as shown in Table 1),
in which as many as 1500 iterations are used. However, we note that for this
problem each DOM iteration takes about a second. Hence, the best that DOM
could achieve would be about 40 s, even if a good initial guess is available. In
this way sweeps outperforms both the actual observed cost and the optimistic
estimated cost of DOM with its linear solve using 40 iterations by a factor of
between 4 and 10. However, the sweeping algorithm has not currently scaled
beyond 128K cores due to its large memory footprint and, additionally, it can
be slower than the linear solver for systems with very high attenuation. This
is because the sparse linear solvers are iterative, but they converge quickly for
systems with large attenuation, as the impact of radiation can be isolated to a
subset of the domain for these systems.
For systems with scattering, DOM typically lags the scattering term and
then resolves until the intensities converge to within a certain tolerance. The
convergence can be costly for systems where the scattering coefficients are sig-
nificantly larger than the absorption coefficients. Given these very encouraging
results, applying sweeps to the full problem and improving its memory use are
clearly the next steps.
pressure solve due to work being done in Uintah/Arches. The most significant
performance improvement was the switch of the I/O library, with the presented
restructuring-based I/O, which resulted in 33 s write times, compared to the
5.5 min required on the Original Inlets case which used the legacy Uintah I/O
system. Ultimately, the Modified Inlets case wrote 1030 datasets allowing for the
creation of 3D rendered movies of the simulation.
Though validation of the simulation data against experimental data was per-
formed, the proprietary nature of both the simulation and experimental data
makes publication of these comparisons problematic. However, working closely
with the GE Power engineers made it possible to validate the results of these
simulations against their previous results. Figure 11 depicts the heat absorption
profile (x-axis) as a function of the elevation in the boiler (y-axis), and shows the
average absorption profile predicted in the unmodfied inlet configuration (Orig-
inal Inlets) is different from the tentative estimates due to the higher fidelity
modeling performed with Arches, but it is in relatively good agreement with the
actual absorption profile based on discussions with GE Power engineers and the
existing proprietary data provided. The second case was run with changes to
the inlet geometry parameters to optimize gas-side energy imbalance (GSEI) by
changing the flow pattern in the wind-box as well as the SOFA inlets.
The key result from this work is the confidence that has been established with
GE Power to demonstrate that high resolution LES simulations are a useful tool
for exploring a range of operating conditions, with the potential to be used for
future designs. This is the first time that computational design at this scale has
Scalable Data Management of the Uintah Simulation Framework 237
Fig. 11. Heat absorption profile as a function of the elevation. The solid green line
shows GE Power’s wall-averaged absorption profile tentative estimates for the expected
operating conditions in the unit. The blue dots show the average absorption profile
computed from unmodified inlet case. (Color figure online)
been used for such a complex combustion problem with petascale simulations.
Future studies of the unit will investigate design and operation adjustments to
achieve incremental improvements in gas-side energy imbalance. GE will consider
testing the new conditions in the existing unit when significant improvements
are discovered.
7 Conclusions
This work has introduced an excellent exascale candidate problem through the
successful simulation of a commercial, 1000 megawatt electric (MWe) ultra-
supercritical (USC) boiler, the largest currently in production worldwide, using
Large-Eddy Simulation (LES) predictions, requiring 280 Million CPU hours on
Mira. The overall objective of this work was in understanding how we can solve
such a problem through the use of an AMT runtime to efficiently schedule and
execute computational tasks, including I/O, and to leverage scalability improve-
ments in the runtime itself, linear solvers, I/O, and computational algorithms.
To achieve the results shown in this work for production-level petascale com-
putations significant code and algorithmic innovations were required, including
novel adaptations of I/O system that achieved a nearly order of magnitude
improvement in I/O performance.
Through this work, we have exposed areas even within an advanced, scalable
runtime system that need careful design consideration for post-petascale and
eventually exascale platforms, particularly when globally coupled problems such
as radiation are considered. For example, while existing radiation methods used
in Uintah scale, it is clear from the results presented that the use of the sweeps
method for problems of this scale and size needs to be investigated further,
to see if it is possible to reduce the overall simulation time significantly. A key
lesson this work conveys is that the success of large, production-scale simulations
238 S. Kumar et al.
depends upon scalability at every level of the code. If any single component
within the simulation pipeline does not scale, the problem cannot be solved. It
is through the integration of these scalable components and subsystems that
the next generation of problems may be solved on exascale systems. Finally, our
results have demonstrated the potential role that LES simulations can have on
analysis and design of an operational commercial boiler and that simulations can
be used as a design tool for future systems, and that choosing fast scalable and
hardware appropriate algorithms, for key areas such as radiation is important
in achieving scalable results.
References
1. Mira home page. https://www.alcf.anl.gov/mira
2. Adams, M.P., Adams, M.L., Hawkins, W.D., Smith, T., Rauchwerger, L., Amato,
N.M., Bailey, T.S., Falgout, R.D.: Provably optimal parallel transport sweeps on
regular grids. Technical report, Lawrence Livermore National Laboratory (LLNL),
Livermore, CA (2013)
3. Ayachit, U.: The ParaView Guide: A Parallel Visualization Application. Kitware
Inc., New York (2015)
4. Bailey, T., Hawkins, W.D., Adams, M.L., Brown, P.N., Kunen, A.J., Adams, M.P.,
Smith, T., Amato, N., Rauchwerger, L.: Validation of full-domain massively par-
allel transport sweep algorithms. Technical report, Lawrence Livermore National
Laboratory (LLNL), Livermore, CA (2014)
5. Balaji, P., Chan, A., Thakur, R., Gropp, W., Lusk, E.: Toward message passing for
a million processes: characterizing MPI on a massive scale blue gene/P. Comput.
Sci. Res. Dev. 24(1), 11–19 (2009)
6. Balsara, D.: Fast and accurate discrete ordinates methods for multidimensional
radiative transfer. Part I, basic methods. J. Quant. Spectrosc. Radiat. Transf.
69(6), 671–707 (2001)
7. Bennett, J., Clay, R., Baker, G., Gamell, M., Hollman, D., Knight, S., Kolla, H.,
Sjaardema, G., Slattengren, N., Teranishi, K., Wilke, J., Bettencourt, M., Bova, S.,
Franko, K., Lin, P., Grant, R., Hammond, S., Olivier, S., Kale, L., Jain, N., Mikida,
E., Aiken, A., Bauer, M., Lee, W., Slaughter, E., Treichler, S., Berzins, M., Har-
man, T., Humphrey, A., Schmidt, J., Sunderland, D., McCormick, P., Gutierrez, S.,
Schulz, M., Bhatele, A., Boehme, D., Bremer, P., Gamblin, T.: ASC ATDM level
2 milestone #5325: asynchronous many-task runtime system analysis and assess-
ment for next generation platforms. Technical report, Sandia National Laboratories
(2015)
Scalable Data Management of the Uintah Simulation Framework 239
8. Berzins, M., Beckvermit, J., Harman, T., Bezdjian, A., Humphrey, A., Meng, Q.,
Schmidt, J., Wight, C.: Extending the uintah framework through the petascale
modeling of detonation in arrays of high explosive devices. SIAM J. Sci. Comput.
38(5), 101–122 (2016)
9. Childs, H., Brugger, E., Whitlock, B., Meredith, J., Ahern, S., Pugmire, D., Biagas,
K., Miller, M., Harrison, C., Weber, G.H., Krishnan, H., Fogal, T., Sanderson, A.,
Garth, C., Bethel, E.W., Camp, D., Rübel, O., Durant, M., Favre, J.M., Navrátil,
P.: VisIt: an end-user tool for visualizing and analyzing very large data. In: High
Performance Visualization-Enabling Extreme-Scale Scientific Insight, pp. 357–372
(2012)
10. HDF5 home page. http://www.hdfgroup.org/HDF5/
11. Humphrey, A., Harman, T., Berzins, M., Smith, P.: A scalable algorithm for radia-
tive heat transfer using reverse Monte Carlo ray tracing. In: Kunkel, J.M., Ludwig,
T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 212–230. Springer,
Cham (2015). https://doi.org/10.1007/978-3-319-20119-1 16
12. Humphrey, A., Sunderland, D., Harman, T., Berzins, M.: Radiative heat transfer
calculation on 16384 GPUs using a reverse Monte Carlo ray tracing approach with
adaptive mesh refinement. In: 2016 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW), pp. 1222–1231, May 2016
13. Kumar, S., Vishwanath, V., Carns, P., Levine, J., Latham, R., Scorzelli, G., Kolla,
H., Grout, R., Ross, R., Papka, M., Chen, J., Pascucci, V.: Efficient data restruc-
turing and aggregation for I/O acceleration in PIDX. In: 2012 International Con-
ference for High Performance Computing, Networking, Storage and Analysis (SC),
pp. 1–11, November 2012
14. Li, J., Liao, W.-K., Choudhary, A., Ross, R., Thakur, R., Gropp, W., Latham,
R., Siegel, A., Gallagher, B., Zingale, M.: Parallel netCDF: a high-performance
scientific I/O interface. In: Proceedings of SC 2003: High Performance Networking
and Computing, Phoenix, AZ. IEEE Computer Society Press, November 2003
15. Lofstead, J., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible IO and
integration for scientific codes through the adaptable IO system (ADIOS). In:
Proceedings of the 6th International Workshop on Challenges of Large Applications
in Distributed Environments, CLADE 2008, pp. 15–24. ACM, New York, June 2008
16. Lustre home page. http://lustre.org
17. Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Investigating applications
portability with the Uintah DAG-based runtime system on petascale supercomput-
ers. In: Proceedings of the International Conference on High Performance Comput-
ing, Networking, Storage and Analysis, SC 2013, pp. 96:1–96:12. ACM, New York
(2013)
18. U. D. of Energy: Exascale computing project (2017). https://exascaleproject.org/
19. Peterson, B., Humphrey, A., Schmidt, J., Berzins, M.: Addressing global data
dependencies in heterogeneous asynchronous runtime systems on GPUs. In: Third
International Workshop on Extreme Scale Programming Models and Middleware,
ESPM2. IEEE Press (2017, submitted)
20. Russell, J.: Doug Kothe on the race to build exascale applications (2017). https://
www.hpcwire.com/2017/05/29/doug-kothe-race-build-exascale-applications/
21. Schmuck, F., Haskin, R.: GPFS: a shared-disk file system for large computing
clusters. In: Proceedings of the 2002 Conference on File and Storage Technologies
(FAST), pp. 231–244 (2002)
22. Scientific Computing and Imaging Institute: Uintah web page (2015). http://www.
uintah.utah.edu/
240 S. Kumar et al.
23. Shan, H., Antypas, K., Shalf, J.: Characterizing and predicting the I/O perfor-
mance of HPC applications using a parameterized synthetic benchmark. In: Pro-
ceedings of Supercomputing, November 2008
24. Wald, I., Johnson, G., Amstutz, J., Brownlee, C., Knoll, A., Jeffers, J., Günther,
J., Navratil, P.: OSPRay - a CPU ray tracing framework for scientific visualization.
IEEE Trans. Visual. Comput. Graph. 23(1), 931–940 (2017)
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Linear Algebra
High Performance LOBPCG Method
for Solving Multiple Eigenvalues
of Hubbard Model: Efficiency
of Communication Avoiding Neumann
Expansion Preconditioner
1 Introduction
Since the High-Tc superconductor was discovered many physicists have tried
to understand the mechanism behind the superconductivity. It is believed that
strong electron correlations underlie the phenomenon, however the exact mech-
anism is not yet fully understood. One of the numerical approaches to the prob-
lem is the exact diagonalization method. In this method the eigenvalue problem
is solved for the Hamiltonian derived exactly from the Hubbard model [1,2],
which is a model of a strongly-correlated electron system. When we solve the
ground state (the smallest eigenvalue and its corresponding eigenvector) of the
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 243–256, 2018.
https://doi.org/10.1007/978-3-319-69953-0_14
244 S. Yamada et al.
Fig. 1. Algorithm of LOBPCG method for matrix A. Here the matrix T is a precon-
ditioner.
2 Related Work
When solving the ground state of a symmetric matrix using the LOBPCG
method, the most time-consuming operation is the matrix-vector multiplication.
The Hamiltonian derived from the Hubbard model (see Fig. 2) is
†
H = −t cjσ ciσ + Ui ni↑ ni↓ , (1)
i,j,σ i
where t is the hopping parameter from one site to another, and Ui is the repulsive
energy for double occupation of the i-th site by two electrons [1,2,7]. Quantities
ci,σ , c†i,σ and ni,σ are the annihilation, creation, and number operator of an
electron with pseudo-spin σ on the i-th site, respectively. The indices in formula
(1) for the Hamiltonian denote the possible states for electrons in the model.
The dimension of the Hamiltonian for the ns -site Hubbard model is
ns ns
× ,
n↑ n↓
where n↑ and n↓ are the number of the up-spin and down-spin electrons,
respectively.
The diagonal element in formula (1) is derived from the repulsive energy Ui in
the corresponding state. The hopping parameter t affects non-zero elements with
Fig. 2. A schematic figure of the 2-dimensional Hubbard model, where t is the hopping
parameter and U is the repulsive energy for double occupation of a site. Up arrows and
down arrows correspond to up-spin and down-spin electrons, respectively.
246 S. Yamada et al.
where I↑(↓) , A↑(↓) and D are the identity matrix, a sparse symmetric matrix
derived from the hopping of an up-spin electron (a down-spin electron), and
a diagonal matrix from the repulsive energy, respectively [7]. Since there is no
regularity in the state change by electron hopping, the distribution of non-zero
elements in matrix A↑(↓) is irregular.
Next, a matrix V is constructed by the following procedures from a vector v.
First, decompose the vector v into n blocks, and order in the two-dimensional
manner as follows,
v = (v1,1 , v2,1 , . . . , vm↑ ,1 , v1,2 , v2,2 , . . . , vm↑ ,2 , · · · , v1,m↓ , v2,m↓ , . . . , vm↑ ,m↓ )T ,
the first block the second block the m↓ -th block
where m↑ and m↓ are the dimensions of the Hamiltonian for up-spin and down-
spin electrons, i.e.
ns ns
m↑ = , m↓ = .
n↑ n↓
The subscripts on each element of v formally indicate the row and column within
the matrix V . Therefore V is a dense matrix. The k-th elements of the matrix D,
dk , are used in the same manner to define a new matrix D̄. The multiplication
in Eq. (2) can then be written as
new
Vi,j = D̄i,j Vi,j + A↑i,k Vk,j + Vi,k A↓j,k (3)
k k
where the subscript i, j of the matrix is represented as the (i, j)-th element and
V and D̄. Accordingly we can parallelize the multiplication Y = HV (≡ Hv) as
follows:
CAL 1: Y c = D̄c V c + A↑ V c ,
COM 1: all-to-all communication from V c to V r ,
CAL 2: W r = V r AT↓ ,
COM 2: all-to-all communication from W r to W c ,
CAL 3: Y c = Y c + W c .
where superscripts c and r denote column wise and row wise partitioning of the
matrix data for the parallel calculation. The operator means an element wise
multiplication. The parallelization strategy requires two all-to-all communication
operations per multiplication.
High Performance LOBPCG Method for Solving Multiple Eigenvalues 247
CAL 1: Y c = D̄c V c + A↑ V c ,
COM 1: all-to-all communication from V c to V r ,
CAL 2: W r = V r AT↓ ,
COM 2: all-to-all communication from W r to W c ,
CAL 3: Y1c = Y c + W c ,
CAL 4: Y c = Y1c + W c ,
CAL 5: Y c = D̄c Y1c + A↑ Y c ,
CAL 6: W r = D̄r V r + W r ,
CAL 7: W r = W r AT↓ ,
COM 3: all-to-all communication from W r to W c ,
CAL 8: Y2c = Y c + W c .
The LOBPCG method for solving the m smallest eigenvalues and corresponding
eigenvectors carries out recurrence with m vectors simultaneously (see Fig. 3). In
this algorithm, the generalized eigenvalue problem has to be solved. We can solve
the problem using the LAPACK function dsyev, if the matrix SB is a positive
definite matrix. Although theoretically SB is always a positive definite matrix,
numerically this is not always the case. The reason is that the norms of the
(i) (i)
vectors wk and pk (i = 1, 2, . . . , m) become small as the LOBPCG iteration
converges, and it is possible that trailing digits are lost in the calculation of
SB . Therefore we set the matrix SB to the identity matrix by orthogonalizing
the vectors per iteration. In the following numerical tests, we utilize the TSQR
method for the orthgonalization [10,11].
High Performance LOBPCG Method for Solving Multiple Eigenvalues 249
Fig. 3. LOBPCG method for solving the m smallest eigenvalues and corresponding
eigenvectors. T (i) is a preconditioner for the i-th smallest eigenvalues. This algorithm
requires m matrix-vector multiplication operations and m preconditioned ones per
iteration.
The formula (5) can approximately remove the components of the eigenvectors
corresponding to the eigenvalues, whose absolute values are greater than or equal
to 1, from the preconditioned vectors. Therefore we expect that the Neumann
expansion using Mi becomes an appropriate preconditioner for solving for mul-
tiple eigenvalues.
4 Performance Result
4.1 Computational Performance and Convergence Property
We examined the computational performance and convergence properties of the
LOBPCG method. We solved the 2-D 4 × 5-site Hubbard model with 5 up-spin
electrons and 5 down-spin ones. The dimension of the Hamiltonian derived from
the model is about 240 million. The number of non-zero off-diagonal elements is
about 1.6 billion. We solved for one, five and 10 eigenvalues (and corresponding
eigenvectors) of the Hamiltonian on 768 cores (64 MPI processes × 12 OpenMP
threads) of the SGI ICE X supercomputer (see Table 1) in Japan Atomic Energy
Agency (JAEA). Table 2 shows the results for a weak interaction case (U/t = 1)
and a strong one (U/t = 10). Table 3 shows the elapsed times of some represen-
tative operations.
The results for U/t = 1 indicate that point Jacobi (PJ) and zero-shift point
Jacobi (ZSPJ) preconditioners hardly improve the convergence compared to
without using a preconditioner at all. When we solve for many eigenvalues, the
PJ and ZSPJ preconditioners have little effect on the speed of the calculation.
On the other hand, the Neumann expansion preconditioner can decrease the
number of iterations required for convergence. Moreover, the larger the Neu-
mann expansion series s, the fewer iterations required. When we solve for only
Table 2. Elapsed time and number of iterations for convergence of LOBPCG method
using zero-shift point Jacobi (ZSPJ), Neumann expansion (NE), or communication
avoiding Neumann expansion (CANE) preconditioner. Here, s is the number of the
Neumann expansion series.
the smallest eigenvalue, the total elapsed time increases as s increases. The rea-
son is that the elapsed time of the Hamiltonian-vector multiplication operation
is dominant over the whole calculation for solving the only smallest eigenvalue
(see Table 3). When we solve multiple eigenvalues, the TSQR operation becomes
dominant. Therefore when the series number s becomes large, it is possible to
achieve speedup of the computation.
Next, we discuss the results for U/t = 10. The results indicate that the PJ
preconditioner improves the convergence properties. On the other hand, ZSPJ for
small m improves convergence, however, its convergence properties when solving
for multiple eigenvalues are almost the same as those for the PJ preconditioner.
When we solve for multiple eigenvalues using the Neumann expansion precondi-
tioner, the solution is obtained faster than using the PJ or ZSPJ preconditioners.
Moreover, as the Neumann expansion series s increases, the Neumann expansion
252 S. Yamada et al.
Table 3. Elapsed time for operations per iteration. This table shows the results using
the zero-shift point Jacobi (ZSPJ), Neumann expansion (NE), and communication
avoiding Neumann expansion (CANE). Here, the Neumann expansion series s is equal
to 1. For m = 1, instead of executing TSQR, we calculate SB ,moreover, ZSPJ precon-
ditioner is calculated together with x, p, X, P .
Table 4. Speedup ratio for the elapsed time per iteration using the Neumann expansion
preconditioner and communication avoiding strategy.
Speedup ratio
m=1 m=5 m = 10
s=1 s=2 s=3 s=1 s=2 s=3 s=1 s=2 s=3
U/t = 1 1.19 1.20 1.21 1.08 1.06 1.11 1.13 1.13 1.20
U/t = 10 1.19 1.16 1.20 1.08 1.06 1.11 1.08 1.12 1.16
preconditioner improves the convergence properties and the total elapsed time
decreases, especially when m is large.
Finally, we talk about the effect of the communication avoiding strategy.
Table 4 shows the speedup ratio for the elapsed time using the Neumann expan-
sion preconditioner per iteration and the communication avoiding strategy. In
all cases the communication avoiding strategy realizes speedup. When we solve
for only the smallest eigenvalue (and its corresponding eigenvector), the speedup
ratio is almost the same as that for the matrix-vector multiplication, because the
multiplication cost is dominant. On the other hand, when we solve for multiple
eigenvalues, the calculation cost except the multiplication becomes dominant.
Therefore the speedup ratio is a little smaller than that for only the multipli-
cation. Furthermore, when the Neumann expansion series s is equal to 3, we
confirm that the ratio improves. In this case, since four multiplications (Hw,
H 2 w, H 3 w and H 4 w) are executed per iteration, the ratio of the multiplication
cost increases. Moreover, we can execute four multiplication operations by two
communication avoiding multiplications. Therefore, the ratio for s = 3 is better
than that for s = 1.
High Performance LOBPCG Method for Solving Multiple Eigenvalues 253
In order to examine the parallel performance of the LOBPCG method using the
Neumann expansion preconditioner, we solved for the 10 smallest eigenvalues
and corresponding eigenvectors of the Hamiltonian derived from the 4 × 5-site
Hubbard model for U/t = 1 with 6 up-spin and 6 down-spin electrons. We
used the LOBPCG method with ZSPJ, NE, and CANE preconditioners using
hybrid parallelization on SGI ICEX in JAEA and the K computer in RIKEN
(see Table 5). The results are shown in Table 6. The results indicate that all
preconditioners achieve excellent parallel efficiency. The communication avoiding
strategy on SGI ICEX decreases the elapsed time per iteration by about 15%.
On the other hand, the communication avoiding strategy on the K computer
did not realize speedup when using a small number of cores. The ratio of the
network bandwidth to FLOPS per node of the K computer is larger than that
of SGI ICEX, so it is possible that the cost of the extra calculations (CAL 4 &
CAL 6) is larger than that of the all-to-all communication operation. However
since the cost of the all-to-all communication operation increases as the number
of the cores increases, the strategy realizes speedup on 4096 cores. Therefore, the
strategy has a potential of speedup for parallel computing using a sufficiently
large number of cores, even if the ratio of the network bandwidth to FLOPS is
large.
Although the LOBPCG method using NE has four times more Hamiltonian-
vector multiplications per iteration than the method with ZSPJ, the former takes
about twice the elapsed time of the latter. The reason is that the calculation oper-
ations except the multiplication is dominant in this case. Therefore, we conclude
that in order to solve for multiple eigenvalues of the Hamiltonian derived from
the Hubbard model using the LOBPCG method in a short computation time, it
is crucial to reduce the number of the iterations for the convergence even if the
calculation cost of the preconditioner is large.
5 Conclusions
In this paper we applied the Neumann expansion preconditioner to the LOBPCG
method to solve for multiple eigenvalues and corresponding eigenvectors of the
Hamiltonian derived from the Hubbard model. We examined the convergence
properties and parallel performance of the algorithms. Since the norm of the
matrix used in the Neumann expansion should be less than 1, we transform
High Performance LOBPCG Method for Solving Multiple Eigenvalues 255
References
1. Rasetti, M. (ed.): The Hubbard Model: Recent Results. World Scientific, Singapore
(1991)
2. Montorsi, A. (ed.): The Hubbard Model. World Scientific, Singapore (1992)
3. Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigen-
value Computations, vol. 1: Theory. SIAM, Philadelphia (2002)
4. Knyazev, A.V.: Preconditioned eigensolvers - an oxymoron? Electron. Trans.
Numer. Anal. 7, 104–123 (1998)
5. Knyazev, A.V.: Toward the optimal eigensolver: locally optimal block precondi-
tioned conjugate gradient method. SIAM J. Sci. Comput. 23, 517–541 (2001)
6. Saad, Y.: Numerical Methods for Large Eigenvalue Problems: Revised Edition.
SIAM (2011)
7. Yamada, S., Imamura, T., Machida, M.: 16.447 TFlops and 159-Billion-dimensional
exact-diagonalization for trapped Fermion-Hubbard Model on the Earth Simulator.
In: Proceedings of SC 2005 (2005)
8. Yamada, S., Imamura, T., Machida, M.: Communication avoiding Neumann expan-
sion preconditioner for LOBPCG method: convergence property of exact diagonal-
ization method for Hubbard model. In: Proceedings of ParCo 2017 (2017, accepted)
9. Barrett, R., et al.: Templates for the Solution of Linear Systems: Building Blocks
for Iterative Methods. SIAM, Philadelphia (1994)
10. Langou, J.: AllReduce algorithms: application to Householder QR factorization. In:
Proceedings of the 2007 International Conference on Preconditioning Techniques
for Large Sparse Matrix Problems in Scientific and Industrial Applications, pp.
103–106 (2007)
11. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-avoiding par-
alleland sequential QR factorizations, Technical report, Electrical Engineering and
Computer Sciences, University of California Berkeley (2008)
256 S. Yamada et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Application of a Preconditioned
Chebyshev Basis
Communication-Avoiding Conjugate
Gradient Method to a Multiphase
Thermal-Hydraulic CFD Code
1 Introduction
Krylov subspace methods are widely used for solving linear systems given by
extreme scale sparse matrices, and thus, their scalability is one of critical issues
towards exascale computing. In nuclear engineering, exascale computing is needed
for Computational Fluid Dynamics (CFD) simulations of turbulent flows such as
multiphase thermal-hydraulic simulations of nuclear reactors and fusion plasma
simulations. In these CFD simulations, implicit solvers based on Krylov subspace
methods occupy dominant computational costs, and the scalability of such CFD
simulations largely depends on the performance of Krylov solvers.
The current Peta-scale machines are characterized by extreme concurrency
reaching at ∼100 k computing nodes. In addition to this feature, on future
exascale machines, which may be based on many-core processors or acceler-
ators, significant acceleration of computation is expected. In Ref. [1], we opti-
mized stencil computation kernels from CFD simulations on the latest many-core
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 257–273, 2018.
https://doi.org/10.1007/978-3-319-69953-0_15
258 Y. Idomura et al.
processors and GPUs, and significant performance gains were achieved. However,
the accelerated computation revealed severe bottlenecks of communication.
Krylov solvers involve local halo data communications for stencil compu-
tations or sparse matrix vector operations SpMVs, and global data reduction
communications for inner product operations in orthogonalization procedures
for basis vectors. Although communication overlap techniques [2] may reduce
the former latency, it can not be applied to the latter. In order to resolve this
issue at mathematics or algorithm levels, in Refs. [3,4], we have introduced
communication-avoiding (CA) Krylov methods to a fusion plasma turbulence
code GT5D [5] and a multiphase thermal-hydraulic CFD code JUPITER [6].
The implicit solver in the GT5D is well-conditioned, and the communication-
avoiding general minimum residual (CA-GMRES) method [7] was stable for large
CA-steps s > 10. On the other hand, the Poisson solver in the JUPITER is
ill-conditioned, and the convergence of the left-preconditioned communication-
avoiding conjugate gradient (P-CACG) method [7] was limited for s ≤ 3. Even
with s = 3, the strong scaling of the JUPITER on the K-computer [8] was
dramatically improved by reducing the number of global data reduction commu-
nications to 1/s. However, for practical use, it is difficult to operate CA Krylov
solvers at the upper limit of CA-steps, because the Poisson operator is time
dependent and its condition number may increase in time. Therefore, we need
to use more robust CA Krylov methods at CA-steps well below the upper limit,
beyond which they become numerically unstable. In order to resolve this issue,
in this work, we introduce the preconditioned Chebyshev basis communication-
avoiding conjugate gradient (P-CBCG) method to the JUPITER, and examine
its robustness and computational performance on the Oakforest-PACS, which
consists of 8,208 KNLs.
The reminder of this paper is organized as follows. Related works are reviewed
in Sect. 2. In Sect. 3, we explain CA Krylov subspace methods used in this work.
In Sect. 4, we discuss numerical properties and kernel performances of CA Krylov
solvers. In Sect. 5, we present the convergence property of CA Krylov methods
and the computational performances of CA Krylov solvers on the JAEA ICEX
and the Oakforest-PACS. Finally, a summary is given in Sect. 6.
2 Related Works
The CACG method is based on the so-called s-step CG method, in which the
data dependency between SpMV and inner product operations in the standard
CG method is removed. Van Rosendale [9] first developed a s-step version of
the CG method. Chronopoulos and Gear [10] called their own variant of the CG
method as the s-step CG method. However, the above works did not change
SpMV operations for generating the s-step basis. Toledo optimized the compu-
tation of the s-step basis in the s-step CG method [11], in which the number
of words transferred between levels of the memory hierarchy is reduced. The
CACG method by Hoemmen [7] reduced communications between levels of the
memory hierarchy and between processors by a matrix power kernel (MPK) [12].
Application of a Preconditioned Chebyshev Basis 259
Carson [13] showed the performance of the CACG method on the Hopper super-
computer using a simple Poisson model problem.
CA-preconditioning is based on sparse approximate inverses with the same
sparsity pattern as the matrix A, or block Jacobi (BJ) and polynomial precon-
ditioners [9,11,14]. For instance, in BJ preconditioning, each processor indepen-
dently solves its local problem. However, when the local preconditioner has data
dependency over the whole local problem as in ILU factorization, it is difficult to
construct a MPK without additional communications, because each local SpMV
requires preconditioned input vector elements from neighboring processors. To
avoid the additional communications, Yamazaki et al. [15] proposed an under-
lap approach, in which each subdomain is divided into an inner part and the
remaining surface part, and preconditioning for the latter is approximated by
point Jacobi preconditioning. However, in our previous work [3], it was shown
that for ill-conditioned problems given by the JUPITER, the underlap approach
leads to significant convergence degradation, and a hybrid CA approach, in which
SpMVs and BJ preconditioning are unchanged and CA is applied only to inner
product operations, was proposed.
In most of performance studies [4,13,15], CA Krylov methods were applied
to well-conditioned problems, where CA-steps are extended for s > 10. How-
ever, in Ref. [3], it was shown that for ill-conditioned problems given by the
JUPITER, the P-CACG method is numerically stable only within a few CA-
steps even with the original BJ preconditioning. This issue is attributed to the
monomial basis vectors, which are aligned to the eigenvector with the maximum
eigenvalue as s increases, and the other eigen-components become relatively
smaller and are hidden by the round-off errors. This violates the linear inde-
pendency of the monomial basis vectors, and makes them ill-conditioned, when
each basis vectors are not orthogonalized after creating it. To resolve this issue,
Hoemmen [7] proposed to use the Newton basis vectors and the Chebyshev basis
vectors. Suda et al. [16] proposed the P-CBCG method, which was tested with
point Jacobi preconditioning on the K-computer [17]. In this work, we apply
the P-CBCG method with BJ preconditioning to the JUPITER, compare its
convergence property and numerical stability against the P-CACG method, and
demonstrate its computational performance on the Oakforest-PACS.
has extreme contrast ∼107 between gas and solid phases, and is ill-conditioned.
The Poisson equation is discretized by the second order accurate centered finite
difference scheme (7 stencils) in the Cartesian grid system (x, y, z). The linear
system of the pressure Poisson equation, which is a symmetric block diago-
nal sparse matrix, is solved using Krylov subspace methods explained in the
following subsections. These Krylov solvers use the compressed diagonal stor-
age (CDS) format, which enables highly efficient direct memory access for the
block diagonal sparse matrix than the compressed sparse row (CSR) format,
which is commonly used in many matrix libraries, and are parallelized using a
MPI+OpenMP hybrid parallelization model, in which MPI is used for coarse
3D domain decomposition in (x, y, z) and fine 1D domain decomposition in z is
applied to each domain via OpenMP. BJ preconditioning is applied to each fine
subdomain so that it is computed in thread parallel.
CA Krylov methods based on the monomial basis vectors. In this work, λmax is
computed by a power method, while λmin is approximated as zero.
In the P-CBCG method, dominant computational costs come from the pre-
conditioned Chebyshev basis vector generation involving the SpMVs and the BJ
preconditioning (line 10) and the remaining matrix computations. The SpMVs
at line 10 require s local halo data communications, while the matrix computa-
tions at lines 5, 11 need global data reduction communications. Therefore, the
P-CBCG method requires two All reduces per s-steps. One All reduce at lines 5,
6 transfers s(s + 1)/2 upper-triangular elements of Q∗k AQk , s elements of Q∗k rsk ,
and one element for the norm of residual vector, while the other All reduce sends
s2 elements of Q∗k ASk+1 .
ICEX KNL
Number of nodes 2,510 8,208
Total performance [PFlops] 2.41 25.00
Number of cores per node 12 × 2 68
Peak performance F [GFlops/processor] 480 3046
STREAM bandwidth B [GByte/s/processor] 58 480(MCDRAM)
B/F 0.12 0.16
Cache [MB/cores] 30/12 1/2
Memory per node [GByte] 64 16
Interconnect bandwidth [GByte/s] 13.6 12.5
Application of a Preconditioned Chebyshev Basis 265
peak performance and the STREAM memory bandwidth of the processor. The
performance ratios of F and B between ICEX and KNL are 6.3× and 8.6×,
respectively.
5 Numerical Experiment
with ∼ 6, 000 iterations (see Fig. 2). Here, the convergence condition is given by
the relative residual error of |b − Ax|/|b| < 10−8 .
The convergence properties of the P-CG, P-CACG, P-CBCG, and P-MBCG
solvers are summarized in Fig. 2. Here, the P-MBCG method is a variant of the P-
CBCG method, in which the Chebyshev basis vectors at lines 2, 10 are replaced by
the monomial basis vectors Sk (rsk , (AM −1 )rsk , (AM −1 )2 rsk , ..., (AM −1 )s−1 rsk ).
Although the P-MBCG method is mathematically similar to the P-CACG
method, the former uses the two term recurrence formulae, while the latter is
based on the CG3 method or the three term recurrence formulae. In Ref. [20],
it was shown that Krylov subspace methods based on three term recurrences
give significantly less accurate residuals than those with two term recurrences.
In this work, we examine this point for CA Krylov subspace methods. As shown
in Ref. [3], the convergence of the P-CACG solver is limited to s = 3, while in the
P-MBCG solver, the convergence is somewhat extended to s = 5. On the other
hand, in the P-CBCG method, the convergence property is dramatically extended
to s = 40. These observations show that the main cause of the convergence degra-
dation is not the three term recurrence formulae, but the ill-conditioned monomial
basis vectors. Another important property is in the P-CBCG solver, the conver-
gence property becomes worse gradually above the upper limit of CA-steps, while
the P-CACG and P-MBCG solvers breaks down immediately above the upper
limit. This property is important for practical use in extreme-scale CFD simula-
tions.
268 Y. Idomura et al.
In the P-CACG solver, we use s = 3, which is the upper limit of CA-steps from
the viewpoint of numerical stability. On the other hand, the choice of s in the
Application of a Preconditioned Chebyshev Basis 269
Fig. 3. Strong scaling of the P-CG, P-CACG(s = 3), and P-CBCG(s = 12) solvers
using 500, 1,000, and 2,000 processors (MPI processes) on ICEX and KNL. The cost
distribution in a single time step is shown.
P-CBCG solver is rather flexible, and the optimum s depends on the following
factors. Firstly, the number of All reduce is reduced to 1/s compared to the P-CG
method. Here, the communication data size scales as ∼s2 . Secondly, the numbers
of floating point operations and memory access per iteration in Matrix kernel scale
as f ∼ s and b ∼ const., respectively, and the arithmetic intensity of Matrix scales
as f /b ∼ s. Thirdly, cache efficiency of CB is affected by the number of basis vec-
tors. Therefore, the computational performance of each kernel varies depending
270 Y. Idomura et al.
on s. Finally, the communication performance is also affected when the data size
is changed from a latency bound regime to a bandwidth bound regime. Although
a simple performance model was presented in Refs. [13,17], we need more detailed
performance models to predict the above complex behaviors. In this work, we
chose s = 12 from s-scan numerical experiments.
The strong scaling of the P-CG, P-CACG, and P-CBCG solvers are summa-
rized in Fig. 3. In the strong scaling test, we use 500, 1,000, and 2,000 processors
on ICEX and KNL, respectively. On ICEX, all Krylov solvers show good strong
scaling, because the computation part is dominant in all cases and the commu-
nication part is suppressed below ∼10 s. Therefore, the P-CACG and P-CBCG
solvers are slower than the P-CG solver, because of additional computation in
CA Krylov methods. On the other hand, on KNL, the computation part is signif-
icantly accelerated (3.5 × ∼5.1×) and the communication part is comparable or
slower (0.3×∼1.1×) compared to ICEX. Here, the cause of slower communication
performance on KNL is still under investigation. As a result, the remaining com-
munication part, in particular, All reduce becomes a severe bottleneck. On KNL,
the cost of All reduce in the P-CG solver increases with the number of processors.
This tendency is observed even in the P-CACG solver. However, in the P-CBCG
solver, the cost increase of All reduce is suppressed, and at 2,000 processors, it is
reduced to ∼1/3 and ∼1/2 compared to the P-CG and P-CACG solvers, respec-
tively. Because of this CA feature, the best performance on KNL is obtained by
the P-CBCG solver, and the P-CBCG solver is 1.38× and 1.17× faster than the
P-CG and P-CACG solvers at 2,000 processors, respectively.
It is noted that in Ref. [3], the P-CACG solver on the K-computer showed ideal
cost reduction of All reduce by 1/s. However, in the present numerical experiment,
the cost reduction of All reduce from the P-CG solver is limited to ∼2/3 and ∼1/3
in the P-CACG and P-CBCG solvers, respectively. These performance ratios are
far above the ideal one 1/s. In order to understand this issue, more detailed perfor-
mance analysis for All reduce is needed. Another issue is that the cost of halo data
communications increases in the P-CBCG solver, while the number of SpMVs is
almost the same as the other solvers. It is confirmed that this cost becomes com-
parable to that in the P-CACG solver, when the number of CA-steps is reduced
to s = 3. Therefore, the performance degradation of halo data communications
seems to depend on the memory usage, which increases with s. These issues will
be addressed in the future work.
6 Summary
In this work, we applied the P-CBCG method to the pressure Poisson equation in
the JUPITER. We analyzed numerical properties of the P-CACG and P-CBCG
methods in detail, and compared it against the P-CG method, which was used
in the original code. The P-CACG and P-CBCG methods reduce data reduction
communications to 1/s, but additional computation is needed for CA procedures.
The P-CACG (s = 3) and P-CBCG (s = 12) methods have ∼2× and ∼3× larger
f , while the increase in b is only ∼1.25×. Because of the improved arithmetic inten-
sity f /b, the resulting computational costs of the P-CACG and P-CBCG solvers
Application of a Preconditioned Chebyshev Basis 271
Acknowledgement. The authors would like to thank Dr. S. Yamashita for provid-
ing the JUPITER for the present benchmark, and Dr. T. Kawamura for the visual-
ization image. This work is supported by the MEXT (Grant for Post-K priority issue
No.6: Development of Innovative Clean Energy). Computations were performed on the
Oakforest-PACS (Univ. Tokyo/Univ. Tsukuba) and the ICEX (JAEA).
References
1. Asahi, Y., et al.: Optimization of fusion Kernels on accelerators with indirect or
strided memory access patterns. IEEE Trans. Parallel Distrib. Syst. 28(7), 1974–
1988 (2017)
2. Idomura, Y., et al.: Communication-overlap techniques for improved strong scaling
of Gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int. J. High
Perform. Comput. Appl. 28(1), 73–86 (2014)
3. Mayumi, A., et al.: Left-preconditioned communication-avoiding conjugate gradient
methods for multiphase CFD simulations on the K computer. In: Proceedings of the
7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems,
ScalA 2016, Piscataway, NJ, USA, pp. 17–24. IEEE Press (2016)
4. Idomura, Y., Ina, T., Mayumi, A., Yamada, S., Matsumoto, K., Asahi, Y., Ima-
mura, T.: Application of a communication-avoiding generalized minimal residual
method to a gyrokinetic five dimensional Eulerian code on many core platforms. In:
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for
Large-Scale Systems, ScalA 2017, New York, NY, USA, pp. 7:1–7:8. ACM (2017)
272 Y. Idomura et al.
5. Idomura, Y., et al.: Study of ion turbulent transport and profile formations using
global gyrokinetic full-f Vlasov simulation. Nucl. Fusion 49, 065029 (2009)
6. Yamashita, S., Ina, T., Idomura, Y., Yoshida, H.: A numerical simulation method
for molten material behavior in nuclear reactors. Nucl. Eng. Des. 322(Suppl. C),
301–312 (2017)
7. Hoemmen, M.: Communication-avoiding Krylov subspace methods. Ph.D. thesis,
University of California, Berkeley (2010)
8. Fujitsu Global: K computer. http://www.fujitsu.com/global/about/businesspolicy/
tech/k/
9. Van Rosendale, J.: Minimizing inner product data dependencies in conjugate gra-
dient iteration. NASA contractor report (1983)
10. Chronopoulos, A., Gear, C.: s-step iterative methods for symmetric linear systems.
J. Comput. Appl. Math. 25(2), 153–168 (1989)
11. Toledo, S.A.: Quantitative performance modeling of scientific computations and
creating locality in numerical algorithms. Ph.D. thesis, Massachusetts Institute of
Technology (1995)
12. Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication
in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel
and Distributed Processing, pp. 1–12, April 2008
13. Carson, E.C.: Communication-avoiding Krylov subspace methods in theory and
practice. Ph.D. thesis, University of California, Berkeley (2015)
14. Chronopoulos, A., Gear, C.W.: Implementation of preconditioned s-step conjugate
gradient methods on a multiprocessor system with memory hierarchy. Technical
report, Department of Computer Science, Illinois University, Urbana, USA (1987)
15. Yamazaki, I., Anzt, H., Tomov, S., Hoemmen, M., Dongarra, J.: Improving the per-
formance of CA-GMRES on multicores with multiple GPUs. In: 2014 IEEE 28th
International Parallel and Distributed Processing Symposium, pp. 382–391, May
2014
16. Suda, R., Cong, L., Watanabe, D., Kumagai, Y., Fujii, A., Tanaka, T.:
Communication-avoiding CG method: new direction of Krylov subspace methods
towards exa-scale computing. RIMS Kôkyûroku 1995, 102–111 (2016)
17. Kumagai, Y., Fujii, A., Tanaka, T., Hirota, Y., Fukaya, T., Imamura, T., Suda,
R.: Performance analysis of the Chebyshev basis conjugate gradient method on
the K computer. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K.,
Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 74–85. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-32149-3 8
18. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Indus-
trial and Applied Mathematics, Philadelphia (2003)
19. Shimokawabe, T., et al.: An 80-fold speedup, 15.0 TFlops full GPU acceleration
of non-hydrostatic weather model ASUCA production code. In: 2010 ACM/IEEE
International Conference for High Performance Computing, Networking, Storage
and Analysis, pp. 1–11, November 2010
20. Gutknecht, M.H., Strakos, Z.: Accuracy of two three-term and three two-term recur-
rences for Krylov space solvers. SIAM J. Matrix Anal. Appl. 22(1), 213–229 (2000)
Application of a Preconditioned Chebyshev Basis 273
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Optimization of Hierarchical Matrix
Computation on GPU
Abstract. The demand for dense matrix computation in large scale and
complex simulations is increasing; however, the memory capacity of cur-
rent computer system is insufficient for such simulations. Hierarchical
matrix method (H-matrices) is attracting attention as a computational
method that can reduce the memory requirements of dense matrix com-
putations. However, the computation of H-matrices is more complex than
that of dense and sparse matrices; thus, accelerating the H-matrices is
required. We focus on H-matrix - vector multiplication (HMVM) on a
single NVIDIA Tesla P100 GPU. We implement five GPU kernels and
compare execution times among various processors (the Broadwell-EP,
Skylake-SP, and Knights Landing) by OpenMP. The results show that,
although an HMVM kernel can compute many small GEMV kernels, merg-
ing such kernels to a single GPU kernel was the most effective implemen-
tation. Moreover, the performance of BATCHED BLAS in the MAGMA
library was comparable to that of the manually tuned GPU kernel.
1 Introduction
The scale of computer simulations continues to increase as hardware capability
advances from post-Peta to Exascale. At such scales, the asymptotic complexity
of both computation and memory is a serious bottleneck if they are not (near)
linear. In addition, the deep memory hierarchy and heterogeneity of such systems
are a challenge for existing algorithms. A fundamental change in the underlying
algorithms for scientific computing is required to facilitate exascale simulations,
i.e., (near) linear scaling algorithms with high data locality and asynchronicity
are required.
In scientific computing, the most common algorithmic components are linear
algebra routines, e.g., matrix - vector multiplication, matrix-matrix multiplica-
tion, factorization, and eigenvalue problems. The performance of these compo-
nents has been used as a proxy to measure the performance of large scale systems.
c The Author(s) 2018
R. Yokota and W. Wu (Eds.): SCFA 2018, LNCS 10776, pp. 274–292, 2018.
https://doi.org/10.1007/978-3-319-69953-0_16
Optimization of Hierarchical Matrix Computation on GPU 275
Note that the general usefulness of the high performance LINPACK benchmark for
supercomputers has long been disputed, and recent advancements of dense linear
algebra methods with near linear complexity could be the final nail in the coffin.
Dense matrices requires O(N 2 ) storage and have a multiplication/factorization
cost of O(N 3 ). Hierarchical low-rank approximation methods, such as H-matrices
[1], hierarchical semi-separable matrices [2], hierarchical off-diagonal low-rank
matrices [3], and hierarchical interpolative factorization methods [4], reduce this
storage requirement to O(N log N ) and the multiplication/factorization cost to
O(N logq N ), where, q denotes a positive number. With such methods, there is no
point performing large scale dense linear algebra operations directly. Note that,
we refer to all hierarchical low-rank approximation methods as H-matrices in this
paper for simplicity.
H-matrices subdivide a dense matrix recursively, i.e., off-diagonal block divi-
sion terminates at a coarse level, whereas diagonal blocks are divided until a
constant block size obtained regardless of the problem size. Here, off-diagonal
blocks are compressed using low-rank approximation, which is critical to achiev-
ing O(N log N ) storage and O(N logq N ) arithmetic complexity. Recently, H-
matrices have attracted increasing attention; however, such efforts have a math-
ematical and algebraic focus. As a result, few parallel implementations of the
H-matrix code have been proposed.
In this paper, we focus on a parallel implementation. Specifically, we target
matrix - vector multiplications on GPUs. Of the many scientific applications that
involve solving large dense matrices, we selected electric field analysis based on
boundary integral formulation. Our results demonstrate that orders of magnitude
speedup can be obtained by merging many matrix - vector computations into a
single GPU kernel and proper implementation of batched BLAS operations in
the MAGMA library [5–7].
The remainder of this paper is organized as follows. An overview of the H-
matrices and its basic computation are presented in Sect. 2. In Sect. 3, we focus
on H-matrix - vector multiplication (HMVM) and propose various single GPU
implementations. Performance evaluation results are presented and discussed in
Sect. 4, and conclusions and suggestions for future work are given in Sect. 5.
Aφ = B. (2)
Here, assume that we have two subsets (i.e., clusters) s, t ∈ I, where the
corresponding domains are defined as follows:
Ωsh := supp ϕi , Ωth := supp ϕi . (3)
i∈s i∈t
A cluster pair (s, t) is ‘admissible’, if the Euclidian distance between Ωsh and Ωth
is sufficiently large compared to their diameters:
k
g(x, y) ∼
= g1ν (x)g2ν (y), (5)
ν=1
where k is a positive number. Such kernel functions are employed in various scien-
tific applications, e.g., electric field analysis, mechanical analysis, and earthquake
cycle simulations. The kernel functions in such applications can be written as
follows:
g(x, y) ∈ span({|x − y|−p , p > 0}). (6)
When we consider static electric field analysis as a practical example, the
kernel function is given by
1
g(x, y) = |x − y|−1 . (7)
4π
Here, denotes the electric permittivity. Figure 1 shows the calculation result
when a surface charge method is used to calculate the electrical charge on the
surface of the conductors. We divided the surface of the conductor into triangular
elements and used step functions as the base function ϕi of the BEM.
Optimization of Hierarchical Matrix Computation on GPU 277
kh
H | :=
ÃK h
v ν (wν )T , (8)
ν=1
ÃK
H x = b. (9)
To solve (9), we use a Krylov subspace method, such as the BiCGSTAB method.
The HACApK library [8] and ppOpen-APPL/BEM [9,10] implement these com-
putations in parallel and distributed computer environments using the MPI and
OpenMP.
Fig. 4. Pseudo code of the HMVM kernel (OMP kernel ); the range of the loops in each
sub-matrix - vector multiplication depends on the target leaves.
Fig. 5. Pseudo code of the HMVM kernel with CUBLAS (CUBLAS kernel ); red text
indicates functions executed on the GPU. (Color figure online)
Fig. 6. Pseudo code of the HMVM kernel with MKL (MKL kernel ); red text indicates
MKL functions. (Color figure online)
Fig. 7. Pseudo code of the HMVM kernel with CUDA (SIMPLE kernel ); the entire
GPU kernel calculates a single GEMV, and each thread block calculates one GEMV
row.
In contrast, HMVM involves many small GEMV calculations. With GPUs, if the
CUBLAS GEMV function is used in HMVM, performance will be low because
of the lack of parallelism. Moreover, launching GPU kernels requires significant
time. In addition, the CUBLAS kernel launches a GEMV kernel for each leaf;
thus, the incurred overhead will increase execution time.
To evaluate and reduce this overhead, we implemented two HMVM kernels
using CUDA.
The first is a GEMV kernel that performs a single GEMV calculation using
the entire GPU, and each thread block calculates one GEMV row. Threads in
the thread block multiply the matrix and vector elements and calculate the
total value using a reduction operation. The reduction algorithm is based on an
optimized example code in the CUDA toolkit, which we refer to as the SIMPLE
kernel. Figure 7 shows the pseudo code of an HMVM kernel using the SIMPLE
kernel. The execution form (i.e., the number of thread block and threads per
block) is an optimization parameter.
Note that many of the GEMV calculations in the HMVM are small; thus, it
is difficult for the SIMPLE kernel to obtain sufficient performance. To improve
performance, some parts of the GPU should calculate a single GEMV in par-
allel. Thus, we developed an advanced kernel in which a single GEMV kernel
is calculated by one thread block, and each line in a single GEMV is calcu-
lated by a single warp. Moreover, to eliminate data transfer between the CPU
and GPU, two GEMV calculations in low-rank sub-matrix - vector multipli-
cation are merged to a single GPU kernel, and shared memory is used rather
Optimization of Hierarchical Matrix Computation on GPU 281
Fig. 8. Pseudo code of the HMVM kernel with CUDA (ASYNC kernel ); one thread
block calculates one GEMV, each warp in the thread blocks calculates a single line,
two GEMV calculations of low-rank sub-matrix - vector multiplication are merged into
a single GPU kernel, and multiple GPU kernels are launched asynchronously.
than global memory. Note that we refer to this kernel as the ASYNC kernel.
Figure 8 shows the pseudo code of an HMVM kernel with the ASYNC kernel.
Here, the execution form is also an optimization parameter, similar to the SIM-
PLE kernel ; however, the number of thread blocks is always one and multiple
GPU kernels are launched concurrently using CUDA stream. Moreover, atomic
function is used to merge the partial results because the atomic addition oper-
ation of the P100 is fast enough and this implementation can make memory
management easy.
Fig. 9. Pseudo code of the HMVM kernel with CUDA (A1 kernel ); the entire HMVM
calculation is executed by a single GPU kernel.
this issue, we have created a new GPU kernel that calculates all sub-matrix -
vector multiplications using a single GPU kernel, which we refer to as the A1
kernel.
Figure 9 shows the pseudo code of an HMVM kernel with the A1 kernel. In
this kernel, each leaf is calculated by a single warp, and the basic algorithm
of each leaf is similar to that of the ASYNC kernel. Although the loop for the
number of leaves is executed on the CPU in the ASYNC kernel, this loop is
executed on the GPU in the A1 kernel. Similar to the ASYNC kernel, here, the
execution form is an optimization parameter.
Fig. 11. Pseudo code of the HMVM kernel with BATCHED MAGMA BLAS
(BATCHED kernel ).
accelerate many small BLAS calculations, batched BLAS has been proposed by
several BLAS library developers. For example, MKL, MAGMA, and CUBLAS
provide batched BLAS functions. Although gemm is the main target function
of batched BLAS, MAGMA provides batched gemv functions for a GPU [13].
Figure 10 shows one of the interfaces of the batched gemv function in MAGMA.
Note that we implemented an HMVM kernel using the batched gemv func-
tion of MAGMA [14]. Figure 11 shows the pseudo code of our HMVM ker-
nel with BATCHED MAGMA BLAS, which we refer to as the BATCHED
kernel. In this kernel, the calculation information is constructed in the loop
of leaves on the CPU, and the GPU calculates the entire HMVM calcu-
lation using the magmablas dgemv vbatched atomic function. Note that the
magmablas dgemv vbatched atomic function is not the original BATCHED
MAGMA function, i.e., it is a function that we modified to use atomic addi-
tion to produce the results.
4 Performance Evaluation
arch=compute 60, code="sm 60,compute 60" for CUDA (nvcc). The MKL ker-
nel is called at the multi-threaded region; thus, sequential MKL is linked. Note
that threaded MKL obtained near by the same performance in all cases. Here,
we used MAGMA BLAS 2.2.
Moreover, to compare performance with other current processors, we mea-
sured the performance on a Skylake-SP CPU and a Knights Landing processor.
The Skylake-SP processor is installed in the ITO supercomputer system (test
operation) at Kyushu University [16], and Intel compiler 17.0.4 with -qopenmp
-O3 -xCORE-AVX512 -mkl=sequential compiler options was used. The Knights
Landing processor is installed in the Oakforest-PACS at JCAHPC [17] and Intel
compiler 17.0.4 with -qopenmp -O3 -xMIC-AVX512 -mkl=sequential compiler
options was used.
Table 1 shows the hardware specifications of all target hardware. Note that
we focus on the performance of a single socket in this paper. The execution
times of the Broadwell-EP (BDW) and Skylake-SP (SKX) were measured using
all 18 CPU cores. The cluster mode of Knights Landing (KNL) was the quadrant
mode, and the memory mode was flat (i.e., only MCDRAM was used). Note that
the KNL execution times were measured using 64 threads with scatter affinity
and hyper-threading degrades performance.
The four matrices in Table 2 are the target matrices of this evaluation. These
matrices were generated from electric field analysis problems. Here, the 10ts and
100ts matrices were generated from a problem with a single spherical object,
Optimization of Hierarchical Matrix Computation on GPU 285
and the 216h matrix was generated from a problem with two spherical objects.
In addition, a human 1x1 matrix was generated from a problem with a single
human-shaped object.
The sizes of the low-rank sub-matrices and small dense sub-matrix of each
target matrix are shown in Fig. 12, where the two left graphs of each matrix
show the size of the low-rank sub-matrices and the right shows the size of the
small dense sub-matrix.
With the 10ts and 100ts matrices, the size of the approximate matrices ndt
and ndl was less than approximately 200 (some were close to 700). Note that all
ranks kt were very small (the largest was 23). With the small dense matrices,
all matrix lengths were less than 100, and many were less than 30.
With the 216h and human 1x1 matrices, the aspect ratio of the small dense
matrices was similar to that of the 10ts and 100ts matrices. With the approxi-
mate matrices, although kt was greater than that of the 10ts and 100ts matrices,
the aspect ratio was similar. However, although nearly all ndt and ndl lengths
were less than 1000, a few matrices had ndt and ndl lengths that were greater
than 5000.
Note that the sizes of these matrices depend on the target matrix. Moreover,
the size is controlled by the matrix assembling algorithm and HACApK param-
eters. The above sizes were generated using current usual HACApK parameter
settings. It is expected that optimizing the matrix size will affect HMVM per-
formance, and this will be the focus of future work.
In this subsection, we discuss execution time and performance. Note that the
dominant part of the BiCGSTAB method is HMVM; therefore we focus on the
execution time of the HMVM. Moreover, the BiCGSTAB method does not mod-
ify the matrix data in its own kernel; thus, the each execution time does not
include the time required to perform data transfer between the main memory
and the GPU in the main iteration of the BiCGSTAB method. Figures 13 and
14 show the execution times for the target matrices. All times are the average
execution time of 100 HMVM calculations in 50 BiCGSTAB iterations. As men-
tioned in the previous section, although the execution form (i.e., grid layout) of
the SIMPLE, ASYNC, and A1 kernels are the optimization parameters, only
286 S. Ohshima et al.
the fastest cases are shown and the chosen forms are shown at Table 3. Note that
the ASYNC kernel launches many GEMV kernels asynchronously with a single
thread block. The “#leaves” grids of the A1 kernel indicate that the number of
thread blocks is equal to the number of leaves, and the outermost GPU kernel
loop is eliminated.
Figure 13(a) shows the execution times of all measurements on the Reedbush-
H. As can be seen, the CUBLAS, SIMPLE, and ASYNC kernels were too slow
Optimization of Hierarchical Matrix Computation on GPU 287
for a performance comparison with the fast kernels. Figure 13(b) shows graphs
with a limited Y-axis from Fig. 13(a). Relative to the CPU execution time, the
OMP and MKL kernels obtained nearly the same performance with all target
matrices. Focusing on the GPU execution time, it is clear that the execution
times of the CUBLAS, SIMPLE, and ASYNC kernels were much greater than
that of the A1 and BATCHED kernels. The major difference between these two
groups is the number of launched GPU kernels. As mentioned in the previous
section, launching GPU kernels requires more time than executing functions
on the CPU and causes long execution times with the three slower kernels.
Therefore, although the ASYNC kernel improves the performance compared to
the CUBLAS and SIMPLE kernels, its performance is much slower than that
of the A1 and BATCHED kernels. On the other hand, the A1 and BATCHED
kernels obtained much higher performance than the other kernels. Note that the
A1 kernel showed better performance than the BATCHED kernel because the
batched functions in MAGMA BLAS include computations that are unnecessary
for HMVM calculation or the execution form is unoptimized.
The execution time ratio of the OMP kernel (BDW) to the A1 kernel was
17.37% with the 10ts matrix, 24.22% with the 216h matrix, 18.18% with the
human 1x1 matrix, and 14.45% with the 100ts matrix, and the execution time
ratio of the OMP kernel (BDW) to the BATCHED kernel was 34.39% with the
10ts matrix, 32.07% with the 216h matrix, 31.43% with the human 1x1 matrix,
and 21.67% with the 100ts matrix. Considering that the calculation performance
ratio of the GPU to CPU was 11.4% and the memory performance was 10.5%,
there might be room to improve the GPU implementation.
Figure 14 shows the execution times of the A1 kernel, BATCHED kernel, and
CPU (i.e., the OMP and MKL kernels) on the Broadwell-EP (BDW), Skylake-
SP (SKX), and Knights Landing (KNL). All times of the KNL are the average
execution time of 100 HMVM calculations in 50 BiCGSTAB iterations, but that
of the SKX are average execution time of greater than 10 iterations because of
the resource limitation of the test operation.
Relative to the performance of SKX, both the OMP and MKL kernels
required nearly 30% less execution time than the OMP kernel of the BDW.
By considering the performance gap between the BDW and SKX in terms of
specification, i.e., the SKX has 45% greater memory bandwidth and more than
200% greater calculation performance than the BDW, it was expected that the
SKX would obtain higher performance than 30%. However, HMVM calculation
involves various loop length, and it is not a suitable calculation for AVX512;
therefore, the obtained performance is not unexpected. On the other hand, there
are large differences between the OMP kernel and MKL kernel of the KNL. How-
ever, it is difficult to describe the reason why the performance of the MKL kernel
was unstable because the MKL implementation is undisclosed. There might be
room to improve the KNL implementation. By considering the performance gap
between the BDW and KNL in terms of specification, i.e., the KNL has 7.6 times
greater memory bandwidth and 5.0 times greater calculation performance than
the BDW. The OMP kernel of the KNL obtained 34% to 57% better perfor-
288 S. Ohshima et al.
mance than the OMP kernel of the BDW. Similar to the SKX, the KNL has
much higher peak performance than the BDW; thus the performance improve-
ment of the KNL is insufficient relative to the performance gap between the
BDW and KNL.
Figure 15 shows the entire execution time of the BiCGSTAB method in
all target environments. Here, although the iteration count was not exactly
the same, only the total computation times are compared. Nearly all vector
and matrix calculations of the BiCGSTAB method were executed on the GPU
with the A1 kernel. Similarly, nearly all vector and matrix calculations of the
Optimization of Hierarchical Matrix Computation on GPU 289
Table 3. Best execution form of each GPU kernel: number of thread block and threads
per thread block
Fig. 15. BiCGSTAB execution times: BDW, SKX, and KNL were the fastest with
OMP and MKL kernels
BiCGSTAB method were executed on the GPU using MAGMA BLAS with the
BATCHED kernel. To simplify the evaluation, only the shortest times of the
OMP and MKL kernels for each hardware configuration are shown. The execu-
tion times of the A1 and BATCHED kernels were less than that of the other
processors, and the A1 kernel demonstrated the fastest performance with all
target matrices. Note that the SKX was faster than the BDW for all matrices
and was faster than the KNL with the 10ts, 216h, and human 1x1 matrices.
However, the KNL showed a shorter execution time than the BDW and SKX
with the 100ts matrix. The reason for this may be that the 100ts matrix has
a greater number of large sub-matrices than the other matrices. Note that the
larger target matrix, the greater performance KNL obtained relatively.
5 Conclusion
kernels into a single kernel (i.e., the A1 kernel ) was the most effective implemen-
tation. This implementation obtained much better performance among the com-
pared processors. Moreover, the BATCHED BLAS function of MAGMA, which
executes many BLAS computations using a single GPU kernel (BATCHED ker-
nel ), obtained good performance. Although the performance of the BATCHED
kernel was less than that of the A1 kernel with all matrices, developing the A1
kernel requires much more time and labor than the BATCHED kernel. There-
fore, it would be beneficial to implement an A1 kernel -based HMVM library in
HACApK. In the best case, the execution time ratio of the OMP kernel on the
Broadwell-EP to the A1 kernel was 14.45% with the 100ts matrix. Owing to the
higher HMVM performance, the BiCGSTAB method with A1 kernel demon-
strated overall better performance than the other kernels on the GPU (i.e., the
NVIDIA Tesla P100), as well as the Skylake-SP and Knights Landing hardware.
Note that various opportunities for future work remain. For example, we are
currently implementing and evaluating in the multi-GPU and multi-nodes envi-
ronments. In such environments, load balancing and data transfer optimization
are very important, and to accelerate data transfer between GPUs, the data
layout in GPU memory may have a significant impact on performance. Simpli-
fication of partition structure of H-matrices used in lattice H-matrices would be
required to improve load balancing and communication pattern [18]. Currently,
it is uncertain whether the A1 and BATCHED kernels have good data lay-
outs. The data layouts of approximate and small dense matrices can be modified
by configuring the parameters of the matrix assembly process in the HACApK
library. The relationship between the data layout of matrices and performance is
an interesting topic. Moreover, optimization of the execution forms of GPU ker-
nel in A1 kernel to various target matrices is an important issue; thus, evaluating
the performance of various matrices is required. In addition, we are considering
providing an implementation of our HMVM kernel in HACApK.
References
1. Hackbusch, W.: A sparse matrix arithmetic based on H-Matrices, Part I: introduc-
tion to h-matrices. Computing 62, 89–108 (1999)
2. Chandrasekaran, S., Dewilde, P., Gu, M., Lyons, W., Pals, T.: A fast solver for
HSS representations via sparse matrices. SIAM J. Matrix Anal. Appl. 29(1), 67–81
(2006)
3. Ambikasaran, S.: Fast Algorithms for Dense Numerical Linear Algebra and Appli-
cations. Ph.D thesis, Stanford University (2013)
Optimization of Hierarchical Matrix Computation on GPU 291
4. Ho, K.L., Ying, L.: Hierarchical interpolative factorization for elliptic operators:
differential equations. Commun. Pure Appl. Math. 69(8), 1415–1451 (2016)
5. MAGMA: MAGMA (2017). http://icl.cs.utk.edu/magma/. Accessed 11 Aug 2017
6. Dongarra, J., Duff, I., Gates, M., Haidar, A., Hammarling, S., Higham, N.J., Hogg,
J., Lara, P.V., Zounon, M., Relton, S.D., Tomov, S.: A Proposed API for Batched
Basic Linear Algebra Subprograms. Draft Report, May 2016 (2016)
7. Batched BLAS: Batched BLAS (2017). http://icl.utk.edu/bblas/. Accessed 23 Dec
2017
8. Ida, A., Iwashita, T., Mifune, T., Takahashi, Y.: Parallel hierarchical matrices
with adaptive cross approximation on symmetric multiprocessing clusters. J. Inf.
Process. 22(4), 642–650 (2014)
9. Iwashita, T., Ida, A., Mifune, T., Takahashi, Y.: Software framework for parallel
BEM analyses with H-matrices using MPI and OpenMP. Procedia Comput. Sci.
108, 2200–2209 (2017). International Conference on Computational Science, ICCS
2017, Zurich, Switzerland, 12–14 June 2017
10. ppOpen-HPC: Open Source Infrastructure for Development and Execution of
Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with
Automatic Tuning (AT) (2017). http://ppopenhpc.cc.u-tokyo.ac.jp/ppopenhpc/.
Accessed 11 Aug 2017
11. NVIDIA: Tesla P100 Most Advanced Data Center Accelerator (2017). http://www.
nvidia.com/object/tesla-p100.html. Accessed 11 Aug 2017
12. NVIDIA: cuBLAS: CUDA Toolkit Documentation (2017). http://docs.nvidia.com/
cuda/cublas/. Accessed 11 Aug 2017
13. Dong, T., Haidar, A., Tomov, S., Dongarra, J.: Optimizing the SVD bidiagonaliza-
tion process for a batch of small matrices. Procedia Comput. Sci. 108, 1008–1018
(2017). International Conference on Computational Science, ICCS 2017, Zurich,
Switzerland, 12–14 June 2017
14. Yamazaki, I., Abdelfattah, A., Ida, A., Ohshima, S., Tomov, S., Yokota, R.,
Dongarra, J.: Analyzing Performance of BiCGStab with Hierarchical Matrix on
GPU cluster. In: 2018 IEEE International Parallel and Distributed Processing
Symposium (IPDPS) (2018, in press)
15. Information Technology Center, The University of Tokyo: Reedbush Super-
computer System (2017). http://www.cc.u-tokyo.ac.jp/system/reedbush/index-e.
html. Accessed 08 Aug 2017
16. Research Institute for Information Technology, Kyushu University: Supercomputer
system ITO (2018). https://www.cc.kyushu-u.ac.jp/scp/system/ITO/. Accessed
09 Feb 2018 (in Japanese)
17. JCAHPC (Joint Center for Advanced HPC): Oakforest-PACS (2018). http://
jcahpc.jp/eng/ofp intro.html. Accessed 09 Feb 2018
18. Ida, A.: Lattice H-matrices on distributed-memory systems. In: 2018 IEEE Inter-
national Parallel and Distributed Processing Symposium (IPDPS) (2018, in press)
292 S. Ohshima et al.
Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits use, sharing, adaptation, distribution and reproduction in any medium
or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this chapter are included in the
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the
material. If material is not included in the chapter’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
Erratum to: Machine Learning Predictions
for Underestimation of Job Runtime
on HPC System
Erratum to:
Chapter “Machine Learning Predictions for Underestimation
of Job Runtime on HPC System” in: R. Yokota and
W. Wu (Eds.): Supercomputing Frontiers, LNCS 10776,
https://doi.org/10.1007/978-3-319-69953-0_11
The original version of this chapter contained an error. The affiliation of the second
author was incorrect. The original chapter has been corrected.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appro-
priate credit to the original author(s) and the source, provide a link to the Creative Commons
license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
Author Index