Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Yang, Shuo; Wu, Kai; Qiao, Yifan; Li, Dong; Zhai, Jidong

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1705.05541 (cs)

[Submitted on 16 May 2017]

Title:Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Authors:Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, Jidong Zhai

View PDF

Abstract:Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main memory are not lost when the system crashes because of the non-volatility nature of NVM. However, because of volatile caches, data must be logged and explicitly flushed from caches into NVM to ensure consistence and correctness before crashes, which can cause large runtime overhead.
In this paper, we introduce an algorithm-based method to establish crash consistence in NVM for HPC applications. We slightly extend application data structures or sparsely flush cache blocks, which introduce ignorable runtime overhead. Such extension or cache flushing allows us to use algorithm knowledge to \textit{reason} data consistence or correct inconsistent data when the application crashes. We demonstrate the effectiveness of our method for three algorithms, including an iterative solver, dense matrix multiplication, and Monte-Carlo simulation. Based on comprehensive performance evaluation on a variety of test environments, we demonstrate that our approach has very small runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller than that of traditional checkpoint, while having the same or less recomputation cost.

Comments:	12 pages
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1705.05541 [cs.DC]
	(or arXiv:1705.05541v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1705.05541

Submission history

From: Kai Wu [view email]
[v1] Tue, 16 May 2017 06:01:39 UTC (2,080 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators