Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

Luo, Yixin; Govindan, Sriram; Sharma, Bikash; Santaniello, Mark; Meza, Justin; Kansal, Aman; Liu, Jie; Khessib, Badriddine; Vaid, Kushagra; Mutlu, Onur

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1602.00729 (cs)

[Submitted on 1 Feb 2016 (v1), last revised 10 May 2018 (this version, v2)]

Title:Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

Authors:Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

View PDF

Abstract:This paper summarizes our work on characterizing application memory error vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory (HRM), which was published in DSN 2014, and examines the work's significance and future potential. Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory reliability for different applications.
Toward this end, in our DSN 2014 paper, we make three main contributions to enable highly-reliable servers at low datacenter cost. First, we develop a new methodology to quantify the tolerance of applications to memory errors. Second, using our methodology, we perform a case study of three new data-intensive workloads (an interactive web search application, an in-memory key--value store, and a graph mining framework) to identify new insights into the nature of application memory error vulnerability. Third, based on our insights, we propose several new hardware/software heterogeneous-reliability memory system designs to lower datacenter cost while achieving high reliability and discuss their trade-offs. We show that our new techniques can reduce server hardware cost by 4.7% while achieving 99.90% single server availability.

Comments:	4 pages, 4 figures, summary report for DSN 2014 paper: "Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory"
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1602.00729 [cs.DC]
	(or arXiv:1602.00729v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1602.00729

Submission history

From: Yixin Luo [view email]
[v1] Mon, 1 Feb 2016 22:23:18 UTC (702 KB)
[v2] Thu, 10 May 2018 05:27:22 UTC (892 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators