On data skewness, stragglers, and MapReduce progress indicators

Coppa, Emilio; Finocchi, Irene

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1503.09062 (cs)

[Submitted on 31 Mar 2015 (v1), last revised 2 Apr 2015 (this version, v2)]

Title:On data skewness, stragglers, and MapReduce progress indicators

Authors:Emilio Coppa, Irene Finocchi

View PDF

Abstract:We tackle the problem of predicting the performance of MapReduce applications, designing accurate progress indicators that keep programmers informed on the percentage of completed computation time during the execution of a job. Through extensive experiments, we show that state-of-the-art progress indicators (including the one provided by Hadoop) can be seriously harmed by data skewness, load unbalancing, and straggling tasks. This is mainly due to their implicit assumption that the running time depends linearly on the input size. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption and exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Our theoretical progress model requires fine-grained profile data, that can be very difficult to manage in practice. To overcome this issue, we resort to computing accurate approximations for some of the quantities used in our model through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of real-world benchmarks shows that NearestFit is practical w.r.t. space and time overheads and that its accuracy is generally very good, even in scenarios where competitors incur non-negligible errors and wide prediction fluctuations. Overall, NearestFit significantly improves the current state-of-art on progress analysis for MapReduce.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Software Engineering (cs.SE)
Cite as:	arXiv:1503.09062 [cs.DC]
	(or arXiv:1503.09062v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1503.09062

Submission history

From: Emilio Coppa [view email]
[v1] Tue, 31 Mar 2015 14:29:13 UTC (795 KB)
[v2] Thu, 2 Apr 2015 15:55:15 UTC (1,111 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:On data skewness, stragglers, and MapReduce progress indicators

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:On data skewness, stragglers, and MapReduce progress indicators

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators