On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Kwasniewski, Grzegorz; Kabić, Marko; Ben-Nun, Tal; Ziogas, Alexandros Nikolaos; Saethre, Jens Eirik; Gaillard, André; Schneider, Timo; Besta, Maciej; Kozhevnikov, Anton; VandeVondele, Joost; Hoefler, Torsten

doi:10.1145/3458817.3476167

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2108.09337 (cs)

[Submitted on 20 Aug 2021 (v1), last revised 25 Apr 2023 (this version, v2)]

Title:On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Authors:Grzegorz Kwasniewski, Marko Kabić, Tal Ben-Nun, Alexandros Nikolaos Ziogas, Jens Eirik Saethre, André Gaillard, Timo Schneider, Maciej Besta, Anton Kozhevnikov, Joost VandeVondele, Torsten Hoefler

View PDF

Abstract:Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

Comments:	15 pages (including references), 11 figures. arXiv admin note: substantial text overlap with arXiv:2010.05975
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Computational Complexity (cs.CC); Performance (cs.PF)
Cite as:	arXiv:2108.09337 [cs.DC]
	(or arXiv:2108.09337v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2108.09337
Journal reference:	Published at Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November, 2021(SC'21)
Related DOI:	https://doi.org/10.1145/3458817.3476167

Submission history

From: Grzegorz Kwasniewski [view email]
[v1] Fri, 20 Aug 2021 19:24:34 UTC (2,313 KB)
[v2] Tue, 25 Apr 2023 10:58:53 UTC (2,313 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators