A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Qararyah, Fareed; Wahib, Mohamed; Dikbayır, Doğa; Belviranli, Mehmet Esat; Unat, Didem

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2008.08636v1 (cs)

[Submitted on 19 Aug 2020 (this version), latest version 5 May 2021 (v2)]

Title:A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Authors:Fareed Qararyah, Mohamed Wahib, Doğa Dikbayır, Mehmet Esat Belviranli, Didem Unat

View PDF

Abstract:We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not fit into single device this http URL decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is this http URL is completely independent of the deep learning aspects of a DNN and requires no modification neither at the model nor at the systems level implementation of operation kernels. It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to a few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving super-linear scaling for both the batch size and training throughput. In comparison to related work (Mesh-TensorFlow and gradient Checkpointing), ParDNN either outperforms or qualitatively improves upon them.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2008.08636 [cs.DC]
	(or arXiv:2008.08636v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2008.08636

Submission history

From: Fareed Qararyah [view email]
[v1] Wed, 19 Aug 2020 19:09:04 UTC (1,118 KB)
[v2] Wed, 5 May 2021 11:26:25 UTC (2,529 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators