On Optimizing the Communication of Model Parallelism

Zhuang, Yonghao; Zhao, Hexu; Zheng, Lianmin; Li, Zhuohan; Xing, Eric P.; Ho, Qirong; Gonzalez, Joseph E.; Stoica, Ion; Zhang, Hao

Computer Science > Machine Learning

arXiv:2211.05322 (cs)

[Submitted on 10 Nov 2022 (v1), last revised 18 Aug 2024 (this version, v2)]

Title:On Optimizing the Communication of Model Parallelism

Authors:Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

View PDF HTML (experimental)

Abstract:We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2211.05322 [cs.LG]
	(or arXiv:2211.05322v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.05322

Submission history

From: Yonghao Zhuang [view email]
[v1] Thu, 10 Nov 2022 03:56:48 UTC (986 KB)
[v2] Sun, 18 Aug 2024 16:24:54 UTC (1,278 KB)

Computer Science > Machine Learning

Title:On Optimizing the Communication of Model Parallelism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:On Optimizing the Communication of Model Parallelism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators