S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Hu, Yuezhou; Zhu, Jun; Chen, Jianfei

Computer Science > Machine Learning

arXiv:2409.09099 (cs)

[Submitted on 13 Sep 2024 (v1), last revised 27 Dec 2024 (this version, v3)]

Title:S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Authors:Yuezhou Hu, Jun Zhu, Jianfei Chen

View PDF HTML (experimental)

Abstract:Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models. Our toolkit is available at this https URL.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2409.09099 [cs.LG]
	(or arXiv:2409.09099v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2409.09099
Journal reference:	38th Conference on Neural Information Processing Systems (NeurIPS 2024)

Submission history

From: Yuezhou Hu [view email]
[v1] Fri, 13 Sep 2024 08:29:36 UTC (2,522 KB)
[v2] Sun, 27 Oct 2024 14:15:32 UTC (2,160 KB)
[v3] Fri, 27 Dec 2024 09:30:18 UTC (2,160 KB)

Computer Science > Machine Learning

Title:S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators