BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Abdelkhalik, Hamdy; Aktar, Shamminuj; Arafa, Yehia; Barai, Atanu; Chennupati, Gopinath; Santhi, Nandakishore; Panda, Nishant; Prajapati, Nirmal; Turja, Nazmul Haque; Eidenbenz, Stephan; Badawy, Abdel-Hameed

Computer Science > Machine Learning

arXiv:2202.07798 (cs)

[Submitted on 16 Feb 2022 (v1), last revised 12 Nov 2023 (this version, v3)]

Title:BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Authors:Hamdy Abdelkhalik, Shamminuj Aktar, Yehia Arafa, Atanu Barai, Gopinath Chennupati, Nandakishore Santhi, Nishant Panda, Nirmal Prajapati, Nazmul Haque Turja, Stephan Eidenbenz, Abdel-Hameed Badawy

View PDF

Abstract:Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.

Comments:	Accepted at the 29th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2023)
Subjects:	Machine Learning (cs.LG); Performance (cs.PF)
Cite as:	arXiv:2202.07798 [cs.LG]
	(or arXiv:2202.07798v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2202.07798

Submission history

From: Shamminuj Aktar [view email]
[v1] Wed, 16 Feb 2022 00:19:15 UTC (2,471 KB)
[v2] Fri, 18 Feb 2022 04:47:11 UTC (2,471 KB)
[v3] Sun, 12 Nov 2023 04:13:50 UTC (2,909 KB)

Computer Science > Machine Learning

Title:BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators