0% found this document useful (0 votes)
3 views

moTuner A Compiler-based Auto-tuning Approach for Mixed-precision Operators

moTuner is a compiler-based framework designed to automate the tuning of mixed-precision operators, significantly reducing the programming burden associated with manual adjustments. It operates at the compiler level, allowing for efficient mixed-precision computation without requiring modifications to source code or operator libraries, and achieves performance improvements of up to 3.13x while maintaining high accuracy. The framework effectively narrows down configuration spaces through optimized search strategies, making it applicable across various hardware platforms and applications.

Uploaded by

therlf2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

moTuner A Compiler-based Auto-tuning Approach for Mixed-precision Operators

moTuner is a compiler-based framework designed to automate the tuning of mixed-precision operators, significantly reducing the programming burden associated with manual adjustments. It operates at the compiler level, allowing for efficient mixed-precision computation without requiring modifications to source code or operator libraries, and achieves performance improvements of up to 3.13x while maintaining high accuracy. The framework effectively narrows down configuration spaces through optimized search strategies, making it applicable across various hardware platforms and applications.

Uploaded by

therlf2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

moTuner: A Compiler-based Auto-tuning Approach for

Mixed-precision Operators
Zewei Mo Zejia Lin∗ Xianwei Zhang Yutong Lu
[email protected] [email protected] [email protected] [email protected]
Sun Yat-Sen University Northwestern Sun Yat-Sen University Sun Yat-Sen University
Guangzhou, China Polytechnical University Guangzhou, China Guangzhou, China
Xi’an, China
ABSTRACT 1 INTRODUCTION
Arithmetic operators are now used in a wide spectrum of domains, The past decade witnesses the ubiquitous development of compute-
including artificial intelligence, data analytics and scientific com- intensive machine learning, data processing workloads and sci-
puting. Meanwhile, specialized hardware components to enable entific computing applications, and the popularity of hardware
low-precision computing are increasingly deployed in GPUs and computing devices including GPUs and domain-specific accelera-
accelerators. Whereas promising to boost performance, accelerat- tors. The continuous demand of compute horsepower drives the
ing the operators on the hardware necessitates manually tuning the emergence of dedicated operators and low-precision computing to
mixed-precision knobs to balance the performance and accuracy, raise the performance levels on resource-constrained systems. Ac-
which can be extremely challenging in real practices. cordingly, dense arithmetic operations like convolution and general
To address the issue, we present moTuner , an automatic frame- matrix multiplication (GEMM) have been greatly invested, lead-
work for efficiently tuning mixed-precision operators. moTuner ing to various pre-built libraries, such as cuBLAS [33], rocBLAS
works on compiler-level to automatically enable the mixed-precision [2] and MKL [19]. As the operators are overwhelmingly adopted,
computation, without involving any manual modifications of source different types of hardware devices with specialized units have
code and/or the operator library, thus significantly alleviating the been introduced to efficiently support the operator computations.
programming burden. Owing to be implemented in compilation Recent GPUs are adding extra units like Tensor Cores [34] and
phase, moTuner can be more widely applicable with lessened efforts Matrix Cores [1], and accelerators including Google TPUs [22] and
on the libraries. Further, moTuner adopts optimized search strategy Cambricon’s MLUs [6] are used in multiple domains. To further
in tuning to effectively narrow down the configuration space. The improve the performance, low-precision computing is being ac-
evaluations on GEMM operators and real applications demonstrate tively explored from both hardware and software perspectives. On
that moTuner achieves performance improvement up to 3.13x and hardware, relaxed data types like 16-bit floating point (FP16) and
1.15x respectively, while guaranteeing considerably high accuracy. 8-bit integer (INT8) are well supported in almost all recent GPUs
and accelerators; on software, quantization and precision refine-
CCS CONCEPTS ment [28, 41]were proposed and utilized to support mixed-precision
• Software and its engineering → Compilers; • Computer computations.
systems organization → Heterogeneous (hybrid) systems. Naturally, the software and hardware techniques should be com-
bined to unleash the full potential of mixed-precision computing.
KEYWORDS As such, the application programmers are required to scrutinize
the source code to refactor the operator calls, which typically in-
mixed-precision operator, auto-tuning, compiler, performance and
volves multiple parameters to tune the optimal mixed-precision
accuracy, GPUs
settings with acceptable errors and higher performance. Besides,
ACM Reference Format: the aforementioned quantization and precision refinement are usu-
Zewei Mo, Zejia Lin, Xianwei Zhang, and Yutong Lu. 2022. moTuner: A ally implemented inside the underlying libraries, thus necessitating
Compiler-based Auto-tuning Approach for Mixed-precision Operators. In the application programmers to revise the library and/or cooperate
19th ACM International Conference on Computing Frontiers (CF’22), May with the library providers. However, the low-level libraries are often
17–19, 2022, Torino, Italy. ACM, New York, NY, USA, 9 pages. https://doi.
highly-optimized and pre-defined by the hardware vendors, and
org/10.1145/3528416.3530231
can even be closed-source for proprietary reasons. Alternatively,
∗ Work done while interning at Sun Yat-sen Unversity. the quantization and precision refinement steps can be migrated
to application codes instead, thus leaving all changes to the end
Permission to make digital or hard copies of all or part of this work for personal or programmers, which inevitably increases the programming burden
classroom use is granted without fee provided that copies are not made or distributed and further impedes the mixed-precision usages. For either option,
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM programmers have to tune the knobs to tradeoff the speedup and
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, output accuracy, which commonly incurs non-trivial manual ef-
to post on servers or to redistribute to lists, requires prior specific permission and/or a forts and expertise. Even worse, the prevalent use of heterogeneous
fee. Request permissions from [email protected].
CF’22, May 17–19, 2022, Torino, Italy computing platforms composed of CPUs, GPUs and accelerators fre-
© 2022 Association for Computing Machinery. quently demands the portability of the mixed-precision procedures,
ACM ISBN 978-1-4503-9338-6/22/05. . . $15.00 rendering recurring engineering efforts.
https://doi.org/10.1145/3528416.3530231

94
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.

To reduce the programming burden when taking advantage of various efforts focus on taking advantages of low-precision (e.g.,
mixed-precision computation on large programs and decouple ap- FP16, FP21 and INT8) arithmetic to speedup programs [4, 16, 21, 39]
plications from unnecessary operator libraries, we devise moTuner and many operators are implemented in mixed-precision like ReLU
to efficiently tune mixed-precision settings for operators in pro- [40], GEMM [28] and winograd [3]. These low-precision repre-
grams. It uses an optimized tuning strategy and user-defined error sentations with smaller size occupy less resources of GPU than
thresholds of the operator to tune mixed-precision settings for oper- high-precision ones, leading to a more limited value range, coarse-
ators. When tuning, it applies the compiler to automatically analyze grained precision and higher performance. To further boost the
dependencies of operators and collects profiling information to help operator efficiency, different levels of floating-point precision are
narrow down the search space of mixed-precision settings for op- now supported in recent GPUs (e.g., MI100 [1] and A100 [34]) to
erators. In the end it generates a program with higher performance operate data with reduced sizes. Precision of mixed-precision op-
and acceptable error. Also it targets on operator libraries of GPU erators’ input and output can be represented like 𝑃𝑙 /𝑃ℎ where the
and allows self-defined approaches of quantization and precision former is the low precision of input and the latter is the high preci-
refinement to be embedded into compilers, which eases usage effort sion of output.
of various libraries when programming on heterogeneous systems. To achieve this, developers need to first transform high-precision
In summary, the contributions of this paper are: data to low-precision, which is called quantization [12]. Then they
• we highlight the importance of removing the obstacles to utilize hardware and software with the capability of exploiting
mixed-precision uses, and then propose to consolidate the low-precision calculation like Tensor Core and VNNI [18], to pro-
application- and library-level changes into the compiler, duce results. In the end, de-quantization is performed to turn low-
which makes the whole procedure transparent to end users. precision data produced back to high-precision. This method has
• our designed compiler-based framework moTuner automati- already been widely employed in AI application due to their insen-
cally enables mixed-precision by identifying and wrapping sitivity to precision [29], greatly reducing dearth of resource and
up the operator calls, and meanwhile adaptively and effec- improving performance. But it’s difficult to exploit it in scientific
tively selects the appropriate parameters to well support the computing applications because they are strict with the precision
desired mixed-precision computations. loss.
• experimental evaluations for multiple GEMMs and real world
applications show that the proposed framework is effectively 2.2 Compilation Procedure
speeding up the execution, while refraining the accuracy loss Operator
to low levels. Library
LLVM System
The rest of the paper is organized as follows. Section 2 introduces
*.cpp *.i *.s *.o *.out
the background and motivation. Section 3 elaborates the proposed Pre-
Compiler
Assem-
Linker
processor bler
designs. Section 4 presents the experimental methodology, and
Section 5 analyzes our experimental results. Section 6 discusses Optimization
related work. Section 7 concludes the paper. *.i Front *.bc *.bc Code *.s
Passes
End Gen
2 BACKGROUND AND MOTIVATION Figure 1: General compilation procedure.
In this section, we first introduce representative operators together Operator usage requires the assistance from compilers, which
with their mixed-precision computations, and then the basic compi- bridges the high-level codes to the underlying hardware. Figure 1
lation process. Next, we present the motivation of compiler-based shows the compilation procedure of how C++ code using operators
automatic tuning to support mixed-precision operators. is turned into executable file through LLVM [27]. In the beginning,
source code is handled by pre-processor to process the included
2.1 Operators and Mixed-precision files and macro definitions. Then the pre-processed file is delivered
As the requirement of machine learning and scientific computing to the compiler. The procedure of compiler is typically partitioned
applications continuously grows, the high computation demands into three phases: front end, middle end and back end. Front end is
drive GPUs to become the de facto computing platforms. And, in- to check syntax correctness. Then, it turns the pre-processed code
creasing problem complexity and data volume motivate researches into intermmediate representation (IR) file (i.e., bitcode file). IR is
to optimize compute resources from both hardware and software the data structure and code used internally by compiler to represent
perspectives, with examples of dedicated arithmetic units and oper- source code, which eases difficulties for applying numerous general
ator libraries. In recent years, operators are dominating the whole optimizations on it. The middle end uses multiple passes to analyze
execution of machine learning applications, and are even frequently or optimize IR. Each of them performs one specific operation on IR.
involved in scientific computing workloads. Among the diverse op- Next, the back end uses code generator to parse the optimized IR
erators, GEMM, 𝐷 = 𝛼𝐴𝐵 + 𝛽𝐶, is a representative one used in file and generates assembly based on the target hardware platform.
many neural networks like CNN [24], ResNet [15] and linear alge- Later the assembly file is turned into the re-allocated object file by
bra applications like HPL [31], NTChem [35], Laghos [11]. the following assembler. In the end, linker links the object file with
The operator in these applications usually runs in a high-precision shared libraries including operator libraries provided by hardware
floating-point arithmetic (e.g., FP64, FP32). However, as the quan- vendors to produce the executable file. The shared library is pro-
tization methods and analysis of error control are getting mature, duced by compiling with flag of –shared in advance and linked

95
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy

with target file in linker to provide implementations for invoked error in multiple domains. This requires a lot of domain-specific
APIs of operators. Thus, the executable file is able to unleash the knowledge and engineering efforts for developers. For instance, in
computing power of hardware by using operator libraries. linear algebra applications running on GPU like cholesky factoriza-
tion [20] and HPL-AI [17], GMRES [38] are applied to accomplish
2.3 Motivation the precision refinement. Multiple operators provided by GPU ven-
2.3.1 Automate Mixed-precision. To bring mixed-precision oper- dors captures most computation in them. So unlike what prior
ators into play when optimizing programs, developers need to works [7, 13, 14, 25] focus on, most error in these applications are
manually modify some parts in Figure 1. In source code, develop- brought up or propagated by mixed-precision operators instead of
ers are required to write the quantization kernel functions, choose instructions and accumulated in the final results of programs.
which operators to be in mixed-precision and then replace them Flexible Threshold
with mixed-precision ones. This can be particularly time consum- 0.0050

Density
ing when working on large programs with thousands lines of code 0.0025 Strict Threshold Intolerable Error

like HPL [31], CFD [32]. Further, to better trade off performance 0.0000
0 50 100 150 200 250 300 350 400
and precision in each operator without sacrificing much generality Maximum Absolute Error

and convenience, developers implement their own mixed-precision Figure 3: The maximum absolute error distribution of matrix
refinement approach as dynamic shared objects based on exist- G under all mixed-precision setting combinations.
ing operator libraries. The precision refinement approach utilizes For example, matrix G in Figure 2 is the output of the third
mixed-precision operator from vendors to calculate a result and mixed-precision M_GEMM which is implemented based on [28]. Fig-
applies other calculations on it to accomplish the precision refine- ure 3 shows the maximum absolute error distribution of matrix G
ment job. But the shared objects may need to be rebuilt whenever under all combinations of precision refinement settings. Density
the dependent operator libraries are updated, which costs a lot of represents the occurrence frequency of an error, which equals to
time and efforts to maintain. the frequency of settings producing this error. Assuming that a
#include “header.h”
Mixed- scientific computing program contains the source code shown in
……
{
precision
Program
Figure 2 and takes matrix G as its output. As the maximum tolerable
Source Code
GEMM(A,B,C,D,p1,p2,p3);
GEMM(D,G,A,F,p4,p5,p6); absolute error of its output is marked by the red line, qualified
……
}
GEMM(D,F,G,G,p7,p8,p9);
mixed-precision settings of these three M_GEMM is few. So it’s hard
{
GEMM(A,B,C,D); LLVM FP32/FP32 to efficiently control the error of such a program by tuning these
parameters. But if the program has a more flexible error threshold
GEMM(D,G,A,F); System Program
GEMM(D,F,G,G);
Quantitization(…){
}
// implementation (the green line) for its output like neural network, more qualified
}
M_GEMM(in,in,in,out,
Shared Operator
settings are available which eases the burden of controlling the
error. Because the setting tuning for programs containing mixed-
p1,p2,p3){
// implementation Library Libraries
}
precision operators varies according to different error requirements
Figure 2: An example use of mixed-precision GEMM operator. for different applications or input, we need an efficient tool to help
controlling error in different scenarios.
What’s more, using such a framework requires parameter man-
ual tuning to gain the best trade-off between performance and 3 COMPILER-BASED AUTO-TUNING
correctness. Figure 2 illustrates the procedure of applying mixed- We introduce moTuner , a novel auto-tuning approach to well sup-
precision operators in an application. Here M_GEMM is a mixed- port mixed-precision operators to balance performance and ac-
precision GEMM framework with precision refinement, providing curacy. It analyzes dependency among multiple mixed-precision
some parameters (e.g., three [28] in the red code) for developers operators using compiler and then applies an optimized tuning
to tune and strike a balance between performance and accuracy. strategy to efficiently determine the appropriate setting of each op-
In order to employ it, developers need to replace the original code erator under a given error threshold. In the end, moTuner produces
with the one using M_GEMM and set parameters for them. Assuming a program with mostly optimal performance.
that each M_GEMM has 𝑀 setting combinations, the tuning space is
𝑂 (𝑀 3 ) in this sample. But it can be as huge as 𝑂 (𝑀 𝑁 ) when the 3.1 Design Overview
program has 𝑁 GEMMs. So, to gain a mixed-precision program
Figure 4 presents the overall architecture of moTuner . The input
with acceptable error and the highest performance, developers have
is a linked IR file of all source code files. The output is a program
to pay huge efforts when carefully tuning the setting for each op-
with improved performance and constrained error.
erator. Then developers are required to build their own operator
The flow of moTuner can be divided into three parts: Marker,
libraries based on ones provided by vendors and link them with the
Adjuster, and Finalizer. As the forefront component, Marker ex-
compiled file of modified code, which increases the complexity of
tends compiler to tag each mixed-precision operator with an unique
usage.
identifier and insert helper functions in IR which is to dump run-
2.3.2 Control Error. Due to more limited range and coarse-grained time information in execution. Then Marker turns the processed IR
precision that low-precision data has than high-precision one, there file into a new program. It will be executed once to dump informa-
exists the emergency of re-designing existing algorithms to har- tion. Next, Adjuster comes into play to analyze dependency among
ness high performance mixed-precision hardware within acceptable executed operators in IR as well as tune mixed-precision setting for

96
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.

Input LLVM
Passes
Input Pass 1
Source
Source
Source
Code
Code
Code
llvm-
link Linked
IR file
1
2 ··· Marker
Adjuster Optimized
IR file
3
Pass N Finalizer
Code
Generator
Assembler
Linker

Optimized
Setting Excutable
File Operator
File
Library

Marker
Optimized
GPU
Dumped Cuda Matrix Excutable
Adjuster Tensor
File Core Core Core File
Finalizer
GPU Memory Output

Figure 4: An overview of moTuner .

each operator. Allowing customizable error thresholds to control 3.1.2 Adjuster. In this phase, the original linked IR file and data
the error of each operator, it finds the mixed-precision setting with dumped before are taken as input. Instead of trying all combinations
high performance and acceptable error for each operator. In the end, for operators’ settings, we design an optimized tuning strategy us-
it generates a file describing recommended mixed-precision settings ing dumped output and identifiers of each operator under original
for executed operators. The third part is Finalizer, which directs precision. It assists moTuner to efficiently narrow down the search
the compiler to replace the original operators by mixed-precision space and find the appropriate mixed-precision settings for opera-
ones with recommended settings. tors. The tuning strategy consists of two main parts. The first one is
the related operators analysis and the second one is the optimized
3.1.1 Marker. In this part, Marker assigns an unique identifier adjustment. Details of these two will be discussed in Section 3.2. In
to every execution of each operator in LLVM IR and generates a the end, Adjuster generates a file describing recommended mixed-
new program. The identifier for each operator includes two parts, precision settings for all operators. All mixed-precision settings are
ordered ID of its first execution and the counter of its execution. guaranteed to satisfy given error threshold and not degrade the
Then the new program is executed once to dump identifier and performance.
corresponding output under original precision.
3.1.3 Finalizer. In this part, Finalizer takes the setting file gen-
for(int i = 0; i < n; i++) Dumped Identifiers Dumped Identifiers erated by Adjuster as input and replaces executed operators by
1_1 1_1 mixed-precision ones with corresponding mixed-precision settings
Operator 1
Input: C in LLVM IR. Then it turns the optimized IR file into an executable
Output: B 2_1 file for developers.
Operator 2 3_1
Input: B 2_1 3.2 Tuning Optimizations
Output: C 1_2
3.2.1 Related Operators Analysis. Related operators of one are
Operator 3 2_2 those directly or indirectly propagate error to its outputs (including
Input: C
Output: C
itself). It means that one’s related operators (except itself) should
3_1 3_2
be executed before it and have data dependency with its input.
(a) A loop with 3 operators. (b) Identifiers when itera- (c) Identifiers when itera-
Related operators help to identify which mixed-precision settings
tion runs once. tion runs twice.
of operators should be upgraded if one operator raises unacceptable
Figure 5: A loop with three operators and their identifiers in error even in original precision. To obtain one’s related operators in
different iteration times. adjuster, we leverage the identifiers dumped in the Marker phase.
Once an identifier of operator 𝑎 occurs before one of operator 𝑏 in
Here we give an example to examine the identifier. Assuming the dumped list and there exists data dependency between 𝑎 and
that the loop body in Figure 5(a) is executed twice, Figure 5(c) 𝑏 in static code analysis, 𝑎 is considered as a related operator of 𝑏.
shows dumped identifiers of each execution of every operator. Two Here we assume that LLVM already constructs data dependency
dumped outputs of operator 1 will have the following identifiers: graph of values for us.
1_1 and 1_2. The first number is the ordered ID and the second Figure 5(a) shows a loop body consisting of three operators. In
one is the counter. Considering an operator may accept a different static code analysis, there exists data dependency between any two
input every time, error produced in every execution can differ a of them. Assuming that the body of loop runs only once, we can get
lot. moTuner uses it to identify the output of each operator’s every the identifier list in Figure 5(b) after Marker phase. As the directly
execution and obtain every output error of each execution. related one of operator 3, operator 2 produces error in calculation

97
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy

Table 1: Variable definition. Algorithm 1: Optimized tuning algorithm.


Notation Definition Data: 𝑂, 𝑙 𝑣, 𝑆, 𝑈 𝐸.
Result: Optimized mixed-precision settings for operators.
𝑂 Ordered ID set of run operators 1 𝐸 ← { }; 𝑇 ← { }; 𝑁 𝐼 𝐷 ← 𝑂 0 ;
𝑇𝑖 Cost time list of i-th operator under all settings
2 for 𝑖 = 0 → 𝑙𝑒𝑛 (𝑂 ) + 𝑙 𝑣 − 1 do
𝐸𝑖 Error list of i-th operator under all settings
3 if 𝑖 ≤ 𝑙𝑒𝑛 (𝑂 ) − 1 then
𝐸𝑇 Error list of each tuned operator in the last run
4 𝑅𝑖 ← 𝐷𝐷𝐴(𝑂𝑖 ) ;
𝑆 Candidate mixed-precision settings of i-th operator
/* 𝐺𝑒𝑡 𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 𝑜 𝑓 𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟 𝑂𝑖 */
𝑈𝐸 User defined error threshold
𝑁𝐼𝐷 ID of next operator to tune 5 𝑡𝑚𝑎𝑥 ← 𝑓 𝑙𝑜𝑎𝑡𝑚𝑎𝑥 ; 𝑘𝑏𝑒𝑠𝑡 ← −1;
𝑙𝑣 Level number of mixed-precision settings 6 for 𝑘 = 0 → 𝑙𝑒𝑛 (𝑆 ) − 1 do
7 𝑒𝑟𝑟 ← 𝐸𝑖𝑘 ; 𝑡 ← 𝑇𝑖𝑘 ;
8 if 𝑒𝑟𝑟 ≤ 𝑈 𝐸 ∧ 𝑡 ≤ 𝑡𝑚𝑎𝑥 then
and propagates to it through the value 𝐶. As the indirectly related 9 𝑡𝑚𝑎𝑥 ← 𝑡 ; 𝑘𝑏𝑒𝑠𝑡 ← 𝑘;
one of it, operator 1 first propagates error to operator 2 through 10 if 𝑘𝑏𝑒𝑠𝑡 ≠ −1 then
the value 𝐵. Then after the calculation of operator 2, the error is 11 𝑠𝑒𝑡 _𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟 (𝑆𝑘𝑏𝑒𝑠𝑡 , 𝑖 ) ;
brought to operator 3 via the value 𝐶 again. This helps moTuner /* 𝑆𝑒𝑡 𝑂𝑖 𝑤𝑖𝑡ℎ 𝑘𝑏𝑒𝑠𝑡 𝑡ℎ 𝑠𝑒𝑡𝑡𝑖𝑛𝑔 */
to determine that operator 1 doesn’t rely on operator 3 even if 12 else
data dependency exists between them in static code analysis. But 13 𝑟𝑒𝑝𝑎𝑖𝑟 _𝑠𝑒𝑡𝑡𝑖𝑛𝑔 (𝑅𝑖 , 𝑂 ) ;
when the body of loop is executed twice, Figure 5(c) shows the /* 𝑇𝑢𝑛𝑒 𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠 𝑡𝑜 𝑙𝑜𝑤𝑒𝑟 𝑒𝑟𝑟𝑜𝑟 */
corresponding dumped identifier list. Identifier of operator 3’s first 14 if 𝑂𝑖 == 𝑁 𝐼 𝐷 then
run occurs before the one of operator 1’s second run. It means that 15 𝑤𝑟𝑎𝑝_𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟 (𝑂𝑖 ) ; /* 𝑇 𝑟 𝑦 𝑎𝑙𝑙 𝑚𝑖𝑥𝑒𝑑 −
error produced by operator 3 in the first run will be propagated to 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑠 𝑓 𝑜𝑟 𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟 𝑂𝑖 */
operator 1 in the second run through 𝐶, which makes operator 3 16 for 𝑗 = 0 → 𝑙𝑒𝑛 (𝐸𝑇 ) − 1 do
related to operator 1. 17 𝑅 𝑗 ← 𝐷𝐷𝐴(𝑂 𝑗 );
18 if 𝐸𝑇 𝑗 ≥ 𝑈 𝐸 then
3.2.2 Optimized Adjustment. We devise the optimized adjustment 19 𝑟𝑒𝑝𝑎𝑖𝑟 _𝑠𝑒𝑡𝑡𝑖𝑛𝑔 (𝑅 𝑗 , 𝑂 );
strategy to track accumulation of error following its propagation 20 𝑁 𝐼 𝐷 ← 𝑂𝑖+1 ;
path which is the execution path of operators, and adjust the mixed- 21 𝐸𝑇 , 𝐸𝑖 ,𝑇𝑖 , ← 𝑒𝑥𝑒𝑐𝑢𝑡𝑒 ( ); /* 𝑟𝑒𝑟𝑢𝑛 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 */
precision settings as soon as an error larger than the threshold
occurs. This greatly reduces the search space of tuning mixed- Program Program Program
Fourth Fifth
precision settings for operators. Table 1 introduces the notations Operator 1
INT8/FP32 Adjusting
Operator 1
FP16/FP32 Adjusting
Operator 1
FP32/FP32
we will use later and Algorithm 1 details the optimized tuning Operator 2
Process
Operator 2
Process
Operator 2
approach. INT8/FP32 FP16/FP32 FP32/FP32

To achieve this approach, we need to classify mixed-precision Wrapped


Operator 3
Operator 3
FP16/FP32
Operator 3
FP32/FP32
settings into 𝑙𝑣 levels at first. The higher level a setting gets, the Dump Dump Dump
higher accuracy and performance penalty it produces. For example, Performance Performance Performance
mixed-precision settings consisting of INT8/FP32, FP16/FP32 and Errors Errors Errors
Settings Settings Settings
FP32/FP32 can be levelized into three levels, where INT8/FP32 is the
lowest level and the FP32/FP32 is the highest one. Then line 5-9 in Figure 6: Optimized adjustment of three operators in the
the Algorithm 1 find a mixed-precision setting for an operator with worst case.
the highest performance and acceptable error. If there doesn’t exist
such a setting, it means that error produced by related operators is not, their settings will be upgraded to FP32/FP32. Then the latest
too much and this operator cannot produce a qualified result even program is executed again to verify. Because the original program
in FP32 precision. As a result, line 12 upgrades settings of related is in FP32/FP32, the latest program always produces a qualified er-
operators to next higher level. To verify whether its new error is ror, which guarantees the correctness of the program optimized by
lower than given threshold, we need extra runs to dump the error moTuner . With this approach, Adjuster has the time complexity
of these operators. Number of extra run equals to the level number of 𝑂 (𝑁 + 𝑙𝑣 − 1), where 𝑁 is the number of operators.
minus one, which guarantees that we are able to generate qualified
settings for related operators.
3.3 Implementation
Here we provide an example. We assume that the body of loop in Our approach is built in HIPCC, an open-source LLVM compiler
Figure 5(a) only runs once and operators 1, 2 have been tuned to be with full support of HIP. HIP is a heterogeneous programming
INT8/FP32 after first three adjusting procedures like Figure 6. Also, framework compatible with both AMD and Nvidia GPUs, thus all
the performance and error of operator 3 with different settings are the code we operate and analyze is implemented in HIP. Now we
dumped. Then in the fourth one, dumped information of operator 3 discuss main components of our implementation, (1) GEMM and
is taken as input. If moTuner detects that operator 3 cannot satisfy precision refinement framework, (2) Optimization passes.
accuracy requirement under all settings, moTuner will turn opera- • GEMM and precision refinement framework. We target
tors 1, 2 and 3 into FP16/FP32 mode. Then the new program will be on FP16/FP32 GEMM in HIPBLAS as the mixed-precision
executed to validate whether the upgraded setting is qualified. If operator and build precision refinement framework based

98
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.

execution
1 2 3 1 2 3 4 5 6
on QUANTENSOR [28]. (𝜏1, 𝜏2, 𝜏3 ) is the parameter in it dependency

to gain tradeoff between performance and accuracy. When data


1 2 3 4 5 6 7 8 9
(𝜏1, 𝜏2, 𝜏3 ) = (0, 0, 0), it runs a GEMM under FP16/FP32 with- dependency

out precision refinements. To take advantage of the opti- Figure 7: Dependency in micro-benchmarks containing 3, 6
mized adjustment, refinement parameters are levelized into and 9 GEMMs.
3 levels. The lowest one is with 𝜏1 + 𝜏2 < 2, the higher one
is with (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0) and the highest one is in FP32 execution counts of GEMM operator. 𝑁 and 𝑡𝑠 refer to input matrix
precision. For extra memory space needed by the precision dimension and tile size, respectively.
refinement framework to store low-precision data, we collect
the size of extra memory space needed by each GEMM in
4.2 Metrics
Marker and create the biggest one for usage in Adjuster For performance measurement, we consider the execution time
and Finalizer. speedup of the whole application when running Micro and CF. As
• Optimization passes. The Marker, Adjuster and Finalizer for the HPL-AI, we consider the speedup of its GEMM part, owing
are implemented as separated optimization passes in LLVM. to the high number of non-operator calls. The execution time is the
In order to record result, ID and dimensions of each GEMM average of five repeated runs.
in Marker, we use a wrapper function to replace the original To measure accuracy of an operator, we select the maximum
GEMM function. This wrapper function not only performs absolute error (𝐸𝛿 ) and mean relative error (𝐸𝛾 ) of its result. Let

the GEMM under FP32 precision with the same input, but 𝑋 be the matrix produced by FP32/FP32 GEMM, 𝑋 be the matrix
also dumps ID, dimensions and result of original GEMM to computed in FP16/FP32 mixed-precision. We define the maximum
file system. To find all directly and indirectly related oper- absolute error 𝐸𝛿 as the maximum difference between matrix 𝑋

ators for one, we maintain a directly related operators set and 𝑋 :
′ ′
for each operators and apply DFS algorithm on detecting 𝐸𝛿 (𝑋, 𝑋 ) = 𝑋 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛 − 𝑋 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛 (1)

indirectly related operators for it.
And the relative error is in the Frobenius norm 𝐸𝛾 :
4 EXPERIMENTAL METHODOLOGY 𝑋 −𝑋


4.1 Platform and Workloads 𝐸𝛾 (𝑋, 𝑋 ) = ′
𝐹
(2)
𝑋 𝐹
4.1.1 Environments. Settings of error threshold are categorized into eight kinds, which
We conduct experiments on the computing platform with configura- are used to tune operators in the following experiments and listed in
tions being summarized in Table 2. All programs are compiled using Table 3. These thresholds cover a wide range for usual applications.
with [email protected] with option -O3, which is commonly switched
Table 3: Categories of different error thresholds.
on to achieve high performance.
Table 2: Specifications for the computing platform. Error Kind Value Error Threshold Category

Hardware Software 𝐸𝛾 0.05 E1


EPYC 7302 (Freq.: Operating 𝐸𝛾 0.005 E2
CPU CentOS 7.9 𝐸𝛾 0.0005 E3
3.0-3.3 GHz) System
MI100 (FP32 Perf.: 𝐸𝛾 0.00005 E4
Operator [email protected] 𝐸𝛿 100 E5
GPU 46.1 TFLOPS, FP16
Library [email protected] 𝐸𝛿 10 E6
Perf.: 184.6 TFLOPS)
𝐸𝛿 1 E7
Memory 32 GB Compiler [email protected]
𝐸𝛿 0.1 E8

4.1.2 Benchmarks. As for the accuracy of Micro, we consider the 1 − 𝐸𝛾 of operators


We test three benchmarks to demonstrate that moTuner can au- with the most data dependencies. In CF, 1 − 𝐸𝛾 of the result matrix
tomatically and efficiently tune mixed-precision setting of each is used to measure accuracy. For HPL-AI, because it only validates
GEMM to improve performance while guaranteeing desired accu- whether the scaled residual of final result is less than 16, we mea-
racy. These workloads all use FP32/FP32 GEMM by default and set sure its accuracy by 1−𝐸𝛾 of the scaled residual. For a measurement
FP16/FP32 as the target mixed-precision setting. The first one is the of tuning effectiveness, we denote the execution count instead of
micro-benchmarks (Micro), contains varying number of GEMMs search time as the tuning effort because time spent in hand opti-
(e.g., 3, 6 and 9 1 ). Figure 7 shows the dependency in the tested micro- mization is not objective. And we select the log10 (𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡)
benchmarks. Also to validate the moTuner ’s robustness for data, we as the metric.
generate matrices with following data distributions: 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 0.5),
𝑈 𝑛𝑖 𝑓 𝑜𝑟𝑚(−0.5, 0.5) and 𝑅𝑎𝑛𝑑𝑜𝑚(−1, 1). 5 RESULTS AND ANALYSIS
The other two are cholesky factorization (CF) and HPL-AI [17] In this section, we present and analyze the respective experiment
and we implement the tiled version of them. The input setting of results of the aforementioned workloads. We evaluate moTuner
these two applications are denoted using (𝑁 , 𝑡𝑠), which decides from three different perspectives: performance and accuracy of
1 TheGPU memory capacity allows at most 9 GEMMs, which are occupying approxi- optimized programs, automation effectiveness and sensitivity of
mately 85% memory. input setting. We studied and compared the following schemes:

99
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy

Baseline Exhaust PriorK moTuner Baseline Exhaust PriorK moTuner


• Baseline. We run the program in FP32/FP32 for once. 1.6 100.0%

• Exhaust. This method exhaustively searches all setting com- 1.4


99.8%
binations for executed operators in the program and select
1.2
1
the one bringing the highest performance and acceptable

Accuracy
99.6%

Speedup
0.8
error. 0.6 99.4%

• PriorK. With expertise or prior knowledge, the search space 0.4


99.2%

in Exhaust can be effectively narrowed. As such, PriorK is


0.2
0 99.0%
aware of error and performance of each operator with every E1 E2 E3 E4 E5 E6 E7 E8 E1 E2 E3 E4 E5 E6 E7 E8
Error Threshold Category Error Threshold Category
setting in advance. In tests for the Micro, we randomly select
(a) Speedup. (b) Accuracy.
the top 1% settings which produce acceptable error for each
operator. In tests for the real world application, we randomly Figure 9: Speedup and accuracy of CF tuned by different
choose 𝑀/2 from fastest 50% settings with acceptable er- schemes under varying error thresholds.
ror for each operator, where M indicates the total setting Baseline Exhaust PriorK moTuner Baseline Exhaust PriorK moTuner

combinations for all operators. For both, we use the average 3 100.0%

performance and accuracy of all chosen settings as the result. 80.0%

• moTuner . We use moTuner to optimize a program for once. 2

Accuracy
60.0%

Speedup
5.1 Performance and Accuracy 1
40.0%

Figure 8 compares the speedup and accuracy of the optimized Mi- 20.0%

cro GEMM count is 9 (i.e., GEMM-1, ..., GEMM-9) and the data 0 0.0%

distribution of input matrix is 𝑁𝑜𝑟𝑚𝑎𝑙. Figure 8(a) demonstrates


E1 E2 E3 E4 E5 E6 E7 E8 E1 E2 E3 E4 E5 E6 E7 E8
Error Threshold Category Error Threshold Category
that moTuner gains up to 3.13x speedup and 1.72x speedup on (a) Speedup. (b) Accuracy.
average, which are mostly the same as the ones of Exhaust and
higher than PriorK. But for E6, performance gained by moTuner is Figure 10: Speedup of GEMM part in HPL-AI and accuracy
less than Exhaust because moTuner sets the (𝜏1, 𝜏2, 𝜏3 ) of GEMM- of HPL-AI tuned by different schemes under varying error
5 as (1, 1, 0) while it can still satisfy the error requirement with thresholds.
(𝜏1, 𝜏2, 𝜏3 ) = (0, 0, 0). Exhaust detects such a situation and provides
the setting with higher performance to GEMM-5. Figure 8(b) shows moTuner is capable to achieve over 99.99% accuracy under all given
the accuracy of GEMM-3 in Baseline and the tuned programs. When thresholds, whereas PriorK brings some little advantages under
error category is E1 and E2, moTuner achieves lower accuracy than some error thresholds compared to moTuner .
PriorK, which is still higher than 99.4%. For the remaining error Figure 10 presents the speedup and accuracy of baseline of HPL-
categories, moTuner can achieve almost 100% accuracy. AI’s GEMM part with the ones optimized by different schemes. The
input setting is (24576, 8192). Figure 10(a) shows that while PriorK
Baseline
3.5
Exhaust PriorK moTuner Baseline Exhaust PriorK moTuner
can only gain 1.97x speedup on average, moTuner is able to acquire
100.0%

3
a 2.92x mean speedup, which is the same as the Exhaust. Figure
2.5
99.8%
10(b) demonstrates that moTuner can achieve over 92% accuracy
in all situations while PriorK achieves 98%. But because HPL-AI
Speedup

Accuracy

2 99.6%

1.5
only validates whether the scaled residual is smaller than 16 and
99.4%
1
the one from programs tuned by moTuner is 0.003551, it means that
0.5
99.2% moTuner is able to generate qualified HPL-AI programs with faster
0 99.0% GEMMs.
E1 E2 E3 E4 E5 E6 E7 E8 E1 E2 E3 E4 E5 E6 E7 E8
Error Threshold Category Error Threshold Category
5.2 Automation Efficiency
(a) Speedup. (b) Accuracy.
In this section, we demonstrate how efficient moTuner can be when
Figure 8: Speedup and accuracy of Micro gained by different tuning the benchmarks, compared to two manual assisting methods.
schemes under varying error thresholds. The input of these benchmarks keeps the same as the ones used in
Figure 9 covers the speedup and accuracy of CF optimized by Section 5.1. As discussed before, the efforts are estimated using the
different schemes. The input setting of CF is (40960, 8192). Fig- execution counts. Figure 11 illustrates the tuning efforts needed by
ure 9(a) shows that moTuner can obtain a mean speedup of 1.153x different schemes. The Exhaust demands the most tuning efforts and
speedup in all error categories except E4. When the parameter level moTuner requires the least. Also, Exhaust and PriorK both require
of mixed-precision GEMM with 𝑡𝑠 = 8192 is lower than the one manual code modification while moTuner provides an end-to-end
of (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0), 𝐸𝛾 of its result is above 0.00005. So opti- automatic optimization.
mizing with E4 requires it to be upgraded to (1, 1, 0) theoretically.
But because mixed-precision GEMM with (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0) is 5.3 Sensitivity Studies
slower than FP32/FP32 GEMM due to type-casting cost, moTuner In this section, we present how moTuner performs on programs
decides to keep it in FP32 precision for E4 and brings no perfor- with varied inputs. First, we test on Micro with different data dis-
mance degradation. For the accuracy of CF shown in Figure 9(b), tributions of input matrices and GEMM count. Figure 12 shows

100
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.

Exhaust PriorK moTuner


10
dynamic analysis methods only aims at instructions and brings
8 up significant tuning time. moTuner belongs to this category and

Tuning Effort
6 only targets on error of operators instead of error of instruction.
4 As for static analysis methods, they only analyze the characteris-
tics of code to gather information. Darulova proposes Xfp [10] to
2
gain sub-optimal solution without performing an exhaustive explo-
0
Micro CF HPL-AI
ration. Rosa [8, 9] is a source-to-source compiler which provides
Benchmark a precision mix for a given program on real values. It introduces
Figure 11: Average tuning effort needed by three schemes a contract-based programming paradigm based on the Scala func-
under all error thresholds. tional programming language, which is not suitable for most exist-
ing scientific computing applications. AMPT-GA [23] implements
that moTuner can obtain 2.67x speedup with over 99.9% accuracy in a static analysis aiming at identifying strongly connected variables
average for Micro with differing inputs. Figure 13 shows how mo- in the dependency graph. These works are lack of the sensitivity
Tuner performs on CF and HPL-AI with changing inputs. moTuner for different input in varying degrees while moTuner can generate
can gain a mean speedup of 1.10x and 1.19x on CF and the GEMM the appropriate mixed-precision setting for different input.
part of HPL-AI respectively while maintaining over 99% accuracy Mixed-precision tuning: to lessen programming burden, sev-
on average under E3 and E6. eral efforts focused on automatic generation of mixed-precision
3N 3U 3R 6N 6U 6R 9U 9R 3N 3U 3R 6N 6U 6R 9U 9R
programs on CPU and GPU. For CPU programs, Precimonious [36]
4 100.0%
proposes the delta debugging to narrow the search space for single,
double and long precision. Then a follow-work devises blame analy-
3
sis [37] to reduce the search space further. While these two focus on
Accuracy
Speedup

2 99.5% each variable, HiFP-Tuner [14] targets on groups of variables con-


structed by using the community structure detection, which also
1
reduces the search space. Similarly, FPTuner [7] applies SMT-solver
0
E3 E6
99.0%
E3 E6
to tune groups of operations and predict an error upper bound.
Error Threshold Category Error Threshold Category For GPU programs, GPUMixer [25] tunes mixed-precision settings
(a) Speedup. (b) Accuracy. of floating-point operations. Although GPUMixer is performance-
driven, it cannot utilize mixed-pecision operators and helps pre-
Figure 12: Speedup and accuracy of different micro- cision refinement frameworks to be applied like moTuner . GPU-
benchmarks gained by moTuner under E3 and E6, the input FPtuner [13] takes into account code patterns prone to error propa-
setting is denoted by the number of GEMM and data distri- gation, but it only supports 32- and 64-bits floating-point arithmetic
bution (𝑁 : 𝑁𝑜𝑟𝑚𝑎𝑙, 𝑈 : 𝑈 𝑛𝑖 𝑓 𝑜𝑟𝑚, 𝑅 : 𝑅𝑎𝑛𝑑𝑜𝑚). while moTuner supports FP16 and INT8. ADAPT [30] uses algo-
rithmic differentiation to estimate error with reduced search space.
CF(32768,4096)
CF(40960,10240)
CF(32768,8192)
HPL-AI(16384,8192)
CF(32768,4096)
CF(40960,10240)
CF(32768,8192)
HPL-AI(16384,8192)
moTuner targets on GPU programs and is orthogonal to these works.
HPL-AI(24576,4096) HPL-AI(32768,1024) HPL-AI(24576,4096) HPL-AI(32768,1024)
1.6
1.4
100%
7 CONCLUSION
98%
1.2
The paper proposes moTuner , a compiler-based auto-tuning ap-
Accuracy
Speedup

1
proach for mixed-precision operators. moTuner automatically han-
96%
0.8
0.6
0.4
94%
dles both the configuration knobs and the quantization/refinement
0.2
92%
operations, thus eases the programming burden. Further, an ef-
0
E3 E6
90%
E3 E6
ficient tuning strategy is applied to strike a balance on the per-
Error Threshold Category Error Threshold Category formance improvement and output quality. We test moTuner on
(a) Speedup. (b) Accuracy. micro-benchmark with multiple GEMMs, cholesky factorization
and HPL-AI and the preliminary results demonstrate that moTuner
Figure 13: Speedup and accuracy of real applications gained
can efficiently obtain up to 3.13x, 1.15x and 2.92x respectively. In
by moTuner with different inputs under E3 and E6.
the future, we plan to extend moTuner to support more complicated
operators.
6 RELATED WORK
Error analysis of floating-point: to evaluate whether a program ACKNOWLEDGMENT
can take advantage of mixed-precision of floating-point to gain We thank the anonymous reviewers for their constructive com-
performance, a variety of works have been proposed to predict the ments and suggestions. This research was supported by the National
error of floating-point arithmetic. They can be divided into two Natural Science Foundation of China-#62102465/-#U1811461, the
categories: dynamic analysis and static analysis. Dynamic analysis Program for Guangdong Introducing Innovative and Entrepreneurial
requires for running programs to gather necessary information. A Teams under Grant NO. 2016ZT06D211, the Major Program of
dynamic analysis approach [5] is presented to find potential risk of Guangdong Basic and Applied Research-#2019B030302002, the Guang-
degrading accuracy. Then Lam et al. [26] apply dynamic analysis dong Natural Science Foundation-#2018B030312002, and CCF-Baidu
to detect cancellation error in floating-point calculations. These Open Fund (CCF-BAIDU OF2021032).

101
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy

REFERENCES and Analysis. IEEE/ACM, 1–14.


[1] AMD. 2021. AMD Instinct™ MI100 Accelerator. Retrieved 2022-01 from https: [22] Norman P. Jouppi, Cliff Young, et al. 2017. In-Datacenter Performance Analysis
//www.amd.com/en/products/server-accelerators/instinct-mi100. of a Tensor Processing Unit. In Proceedings of the 44th Annual International
[2] AMD. 2021. AMD rocBLAS Library. Retrieved 2022-01 from https://github.com/ Symposium on Computer Architecture. Association for Computing Machinery,
ROCmSoftwarePlatform/rocBLAS. 1–12.
[3] Barbara Barabasz, Andrew Anderson, et al. 2020. Error Analysis and Improving [23] Pradeep V Kotipalli, Ranvijay Singh, et al. 2019. AMPT-GA: Automatic Mixed
the Accuracy of Winograd Convolution for Deep Neural Networks. ACM Trans. Precision Floating Point Tuning for GPU Applications. In Proceedings of the
Math. Softw. 46, 4, 37:1–37:33. ACM International Conference on Supercomputing. Association for Computing
[4] Chaim Baskin, Natan Liss, et al. 2021. UNIQ: Uniform Noise Injection for Non- Machinery, 160–170.
Uniform Quantization of Neural Networks. ACM Trans. Comput. Syst. 37, 1-4, [24] Alex Krizhevsky, Ilya Sutskever, et al. 2017. ImageNet Classification with Deep
4:1–4:15. Convolutional Neural Networks. Commun. ACM 60, 6, 84–90.
[5] Florian Benz, Andreas Hildebrandt, et al. 2012. A dynamic program analysis to [25] Ignacio Laguna, Paul C. Wood, et al. 2019. GPUMixer: Performance-Driven
find floating-point accuracy problems. In ACM SIGPLAN Conference on Program- Floating-Point Tuning for GPU Scientific Applications. In High Performance
ming Language Design and Implementation. ACM, 453–462. Computing - 34th International Conference Proceedings, Vol. 11501. Springer, 227–
[6] Cambricon. 2021. Cambricon MLU Accelerator. Retrieved 2022-01 from https: 246.
//www.cambricon.com/. [26] Michael O. Lam, Jeffrey K. Hollingsworth, et al. 2013. Dynamic Floating-Point
[7] Wei-Fan Chiang, Mark Baranowski, et al. 2017. Rigorous floating-point mixed- Cancellation Detection. Parallel Comput. 39, 3, 146–155.
[27] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for
precision tuning. In Proceedings of the 44th ACM SIGPLAN Symposium on Princi-
Lifelong Program Analysis & Transformation. In 2nd IEEE / ACM International
ples of Programming Languages. ACM, 300–315.
Symposium on Code Generation and Optimization. IEEE Computer Society, 75–88.
[8] Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In The
[28] Guangli Li, Jingling Xue, et al. 2021. Unleashing the Low-Precision Computation
41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Potential of Tensor Cores on GPUs. In 2021 IEEE/ACM International Symposium
Languages. ACM, 235–248.
on Code Generation and Optimization. IEEE, 90–102.
[9] Eva Darulova and Viktor Kuncak. 2017. Towards a Compiler for Reals. ACM
[29] Stefano Markidis, Steven Wei Der Chien, et al. 2018. NVIDIA Tensor Core
Trans. Program. Lang. Syst. 39, 2, 8:1–8:28.
Programmability, Performance, Precision. In 2018 IEEE International Parallel and
[10] Eva Darulova, Viktor Kuncak, et al. 2013. Synthesis of fixed-point programs. In
Distributed Processing Symposium Workshops. IEEE Computer Society, 522–531.
Proceedings of the International Conference on Embedded Software. IEEE, 22:1–
[30] Harshitha Menon, Michael O. Lam, et al. 2018. ADAPT: Algorithmic Differentia-
22:10.
tion Applied to Floating-Point Precision Tuning. In Proceedings of the International
[11] Veselin A. Dobrev, Tzanio V. Kolev, et al. 2012. High-Order Curvilinear Finite
Conference for High Performance Computing, Networking, Storage, and Analysis.
Element Methods for Lagrangian Hydrodynamics. SIAM Journal on Scientific
IEEE Press, 48:1–48:13.
Computing 34, 5, B606–B641.
[31] netlib. 2021. HPL benchmark. Retrieved 2022-01 from https://www.netlib.org/
[12] Amir Gholami, Sehoon Kim, et al. 2021. A Survey of Quantization Methods for
benchmark/hpl/.
Efficient Neural Network Inference. CoRR abs/2103.13630.
[32] Tomás Norton and Da-Wen Sun. 2006. Computational fluid dynamics (CFD) – an
[13] Ruidong Gu and Michela Becchi. 2020. GPU-FPtuner: Mixed-precision Auto-
effective and efficient design and analysis tool for the food industry: A review.
tuning for Floating-point Applications on GPU. In 27th IEEE International Con-
Trends in Food Science & Technology 17, 11, 600–620.
ference on High Performance Computing. IEEE, 294–304.
[33] NVIDIA. 2008. cuBLAS Library. Retrieved 2022-01 from https://docs.nvidia.com/
[14] Hui Guo and Cindy Rubio-González. 2018. Exploiting community structure
cuda/cublas/.
for floating-point precision tuning. In Proceedings of the 27th ACM SIGSOFT
[34] NVIDIA. 2021. NVIDIA A100 Tensor Core GPU. Retrieved 2022-01 from https:
International Symposium on Software Testing and Analysis. ACM, 333–343.
//www.nvidia.com/en-us/data-center/a100.html.
[15] Kaiming He, Xiangyu Zhang, et al. 2016. Deep Residual Learning for Image
[35] Riken. 2021. Comprehensive software for ab initio quantum chemistry calcu-
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition.
lations of large and complicated molecular systems. Retrieved 2022-01 from
IEEE Computer Society, 770–778.
https://www.r-ccs.riken.jp/software_center/software/ntchem/overview/.
[16] Tsuyoshi Ichimura, Kohei Fujita, et al. 2018. In Proceedings of the International
[36] Cindy Rubio-González, Cuong Nguyen, et al. 2013. Precimonious: Tuning Assis-
Conference for High Performance Computing, Networking, Storage, and Analysis.
tant for Floating-Point Precision. In Proceedings of the International Conference
IEEE / ACM, 49:1–49:11.
on High Performance Computing, Networking, Storage and Analysis. Association
[17] ICL. 2021. The High Performance LINPACK for Accelerator Introspection
for Computing Machinery, 27:1–27:12.
(HPL-AI) benchmark. Retrieved 2022-01 from https://bitbucket.org/icl/hpl-
[37] Cindy Rubio-González, Cuong Nguyen, et al. 2016. Floating-Point Precision
ai/src/main/.
Tuning Using Blame Analysis. In Proceedings of the 38th International Conference
[18] Intel. 2019. Introduction to Intel deep learning boost on second
on Software Engineering. Association for Computing Machinery, 1074–1085.
generation Intel Xeon scalable processors. Retrieved 2022-01 from
[38] Youcef Saad and Martin H. Schultz. 1986. GMRES: A Generalized Minimal
https://software.intel.com/content/www/us/en/develop/articles/introduction-
Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J. Sci.
to-intel-deep-learning-boost-on-second-generation-intel-xeon-scalable.html.
Statist. Comput. 7, 3, 856–869.
[19] Intel. 2021. Intel MKL. Retrieved 2022-01 from https://www.intel.com/content/
[39] Zhuoran Song, Bangqi Fu, et al. 2020. DRQ: Dynamic Region-based Quanti-
www/us/en/developer/tools/oneapi/onemkl.html.
zation for Deep Neural Network Acceleration. In 2020 ACM/IEEE 47th Annual
[20] Emmanuel Jeannot. 2012. Performance Analysis and Optimization of the Tiled
International Symposium on Computer Architecture. IEEE, 1010–1021.
Cholesky Factorization on NUMA Machines. Proceedings - International Sympo-
[40] Huanrui Yang, Lin Duan, et al. 2021. BSQ: Exploring Bit-Level Sparsity for
sium on Parallel Architectures, Algorithms and Programming, 210–217.
Mixed-Precision Neural Network Quantization. CoRR abs/2102.10462.
[21] Weile Jia, Han Wang, et al. 2020. Pushing the limit of molecular dynamics with
[41] Zhaoyang Zhang, Wenqi Shao, et al. 2021. Differentiable Dynamic Quantiza-
ab initio accuracy to 100 million atoms with machine learning. In Proceedings of
tion with Mixed Precision and Adaptive Resolution. In Proceedings of the 38th
the International Conference for High Performance Computing, Networking, Storage
International Conference on Machine Learning, Vol. 139. PMLR, 12546–12556.

102

You might also like