moTuner A Compiler-based Auto-tuning Approach for Mixed-precision Operators
moTuner A Compiler-based Auto-tuning Approach for Mixed-precision Operators
Mixed-precision Operators
Zewei Mo Zejia Lin∗ Xianwei Zhang Yutong Lu
[email protected] [email protected] [email protected] [email protected]
Sun Yat-Sen University Northwestern Sun Yat-Sen University Sun Yat-Sen University
Guangzhou, China Polytechnical University Guangzhou, China Guangzhou, China
Xi’an, China
ABSTRACT 1 INTRODUCTION
Arithmetic operators are now used in a wide spectrum of domains, The past decade witnesses the ubiquitous development of compute-
including artificial intelligence, data analytics and scientific com- intensive machine learning, data processing workloads and sci-
puting. Meanwhile, specialized hardware components to enable entific computing applications, and the popularity of hardware
low-precision computing are increasingly deployed in GPUs and computing devices including GPUs and domain-specific accelera-
accelerators. Whereas promising to boost performance, accelerat- tors. The continuous demand of compute horsepower drives the
ing the operators on the hardware necessitates manually tuning the emergence of dedicated operators and low-precision computing to
mixed-precision knobs to balance the performance and accuracy, raise the performance levels on resource-constrained systems. Ac-
which can be extremely challenging in real practices. cordingly, dense arithmetic operations like convolution and general
To address the issue, we present moTuner , an automatic frame- matrix multiplication (GEMM) have been greatly invested, lead-
work for efficiently tuning mixed-precision operators. moTuner ing to various pre-built libraries, such as cuBLAS [33], rocBLAS
works on compiler-level to automatically enable the mixed-precision [2] and MKL [19]. As the operators are overwhelmingly adopted,
computation, without involving any manual modifications of source different types of hardware devices with specialized units have
code and/or the operator library, thus significantly alleviating the been introduced to efficiently support the operator computations.
programming burden. Owing to be implemented in compilation Recent GPUs are adding extra units like Tensor Cores [34] and
phase, moTuner can be more widely applicable with lessened efforts Matrix Cores [1], and accelerators including Google TPUs [22] and
on the libraries. Further, moTuner adopts optimized search strategy Cambricon’s MLUs [6] are used in multiple domains. To further
in tuning to effectively narrow down the configuration space. The improve the performance, low-precision computing is being ac-
evaluations on GEMM operators and real applications demonstrate tively explored from both hardware and software perspectives. On
that moTuner achieves performance improvement up to 3.13x and hardware, relaxed data types like 16-bit floating point (FP16) and
1.15x respectively, while guaranteeing considerably high accuracy. 8-bit integer (INT8) are well supported in almost all recent GPUs
and accelerators; on software, quantization and precision refine-
CCS CONCEPTS ment [28, 41]were proposed and utilized to support mixed-precision
• Software and its engineering → Compilers; • Computer computations.
systems organization → Heterogeneous (hybrid) systems. Naturally, the software and hardware techniques should be com-
bined to unleash the full potential of mixed-precision computing.
KEYWORDS As such, the application programmers are required to scrutinize
the source code to refactor the operator calls, which typically in-
mixed-precision operator, auto-tuning, compiler, performance and
volves multiple parameters to tune the optimal mixed-precision
accuracy, GPUs
settings with acceptable errors and higher performance. Besides,
ACM Reference Format: the aforementioned quantization and precision refinement are usu-
Zewei Mo, Zejia Lin, Xianwei Zhang, and Yutong Lu. 2022. moTuner: A ally implemented inside the underlying libraries, thus necessitating
Compiler-based Auto-tuning Approach for Mixed-precision Operators. In the application programmers to revise the library and/or cooperate
19th ACM International Conference on Computing Frontiers (CF’22), May with the library providers. However, the low-level libraries are often
17–19, 2022, Torino, Italy. ACM, New York, NY, USA, 9 pages. https://doi.
highly-optimized and pre-defined by the hardware vendors, and
org/10.1145/3528416.3530231
can even be closed-source for proprietary reasons. Alternatively,
∗ Work done while interning at Sun Yat-sen Unversity. the quantization and precision refinement steps can be migrated
to application codes instead, thus leaving all changes to the end
Permission to make digital or hard copies of all or part of this work for personal or programmers, which inevitably increases the programming burden
classroom use is granted without fee provided that copies are not made or distributed and further impedes the mixed-precision usages. For either option,
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM programmers have to tune the knobs to tradeoff the speedup and
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, output accuracy, which commonly incurs non-trivial manual ef-
to post on servers or to redistribute to lists, requires prior specific permission and/or a forts and expertise. Even worse, the prevalent use of heterogeneous
fee. Request permissions from [email protected].
CF’22, May 17–19, 2022, Torino, Italy computing platforms composed of CPUs, GPUs and accelerators fre-
© 2022 Association for Computing Machinery. quently demands the portability of the mixed-precision procedures,
ACM ISBN 978-1-4503-9338-6/22/05. . . $15.00 rendering recurring engineering efforts.
https://doi.org/10.1145/3528416.3530231
94
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.
To reduce the programming burden when taking advantage of various efforts focus on taking advantages of low-precision (e.g.,
mixed-precision computation on large programs and decouple ap- FP16, FP21 and INT8) arithmetic to speedup programs [4, 16, 21, 39]
plications from unnecessary operator libraries, we devise moTuner and many operators are implemented in mixed-precision like ReLU
to efficiently tune mixed-precision settings for operators in pro- [40], GEMM [28] and winograd [3]. These low-precision repre-
grams. It uses an optimized tuning strategy and user-defined error sentations with smaller size occupy less resources of GPU than
thresholds of the operator to tune mixed-precision settings for oper- high-precision ones, leading to a more limited value range, coarse-
ators. When tuning, it applies the compiler to automatically analyze grained precision and higher performance. To further boost the
dependencies of operators and collects profiling information to help operator efficiency, different levels of floating-point precision are
narrow down the search space of mixed-precision settings for op- now supported in recent GPUs (e.g., MI100 [1] and A100 [34]) to
erators. In the end it generates a program with higher performance operate data with reduced sizes. Precision of mixed-precision op-
and acceptable error. Also it targets on operator libraries of GPU erators’ input and output can be represented like 𝑃𝑙 /𝑃ℎ where the
and allows self-defined approaches of quantization and precision former is the low precision of input and the latter is the high preci-
refinement to be embedded into compilers, which eases usage effort sion of output.
of various libraries when programming on heterogeneous systems. To achieve this, developers need to first transform high-precision
In summary, the contributions of this paper are: data to low-precision, which is called quantization [12]. Then they
• we highlight the importance of removing the obstacles to utilize hardware and software with the capability of exploiting
mixed-precision uses, and then propose to consolidate the low-precision calculation like Tensor Core and VNNI [18], to pro-
application- and library-level changes into the compiler, duce results. In the end, de-quantization is performed to turn low-
which makes the whole procedure transparent to end users. precision data produced back to high-precision. This method has
• our designed compiler-based framework moTuner automati- already been widely employed in AI application due to their insen-
cally enables mixed-precision by identifying and wrapping sitivity to precision [29], greatly reducing dearth of resource and
up the operator calls, and meanwhile adaptively and effec- improving performance. But it’s difficult to exploit it in scientific
tively selects the appropriate parameters to well support the computing applications because they are strict with the precision
desired mixed-precision computations. loss.
• experimental evaluations for multiple GEMMs and real world
applications show that the proposed framework is effectively 2.2 Compilation Procedure
speeding up the execution, while refraining the accuracy loss Operator
to low levels. Library
LLVM System
The rest of the paper is organized as follows. Section 2 introduces
*.cpp *.i *.s *.o *.out
the background and motivation. Section 3 elaborates the proposed Pre-
Compiler
Assem-
Linker
processor bler
designs. Section 4 presents the experimental methodology, and
Section 5 analyzes our experimental results. Section 6 discusses Optimization
related work. Section 7 concludes the paper. *.i Front *.bc *.bc Code *.s
Passes
End Gen
2 BACKGROUND AND MOTIVATION Figure 1: General compilation procedure.
In this section, we first introduce representative operators together Operator usage requires the assistance from compilers, which
with their mixed-precision computations, and then the basic compi- bridges the high-level codes to the underlying hardware. Figure 1
lation process. Next, we present the motivation of compiler-based shows the compilation procedure of how C++ code using operators
automatic tuning to support mixed-precision operators. is turned into executable file through LLVM [27]. In the beginning,
source code is handled by pre-processor to process the included
2.1 Operators and Mixed-precision files and macro definitions. Then the pre-processed file is delivered
As the requirement of machine learning and scientific computing to the compiler. The procedure of compiler is typically partitioned
applications continuously grows, the high computation demands into three phases: front end, middle end and back end. Front end is
drive GPUs to become the de facto computing platforms. And, in- to check syntax correctness. Then, it turns the pre-processed code
creasing problem complexity and data volume motivate researches into intermmediate representation (IR) file (i.e., bitcode file). IR is
to optimize compute resources from both hardware and software the data structure and code used internally by compiler to represent
perspectives, with examples of dedicated arithmetic units and oper- source code, which eases difficulties for applying numerous general
ator libraries. In recent years, operators are dominating the whole optimizations on it. The middle end uses multiple passes to analyze
execution of machine learning applications, and are even frequently or optimize IR. Each of them performs one specific operation on IR.
involved in scientific computing workloads. Among the diverse op- Next, the back end uses code generator to parse the optimized IR
erators, GEMM, 𝐷 = 𝛼𝐴𝐵 + 𝛽𝐶, is a representative one used in file and generates assembly based on the target hardware platform.
many neural networks like CNN [24], ResNet [15] and linear alge- Later the assembly file is turned into the re-allocated object file by
bra applications like HPL [31], NTChem [35], Laghos [11]. the following assembler. In the end, linker links the object file with
The operator in these applications usually runs in a high-precision shared libraries including operator libraries provided by hardware
floating-point arithmetic (e.g., FP64, FP32). However, as the quan- vendors to produce the executable file. The shared library is pro-
tization methods and analysis of error control are getting mature, duced by compiling with flag of –shared in advance and linked
95
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy
with target file in linker to provide implementations for invoked error in multiple domains. This requires a lot of domain-specific
APIs of operators. Thus, the executable file is able to unleash the knowledge and engineering efforts for developers. For instance, in
computing power of hardware by using operator libraries. linear algebra applications running on GPU like cholesky factoriza-
tion [20] and HPL-AI [17], GMRES [38] are applied to accomplish
2.3 Motivation the precision refinement. Multiple operators provided by GPU ven-
2.3.1 Automate Mixed-precision. To bring mixed-precision oper- dors captures most computation in them. So unlike what prior
ators into play when optimizing programs, developers need to works [7, 13, 14, 25] focus on, most error in these applications are
manually modify some parts in Figure 1. In source code, develop- brought up or propagated by mixed-precision operators instead of
ers are required to write the quantization kernel functions, choose instructions and accumulated in the final results of programs.
which operators to be in mixed-precision and then replace them Flexible Threshold
with mixed-precision ones. This can be particularly time consum- 0.0050
Density
ing when working on large programs with thousands lines of code 0.0025 Strict Threshold Intolerable Error
like HPL [31], CFD [32]. Further, to better trade off performance 0.0000
0 50 100 150 200 250 300 350 400
and precision in each operator without sacrificing much generality Maximum Absolute Error
and convenience, developers implement their own mixed-precision Figure 3: The maximum absolute error distribution of matrix
refinement approach as dynamic shared objects based on exist- G under all mixed-precision setting combinations.
ing operator libraries. The precision refinement approach utilizes For example, matrix G in Figure 2 is the output of the third
mixed-precision operator from vendors to calculate a result and mixed-precision M_GEMM which is implemented based on [28]. Fig-
applies other calculations on it to accomplish the precision refine- ure 3 shows the maximum absolute error distribution of matrix G
ment job. But the shared objects may need to be rebuilt whenever under all combinations of precision refinement settings. Density
the dependent operator libraries are updated, which costs a lot of represents the occurrence frequency of an error, which equals to
time and efforts to maintain. the frequency of settings producing this error. Assuming that a
#include “header.h”
Mixed- scientific computing program contains the source code shown in
……
{
precision
Program
Figure 2 and takes matrix G as its output. As the maximum tolerable
Source Code
GEMM(A,B,C,D,p1,p2,p3);
GEMM(D,G,A,F,p4,p5,p6); absolute error of its output is marked by the red line, qualified
……
}
GEMM(D,F,G,G,p7,p8,p9);
mixed-precision settings of these three M_GEMM is few. So it’s hard
{
GEMM(A,B,C,D); LLVM FP32/FP32 to efficiently control the error of such a program by tuning these
parameters. But if the program has a more flexible error threshold
GEMM(D,G,A,F); System Program
GEMM(D,F,G,G);
Quantitization(…){
}
// implementation (the green line) for its output like neural network, more qualified
}
M_GEMM(in,in,in,out,
Shared Operator
settings are available which eases the burden of controlling the
error. Because the setting tuning for programs containing mixed-
p1,p2,p3){
// implementation Library Libraries
}
precision operators varies according to different error requirements
Figure 2: An example use of mixed-precision GEMM operator. for different applications or input, we need an efficient tool to help
controlling error in different scenarios.
What’s more, using such a framework requires parameter man-
ual tuning to gain the best trade-off between performance and 3 COMPILER-BASED AUTO-TUNING
correctness. Figure 2 illustrates the procedure of applying mixed- We introduce moTuner , a novel auto-tuning approach to well sup-
precision operators in an application. Here M_GEMM is a mixed- port mixed-precision operators to balance performance and ac-
precision GEMM framework with precision refinement, providing curacy. It analyzes dependency among multiple mixed-precision
some parameters (e.g., three [28] in the red code) for developers operators using compiler and then applies an optimized tuning
to tune and strike a balance between performance and accuracy. strategy to efficiently determine the appropriate setting of each op-
In order to employ it, developers need to replace the original code erator under a given error threshold. In the end, moTuner produces
with the one using M_GEMM and set parameters for them. Assuming a program with mostly optimal performance.
that each M_GEMM has 𝑀 setting combinations, the tuning space is
𝑂 (𝑀 3 ) in this sample. But it can be as huge as 𝑂 (𝑀 𝑁 ) when the 3.1 Design Overview
program has 𝑁 GEMMs. So, to gain a mixed-precision program
Figure 4 presents the overall architecture of moTuner . The input
with acceptable error and the highest performance, developers have
is a linked IR file of all source code files. The output is a program
to pay huge efforts when carefully tuning the setting for each op-
with improved performance and constrained error.
erator. Then developers are required to build their own operator
The flow of moTuner can be divided into three parts: Marker,
libraries based on ones provided by vendors and link them with the
Adjuster, and Finalizer. As the forefront component, Marker ex-
compiled file of modified code, which increases the complexity of
tends compiler to tag each mixed-precision operator with an unique
usage.
identifier and insert helper functions in IR which is to dump run-
2.3.2 Control Error. Due to more limited range and coarse-grained time information in execution. Then Marker turns the processed IR
precision that low-precision data has than high-precision one, there file into a new program. It will be executed once to dump informa-
exists the emergency of re-designing existing algorithms to har- tion. Next, Adjuster comes into play to analyze dependency among
ness high performance mixed-precision hardware within acceptable executed operators in IR as well as tune mixed-precision setting for
96
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.
Input LLVM
Passes
Input Pass 1
Source
Source
Source
Code
Code
Code
llvm-
link Linked
IR file
1
2 ··· Marker
Adjuster Optimized
IR file
3
Pass N Finalizer
Code
Generator
Assembler
Linker
Optimized
Setting Excutable
File Operator
File
Library
Marker
Optimized
GPU
Dumped Cuda Matrix Excutable
Adjuster Tensor
File Core Core Core File
Finalizer
GPU Memory Output
each operator. Allowing customizable error thresholds to control 3.1.2 Adjuster. In this phase, the original linked IR file and data
the error of each operator, it finds the mixed-precision setting with dumped before are taken as input. Instead of trying all combinations
high performance and acceptable error for each operator. In the end, for operators’ settings, we design an optimized tuning strategy us-
it generates a file describing recommended mixed-precision settings ing dumped output and identifiers of each operator under original
for executed operators. The third part is Finalizer, which directs precision. It assists moTuner to efficiently narrow down the search
the compiler to replace the original operators by mixed-precision space and find the appropriate mixed-precision settings for opera-
ones with recommended settings. tors. The tuning strategy consists of two main parts. The first one is
the related operators analysis and the second one is the optimized
3.1.1 Marker. In this part, Marker assigns an unique identifier adjustment. Details of these two will be discussed in Section 3.2. In
to every execution of each operator in LLVM IR and generates a the end, Adjuster generates a file describing recommended mixed-
new program. The identifier for each operator includes two parts, precision settings for all operators. All mixed-precision settings are
ordered ID of its first execution and the counter of its execution. guaranteed to satisfy given error threshold and not degrade the
Then the new program is executed once to dump identifier and performance.
corresponding output under original precision.
3.1.3 Finalizer. In this part, Finalizer takes the setting file gen-
for(int i = 0; i < n; i++) Dumped Identifiers Dumped Identifiers erated by Adjuster as input and replaces executed operators by
1_1 1_1 mixed-precision ones with corresponding mixed-precision settings
Operator 1
Input: C in LLVM IR. Then it turns the optimized IR file into an executable
Output: B 2_1 file for developers.
Operator 2 3_1
Input: B 2_1 3.2 Tuning Optimizations
Output: C 1_2
3.2.1 Related Operators Analysis. Related operators of one are
Operator 3 2_2 those directly or indirectly propagate error to its outputs (including
Input: C
Output: C
itself). It means that one’s related operators (except itself) should
3_1 3_2
be executed before it and have data dependency with its input.
(a) A loop with 3 operators. (b) Identifiers when itera- (c) Identifiers when itera-
Related operators help to identify which mixed-precision settings
tion runs once. tion runs twice.
of operators should be upgraded if one operator raises unacceptable
Figure 5: A loop with three operators and their identifiers in error even in original precision. To obtain one’s related operators in
different iteration times. adjuster, we leverage the identifiers dumped in the Marker phase.
Once an identifier of operator 𝑎 occurs before one of operator 𝑏 in
Here we give an example to examine the identifier. Assuming the dumped list and there exists data dependency between 𝑎 and
that the loop body in Figure 5(a) is executed twice, Figure 5(c) 𝑏 in static code analysis, 𝑎 is considered as a related operator of 𝑏.
shows dumped identifiers of each execution of every operator. Two Here we assume that LLVM already constructs data dependency
dumped outputs of operator 1 will have the following identifiers: graph of values for us.
1_1 and 1_2. The first number is the ordered ID and the second Figure 5(a) shows a loop body consisting of three operators. In
one is the counter. Considering an operator may accept a different static code analysis, there exists data dependency between any two
input every time, error produced in every execution can differ a of them. Assuming that the body of loop runs only once, we can get
lot. moTuner uses it to identify the output of each operator’s every the identifier list in Figure 5(b) after Marker phase. As the directly
execution and obtain every output error of each execution. related one of operator 3, operator 2 produces error in calculation
97
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy
98
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.
execution
1 2 3 1 2 3 4 5 6
on QUANTENSOR [28]. (𝜏1, 𝜏2, 𝜏3 ) is the parameter in it dependency
out precision refinements. To take advantage of the opti- Figure 7: Dependency in micro-benchmarks containing 3, 6
mized adjustment, refinement parameters are levelized into and 9 GEMMs.
3 levels. The lowest one is with 𝜏1 + 𝜏2 < 2, the higher one
is with (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0) and the highest one is in FP32 execution counts of GEMM operator. 𝑁 and 𝑡𝑠 refer to input matrix
precision. For extra memory space needed by the precision dimension and tile size, respectively.
refinement framework to store low-precision data, we collect
the size of extra memory space needed by each GEMM in
4.2 Metrics
Marker and create the biggest one for usage in Adjuster For performance measurement, we consider the execution time
and Finalizer. speedup of the whole application when running Micro and CF. As
• Optimization passes. The Marker, Adjuster and Finalizer for the HPL-AI, we consider the speedup of its GEMM part, owing
are implemented as separated optimization passes in LLVM. to the high number of non-operator calls. The execution time is the
In order to record result, ID and dimensions of each GEMM average of five repeated runs.
in Marker, we use a wrapper function to replace the original To measure accuracy of an operator, we select the maximum
GEMM function. This wrapper function not only performs absolute error (𝐸𝛿 ) and mean relative error (𝐸𝛾 ) of its result. Let
′
the GEMM under FP32 precision with the same input, but 𝑋 be the matrix produced by FP32/FP32 GEMM, 𝑋 be the matrix
also dumps ID, dimensions and result of original GEMM to computed in FP16/FP32 mixed-precision. We define the maximum
file system. To find all directly and indirectly related oper- absolute error 𝐸𝛿 as the maximum difference between matrix 𝑋
′
ators for one, we maintain a directly related operators set and 𝑋 :
′ ′
for each operators and apply DFS algorithm on detecting 𝐸𝛿 (𝑋, 𝑋 ) = 𝑋 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛 − 𝑋 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛 (1)
∞
indirectly related operators for it.
And the relative error is in the Frobenius norm 𝐸𝛾 :
4 EXPERIMENTAL METHODOLOGY 𝑋 −𝑋
′
′
4.1 Platform and Workloads 𝐸𝛾 (𝑋, 𝑋 ) = ′
𝐹
(2)
𝑋 𝐹
4.1.1 Environments. Settings of error threshold are categorized into eight kinds, which
We conduct experiments on the computing platform with configura- are used to tune operators in the following experiments and listed in
tions being summarized in Table 2. All programs are compiled using Table 3. These thresholds cover a wide range for usual applications.
with [email protected] with option -O3, which is commonly switched
Table 3: Categories of different error thresholds.
on to achieve high performance.
Table 2: Specifications for the computing platform. Error Kind Value Error Threshold Category
99
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy
Accuracy
99.6%
Speedup
0.8
error. 0.6 99.4%
combinations for all operators. For both, we use the average 3 100.0%
Accuracy
60.0%
Speedup
5.1 Performance and Accuracy 1
40.0%
Figure 8 compares the speedup and accuracy of the optimized Mi- 20.0%
cro GEMM count is 9 (i.e., GEMM-1, ..., GEMM-9) and the data 0 0.0%
3
a 2.92x mean speedup, which is the same as the Exhaust. Figure
2.5
99.8%
10(b) demonstrates that moTuner can achieve over 92% accuracy
in all situations while PriorK achieves 98%. But because HPL-AI
Speedup
Accuracy
2 99.6%
1.5
only validates whether the scaled residual is smaller than 16 and
99.4%
1
the one from programs tuned by moTuner is 0.003551, it means that
0.5
99.2% moTuner is able to generate qualified HPL-AI programs with faster
0 99.0% GEMMs.
E1 E2 E3 E4 E5 E6 E7 E8 E1 E2 E3 E4 E5 E6 E7 E8
Error Threshold Category Error Threshold Category
5.2 Automation Efficiency
(a) Speedup. (b) Accuracy.
In this section, we demonstrate how efficient moTuner can be when
Figure 8: Speedup and accuracy of Micro gained by different tuning the benchmarks, compared to two manual assisting methods.
schemes under varying error thresholds. The input of these benchmarks keeps the same as the ones used in
Figure 9 covers the speedup and accuracy of CF optimized by Section 5.1. As discussed before, the efforts are estimated using the
different schemes. The input setting of CF is (40960, 8192). Fig- execution counts. Figure 11 illustrates the tuning efforts needed by
ure 9(a) shows that moTuner can obtain a mean speedup of 1.153x different schemes. The Exhaust demands the most tuning efforts and
speedup in all error categories except E4. When the parameter level moTuner requires the least. Also, Exhaust and PriorK both require
of mixed-precision GEMM with 𝑡𝑠 = 8192 is lower than the one manual code modification while moTuner provides an end-to-end
of (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0), 𝐸𝛾 of its result is above 0.00005. So opti- automatic optimization.
mizing with E4 requires it to be upgraded to (1, 1, 0) theoretically.
But because mixed-precision GEMM with (𝜏1, 𝜏2, 𝜏3 ) = (1, 1, 0) is 5.3 Sensitivity Studies
slower than FP32/FP32 GEMM due to type-casting cost, moTuner In this section, we present how moTuner performs on programs
decides to keep it in FP32 precision for E4 and brings no perfor- with varied inputs. First, we test on Micro with different data dis-
mance degradation. For the accuracy of CF shown in Figure 9(b), tributions of input matrices and GEMM count. Figure 12 shows
100
CF’22, May 17–19, 2022, Torino, Italy Z. Mo and Z. Lin, et al.
Tuning Effort
6 only targets on error of operators instead of error of instruction.
4 As for static analysis methods, they only analyze the characteris-
tics of code to gather information. Darulova proposes Xfp [10] to
2
gain sub-optimal solution without performing an exhaustive explo-
0
Micro CF HPL-AI
ration. Rosa [8, 9] is a source-to-source compiler which provides
Benchmark a precision mix for a given program on real values. It introduces
Figure 11: Average tuning effort needed by three schemes a contract-based programming paradigm based on the Scala func-
under all error thresholds. tional programming language, which is not suitable for most exist-
ing scientific computing applications. AMPT-GA [23] implements
that moTuner can obtain 2.67x speedup with over 99.9% accuracy in a static analysis aiming at identifying strongly connected variables
average for Micro with differing inputs. Figure 13 shows how mo- in the dependency graph. These works are lack of the sensitivity
Tuner performs on CF and HPL-AI with changing inputs. moTuner for different input in varying degrees while moTuner can generate
can gain a mean speedup of 1.10x and 1.19x on CF and the GEMM the appropriate mixed-precision setting for different input.
part of HPL-AI respectively while maintaining over 99% accuracy Mixed-precision tuning: to lessen programming burden, sev-
on average under E3 and E6. eral efforts focused on automatic generation of mixed-precision
3N 3U 3R 6N 6U 6R 9U 9R 3N 3U 3R 6N 6U 6R 9U 9R
programs on CPU and GPU. For CPU programs, Precimonious [36]
4 100.0%
proposes the delta debugging to narrow the search space for single,
double and long precision. Then a follow-work devises blame analy-
3
sis [37] to reduce the search space further. While these two focus on
Accuracy
Speedup
1
proach for mixed-precision operators. moTuner automatically han-
96%
0.8
0.6
0.4
94%
dles both the configuration knobs and the quantization/refinement
0.2
92%
operations, thus eases the programming burden. Further, an ef-
0
E3 E6
90%
E3 E6
ficient tuning strategy is applied to strike a balance on the per-
Error Threshold Category Error Threshold Category formance improvement and output quality. We test moTuner on
(a) Speedup. (b) Accuracy. micro-benchmark with multiple GEMMs, cholesky factorization
and HPL-AI and the preliminary results demonstrate that moTuner
Figure 13: Speedup and accuracy of real applications gained
can efficiently obtain up to 3.13x, 1.15x and 2.92x respectively. In
by moTuner with different inputs under E3 and E6.
the future, we plan to extend moTuner to support more complicated
operators.
6 RELATED WORK
Error analysis of floating-point: to evaluate whether a program ACKNOWLEDGMENT
can take advantage of mixed-precision of floating-point to gain We thank the anonymous reviewers for their constructive com-
performance, a variety of works have been proposed to predict the ments and suggestions. This research was supported by the National
error of floating-point arithmetic. They can be divided into two Natural Science Foundation of China-#62102465/-#U1811461, the
categories: dynamic analysis and static analysis. Dynamic analysis Program for Guangdong Introducing Innovative and Entrepreneurial
requires for running programs to gather necessary information. A Teams under Grant NO. 2016ZT06D211, the Major Program of
dynamic analysis approach [5] is presented to find potential risk of Guangdong Basic and Applied Research-#2019B030302002, the Guang-
degrading accuracy. Then Lam et al. [26] apply dynamic analysis dong Natural Science Foundation-#2018B030312002, and CCF-Baidu
to detect cancellation error in floating-point calculations. These Open Fund (CCF-BAIDU OF2021032).
101
moTuner: A Compiler-based Auto-tuning Approach for Mixed-precision Operators CF’22, May 17–19, 2022, Torino, Italy
102