Gpus.: Article
Gpus.: Article
GPUs.
Article:
Lu, G, Zhang, W and Wang, Z orcid.org/0000-0001-6157-0662 (2021) Optimizing
Depthwise Separable Convolution Operations on GPUs. IEEE Transactions on Parallel
and Distributed Systems. p. 1. ISSN 1045-9219
https://doi.org/10.1109/tpds.2021.3084813
© 2021, IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works.
Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.
[email protected]
https://eprints.whiterose.ac.uk/
1
Abstract—The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to
reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable
convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such
approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a
few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU
architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to
reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach
employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and
to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded
NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach
against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2×
(up to 3×) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces
the end-to-end training time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3% respectively, and reduces the end-to-end inference
time of MobileNet and EfficentNet by 12.2% and 13.5% respectively.
Index Terms—Performance Optimization, Convolution, Depthwise, Pointwise, Memory Optimization, GPU Utilization
problem for two real-life scenarios: when running a trained prior approaches.
model for inferencing - where the model typically only takes We evaluate our approach by applying it to both depth-
in one or a few samples (and hence a small batch size) - or wise and pointwise convolutions with FP32 and INT8 on
performing on-device distributed training on an embedded two GPU platforms: an NVIDIA RTX 2080Ti GPU and an
device - where the number of training samples is likely to embedded NVIDIA Jetson AGX Xavier GPU. We compared
be small due to resource constraints. our approach against cuDNN, an industry strengthened
Our work addresses the memory latency and work dis- DNN library that is heavily optimized for the NVIDIA GPU
tribution issues identified above. By addressing these two architecture. Experimental results show that our approach
issues together, our approach enables efficient DSC because delivers over 3× and 2× faster performance than cuDNN
it accelerates not only depthwise convolution by reducing for depthwise and pointwise convolutions, respectively. We
the GPU memory access latency and also pointwise convo- show that, when the batch size is <= 128, our approach
lution for model inference and small-batch-sized training. averagely improves the end-to-end training performance of
To improve the memory performance of depthwise con- MobileNetV2 [27], [7] and EfficientNet-B0 [8] by 11.5% and
volution, we introduce two novel optimization techniques 7.3% respectively, and improves the end-to-end inference
for operations performed on rows and columns. Our ap- performance of MobileNetV2 and EfficentNet-B0 by 9.7%
proach reduces the number of memory accesses required and 11.6% respectively.
by reusing data. To improve column data reuse, we use the This paper makes the following technical contributions:
shuffle instructions (supported by both CUDA and OpenCL
• It presents two novel algorithms for column and
and hence is applicable to mainstream GPUs) to exchange
row reuse (Section 3) for depthwise convolution,
elements among threads within a GPU warp (or working
improving the data locality and the memory access
group). In this way, we can avoid reloading the same ele-
latency for depthwise convolution.
ments shared among different threads. We also apply shuffle
• It describes a novel method for transforming dy-
instructions to converting dynamic indices to static indices
namic indices into static indices to assist register pro-
to assist register promotion, an optimization strategy that is
motion for performance optimization (Section 3.1.3).
not exploited in previous studies [23], [24]. To increase row
• It presents a dynamic tile size scheme for pointwise
data reuse, we multiply one input row with multiple rows of
convolution, increasing GPU utilization while mini-
a convolutional kernel (or filter) to compute multiple output
mizing the global memory access latency (Section 4).
elements at the same time. This strategy improves the data
locality of elements within a row, reducing the number of This work extends our prior work [28] by proposing a
memory transactions compared with that of the existing new dynamic tile size scheme to optimize pointwise con-
convolution processing pipeline. By reducing the number of volution, which is a key component of depthwise separable
memory accesses, our approach improves the performance convolution. We also added new experiments performed on
of depthwise convolution. embedded devices and using 8-bit integers for the neural
To overcome the drawback of fixed tile size work par- networks. The new experiments demonstrate the robust-
tition of a GEMM kernel for pointwise convolution, we ness of the proposed approach, showing that it consistently
employ a dynamic tile size scheme. Our approach first outperforms cuDNN by delivering the overall best perfor-
adjusts the work assigned to each GPU thread so that we mance.
have a sufficient number of tiles to be distributed to GPU
threads to improve the GPU utilization. A challenge here is
2 BACKGROUND
how to assign the right amount of work to GPU threads so
that the global memory access latency can be adequately 2.1 GPU Architecture
hidden through computation. Having too little work per Deep learning models are often trained and executed on the
GPU thread will make the GPU memory access dominates GPU. Modern GPUs employ a complex execution pipeline
the execution while having too large work assignment will and memory hierarchy to support concurrent execution
lead to low GPU utilization (as only a small number of of parallel threads. A typical GPU consists of multiple
GPU threads will receive a tile to work on). To this end, our Streaming Multiprocessors (SMs). Each SM includes multi-
dynamic scheme distributes input or filter channels across ple Single-Instruction-Multiple-Thread (SIMT) units, each of
threads within a warp to minimize memory latency with im- which has multiple lanes of execution. Threads scheduled in
proved GPU parallelism. Recent works [18], [25] employ a the same SIMT unit are called a warp, which is the smallest
heuristic method to maximize parallelism for GEMM. They scheduling unit in GPU. Like a modern CPU, a GPU consists
achieve this by trying to combine multiple convolutions that of multiple memory hierarchies. The thread-local registers
can be computed concurrently into one GEMM kernel. Such are the fastest memory component, having the lowest access
a strategy assumes multiple parallel convolutions can be latency (1-2 cycles). The SM local L1 caches and shared
performed within a single GEMM kernel. However, this memory provide a larger storage capacity over the thread-
strong assumption is only valid for some special CNN local registers but have modestly higher accessing latency
structures like the inception layer in GoogleNet [26]. As a of around 30 cycles [29], [30]. All the SMs share a unified
result, these prior methods are not applicable to the more L2 cache that provides an accessing latency of about 200
general case of CNNs where convolution operations must cycles. The off-chip global memory, similar to the RAM in a
be performed sequentially due to dependence. Our dynamic CPU system, provides the largest memory storage capacity
work distribution strategy does not rely on this assumption on the GPU but has the longest accessing latency of around
and hence is more generally applicable compared to these 500 cycles measured through running micro-benchmarks on
3
Output …….
1 3 Filter
t0 0 1 2 3 4 t0 0 2 4 t0 0 1 2 3 4
t1 1 2 3 4 5 t1 1 3 5 t1 1 2 3 4 5
t2 2 3 4 5 6 t2 2 4 6 t2 2 3 4 5 6
t3 3 4 5 6 7 t3 3 5 7 t3 3 4 5 6 7
step 1 step 2 step 3 step 4 step 5 step 1 step 3 step 2 step 1 step 4 step 3 step 5 step 2
(a) Direct convolution: Each thread (b) Optimized convolution: each (c) Our approach: each thread re-
loads 5 input elements from global thread retrieves its third element from trieves its second and fourth elements
memory. the corresponding thread. from corresponding threads.
Fig. 3. Illustration of direct and optimized convolutions. We use a 5 × 5 filter and each thread calculates the convolution for one output element. This
example shows how a thread processes the first 5 corresponding input elements.
3.1.1 Standard convolution While this version reduces the redundant memory ac-
cesses compared to a standard convolution implementation,
Fig. 3a shows a standard depthwise convolution operation,
there is still room for improvement. The problem is that
operating on a single-channel input on the example shown
the shuffle instruction shf l xor(iT emp[i], 2) now becomes
in Fig. 2. Here, each thread loads the first corresponding
a bottleneck because iT emp is accessed through dynamic in-
input element from the GPU global memory. Given that
dexing. Since the indices and the access pattern to iT emp are
the indices of these elements are contiguous, i.e., 0, 1, 2,
not available at compile-time, the compiler cannot decide
and 3 in this example, concurrent access to these elements
which of the elements in iT emp will be frequently accessed
will be coalesced to form a single memory transaction. As a
and has to place iT emp in the local memory which would
result, each step will incur one memory access, five for the
still incur an access latency of around 500 cycles. If we can
five steps (steps 1-5) as shown in Fig. 3a. After completing
promote register allocation for iT emp, we can then further
step 5, each pair of adjacent threads will have four duplicate
improve the performance of convolution.
input elements, corresponding to the duplicate columns in
Fig. 2. Specifically, input elements 1, 2 and 3 loaded in 3.1.3 Our approach
step 2 would have already been loaded by threads t1, t2 Our column reuse scheme (Fig. 3c) converts dynamic index-
and t3 in the previous step (Fig. 3a). The repeated load ing to static array accesses to promote register allocation.
to these elements leads to redundant memory accesses and This strategy is described in Algorithms 1 and 2, where the
unnecessary memory access latency. Even the elements may first algorithm is used for step 3, and the latter is used for
be prefetched to the L1 cache before the next step, access to steps 4 and 5. Note that these two algorithms can be used
the L1 cache still takes around 30 cycles on a 2080Ti GPU. To for different sized convolution kernels, for which we will
reduce the memory overhead, we would like to avoid such discuss in Section 3.1.4.
redundant memory accesses. Fig. 4 gives a working example of Algorithm 1. Here, we
first load the corresponding first and fifth input elements
3.1.2 An optimized implementation into iT emp before passing it to Algorithm 1. Then, we
pack two 32-bit elements into a 64-bit variable, exchange,
To eliminate the redundant loads, we could use the shuffle where iT emp[4] and iT emp[0] are the high and low 32 bits,
instructions supported by both CUDA and OpenCL to ex- respectively (Line 2). As threads t0 and t1 will provide the
change input elements among different threads. To this end, fifth element of the data they load, which are the high 32
we adopt the optimization developed in our prior work [28]. bits of exchange, we right shift exchange for both threads
Fig. 3b depicts such an optimization. Specifically, in steps 1 by an offset of 32 to place iT emp[4] in the low 32 bits. Now
and 2 of Fig. 3b, each thread loads the corresponding first turning our attention to threads t2 and t3 that will provide
and fifth input elements from the global memory. In step 3, the first element of the data they load. Since the elements are
each thread utilizes the shuffle instruction to retrieve the the low 32 bits of exchange, we right shift exchange in both
third element from another thread. For example, threads threads by an offset of 0. The number of places to be shifted
t0 and t1 could retrieve the third element from threads t2 for each thread is calculated based on the thread ID (Line 3).
and t3, respectively, and provide the fifth element (dashed Next, we unpack exchange into iT emp[2] (high 32 bits) and
squares in step 2) for both threads. Similarly, threads t2 iT emp[1] (low 32 bits) (Line 5). By doing so, we can retrieve
and t3 retrieve the third element from threads t0 and t1, the element a thread needs to supply from a fixed location,
respectively, and provide the first element (dashed squares iT emp[1]. Finally, we use the shuffle instruction to exchange
in step 1) for threads t0 and t1. Using the CUDA shuffle the elements among threads (Line 6).
instruction, this exchange process can be implemented as Using Algorithm 1, we can replace dynamic index i in
shf l xor(iT emp[i], 2), where iT emp is a thread-local array shf l xor(iT emp[i], 2) (whose value is unknown at compile
used to store the five input elements, and i is the location in time) with a static index, 1 in shf l xor(iT emp[1], 2), in
the local array. For our working example, threads t0 and t1 our working example. By doing so, we promote register
will supply the fifth element, hence i = 4. Similarly, threads allocation by allowing the compiler to put all the thread-
t2 and t3 will provide the first element, thus i = 0. local variables into the fast GPU registers (that have access
5
default wrap size of our evaluation GPU) and the last sub- 4.1.1 A two-level tiling scheme
block in Lines 4-15 and 16-19, respectively. In this way, each To divide the output into block tiles, we utilize two logical
GPU thread calculates one column of the output elements. layouts of the output, L1 and L2, as shown in Fig. 7. FN and
This is done through several steps. First, each thread block IN × IH × IW represent the filter and input dimensions
loads the filter into shared memory and divides the filter of the output respectively. Notice that our 2-level tiling
into a combination of 3-column and 5-column sub-filters. can handle arbitrary input sizes since we do not require
Next, each thread calculates the address of the first input IH = IW . Before partitioning the output, we first select the
element it needs (Line 6). For each output element and layout of the output based on the size of the filter dimension.
sub-filter, each thread loads corresponding input elements The rationale behind choosing the filter dimension instead
into iT emp and passes iT emp to Algorithms 1 and 2 to of the input dimension can be described as follows. The
fill the row vector iT emp (Line 10). Then, each thread number of filters, FN , is fixed once the structure of a CNN
passes the filled vector iT emp to Algorithm 3 to calculate is determined. But the size of the input dimension will be
multiple output elements and store results in the register affected by the batch size, IN , during inference and training.
array sum (Line 11). Finally, when the calculation of one Therefore, it is easier to design our approach based on the
output element is completed, we write the corresponding size of the filter dimension. When FN > 48, we choose
result in sum into the result array O (Line 13). layout L1 and distribute filter channels across threads within
a warp. Otherwise, we choose layout L2 and distribute input
channels. The boundary FN = 48 is determined as follows.
4 O PTIMIZING P OINTWISE C ONVOLUTION Fig. 7 demonstrates that in layout L2, the maximal value of
In this section, we explain the workflow of our dynamic FN is 4 × W arpH and W arpH ≤ 12 (explain later in this
tile size scheme for pointwise convolution. This approach section), therefore we have FN ≤ 48 for layout L2.
extends the optimization for convolution operations in our Since both layouts have the similar procedure, we take
prior work [28] to pointwise convolution. Our approach layout L1 as an illustration example and give a brief descrip-
consists of three stages, described as follows. tion of layout L2 at the end of this section. After choosing
In the first and second stages, we identify parameters the layout based on FN , we partition the output along
related to the tile size and determine candidate values for the filter dimension. First, we halve the filter dimension if
each parameter (Section 4.1 and Section 4.2). The first and FN ≥ 512. The reason is that if we let each thread block
second stages process input dependent and independent process a large number of filters, then each thread needs to
parameters respectively. In the third stage, as detailed in issue more than 15 global memory load instructions, which
Algorithm 5, we iterate over all combinations of parameters may cause MIO (Memory Input Output) instruction queue
and search for the combination that achieves optimal SM throttle and leads to performance degradation. Then, we
utilization and data reuse (Section 4.3). halve both dimensions of each block tile and generate 2 × 2
We note that previous studies [31], [32], [33], [34], [35], warp tiles.
[36], [37], [38] have exploited tiling and autotuning for
convolution and GEMM operations. However, these prior 4.1.2 Determine candidates for W arpH and W arpW
methods are inadequate for pointwise convolutions on Based on the partition method, we know that W arpW can
GPUs due to two main drawbacks: they do not consider be calculated with W arpW = FN /4 or W arpW = FN /2.
SM utilization when choosing the optimal tile size and are Thus, we only need to determine candidate values for
not designed for pointwise convolutions with small inputs. W arpH based on the size of the input dimension. In our
Our dynamic tile size scheme avoids these two drawbacks. design, when W arpH > 12, we need assembly level opti-
To improve SM utilization, our approach searches for the mizations like the work in [16], [39] for some configurations
optimal tile size for the output based on the input size of pointwise convolutions to avoid register spills. But in
to generate a proper number of tiles to saturate GPU and this work, we focus on higher level rather than assembly
maximize data reuse. To optimize pointwise convolution level optimizations, and thus set W arpH ≤ 12. If the size
with small inputs, we distribute channels across threads of the input dimension is large, we prefer to choose a large
within a warp to increase the arithmetic intensity for each W arpH because using small W arpH will generate many
thread. thread blocks and results in multiple loads of shared filters
[40], [41]. If the size of the input dimension is small, we
prefer to choose a small W arpH because using a large
4.1 Determine Tiling Parameters W arpH will generate a few thread blocks and result in
In our design, we use a 2-level tiling scheme, as shown in SM underutilization. Since each thread loads at most 12
Fig. 7, to partition the output into block tiles and warp tiles. input elements (W arpH ≤ 12), we set the upper limit
Each thread block processes one block tile and each warp of large W arpH to 12 and the lower limit to 12/2 = 6.
processes one warp tile. The height dimension of the warp Therefore, the candidates for large W arpH are W arpH =
tile is shared among 32 threads of a warp and the width {6, 7, 8, 9, 10, 11, 12}. The candidates for small W arpH are
dimension of the warp tile is distributed across 32 threads W arpH = {2, 3, 4, 5, 6, 7, 8}. In our experiments, there is no
of a warp. Hence, we have two input dependent parameters, clear boundary between large and small candidate sets of
namely the height and width of the warp tile, denoted W arpH , therefore we let both sets overlap in the middle
as W arpH and W arpW respectively. Now we introduce values. The boundary between the large and small size
how to use the 2-level tiling scheme to determine candidate of the input dimension is experimentally determined as
values for W arpH and W arpW . IN × IH × IW = 16 × 14 × 14.
8
Compared to layout L1, layout L2 swaps the input and each thread processes Tnum = W arpW /Fnum = 64/4 = 16
filter dimensions. Hence, W arpH can be calculated with filter elements. The arithmetic intensity can be estimated
W arpH = FN /4 or W arpH = FN /2. The candidate values as W arpH ×Tnum 8×16
W arpH +Tnum = 8+16 = 5.3. Higher arithmetic inten-
for large W arpW are W arpW = {6, 7, 8, 9, 10, 11, 12} and sity increases the chance to hide global memory access
for small W arpW are W arpW = {2, 3, 4, 5, 6, 7, 8}. latency. To fully utilize a warp, candidate values for Cnum
should be a power of 2. Thus, candidates for Cnum are
4.2 Determine Candidates for Input Independent Pa- Cnum = {1, 2, 4, 8, 16, 32}.
rameters
There are three input independent parameters we need to 4.3 Search For the Optimal Combination
consider, namely the number of warps in a thread block 4.3.1 Hardware resources constraints
(W arpnum ), the number of thread blocks that can run con-
currently on an SM (Blocknum ) and the number of channels When searching for the optimal combination of tiling
to be distributed (Cnum ). and input independent parameters, we focus on combi-
nations that can meet the hardware resources constraints,
4.2.1 Determine candidates for W arpnum and Blocknum including registers and shared memory. In the rest of
this section, we take layout L1 as an illustration example.
When determining candidates for W arpnum , we need to
Based on Blocknum , we calculate the number of regis-
consider (1) a small warp number will decrease the op-
ters each thread can use (LimitR ) and the size of shared
portunity to hide the memory access latency at the warp
memory each thread block can use (LimitS ) with for-
level, (2) a large warp number will decrease the number
mulas LimitR = T otalR /(Blocknum × W arpnum × 32)
of thread blocks and may lead to SM underutilization. We
and LimitS = T otalS /Blocknum respectively. T otalR and
empirically set the warp number to be four (W arpnum = 4),
T otalS represent the number of registers and the size of
which gives good performance on our pilot study using
shared memory of an SM, respectively. On RTX 2080Ti,
microbenchmarks of hand-written pointwise convolution
T otalR = 65536 and T otalS = 64KB while on Jetson AGX
kernels. For the number of thread blocks, Blocknum , we
Xavier, T otalR = 65536 and T otalS = 48KB .
use two values, 2 and 4, on our evaluation platforms.
In our approach, each warp processes one warp tile
These choices are justified as follows. For Nvidia GPUs,
which contains W arpH × W arpW output elements. Each
each GPU thread can use up to 255 registers, and each
thread calculates W arpH × Tnum elements and thus needs
SM has 65,536 registers. If we set Blocknum = 1 and
Rresult = W arpH × Tnum , Roperand = W arpH + Tnum
W arpnum = 4 (per our discussion above), each SM will
registers to store results and operands respectively. The
have Blocknum ×W arpnum = 4 wraps. This allows a thread
constraints can be formulated as follows:
block to use up to just half of the available registers of
an SM because a thread block under this setting can use Cnum × 2 × W arpW Cnum × 2 × W arpH
Rtmp = +
at most 4 (warps in an SM ) × 32 (threads per warp) × 128 128
255 (registers per thread) = 32, 640 registers. Therefore, Rresult + Roperand + Rtmp + extraR ≤ LimitR (1)
to utilize the available hardware register, one should set
Blocknum to be greater than one. We also found that (2 × W arpH + 2 × W arpW ) × Cnum × 4 × 2 ≤ LimitS (2)
setting Blocknum > 4 during searching offers little ben-
efit and hence we set the Blocknum to be either 2 or 4 where Rtmp is the number of temporary registers used to
(Blocknum = {2, 4}). store filter and input elements loaded from global memory.
2 × W arpH and 2 × W arpW represent the height and width
4.2.2 Determine candidates for Cnum of the block tile respectively. 128 means each thread block
When searching for the optimal combination of parameters, has W arpnum × 32 = 4 × 32 = 128 threads to load data
a small tile size may be generated, which may lead to low from global memory. In Formula 1, extraR is the number of
arithmetic intensity and can not hide global memory access additional registers allocated by the compiler and its value is
latency. For example, we assume that the warp tile size is determined through an off-line method. In our experiments,
W arpH × W arpW = 8 × 64 and has 56 channels, which we set extraR = 40 because the NVIDIA CUDA compiler,
means that one warp needs to convolve 8 input elements on average, allocates 40 additional registers for each kernel
with 64 filter elements and accumulates results 56 times to on our evaluation platforms. These additional registers are
generate 8 × 64 = 512 elements. Since the height dimension usually used to store temporary variables for utilizing GPU
is shared among 32 threads of the warp, each thread loads arithmetic pipelines. In Formula 2, 4 means each element
8 input elements, and the width dimension is distributed has 4 bytes and 2 means we use a double buffer method
across 32 threads, each thread loads 2 filter elements. There- [33], [42].
fore, each thread accumulates 56 channels of 8 × 2 = 16
elements. Now we can estimate the arithmetic intensity of 4.3.2 Searching workflow
each thread for one iteration as number of multiplications
number of elements = To guide the search for the optimal combination of parame-
8×2
= 1.6 . We can improve arithmetic intensity by dis- ters, we use two metrics named SM utilization (SMutil ) and
8+2
tributing channels across threads, as shown in Fig. 7. We arithmetic intensity (AI ). Two metrics can be calculated as
distribute eight channels (Cnum = 8) of each filter element follows:
across 32 threads of the warp. In that case, each warp can FN IN × IH × IW
process Fnum = 32/Cnum = 32/8 = 4 filter elements and Blockcount = ×
2 × W arpW 2 × W arpH
9
$ %!
Tile 0
!!! Tile 1
Warp Warp !"#$!
Partition a
!! "!""!#
!"# block tile into 4
Tile 0 Tile 1
Distribute
channels of Output Warp Warp warp tiles
Block Tile 0
Layout L1 !" Tile 2 Tile 3
filters #$
! % &'
( !!!
Block Tile 1
Distribute Output
$ #"
!" # !
% Tile 0
Block
!!! %
!!
!! "!""!#
$ Block Block
Tile 0 Tile 1
!!! !"#$! %
!"#$"
&
output elements. !! !"#$
!
Algorithm 5: Optimized Pointwise Convolution combinations that satisfy the constraints LimitR (Formula
Input: I , F 1) and LimitS (Formula 2) (Line 3).
Output: O Next, we calculate values of SMutil (Formula 3) and AI
// below codes are executed on CPU
(Formula 4) for all satisfied combinations (Line 5) and select
1 Determine candidates for relevant parameters;
2 foreach parameter combination do the optimal combination with following steps (Line 6-7):
3 if not satisfy constraints of Formula 1 and 2 then
4 continue; Step 1 If SMutil ≥ 1 is true for all combinations, we select
5 Calculate SMutil and AI with Formula 3 and 4; the combinations that possess the smallest and close
6 Choose combinations whose SMutil is close to 1;
to the smallest SMutil . The reason is that when
7 Among chosen combinations, choose the
combination with the maximal AI ; SMutil ≥ 1, all SMs are utilized, in which case
8 Choose the kernel based on the chosen combination.; we want to reduce the number of thread blocks to
// below codes are executed on GPU reduce the number of loads of shared filters or inputs
9 Load Cnum channels of a block tile into shared memory between multiple thread blocks.
array sharedBuf 1; Step 2 If there exists combinations such that SMutil < 1,
10 syncthreads(); we first collect these combinations. Then, among col-
11 for iter ← 0 to IC By 2 × Cnum do
12 Load next Cnum channels into Rtmp ; lected combinations, we select the ones that possess
13 Load channels from sharedBuf 1 into Roperand ; the biggest and close to the biggest SMutil . The
14 Accumulate output elements into Rresult ; reason is that when SMutil < 1, there are idle SMs,
15 Write Rtmp into sharedBuf 2; in which case we want to increase SMutil to fully
16 syncthreads(); utilize SMs. We do not want SMutil to exceed 1
17 Repeat above steps but swap sharedBuf 1 and because that will incur more memory operations.
sharedBuf 2;
Step 3 Among candidate combinations selected in Step 1
18 Use segmented parallel reduction to get the final
output elements and write the result to O; and Step 2, we select the combination with the maxi-
mum value of AI because higher arithmetic intensity
can hide more global memory access latency.
Blockcount
SMutil = (3) Last, we choose the pointwise convolution kernel based
Blocknum × SMnum
on the selected combination (Line 8). In this kernel, each
W arpH × Tnum thread block first loads Cnum channels of the corresponding
AI = (4) block tile into shared memory array sharedBuf 1(Line 9).
W arpH + Tnum
Meanwhile, the thread block loads the next Cnum channels
where Blockcount is the number of generated thread blocks, of the block tile into temporary registers (Line 12). Then,
SMnum is the number of SMs on a GPU. For RTX 2080Ti we load data from sharedBuf 1 into registers (Line 13) and
and Jetson AGX Xavier, SMnum = 68 and SMnum = 8 accumulate output elements into registers (Line 14). Next,
respectively. we write data in temporary registers into sharedBuf 2 (Line
The whole workflow is described in Algorithm 5. We first 15). The kernel repeats the process until all channels have
determine candidates for relevant parameters, including been accumulated to output elements. Finally, we use a
W arpH , W arpW , W arpnum , Blocknum and Cnum , based warp level segmented parallel reduction to reduce results
on the size of the input and filter (Line 1). Then we iterate of different channels into the final result and write results to
over all combinations of parameters (Line 2), and keep the global memory (Line 18).
10
cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow
CONV1 CONV2 CONV3 CONV1 CONV2 CONV3
4
7
2 4
1
1
CONV4 CONV5 CONV6 4 CONV4 CONV5 CONV6
Speedup
Speedup
2
2
1
1
3 CONV7 CONV8 CONV9 4 CONV7 CONV8 CONV9
2
2
1 1
25 30
10 10
80 CONV4 CONV5 CONV6 80
CONV4 CONV5 CONV6
Speedup
Speedup
40 40
20
10
CONV7 CONV8 CONV9 100 CONV7 CONV8 CONV9
90
50 50
20
10
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size Batch Size
(c) Speedups on Xavier for the 3 × 3 fitler. (d) Speedups on Xavier for the 5 × 5 fitler.
Fig. 8. Speedups of IMPLICIT, PRECOMP, our approach and TensorFlow over the baseline implementation (GEMM) for FP32 depthwise convolution
with filters of size 3 × 3 and 5 × 5 on two platforms.
on 2080Ti and Xavier respectively. Since IMPLICIT is closed SM utilization compared to our approach. The reason our
source, we analyze its performance through CUDA Nsight approach leads to lower SM utilization is explained as
Compute [46] and present the results in Section 6.1.3. Over- follows. Our row reuse algorithm performs better when
all, our approach improves IMPLICIT by 2.0× and 1.4× a thread operates on more rows of the output. However,
when using a 3 × 3 filter on 2080Ti and Xavier respectively, the more rows a thread computes on, the fewer warps
and 3.5× and 2.1× when using a 5 × 5 filter on 2080Ti and and thread blocks we can generate. Without enough warps
Xavier respectively. running on SMs, the SM utilization will degrade. Though
IMPLICIT has high SM utilization, it does not result in
INT8 implementation. We found using FP32 gives a
good performance for depthwise convolution. The reason is
speedup of more than 10× over the INT8 version for
that depthwise convolution possesses a low computational
depthwise convolution in cuDNN. This is because the INT8
requirement and is more sensitive to memory performance;
version has the overhead of dequantization (i.e., convert-
hence the focus of performance optimization should be
ing the results from INT8 to FP32 after convolution) and
reducing the memory access latency. If we now look at
can not fully utilize DP4A instruction to accelerate INT8
Fig. 9b, we see that row and column reuse techniques
convolution. We note that TensorFlow does not optimize
reduce memory operations with up to 4.5× lower LDG
depthwise convolution on INT8. Nonetheless, our approach
instructions to be executed when compared to IMPLICIT.
gives over 10× speedups when using INT8 over cuDNN
By reducing the memory access overhead, which dominates
and TensorFlow.
the execution time of depthwise convolution, our approach
thus can lead to better overall performance compared to
6.1.3 Further analysis cuDNN, despite lower SM utilization.
Our performance gain is mainly attributed to the reduced
number of memory accesses offered by our column and row
reuse algorithms.
Fig. 9 reports the measured LDG (load from global From Fig. 8 we can observe that speedups of our ap-
memory) instruction counts and SM utilization for the fast proach over IMPLICIT fluctuate in a small range as batch
IMPLICIT algorithm and our approach when using a 3 × 3 size increases. Both IMPLICIT and our approach can not
filter and a batch size of 32 on 2080Ti. Other configurations benefit from higher GPU utilization because depthwise
follow a similar performance trend. We can see in Fig. 9a convolution is memory bound, thus IMPLICIT and our
that the IMPLICIT algorithm has an average of 2× higher approach grow at the same rate.
12
cuDNN IMPLICIT ours for FP32 and PRECOMP for INT8. When using data type
INT8, PRECOMP performs better than IMPLICIT in 180
50 out of 180 test cases on 2080Ti and 127 out of 180 test
SM Utilization (%)
cases on Xavier. For FP32, we normalize the speedup over
GEMM. For INT8, because GEMM does not support this
40
data type, we show the speedup over IMPLICIT. Table 3 lists
the layer configurations and parameter values generated for
30 W arpH , W arpW , Blocknum and Cnum (W arpnum = 4).
The notations can be found at Section 2.3.
Normally, for a given convolution layer configuration,
1 2 3 4 5 6 7 8 9
NV NV NV NV NV NV NV NV NV when the width of the logical layout of the output (Fig. 7) is
CO CO CO CO CO CO CO CO CO small, our scheme tends to choose a small Cnum . This allows
(a) SM utilizations of IMPLICIT and our approach. one to generate more warps to utilize the GPU. On the other
5 hand, our scheme tends to choose a large Cnum to reduce the
number of warps because there are already enough warps
4 to maximize the utilization of the GPU. A special parameter
tuple (we take parameter tuples generated for 2080Ti as
3 examples) is the layer configuration CONV9 with IN = 1.
Ratio
4
2
1
CONV15 CONV16 CONV17 CONV18 CONV19 CONV20 cuDNN
IMPLICIT
4 cuDNN
2 PRECOMP
1 ours
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size
3
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7
2
1
2
1
TABLE 3
Layer configurations of pointwise convolutions (FC = IC , IW = IH and FW = FH = 1) and parameter values of W arpH , W arpW , Blocknum
and Cnum . We use the tuples (W arpH , W arpW , Blocknum , Cnum ) and [W arpH , W arpW , Blocknum , Cnum ] to represent parameter values
generated for 2080Ti and Xavier respectively.
LAYER IC IH FN IN = 1 IN = 8 IN = 16 IN = 32 IN = 64 IN = 128
( 4, 12, 2, 8) ( 4, 47, 4, 2) ( 4, 480, 4, 1) ( 4, 480, 4, 1) ( 4, 480, 4, 1) ( 4,1216, 2, 32)
CONV1 16 56 8
[ 4, 50, 4, 2] [ 4, 480, 4, 1] [ 4,1216, 2, 32] [ 4,1216, 2, 32] [ 4,1216, 2, 32] [ 4,1216, 2, 32]
( 8, 12, 2, 8) ( 8, 47, 4, 4) ( 8, 256, 4, 1) ( 8, 256, 4, 1) ( 8, 672, 2, 32) ( 8, 672, 2, 32)
CONV2 8 56 16
[ 8, 50, 4, 4] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32]
(12, 36, 2, 8) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4)
CONV3 16 56 72
[12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4]
( 6, 6, 2, 32) ( 6, 47, 2, 4) ( 6, 47, 4, 4) ( 6, 320, 4, 1) ( 6, 320, 4, 1) ( 6, 864, 2, 32)
CONV4 72 28 24
[ 6, 50, 2, 4] [ 6, 320, 4, 1] [ 6, 864, 2, 32] [ 6, 864, 2, 32] [ 6, 864, 2, 32] [ 6, 864, 2, 32]
( 3, 48, 2, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2)
CONV5 24 28 96
[12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2]
( 6, 2, 2, 32) ( 6, 12, 2, 16) ( 6, 24, 2, 8) ( 6, 47, 2, 4) ( 6, 47, 4, 4) ( 6, 320, 4, 1)
CONV6 96 14 24
[ 6, 13, 2, 16] [ 6, 50, 4, 4] [ 6, 320, 4, 1] [ 6, 320, 4, 1] [ 6, 864, 2, 32] [ 6, 864, 2, 32]
( 2, 48, 2, 1) ( 6, 48, 2, 4) (12, 48, 2, 8) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2)
CONV7 24 14 96
[ 7, 48, 2, 4] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2]
( 2, 96, 2, 1) ( 6, 96, 2, 2) (12, 96, 2, 4) (12, 96, 4, 1) (12, 96, 4, 1) (12, 96, 4, 1)
CONV8 32 14 192
[ 7, 96, 2, 2] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1]
(12, 2, 2, 32) (12, 12, 2, 32) (12, 24, 2, 16) (12, 47, 2, 8) (12, 47, 4, 2) (12, 160, 4, 1)
CONV9 192 14 48
[12, 13, 2, 32] [12, 50, 4, 2] [12, 160, 4, 1] [12, 480, 2, 32] [12, 480, 2, 32] [12, 480, 2, 32]
(10, 2, 2, 32) (10, 12, 2, 32) (10, 24, 2, 16) (10, 47, 2, 8) (10, 47, 4, 4) (10, 192, 4, 1)
CONV10 96 14 40
[10, 13, 2, 16] [10, 50, 4, 2] [10, 192, 4, 1] [10, 544, 2, 32] [10, 544, 2, 32] [10, 544, 2, 32]
( 2, 60, 2, 1) ( 6, 60, 2, 2) (12, 60, 2, 4) (12, 60, 4, 2) (12, 60, 4, 2) (12, 60, 4, 2)
CONV11 40 14 120
[ 7, 60, 2, 4] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2]
( 8, 2, 2, 32) ( 8, 12, 2, 16) ( 8, 24, 2, 8) ( 8, 47, 2, 4) ( 8, 47, 4, 4) ( 8, 256, 4, 1)
CONV12 120 14 32
[ 8, 13, 2, 16] [ 8, 50, 4, 4] [ 8, 256, 4, 1] [ 8, 256, 4, 1] [ 8, 672, 2, 32] [ 8, 672, 2, 32]
( 2, 120, 2, 1) ( 6, 120, 2, 1) (12, 120, 2, 4) (12, 120, 4, 1) (12, 120, 4, 1) (12, 120, 4, 1)
CONV13 40 14 240
[ 7, 120, 2, 2] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1]
( 2, 32, 2, 2) ( 2, 32, 2, 2) ( 3, 32, 2, 2) ( 6, 32, 2, 4) (12, 32, 2, 8) (12, 32, 4, 4)
CONV14 240 7 64
[ 2, 32, 2, 2] [ 7, 32, 4, 8] [12, 32, 4, 4] [12, 32, 4, 4] [12, 32, 4, 4] [12, 32, 4, 4]
( 2, 120, 2, 1) ( 2, 120, 2, 1) ( 3, 120, 2, 1) ( 6, 120, 2, 1) (12, 120, 2, 4) (12, 120, 4, 1)
CONV15 64 7 240
[ 2, 120, 2, 1] [ 7, 120, 4, 2] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1]
( 2, 216, 2, 1) ( 2, 216, 2, 1) ( 3, 216, 2, 1) ( 6, 216, 2, 1) (12, 216, 2, 2) ( 9, 216, 4, 32)
CONV16 72 7 432
[ 2, 216, 2, 1] [ 7, 216, 4, 1] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32]
( 2, 56, 2, 1) ( 2, 56, 2, 1) ( 3, 56, 2, 1) ( 6, 56, 2, 4) (12, 56, 2, 8) (12, 56, 4, 2)
CONV17 432 7 112
[ 2, 56, 2, 1] [ 7, 56, 4, 4] [12, 56, 4, 2] [12, 56, 4, 2] [12, 56, 4, 2] [12, 56, 4, 2]
( 2, 216, 2, 1) ( 2, 216, 2, 1) ( 3, 216, 2, 1) ( 6, 216, 2, 1) (12, 216, 2, 2) ( 9, 216, 4, 32)
CONV18 112 7 432
[ 2, 216, 2, 1] [ 7, 216, 4, 1] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32]
( 2, 36, 2, 1) ( 2, 36, 2, 1) ( 3, 36, 2, 2) ( 6, 36, 2, 4) (12, 36, 2, 8) (12, 36, 4, 4)
CONV19 432 7 72
[ 2, 36, 2, 1] [ 7, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4]
( 2, 256, 2, 1) ( 3, 256, 2, 1) ( 6, 256, 2, 1) (12, 256, 2, 1) ( 8, 256, 4, 32) ( 8, 256, 4, 32)
CONV20 432 7 1024
[ 4, 256, 2, 1] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32]
cuDNN IMPLICIT ours Inference. For inference, we test standard and quantized
70 MobileNetV2 and EfficientNet-B0 with batch sizes of 1,
8, 16, 32, 64 and 128 on both platforms and report the
60
SM Utilization (%)
8
cuDNN IMPLICIT
6 ours
simple
Cycles
4
2
0
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ONV11 ONV12 ONV13 ONV14 ONV15 ONV16 ONV17 ONV18 ONV19 ONV20
C ON C ON C ON C ON C ON C ON C ON C ON C ON C ON C C C C C C C C C C
Fig. 13. The average number of cycles each warp spends on waiting for the global memory access to complete.
TABLE 5
Inference time of MobileNetV2 and EfficientNet-B0 with FP32 and INT8 on 2080Ti and Xavier.
MobileNetV2 EfficientNet-B0
Batch 1 8 16 32 64 128 1 8 16 32 64 128
cuDNN (ms) 7.5 8.8 9.7 14.4 19.1 28.7 10.1 13.7 18.1 25.0 36.4 52.3
2080Ti
Ours (ms) 6.1 7.1 8.0 12.0 16.9 26.3 7.9 11.3 15.3 21.9 32.6 47.6
(FP32)
Improved (%) 18.6 19.3 17.5 16.7 11.5 8.4 21.8 17.5 15.5 12.4 10.4 9.0
cuDNN (ms) 16.6 22.3 32.1 52.6 84.2 140.1 19.3 27.4 38.3 57.2 94.0 157.8
Xavier
Ours (ms) 13.2 18.9 27.8 44.7 76.1 130.0 15.5 23.2 32.1 50.7 87.3 151.1
(FP32)
Improve (%) 20.5 15.2 13.4 15.0 9.6 7.2 19.7 15.3 16.3 11.4 7.1 4.2
cuDNN (ms) 6.3 7.4 7.7 11.2 14.6 20.2 8.0 9.5 13.3 18.7 26.8 38.3
2080Ti
Ours (ms) 5.5 6.6 6.8 10.3 14.0 19.7 6.8 8.2 11.8 16.9 25.3 36.6
(INT8)
Improved (%) 12.7 10.8 11.7 8.0 4.1 2.5 15.0 13.7 11.3 9.6 5.6 4.4
cuDNN (ms) 13.3 18.0 27.0 42.6 64.8 103.7 16.1 21.0 33.7 52.8 80.3 127.5
Xavier
Ours (ms) 11.7 15.4 22.7 38.8 58.3 94.4 14.2 18.8 30.3 48.2 73.2 117.7
(INT8)
Improved (%) 12.0 14.4 16.0 8.9 10.0 9.0 11.8 10.5 10.1 8.7 8.8 7.7
Conference on Cluster Computing (CLUSTER). IEEE, 2020, pp. 399– lutional neural networks for document processing,” Tenth Interna-
403. tional Workshop on Frontiers in Handwriting Recognition, 2006.
[29] X. Mei and X. Chu, “Dissecting gpu memory hierarchy through [49] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolu-
microbenchmarking,” IEEE Transactions on Parallel and Distributed tional networks through ffts,” arXiv preprint arXiv:1312.5851, 2013.
Systems, vol. 28, no. 1, pp. 72–86, 2016. [50] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
[30] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting networks,” in Proceedings of the IEEE Conference on Computer Vision
the nvidia volta gpu architecture via microbenchmarking,” arXiv and Pattern Recognition, 2016, pp. 4013–4021.
preprint arXiv:1804.06826, 2018. [51] M. Cho and D. Brand, “Mec: memory-efficient convolution for
[31] D. E. Tanner, “Tensile: Auto-tuning gemm gpu assembly for all deep neural network,” in Proceedings of the 34th International Con-
problem sizes,” in 2018 IEEE International Parallel and Distributed ference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 815–
Processing Symposium Workshops (IPDPSW). IEEE, 2018, pp. 1066– 824.
1075. [52] F. N. Iandola, D. Sheffield, M. J. Anderson, P. M. Phothilimthana,
[32] V. Kelefouras, A. Kritikakou, I. Mporas, and V. Kolonias, “A high- and K. Keutzer, “Communication-minimizing 2d convolution in
performance matrix–matrix multiplication methodology for cpu gpu registers,” in IEEE International Conference on Image Processing,
and gpu architectures,” The Journal of Supercomputing, vol. 72, no. 3, 2014.
pp. 804–844, 2016. [53] Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka, “Accelerating
deep learning frameworks with micro-batches,” in 2018 IEEE
[33] A. Abdelfattah, S. Tomov, and J. Dongarra, “Fast batched matrix
International Conference on Cluster Computing (CLUSTER). IEEE,
multiplication for small sizes using half-precision arithmetic on
2018, pp. 402–412.
gpus,” in 2019 IEEE International Parallel and Distributed Processing
[54] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, “Optimizing
Symposium (IPDPS). IEEE, 2019, pp. 111–122.
memory efficiency for deep convolutional neural networks on
[34] J. Kurzak, H. Anzt, M. Gates, and J. Dongarra, “Implementation gpus,” in SC ’16: Proceedings of the International Conference for High
and tuning of batched cholesky factorization and solve for nvidia Performance Computing, Networking, Storage and Analysis, 2016, pp.
gpus,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, 633–644.
no. 7, pp. 2036–2048, 2015. [55] J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-
[35] L. Jiang, C. Yang, and W. Ma, “Enabling highly efficient batched memory overhead direct convolutions,” in International Conference
matrix multiplications on sw26010 many-core processor,” ACM on Machine Learning, 2018, pp. 5776–5785.
Transactions on Architecture and Code Optimization (TACO), vol. 17,
no. 1, pp. 1–23, 2020.
[36] P. Tillet and D. Cox, “Input-aware auto-tuning of compute-bound
hpc kernels,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, 2017, pp.
1–12.
Gangzhao Lu received the B.S. degree in com-
[37] H. Lan, J. Meng, C. Hundt, B. Schmidt, M. Deng, X. Wang, W. Liu, puter science and engineering from Harbin In-
Y. Qiao, and S. Feng, “Feathercnn: Fast inference computation with stitute of Technology, China, in 2014. He is
tensorgemm on arm architectures,” IEEE Transactions on Parallel currently working toward the Ph.D. degree in
and Distributed Systems, vol. 31, no. 3, pp. 580–594, 2019. the School of Cyberspace Science, Harbin In-
[38] Y. Zhang and F. Mueller, “Autogeneration and autotuning of 3d stitute of Technology. His research interests in-
stencil codes on homogeneous and heterogeneous gpu clusters,” clude performance modeling, parallel optimiza-
IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 3, tion, auto-tuning.
pp. 417–427, 2012.
[39] D. Yan, W. Wang, and X. Chu, “Demystifying tensor cores to
optimize half-precision matrix multiply,” in 2020 IEEE International
Parallel and Distributed Processing Symposium, IPDPS, 2020, pp. 20–
24.
[40] L. Jia, Y. Liang, X. Li, L. Lu, and S. Yan, “Enabling efficient fast con-
volution algorithms on gpus via megakernels,” IEEE Transactions
on Computers, 2020. Weizhe Zhang (Senior Member, IEEE) received
[41] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: B.Eng, M.Eng and Ph.D. degree of Engineering
An automatic schedule exploration and optimization framework in computer science and technology in 1999,
for tensor computation on heterogeneous system,” in Proceedings of 2001 and 2006 respectively from Harbin Institute
the Twenty-Fifth International Conference on Architectural Support for of Technology.
Programming Languages and Operating Systems, 2020, pp. 859–873. He is currently a professor in the School of
[42] D. Nichols, N.-S. Tomov, F. Betancourt, S. Tomov, K. Wong, and Computer Science and Technology at Harbin
J. Dongarra, “Magmadnn: towards high-performance data analyt- Institute of Technology, China, and director in
ics and machine learning for data-driven scientific computing,” in the Cyberspace Security Research Center, Peng
International Conference on High Performance Computing. Springer, Cheng Laboratory, Shenzhen, China. His re-
2019, pp. 490–503. search interests are primarily in parallel com-
[43] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, puting, distributed computing, cloud and grid computing, and computer
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for network. He has published more than 100 academic papers in journals,
large-scale machine learning,” in 12th {USENIX} symposium on books, and conference proceedings.
operating systems design and implementation ({OSDI} 16), 2016, pp.
265–283.
[44] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling, “Data-free
quantization through weight equalization and bias correction,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 1325–1334. Zheng Wang is an associate professor with the
[45] NVIDIA, CUDA Toolkit Programming Guide. University of Leeds. His research focuses on
[Online]. Available: https://docs.nvidia.com/cuda/ parallel computing, compilation and systems se-
cuda-c-programming-guide/index.html curity.
[46] ——, NVIDIA Nsight Compute. [Online]. Available: https:
//developer.nvidia.com/nsight-compute
[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
agenet: A large-scale hierarchical image database,” in 2009 IEEE
conference on computer vision and pattern recognition. Ieee, 2009, pp.
248–255.
[48] K. Chellapilla, S. Puri, and P. Simard, “High performance convo-