0% found this document useful (0 votes)
49 views19 pages

Gpus.: Article

Uploaded by

pokemonbeast920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views19 pages

Gpus.: Article

Uploaded by

pokemonbeast920
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

This is a repository copy of Optimizing Depthwise Separable Convolution Operations on

GPUs.

White Rose Research Online URL for this paper:


https://eprints.whiterose.ac.uk/174797/

Version: Accepted Version

Article:
Lu, G, Zhang, W and Wang, Z orcid.org/0000-0001-6157-0662 (2021) Optimizing
Depthwise Separable Convolution Operations on GPUs. IEEE Transactions on Parallel
and Distributed Systems. p. 1. ISSN 1045-9219

https://doi.org/10.1109/tpds.2021.3084813

© 2021, IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works.

Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.

Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.

[email protected]
https://eprints.whiterose.ac.uk/
1

Optimizing Depthwise Separable Convolution


Operations on GPUs
Gangzhao Lu, Weizhe Zhang, Senior Member, IEEE, and Zheng Wang

Abstract—The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to
reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable
convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such
approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a
few samples at once. This paper aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU
architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to
reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach
employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and
to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded
NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach
against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2×
(up to 3×) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces
the end-to-end training time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3% respectively, and reduces the end-to-end inference
time of MobileNet and EfficentNet by 12.2% and 13.5% respectively.

Index Terms—Performance Optimization, Convolution, Depthwise, Pointwise, Memory Optimization, GPU Utilization

1 I NTRODUCTION time by having fewer arithmetic operations. For this reason,


the DSC is widely used in latency-sensitive scenarios, such
I N recent years, deep neural networks (DNNs) have made
astonishing success in solving a wide range of tasks
[1], [2], [3], [4], [5], [6]. One of the most successful DNN
as using a trained CNN on embedded devices or perform-
ing distributed, on-device learning on resource-constrained
systems [11].
architectures is the convolutional neural network (CNN)
A wide range of optimization techniques have been
that is widely used in tasks like image classification [1], [2],
proposed to perform convolutions [12], [13], [14], [15], [16],
object detection [3], [4] and semantic segmentation [5], [6].
[17], [18], [19]. Among these techniques, the fast fourier
CNN models are typically trained and run on GPUs due to
transform (FFT) [15], winograd (Winograd) [16] and general
their computation requirements.
matrix multiplication (GEMM) [17], [18], [19] are broadly
The depthwise separable convolution (DSC) is widely
adopted. However, FFT and Winograd offer little benefit for
used in modern CNN models for accelerating model com-
depthwise convolutions compared to standard 2D convo-
putation time [7], [8], [9], [10]. This operation can process
lution. This is because FFT and Winograd are designed to
both the spatial dimensions (e.g., the width and the height of
optimize arithmetic computation [20], [16], but not memory
an image) and the depth dimension (e.g., the RGB channels
accesses. However, the memory access latency often domi-
of an image) of an input. It achieves this by splitting a
nates the execution time of depthwise convolution [21] due
convolution kernel into two separate kernels that perform
to its lower arithmetic operations compared to a standard
two convolutions: a depthwise convolution and a pointwise
2D convolution. Both methods are also ill-suited for point-
convolution. The former applies a single convolutional filter
wise convolutions (that apply a 1 × 1 kernel) because FFT
for each input channel, and the latter uses a 1×1 kernel to it-
is designed to operate on a large filter and Winograd works
erate through every single point of the input (e.g., the kernel
best when the filter size is 3 × 3.
has a depth of however many channels the input image has).
While GEMM is a good fit for pointwise convolution
Compare to a classical 2D convolution that operates on a 2D
(that is also adopted by cuDNN [22]), the current imple-
input of channels × height × width, the DSC reduces the
mentation of GEMM for CNNs can lead to poor GPU
number of multiplication operations as well as the number
performance during model deployment. A typical GEMM
of parameters needed for the convolution filter (and hence
implementation uses a fixed tile size to distribute work
the chance of model over-fitting) as well as computation
across parallel threads, without taking into consideration
the amount of computation required. As we will show later
• G. Lu and W. Zhang are with the School of Cyberspace Science at Harbin in the paper, such a strategy cannot make effective use of
Institute of Technology, Harbin 150000, China.
E-mail:{lugangzhao,wzzhang}@hit.edu.cn the GPU parallelism when the batch size (i.e., the number of
• Z. Wang is with the School of Computing at University of Leeds, United samples to be processed at once) is small (e.g., <= 128).
Kingdom. The ineffective use of GPU resources leads to low GPU
E-mail: [email protected]
utilization and sub-optimal performance. This is a particular
2

problem for two real-life scenarios: when running a trained prior approaches.
model for inferencing - where the model typically only takes We evaluate our approach by applying it to both depth-
in one or a few samples (and hence a small batch size) - or wise and pointwise convolutions with FP32 and INT8 on
performing on-device distributed training on an embedded two GPU platforms: an NVIDIA RTX 2080Ti GPU and an
device - where the number of training samples is likely to embedded NVIDIA Jetson AGX Xavier GPU. We compared
be small due to resource constraints. our approach against cuDNN, an industry strengthened
Our work addresses the memory latency and work dis- DNN library that is heavily optimized for the NVIDIA GPU
tribution issues identified above. By addressing these two architecture. Experimental results show that our approach
issues together, our approach enables efficient DSC because delivers over 3× and 2× faster performance than cuDNN
it accelerates not only depthwise convolution by reducing for depthwise and pointwise convolutions, respectively. We
the GPU memory access latency and also pointwise convo- show that, when the batch size is <= 128, our approach
lution for model inference and small-batch-sized training. averagely improves the end-to-end training performance of
To improve the memory performance of depthwise con- MobileNetV2 [27], [7] and EfficientNet-B0 [8] by 11.5% and
volution, we introduce two novel optimization techniques 7.3% respectively, and improves the end-to-end inference
for operations performed on rows and columns. Our ap- performance of MobileNetV2 and EfficentNet-B0 by 9.7%
proach reduces the number of memory accesses required and 11.6% respectively.
by reusing data. To improve column data reuse, we use the This paper makes the following technical contributions:
shuffle instructions (supported by both CUDA and OpenCL
• It presents two novel algorithms for column and
and hence is applicable to mainstream GPUs) to exchange
row reuse (Section 3) for depthwise convolution,
elements among threads within a GPU warp (or working
improving the data locality and the memory access
group). In this way, we can avoid reloading the same ele-
latency for depthwise convolution.
ments shared among different threads. We also apply shuffle
• It describes a novel method for transforming dy-
instructions to converting dynamic indices to static indices
namic indices into static indices to assist register pro-
to assist register promotion, an optimization strategy that is
motion for performance optimization (Section 3.1.3).
not exploited in previous studies [23], [24]. To increase row
• It presents a dynamic tile size scheme for pointwise
data reuse, we multiply one input row with multiple rows of
convolution, increasing GPU utilization while mini-
a convolutional kernel (or filter) to compute multiple output
mizing the global memory access latency (Section 4).
elements at the same time. This strategy improves the data
locality of elements within a row, reducing the number of This work extends our prior work [28] by proposing a
memory transactions compared with that of the existing new dynamic tile size scheme to optimize pointwise con-
convolution processing pipeline. By reducing the number of volution, which is a key component of depthwise separable
memory accesses, our approach improves the performance convolution. We also added new experiments performed on
of depthwise convolution. embedded devices and using 8-bit integers for the neural
To overcome the drawback of fixed tile size work par- networks. The new experiments demonstrate the robust-
tition of a GEMM kernel for pointwise convolution, we ness of the proposed approach, showing that it consistently
employ a dynamic tile size scheme. Our approach first outperforms cuDNN by delivering the overall best perfor-
adjusts the work assigned to each GPU thread so that we mance.
have a sufficient number of tiles to be distributed to GPU
threads to improve the GPU utilization. A challenge here is
2 BACKGROUND
how to assign the right amount of work to GPU threads so
that the global memory access latency can be adequately 2.1 GPU Architecture
hidden through computation. Having too little work per Deep learning models are often trained and executed on the
GPU thread will make the GPU memory access dominates GPU. Modern GPUs employ a complex execution pipeline
the execution while having too large work assignment will and memory hierarchy to support concurrent execution
lead to low GPU utilization (as only a small number of of parallel threads. A typical GPU consists of multiple
GPU threads will receive a tile to work on). To this end, our Streaming Multiprocessors (SMs). Each SM includes multi-
dynamic scheme distributes input or filter channels across ple Single-Instruction-Multiple-Thread (SIMT) units, each of
threads within a warp to minimize memory latency with im- which has multiple lanes of execution. Threads scheduled in
proved GPU parallelism. Recent works [18], [25] employ a the same SIMT unit are called a warp, which is the smallest
heuristic method to maximize parallelism for GEMM. They scheduling unit in GPU. Like a modern CPU, a GPU consists
achieve this by trying to combine multiple convolutions that of multiple memory hierarchies. The thread-local registers
can be computed concurrently into one GEMM kernel. Such are the fastest memory component, having the lowest access
a strategy assumes multiple parallel convolutions can be latency (1-2 cycles). The SM local L1 caches and shared
performed within a single GEMM kernel. However, this memory provide a larger storage capacity over the thread-
strong assumption is only valid for some special CNN local registers but have modestly higher accessing latency
structures like the inception layer in GoogleNet [26]. As a of around 30 cycles [29], [30]. All the SMs share a unified
result, these prior methods are not applicable to the more L2 cache that provides an accessing latency of about 200
general case of CNNs where convolution operations must cycles. The off-chip global memory, similar to the RAM in a
be performed sequentially due to dependence. Our dynamic CPU system, provides the largest memory storage capacity
work distribution strategy does not rely on this assumption on the GPU but has the longest accessing latency of around
and hence is more generally applicable compared to these 500 cycles measured through running micro-benchmarks on
3

Output …….

× 5 = 8 thread 0 thread 1 ……. thread 6


12 3 duplicate
5 3 columns
3 8
12 0 1 2 3 4 5 6 7 8 9 10

(a) Depthwise convolution: three 5 × 5 2D filters are used to


duplicate
convolve with one 3-channel 12 × 12 input and generate one Input rows
3-channel 8 × 8 output.

1 3 Filter

8 × 1 = 8 Fig. 2. A working example of performing a depthwise convolution using


a GPU. Here, the filter size is 5 × 5, the input image size is 6 × 11 and
3 4 the output size is 2 × 7.
8 8
(b) Pointwise convolution: four 3-channel 1×1 filters are used convolution. Fig. 1b shows an example of pointwise convo-
to convolve with one 3-channel 8 × 8 input and generate one lution, where four 3 × 1 × 1 filters are used to convolve with
4-channel 8 × 8 output. the 3×8×8 feature map iteratively and each filter generates
Fig. 1. Demonstration of depthwise and pointwise convolutions. one channel of the 4 × 8 × 8 output.

2.3 Roadmap and Notations


NVIDIA RTX 2080Ti GPU used in this work. Local memory
We present our approach for optimizing the two convo-
resides in global memory and is used to hold variables with
lutional kernels of DSC in Sections 3 and 4. We start by
dynamic indexing or too large to fit into registers. It has the
introducing our methods for improving data locality of
same access latency as global memory. The key to optimiz-
depthwise convolution in Section 3 and then presenting our
ing memory performance is to make use of the fast memory
approach for using dynamic work distribution to accelerate
sub-systems (i.e., registers and shared memory) and reduce
small-batch-sized pointwise convolution in Section 4.
the number of memory accesses to slower memory. Our
work is designed to provide such capabilities for depthwise Notations. Throughout the paper, we use I , F , and O to
separable convolution operations. represent the input, the filter, and the output respectively;
we also use N , C , H , and W to denote the batch size, the
channel, the height, and the width, respectively.
2.2 Depthwise Separable Convolution
Our work targets depthwise separable convolution (DSC) 3 O PTIMIZING D EPTHWISE C ONVOLUTION
that is widely used by CNN models to reduce the number of In this section, we describe our two optimizations, column
multiplication operations needed for doing convolution (a reuse (Section 3.1) and row reuse (Section 3.2), for reducing
standard operation in CNN). The DSC splits a standard (e.g., the number of GPU memory accesses for depthwise convo-
multi-channeled) 2D convolution kernel into two individual lution.
kernels: a depthwise convolution kernel and a pointwise
convolution kernel.
3.1 Column Reuse Optimization
The depthwise convolution kernel processes one input
channel at a time, and stacks the outputs of all channels Working example. We use the depthwise convolution with
together to form an c×n×n matrix, where n×n is the output only one channel shown in Fig. 2 as a working example to
of a depthwise convolution kernel and c is the number of explain our column reuse method. In practice, we iterate
channels to be processed. Specifically, it takes as input a the depthwise convolution kernel on each channel in turn
feature map and applies a bank of 2D filters (e.g., an N × N (e.g., the R, G, and B channels of an image). Without loss
kernel) on the width and height directions of the input. We of generality, we slide a 5 × 5 filter over a 6 × 11 input
iteratively apply the depthwise convolution kernel to all with stride 1 to produce a 2 × 7 output. Our column reuse
channels. Fig. 1a shows an example of depthwise convolu- method can also be applied to depthwise convolutions with
tion, where three 5 × 5 2D filters are used to convolve with other stride settings. In this example, each thread calculates
the corresponding channel of a 3 × 12 × 12 feature map and one column of the output. Two parallel threads 0 and 1 will
generate one 3 × 8 × 8 output. execute code to slide the filter along the width dimension,
The output of the depthwise convolution kernel is fed where both threads load two overlapped regions from the
into a pointwise convolution kernel which uses a 1 × 1 filter input image (thereby generating four duplicate columns).
to iterate through every single point. This kernel has a depth Similarly, there will be another thread (thread 6 in this
of the number of input channels (i.e., c). The DSC reduces example) to slide the filter along the height dimension,
the computation by reducing the number of input transfor- which will load two overlapped regions and generates four
mations needed when compared to a standard depthwise duplicate rows.
4

t0 0 1 2 3 4 t0 0 2 4 t0 0 1 2 3 4

t1 1 2 3 4 5 t1 1 3 5 t1 1 2 3 4 5

t2 2 3 4 5 6 t2 2 4 6 t2 2 3 4 5 6

t3 3 4 5 6 7 t3 3 5 7 t3 3 4 5 6 7

step 1 step 2 step 3 step 4 step 5 step 1 step 3 step 2 step 1 step 4 step 3 step 5 step 2

(a) Direct convolution: Each thread (b) Optimized convolution: each (c) Our approach: each thread re-
loads 5 input elements from global thread retrieves its third element from trieves its second and fourth elements
memory. the corresponding thread. from corresponding threads.
Fig. 3. Illustration of direct and optimized convolutions. We use a 5 × 5 filter and each thread calculates the convolution for one output element. This
example shows how a thread processes the first 5 corresponding input elements.

3.1.1 Standard convolution While this version reduces the redundant memory ac-
cesses compared to a standard convolution implementation,
Fig. 3a shows a standard depthwise convolution operation,
there is still room for improvement. The problem is that
operating on a single-channel input on the example shown
the shuffle instruction shf l xor(iT emp[i], 2) now becomes
in Fig. 2. Here, each thread loads the first corresponding
a bottleneck because iT emp is accessed through dynamic in-
input element from the GPU global memory. Given that
dexing. Since the indices and the access pattern to iT emp are
the indices of these elements are contiguous, i.e., 0, 1, 2,
not available at compile-time, the compiler cannot decide
and 3 in this example, concurrent access to these elements
which of the elements in iT emp will be frequently accessed
will be coalesced to form a single memory transaction. As a
and has to place iT emp in the local memory which would
result, each step will incur one memory access, five for the
still incur an access latency of around 500 cycles. If we can
five steps (steps 1-5) as shown in Fig. 3a. After completing
promote register allocation for iT emp, we can then further
step 5, each pair of adjacent threads will have four duplicate
improve the performance of convolution.
input elements, corresponding to the duplicate columns in
Fig. 2. Specifically, input elements 1, 2 and 3 loaded in 3.1.3 Our approach
step 2 would have already been loaded by threads t1, t2 Our column reuse scheme (Fig. 3c) converts dynamic index-
and t3 in the previous step (Fig. 3a). The repeated load ing to static array accesses to promote register allocation.
to these elements leads to redundant memory accesses and This strategy is described in Algorithms 1 and 2, where the
unnecessary memory access latency. Even the elements may first algorithm is used for step 3, and the latter is used for
be prefetched to the L1 cache before the next step, access to steps 4 and 5. Note that these two algorithms can be used
the L1 cache still takes around 30 cycles on a 2080Ti GPU. To for different sized convolution kernels, for which we will
reduce the memory overhead, we would like to avoid such discuss in Section 3.1.4.
redundant memory accesses. Fig. 4 gives a working example of Algorithm 1. Here, we
first load the corresponding first and fifth input elements
3.1.2 An optimized implementation into iT emp before passing it to Algorithm 1. Then, we
pack two 32-bit elements into a 64-bit variable, exchange,
To eliminate the redundant loads, we could use the shuffle where iT emp[4] and iT emp[0] are the high and low 32 bits,
instructions supported by both CUDA and OpenCL to ex- respectively (Line 2). As threads t0 and t1 will provide the
change input elements among different threads. To this end, fifth element of the data they load, which are the high 32
we adopt the optimization developed in our prior work [28]. bits of exchange, we right shift exchange for both threads
Fig. 3b depicts such an optimization. Specifically, in steps 1 by an offset of 32 to place iT emp[4] in the low 32 bits. Now
and 2 of Fig. 3b, each thread loads the corresponding first turning our attention to threads t2 and t3 that will provide
and fifth input elements from the global memory. In step 3, the first element of the data they load. Since the elements are
each thread utilizes the shuffle instruction to retrieve the the low 32 bits of exchange, we right shift exchange in both
third element from another thread. For example, threads threads by an offset of 0. The number of places to be shifted
t0 and t1 could retrieve the third element from threads t2 for each thread is calculated based on the thread ID (Line 3).
and t3, respectively, and provide the fifth element (dashed Next, we unpack exchange into iT emp[2] (high 32 bits) and
squares in step 2) for both threads. Similarly, threads t2 iT emp[1] (low 32 bits) (Line 5). By doing so, we can retrieve
and t3 retrieve the third element from threads t0 and t1, the element a thread needs to supply from a fixed location,
respectively, and provide the first element (dashed squares iT emp[1]. Finally, we use the shuffle instruction to exchange
in step 1) for threads t0 and t1. Using the CUDA shuffle the elements among threads (Line 6).
instruction, this exchange process can be implemented as Using Algorithm 1, we can replace dynamic index i in
shf l xor(iT emp[i], 2), where iT emp is a thread-local array shf l xor(iT emp[i], 2) (whose value is unknown at compile
used to store the five input elements, and i is the location in time) with a static index, 1 in shf l xor(iT emp[1], 2), in
the local array. For our working example, threads t0 and t1 our working example. By doing so, we promote register
will supply the fifth element, hence i = 4. Similarly, threads allocation by allowing the compiler to put all the thread-
t2 and t3 will provide the first element, thus i = 0. local variables into the fast GPU registers (that have access
5

Algorithm 1: RetrieveThirdElement Algorithm 2: RetrieveSecondElement


// iT emp: Buffer for storing input Input: iT emp
elements loaded from memory or Output: iT emp
generated through shuffle 1 tid ← threadIdx.x;
instructions. 2 mov exchange, {iT emp[0], iT emp[2]};
Input: iT emp 3 shif t ← ((tid + 1)&1) << 5;
Output: iT emp 4 exchange ← exchange >> shif t;
1 tid ← threadIdx.x; 5 mov {iT emp[0], iT emp[1]}, exchange;
2 mov exchange, {iT emp[0], iT emp[4]}; 6 iT emp[1] ← shf l xor(iT emp[0], 1);
3 shif t ← ((tid + 2)&2) << 4;
4 exchange ← exchange >> shif t;
5 mov {iT emp[1], iT emp[2]}, exchange; rowi0 %,+*-./0
6 iT emp[2] ← shf l xor(iT emp[1], 2); rowi1 rowf0 out0
rowi2 rowf1 out1
replace dynamic indexing with rowi3
different shift amounts
rowf2 out2
iTemp iTemp iTemp iTemp rowi4
[0] [4] exchange exchange [2] [1]
t0 0 4 4 0 >> 32 4 4 !"#$% '()%*+ &$%#$%
t1 1 5 5 1 >> 32 5 5
pack unpack Fig. 5. A 3 × 3 filter is used to slide over the input image along the height
t2 2 6 6 2 >> 0 6 2 6 2 dimension, which produces a column of output elements.
t3 3 7 7 3 >> 0 7 3 7 3
32 bits 64 bits high low
high low 32 bits 32 bits input along the height dimension, it produces a column of
elements as the output.
Fig. 4. Convert dynamic indexing of array iT emp into static indexing,
allowing iT emp to be allocated in registers instead of the local memory.
3.2.1 Standard convolution
Assume we use one thread to calculate one column of
latency of 1 to 2 cycles as opposed to 500 cycles when the output elements. For the working example given in Fig. 5,
data are stored in the local memory). Note that this approach the convolution will be computed as follows:
will not improve the register usage. Since steps 4 and 5
of our implementation (Fig. 3c) adopt a similar procedure out0 = rowi0 · rowf 0 + rowi1 · rowf 1 + rowi2 · rowf 2
as step 3 in Fig. 3b, we can adapt Algorithm 1 to derive out1 = rowi1 · rowf 0 + rowi2 · rowf 1 + rowi3 · rowf 2
Algorithm 2 with minor modifications. The main distinction out2 = rowi2 · rowf 0 + rowi3 · rowf 1 + rowi4 · rowf 2
between the two algorithms comes from how we process
steps 3-5 shown in Fig. 3c. Specifically, we use four threads As can be seen from the above equations, rowi1 and
to exchange the elements in step 3 for Algorithm 1, but use rowi3 are loaded twice, and rowi2 is loaded three times;
only two adjacent threads in steps 4 and 5 for Algorithm nine rows are being loaded in total. These redundant loads
2. To adapt to the change of the number of threads used, to the same read-only row thus incur extra memory accesses
we recalculate the shift offset for each thread (Line 3 of and additional overhead.
both algorithms) and change the arguments of the shuffle
3.2.2 Our optimization
instruction (Line 6 in both algorithms).
For our working example, Algorithms 1 and 2 respec- To remove redundant loads to the same row, we redesign
tively reduce the number of memory accesses from 5 to 2 the execution flow of the standard depthwise convolution.
and 25 to 10 when 5 and 25 input elements are loaded. Specifically, after fetching a row from the input, we compute
This reduction greatly improves the performance of the the number of output elements that depend on the loaded
depthwise convolution. row. With this information in place, we use the loaded
row to perform inner products with corresponding rows of
3.1.4 Generalize to other filter sizes the filter to calculate the output elements whose outcomes
depending on the loaded row. Our approach translates the
So far we have described our approach using a concrete
execution flow of the working example presented in Fig. 5
working example with a pre-defined filter size, but our
to:
algorithms can be generalized to filters with an arbitrary
size. To apply our approach to a filter of size n × n, we will load rowi0 : out0 = rowi0 · rowf 0
first divide the filter into several n × 5 sub-filters. Next, we
load rowi1 : out0 = out0 + rowi1 · rowf 1
divide the remaining columns into several n × 3 sub-filters
with some overlap columns. Each n × 5 and n × 3 filters can out1 = rowi1 · rowf 0
then be directly processed by Algorithm 1 and Algorithm 2. load rowi2 : out0 = out0 + rowi2 · rowf 2
out1 = out1 + rowi2 · rowf 1
3.2 Row Reuse Optimization out2 = rowi2 · rowf 0
Working example. Consider now the standard convolution load rowi3 : out1 = out1 + rowi3 · rowf 2
example shown in Fig. 5 as a working example for our out2 = out2 + rowi3 · rowf 1
row reuse algorithm. When sliding the filter over the 2D load rowi4 : out2 = out2 + rowi4 · rowf 2
6

Algorithm 3: RowReuse Algorithm 4: Optimized Depthwise Convolution


Input: row, index, f ilter, Out Input: I , F , subBlockHeight
Output: Out Output: O
1 if index<FH − 1 then 1 Load the filter into shared memory;
2 for i ← 0 to index + 1 do 2 Divide columns of the filter into a combination of
3 Out[i] ← Out[i] + row · f ilter[index − i]; 3-column and 5-column sub-filters;
4 else if index ≥ FH − 1 and index<IH − FH + 1 then 3 syncthreads();
5 for i ← 0 to FH do 4 if blockIdx.x<gridDim.x − 1 then
6 oindex ← index − FH + 1 + i; 5 Init thread local register array sum to zero;
7 Out[oindex ] ← 6 Calculate the index of the first input element this
Out[oindex ] + row · f ilter[FH − 1 − i]; thread needs, denoted as inputIndex;
8 else 7 for i ← 0 to subBlockHeight do
9 for i ← FH − 1 to 0 do 8 foreach sub-filter do
10 oindex ← IH − FH + 1; 9 Load corresponding input elements from
11 Out[oindex ] ← Out[oindex ] + row · f ilter[FH − i]; inputIndex of global memory into iT emp;
10 Call RetrieveT hirdElement(iT emp) or
RetrieveSecondElement(iT emp);
sub-block 0 sub-block 1 11 Call RowReuse(iT emp, i,sub-filter, sum);
warp 0 warp 1
12 Write completed element of sum into O;
13 else
14 Divide columns of the last sub-block into multiple
processed by lane 0
partitions and try to evenly assign those partitions
processed by lane 1 to threads of a warp. Each thread uses a direct
processed by lane 2 method to calculate elements of O.;
15 The same method is adopted when processing the
processed by lane 3 edge elements of O;

To apply our approach to depthwise convolution that


Output image works on a 2D matrix, we first divide the output into sub-
blocks. Each sub-block contains exactly n columns (in this
Fig. 6. The output is produced by sliding a 3 × 3 filter over an 8 × 8 input
with one pad. Here, we assume that the warp size is 4 and thus having work, n = 32, which is the default warp size of our GPU
laneid = threadid%4. platform). The only exception is the last sub-block, which
may contain less than n columns. If a sub-block contains
more than k rows (k = 56 in this work), we then further
In this new implementation, we would only issue loads
break down the sub-block along the height dimension. The
to five rows to calculate the output elements of our work-
blocking method implies that our approach can handle
ing example. Compared to the nine loads required by the
arbitrary input sizes. Each GPU thread block will process
standard convolution, we reduce the number of loads to
one or multiple sub-blocks, and each warp will compute
row elements by nearly half. We note that although the
one sub-block.
number of accesses to the output column out is increased,
the overhead is negligible because out is smaller than the 3.3.1 Example
size of multiple rows and often can be stored in registers.
Fig. 6 shows the mapping process of GPU threads to output
We describe a general solution for row reuse in Algo-
elements. In this example, we slide a 3×3 filter over an 8×8
rithm 3, where row denotes the row loaded from the input,
input. To apply a square filter at the edge of the image, we
index denotes the index of row, f ilter denotes the vector
need to pad the input. To reduce the memory pressure, we
of filter rows and f ilter[i] means the ith row of the filter.
do not allocate GPU memory space for the padded elements.
Pseudo codes at Lines 1-5 process the first FH − 1 rows
Instead, we use different methods to calculate the edge and
(rowi0 and rowi1 in Fig. 5) that are needed by less than
inner elements of the output. The edge and inner elements
FH output elements. Codes at Lines 6-11 process the rows
are represented by the shaded and dashed squares in Fig. 6,
needed by exact FH output elements (e.g., rowi2 in Fig. 5).
respectively.
Finally, codes at Lines 12-17 process the last FH − 1 rows,
In this example, we assume each GPU warp contains
which are needed by less than FH output elements (e.g.,
four threads. Therefore, we will divide the inner elements
rowi3 and rowi4 in Fig. 5).
into multiple sub-blocks and each sub-block contains four
Algorithm 3 is designed to eliminate redundant loads to
columns so that a column can be processed by one of the
the same row introduced by sliding a filter over the input
four GPU threads within a wrap. In our case, we will have
along the height dimension. By loading each row of the
two sub-blocks, where sub-block 0 contains four columns,
input just once, our approach greatly reduces the number
but sub-block 1 only contains two columns. To utilize the
of memory transactions for convolution operations.
threads within a warp, we divide elements of the last two
columns evenly among the four threads.
3.3 Putting Together
We now take the widely used 2D convolution as an example 3.3.2 Generalization
to illustrate how to apply both reuse algorithms on convo- In Algorithm 4, we describe our generalized solution. Here,
lution operations. we process the sub-blocks with exactly 32 columns (i.e., the
7

default wrap size of our evaluation GPU) and the last sub- 4.1.1 A two-level tiling scheme
block in Lines 4-15 and 16-19, respectively. In this way, each To divide the output into block tiles, we utilize two logical
GPU thread calculates one column of the output elements. layouts of the output, L1 and L2, as shown in Fig. 7. FN and
This is done through several steps. First, each thread block IN × IH × IW represent the filter and input dimensions
loads the filter into shared memory and divides the filter of the output respectively. Notice that our 2-level tiling
into a combination of 3-column and 5-column sub-filters. can handle arbitrary input sizes since we do not require
Next, each thread calculates the address of the first input IH = IW . Before partitioning the output, we first select the
element it needs (Line 6). For each output element and layout of the output based on the size of the filter dimension.
sub-filter, each thread loads corresponding input elements The rationale behind choosing the filter dimension instead
into iT emp and passes iT emp to Algorithms 1 and 2 to of the input dimension can be described as follows. The
fill the row vector iT emp (Line 10). Then, each thread number of filters, FN , is fixed once the structure of a CNN
passes the filled vector iT emp to Algorithm 3 to calculate is determined. But the size of the input dimension will be
multiple output elements and store results in the register affected by the batch size, IN , during inference and training.
array sum (Line 11). Finally, when the calculation of one Therefore, it is easier to design our approach based on the
output element is completed, we write the corresponding size of the filter dimension. When FN > 48, we choose
result in sum into the result array O (Line 13). layout L1 and distribute filter channels across threads within
a warp. Otherwise, we choose layout L2 and distribute input
channels. The boundary FN = 48 is determined as follows.
4 O PTIMIZING P OINTWISE C ONVOLUTION Fig. 7 demonstrates that in layout L2, the maximal value of
In this section, we explain the workflow of our dynamic FN is 4 × W arpH and W arpH ≤ 12 (explain later in this
tile size scheme for pointwise convolution. This approach section), therefore we have FN ≤ 48 for layout L2.
extends the optimization for convolution operations in our Since both layouts have the similar procedure, we take
prior work [28] to pointwise convolution. Our approach layout L1 as an illustration example and give a brief descrip-
consists of three stages, described as follows. tion of layout L2 at the end of this section. After choosing
In the first and second stages, we identify parameters the layout based on FN , we partition the output along
related to the tile size and determine candidate values for the filter dimension. First, we halve the filter dimension if
each parameter (Section 4.1 and Section 4.2). The first and FN ≥ 512. The reason is that if we let each thread block
second stages process input dependent and independent process a large number of filters, then each thread needs to
parameters respectively. In the third stage, as detailed in issue more than 15 global memory load instructions, which
Algorithm 5, we iterate over all combinations of parameters may cause MIO (Memory Input Output) instruction queue
and search for the combination that achieves optimal SM throttle and leads to performance degradation. Then, we
utilization and data reuse (Section 4.3). halve both dimensions of each block tile and generate 2 × 2
We note that previous studies [31], [32], [33], [34], [35], warp tiles.
[36], [37], [38] have exploited tiling and autotuning for
convolution and GEMM operations. However, these prior 4.1.2 Determine candidates for W arpH and W arpW
methods are inadequate for pointwise convolutions on Based on the partition method, we know that W arpW can
GPUs due to two main drawbacks: they do not consider be calculated with W arpW = FN /4 or W arpW = FN /2.
SM utilization when choosing the optimal tile size and are Thus, we only need to determine candidate values for
not designed for pointwise convolutions with small inputs. W arpH based on the size of the input dimension. In our
Our dynamic tile size scheme avoids these two drawbacks. design, when W arpH > 12, we need assembly level opti-
To improve SM utilization, our approach searches for the mizations like the work in [16], [39] for some configurations
optimal tile size for the output based on the input size of pointwise convolutions to avoid register spills. But in
to generate a proper number of tiles to saturate GPU and this work, we focus on higher level rather than assembly
maximize data reuse. To optimize pointwise convolution level optimizations, and thus set W arpH ≤ 12. If the size
with small inputs, we distribute channels across threads of the input dimension is large, we prefer to choose a large
within a warp to increase the arithmetic intensity for each W arpH because using small W arpH will generate many
thread. thread blocks and results in multiple loads of shared filters
[40], [41]. If the size of the input dimension is small, we
prefer to choose a small W arpH because using a large
4.1 Determine Tiling Parameters W arpH will generate a few thread blocks and result in
In our design, we use a 2-level tiling scheme, as shown in SM underutilization. Since each thread loads at most 12
Fig. 7, to partition the output into block tiles and warp tiles. input elements (W arpH ≤ 12), we set the upper limit
Each thread block processes one block tile and each warp of large W arpH to 12 and the lower limit to 12/2 = 6.
processes one warp tile. The height dimension of the warp Therefore, the candidates for large W arpH are W arpH =
tile is shared among 32 threads of a warp and the width {6, 7, 8, 9, 10, 11, 12}. The candidates for small W arpH are
dimension of the warp tile is distributed across 32 threads W arpH = {2, 3, 4, 5, 6, 7, 8}. In our experiments, there is no
of a warp. Hence, we have two input dependent parameters, clear boundary between large and small candidate sets of
namely the height and width of the warp tile, denoted W arpH , therefore we let both sets overlap in the middle
as W arpH and W arpW respectively. Now we introduce values. The boundary between the large and small size
how to use the 2-level tiling scheme to determine candidate of the input dimension is experimentally determined as
values for W arpH and W arpW . IN × IH × IW = 16 × 14 × 14.
8

Compared to layout L1, layout L2 swaps the input and each thread processes Tnum = W arpW /Fnum = 64/4 = 16
filter dimensions. Hence, W arpH can be calculated with filter elements. The arithmetic intensity can be estimated
W arpH = FN /4 or W arpH = FN /2. The candidate values as W arpH ×Tnum 8×16
W arpH +Tnum = 8+16 = 5.3. Higher arithmetic inten-
for large W arpW are W arpW = {6, 7, 8, 9, 10, 11, 12} and sity increases the chance to hide global memory access
for small W arpW are W arpW = {2, 3, 4, 5, 6, 7, 8}. latency. To fully utilize a warp, candidate values for Cnum
should be a power of 2. Thus, candidates for Cnum are
4.2 Determine Candidates for Input Independent Pa- Cnum = {1, 2, 4, 8, 16, 32}.
rameters
There are three input independent parameters we need to 4.3 Search For the Optimal Combination
consider, namely the number of warps in a thread block 4.3.1 Hardware resources constraints
(W arpnum ), the number of thread blocks that can run con-
currently on an SM (Blocknum ) and the number of channels When searching for the optimal combination of tiling
to be distributed (Cnum ). and input independent parameters, we focus on combi-
nations that can meet the hardware resources constraints,
4.2.1 Determine candidates for W arpnum and Blocknum including registers and shared memory. In the rest of
this section, we take layout L1 as an illustration example.
When determining candidates for W arpnum , we need to
Based on Blocknum , we calculate the number of regis-
consider (1) a small warp number will decrease the op-
ters each thread can use (LimitR ) and the size of shared
portunity to hide the memory access latency at the warp
memory each thread block can use (LimitS ) with for-
level, (2) a large warp number will decrease the number
mulas LimitR = T otalR /(Blocknum × W arpnum × 32)
of thread blocks and may lead to SM underutilization. We
and LimitS = T otalS /Blocknum respectively. T otalR and
empirically set the warp number to be four (W arpnum = 4),
T otalS represent the number of registers and the size of
which gives good performance on our pilot study using
shared memory of an SM, respectively. On RTX 2080Ti,
microbenchmarks of hand-written pointwise convolution
T otalR = 65536 and T otalS = 64KB while on Jetson AGX
kernels. For the number of thread blocks, Blocknum , we
Xavier, T otalR = 65536 and T otalS = 48KB .
use two values, 2 and 4, on our evaluation platforms.
In our approach, each warp processes one warp tile
These choices are justified as follows. For Nvidia GPUs,
which contains W arpH × W arpW output elements. Each
each GPU thread can use up to 255 registers, and each
thread calculates W arpH × Tnum elements and thus needs
SM has 65,536 registers. If we set Blocknum = 1 and
Rresult = W arpH × Tnum , Roperand = W arpH + Tnum
W arpnum = 4 (per our discussion above), each SM will
registers to store results and operands respectively. The
have Blocknum ×W arpnum = 4 wraps. This allows a thread
constraints can be formulated as follows:
block to use up to just half of the available registers of
an SM because a thread block under this setting can use Cnum × 2 × W arpW Cnum × 2 × W arpH
Rtmp = +
at most 4 (warps in an SM ) × 32 (threads per warp) × 128 128
255 (registers per thread) = 32, 640 registers. Therefore, Rresult + Roperand + Rtmp + extraR ≤ LimitR (1)
to utilize the available hardware register, one should set
Blocknum to be greater than one. We also found that (2 × W arpH + 2 × W arpW ) × Cnum × 4 × 2 ≤ LimitS (2)
setting Blocknum > 4 during searching offers little ben-
efit and hence we set the Blocknum to be either 2 or 4 where Rtmp is the number of temporary registers used to
(Blocknum = {2, 4}). store filter and input elements loaded from global memory.
2 × W arpH and 2 × W arpW represent the height and width
4.2.2 Determine candidates for Cnum of the block tile respectively. 128 means each thread block
When searching for the optimal combination of parameters, has W arpnum × 32 = 4 × 32 = 128 threads to load data
a small tile size may be generated, which may lead to low from global memory. In Formula 1, extraR is the number of
arithmetic intensity and can not hide global memory access additional registers allocated by the compiler and its value is
latency. For example, we assume that the warp tile size is determined through an off-line method. In our experiments,
W arpH × W arpW = 8 × 64 and has 56 channels, which we set extraR = 40 because the NVIDIA CUDA compiler,
means that one warp needs to convolve 8 input elements on average, allocates 40 additional registers for each kernel
with 64 filter elements and accumulates results 56 times to on our evaluation platforms. These additional registers are
generate 8 × 64 = 512 elements. Since the height dimension usually used to store temporary variables for utilizing GPU
is shared among 32 threads of the warp, each thread loads arithmetic pipelines. In Formula 2, 4 means each element
8 input elements, and the width dimension is distributed has 4 bytes and 2 means we use a double buffer method
across 32 threads, each thread loads 2 filter elements. There- [33], [42].
fore, each thread accumulates 56 channels of 8 × 2 = 16
elements. Now we can estimate the arithmetic intensity of 4.3.2 Searching workflow
each thread for one iteration as number of multiplications
number of elements = To guide the search for the optimal combination of parame-
8×2
= 1.6 . We can improve arithmetic intensity by dis- ters, we use two metrics named SM utilization (SMutil ) and
8+2
tributing channels across threads, as shown in Fig. 7. We arithmetic intensity (AI ). Two metrics can be calculated as
distribute eight channels (Cnum = 8) of each filter element follows:
across 32 threads of the warp. In that case, each warp can FN IN × IH × IW
process Fnum = 32/Cnum = 32/8 = 4 filter elements and Blockcount = ×
2 × W arpW 2 × W arpH
9

Block Block !"#$!


!!

$ %!
Tile 0
!!! Tile 1
Warp Warp !"#$!
Partition a

!! "!""!#
!"# block tile into 4
Tile 0 Tile 1
Distribute
channels of Output Warp Warp warp tiles
Block Tile 0
Layout L1 !" Tile 2 Tile 3
filters #$
! % &'
( !!!
Block Tile 1

Logical layout Partition output 32 threads


of output into block tiles
Block 8 channels !! !"#$!

Distribute Output
$ #"
!" # !
% Tile 0
Block
!!! %
!!

channels of Tile 1 4 filter or input elements


Layout L2
inputs !! In this example, each warp calculates 8 channels of
"#

!! "!""!#
$ Block Block
Tile 0 Tile 1
!!! !"#$! %
!"#$"
&
output elements. !! !"#$
!

Fig. 7. Workflow of our 2-level tiling and channel distribution methods.

Algorithm 5: Optimized Pointwise Convolution combinations that satisfy the constraints LimitR (Formula
Input: I , F 1) and LimitS (Formula 2) (Line 3).
Output: O Next, we calculate values of SMutil (Formula 3) and AI
// below codes are executed on CPU
(Formula 4) for all satisfied combinations (Line 5) and select
1 Determine candidates for relevant parameters;
2 foreach parameter combination do the optimal combination with following steps (Line 6-7):
3 if not satisfy constraints of Formula 1 and 2 then
4 continue; Step 1 If SMutil ≥ 1 is true for all combinations, we select
5 Calculate SMutil and AI with Formula 3 and 4; the combinations that possess the smallest and close
6 Choose combinations whose SMutil is close to 1;
to the smallest SMutil . The reason is that when
7 Among chosen combinations, choose the
combination with the maximal AI ; SMutil ≥ 1, all SMs are utilized, in which case
8 Choose the kernel based on the chosen combination.; we want to reduce the number of thread blocks to
// below codes are executed on GPU reduce the number of loads of shared filters or inputs
9 Load Cnum channels of a block tile into shared memory between multiple thread blocks.
array sharedBuf 1; Step 2 If there exists combinations such that SMutil < 1,
10 syncthreads(); we first collect these combinations. Then, among col-
11 for iter ← 0 to IC By 2 × Cnum do
12 Load next Cnum channels into Rtmp ; lected combinations, we select the ones that possess
13 Load channels from sharedBuf 1 into Roperand ; the biggest and close to the biggest SMutil . The
14 Accumulate output elements into Rresult ; reason is that when SMutil < 1, there are idle SMs,
15 Write Rtmp into sharedBuf 2; in which case we want to increase SMutil to fully
16 syncthreads(); utilize SMs. We do not want SMutil to exceed 1
17 Repeat above steps but swap sharedBuf 1 and because that will incur more memory operations.
sharedBuf 2;
Step 3 Among candidate combinations selected in Step 1
18 Use segmented parallel reduction to get the final
output elements and write the result to O; and Step 2, we select the combination with the maxi-
mum value of AI because higher arithmetic intensity
can hide more global memory access latency.
Blockcount
SMutil = (3) Last, we choose the pointwise convolution kernel based
Blocknum × SMnum
on the selected combination (Line 8). In this kernel, each
W arpH × Tnum thread block first loads Cnum channels of the corresponding
AI = (4) block tile into shared memory array sharedBuf 1(Line 9).
W arpH + Tnum
Meanwhile, the thread block loads the next Cnum channels
where Blockcount is the number of generated thread blocks, of the block tile into temporary registers (Line 12). Then,
SMnum is the number of SMs on a GPU. For RTX 2080Ti we load data from sharedBuf 1 into registers (Line 13) and
and Jetson AGX Xavier, SMnum = 68 and SMnum = 8 accumulate output elements into registers (Line 14). Next,
respectively. we write data in temporary registers into sharedBuf 2 (Line
The whole workflow is described in Algorithm 5. We first 15). The kernel repeats the process until all channels have
determine candidates for relevant parameters, including been accumulated to output elements. Finally, we use a
W arpH , W arpW , W arpnum , Blocknum and Cnum , based warp level segmented parallel reduction to reduce results
on the size of the input and filter (Line 1). Then we iterate of different channels into the final result and write results to
over all combinations of parameters (Line 2), and keep the global memory (Line 18).
10

5 E XPERIMENTAL S ETUP TABLE 1


Layer configurations of depthwise convolutions.
5.1 Evaluation Platforms
LAYER IN IC IH × IW FH × FW S
We apply our approach to two GPU platforms. The first
platform has an NVIDIA RTX 2080Ti GPU (2080Ti), which CONV1 1,8,16,32,64,128 16 112×112 3 × 3, 5 × 5 2
CONV2 1,8,16,32,64,128 72 56×56 3 × 3, 5 × 5 2
integrates 4350 CUDA cores for floating point computation CONV3 1,8,16,32,64,128 88 28×28 3 × 3, 5 × 5 1
and 4350 CUDA cores for integer operations. The GPU has CONV4 1,8,16,32,64,128 96 28×28 3 × 3, 5 × 5 2
64KB of shared memory. The host machine has a 2.30GHz CONV5 1,8,16,32,64,128 96 14×14 3 × 3, 5 × 5 1
Intel Xeon E5-2697 CPU with 252GB memory, running Linux CONV6 1,8,16,32,64,128 120 14×14 3 × 3, 5 × 5 1
CONV7 1,8,16,32,64,128 192 14×14 3 × 3, 5 × 5 1
kernel v4.15.0. We use CUDA Toolkit 11.0 and cuDNN 7.6.5. CONV8 1,8,16,32,64,128 240 14×14 3 × 3, 5 × 5 2
The second platform is an embedded GPU platform. It has CONV9 1,8,16,32,64,128 432 7×7 3 × 3, 5 × 5 1
an NVIDIA Jetson AGX Xavier GPU (Xavier), which inte-
grates 512 Volta cores and 48KB shared memory. The host TABLE 2
machine has an 1.2GHz 8-core ARM CPU with 32GB mem- Average speedups of four depthwise convolution implementations with
ory, running Linux kernel v4.9.140-tegra. We use CUDA FP32 over GEMM.
Toolkit 10.0 and cuDNN 7.6.3. 3 × 3, 2080Ti 3 × 3, Xavier 5 × 5, 2080Ti 5 × 5, Xavier
IMPLICIT 1.1 32.8 1.1 20.0
5.2 Competing Methods PRECOMP 1.1 1.2 1.0 1.4
ours 2.2 42.8 3.9 39.4
We compare our approach against cuDNN [22] which TensorFlow 1.8 34.6 2.2 25.3
supports a wide range of convolution operations, includ-
ing depthwise and pointwise convolutions optimized for
GPUs. Moreover, cuDNN can execute GEMM-, FFT- and as inference and training of MobileNetV2 and EffcientNet-
Winograd-based convolutions, allowing us to compare our B0 (Section 6.3), showing that our approach consistently
techniques with mainstream convolution methods. Tensor- outperforms alternative methods by delivering the overall
Flow [43] is one of the mainstream machine learning frame- best performance.
works. We also compare our approach against TensorFlow
implementations of depthwise and pointwise convolutions.
6.1 Depthwise Convolution
6.1.1 Setup
5.3 Performance Report
In this experiment, we compare our approach against the
We apply our approach to depthwise and pointwise convo- depthwise convolution implementations of cuDNN and
lutions of DSC. We run each test case ten times with batch TensorFlow. During the experiments, we have compared
sizes of 1, 8, 16, 32, 64 and 128 on an unloaded machine and our approach to seven algorithms in cuDNN, including IM-
report the averaged running time. We found little variance PLICIT GEMM (IMPLICIT), IMPLICIT PRECOMP GEMM
during execution runs, less than 2%. We run convolutions (PRECOMP), GEMM, FFT, FFT TILING (TILING), WINO-
with two data types, 32-bit floating point (FP32) for normal GRAD and WINOGRAD NONFUSED (NONFUSED). We
CNNs and 8-bit integer (INT8) for quantized CNNs [44]. found IMPLICIT and PRECOMP give the best performance
In our experiments, we utilize data layouts N CHW and in all our test cases. We report the results by comparing our
N HW C for FP32 and INT8, respectively, where N, C, H, W approach against IMPLICIT, PRECOMP and TensorFlow,
respectively denote the batch size, the number of channels, and take GEMM as the baseline in this evaluation. Table 1
the height and the width. CUDA [45] provides 8-bit integer gives the layer configurations used in this experiment where
4-element vector dot product (DP4A) instruction that per- the notations were defined earlier in Section 2.3.
forms the vector dot product between two 4-element vectors
and accumulates the result in a 32-bit integer. Utilizing the 6.1.2 Overall results
DP4A instruction, we can group four contiguous channels
of the INT8 data type into a 4-element vector to perform FP32 implementation. Fig. 8 shows that our approach gives
convolution. Therefore, we utilize N HW C data layout for the best speedup in nearly all test cases. Table 2 presents
the INT8 data type due to its better performance over average speedups of IMPLICIT, PRECOMP, our approach
N CHW . and TensorFlow over GEMM for 3 × 3 and 5 × 5 filter sizes
In this work, we first test depthwise convolution with on 2080Ti and Xavier.
two filter sizes, 3 × 3 and 5 × 5, because these are commonly PRECOMP and GEMM algorithms need extra memory
used filter sizes. Then, we report the performance of point- operations to compute output elements. Consequently, both
wise convolution. Lastly, we apply our optimized depthwise algorithms are not suitable for depthwise convolution. Ten-
and pointwise convolutions on the standard and quantized sorFlow achieves better speedups than IMPLICIT because
MobileNetV2 and EfficientNet-B0 to report the performance it employs several specially designed kernels to increase
of both inference and training. the GPU utilization for different input sizes. However, both
TensorFlow and IMPLICIT do not optimize memory per-
formance for depthwise convolution. Compared to Tensor-
6 E XPERIMENTAL R ESULTS Flow, our approach achieves an average speedup of 1.5×
In this section, we report results for depthwise convolution and 1.6× when using a 3 × 3 filter on 2080Ti and Xavier
(Section 6.1) and pointwise convolution (Section 6.2), as well respectively, and 2.2× and 1.7× when using a 5 × 5 filter
11

cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow
CONV1 CONV2 CONV3 CONV1 CONV2 CONV3
4
7
2 4
1
1
CONV4 CONV5 CONV6 4 CONV4 CONV5 CONV6
Speedup

Speedup
2
2
1
1
3 CONV7 CONV8 CONV9 4 CONV7 CONV8 CONV9
2
2
1 1

1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128


Batch Size Batch Size
(a) Speedups on 2080Ti for the 3 × 3 fitler. (b) Speedups on 2080Ti for the 5 × 5 fitler.
cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow cuDNN IMPLICIT cuDNN PRECOMP ours TensorFlow
CONV1 CONV2 CONV3 CONV1 CONV2 CONV3
50 60

25 30
10 10
80 CONV4 CONV5 CONV6 80
CONV4 CONV5 CONV6
Speedup

Speedup
40 40
20
10
CONV7 CONV8 CONV9 100 CONV7 CONV8 CONV9
90
50 50
20
10
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size Batch Size
(c) Speedups on Xavier for the 3 × 3 fitler. (d) Speedups on Xavier for the 5 × 5 fitler.
Fig. 8. Speedups of IMPLICIT, PRECOMP, our approach and TensorFlow over the baseline implementation (GEMM) for FP32 depthwise convolution
with filters of size 3 × 3 and 5 × 5 on two platforms.

on 2080Ti and Xavier respectively. Since IMPLICIT is closed SM utilization compared to our approach. The reason our
source, we analyze its performance through CUDA Nsight approach leads to lower SM utilization is explained as
Compute [46] and present the results in Section 6.1.3. Over- follows. Our row reuse algorithm performs better when
all, our approach improves IMPLICIT by 2.0× and 1.4× a thread operates on more rows of the output. However,
when using a 3 × 3 filter on 2080Ti and Xavier respectively, the more rows a thread computes on, the fewer warps
and 3.5× and 2.1× when using a 5 × 5 filter on 2080Ti and and thread blocks we can generate. Without enough warps
Xavier respectively. running on SMs, the SM utilization will degrade. Though
IMPLICIT has high SM utilization, it does not result in
INT8 implementation. We found using FP32 gives a
good performance for depthwise convolution. The reason is
speedup of more than 10× over the INT8 version for
that depthwise convolution possesses a low computational
depthwise convolution in cuDNN. This is because the INT8
requirement and is more sensitive to memory performance;
version has the overhead of dequantization (i.e., convert-
hence the focus of performance optimization should be
ing the results from INT8 to FP32 after convolution) and
reducing the memory access latency. If we now look at
can not fully utilize DP4A instruction to accelerate INT8
Fig. 9b, we see that row and column reuse techniques
convolution. We note that TensorFlow does not optimize
reduce memory operations with up to 4.5× lower LDG
depthwise convolution on INT8. Nonetheless, our approach
instructions to be executed when compared to IMPLICIT.
gives over 10× speedups when using INT8 over cuDNN
By reducing the memory access overhead, which dominates
and TensorFlow.
the execution time of depthwise convolution, our approach
thus can lead to better overall performance compared to
6.1.3 Further analysis cuDNN, despite lower SM utilization.
Our performance gain is mainly attributed to the reduced
number of memory accesses offered by our column and row
reuse algorithms.
Fig. 9 reports the measured LDG (load from global From Fig. 8 we can observe that speedups of our ap-
memory) instruction counts and SM utilization for the fast proach over IMPLICIT fluctuate in a small range as batch
IMPLICIT algorithm and our approach when using a 3 × 3 size increases. Both IMPLICIT and our approach can not
filter and a batch size of 32 on 2080Ti. Other configurations benefit from higher GPU utilization because depthwise
follow a similar performance trend. We can see in Fig. 9a convolution is memory bound, thus IMPLICIT and our
that the IMPLICIT algorithm has an average of 2× higher approach grow at the same rate.
12

cuDNN IMPLICIT ours for FP32 and PRECOMP for INT8. When using data type
INT8, PRECOMP performs better than IMPLICIT in 180
50 out of 180 test cases on 2080Ti and 127 out of 180 test
SM Utilization (%)
cases on Xavier. For FP32, we normalize the speedup over
GEMM. For INT8, because GEMM does not support this
40
data type, we show the speedup over IMPLICIT. Table 3 lists
the layer configurations and parameter values generated for
30 W arpH , W arpW , Blocknum and Cnum (W arpnum = 4).
The notations can be found at Section 2.3.
Normally, for a given convolution layer configuration,
1 2 3 4 5 6 7 8 9
NV NV NV NV NV NV NV NV NV when the width of the logical layout of the output (Fig. 7) is
CO CO CO CO CO CO CO CO CO small, our scheme tends to choose a small Cnum . This allows
(a) SM utilizations of IMPLICIT and our approach. one to generate more warps to utilize the GPU. On the other
5 hand, our scheme tends to choose a large Cnum to reduce the
number of warps because there are already enough warps
4 to maximize the utilization of the GPU. A special parameter
tuple (we take parameter tuples generated for 2080Ti as
3 examples) is the layer configuration CONV9 with IN = 1.
Ratio

The width of the logical layout of CONV9 is small. Hence


2
we would like to search for a small Cnum . However, in this
case, Cnum = 32 is large. The reason is that our scheme finds
1
that different values of Cnum gives similar GPU utilization,
then it tries to maximize AI (Formula 4) and then choose
1 2 3 4 5 6 7 8 9 Cnum = 32 (the relationship of AI and Cnum is detailed in
NV NV NV NV NV NV NV NV NV
CO CO CO CO CO CO CO CO CO Section 4.2.2).
(b) The ratio of executed LDG (load from global mem-
ory) instruction counts given by IMPLICIT over our 6.2.2 Overall results
approach (ratio = LDG inst counts of cuDN N
LDG inst counts of ours ). FP32 implementation. Fig. 10 shows speedups of IMPLICIT,
Fig. 9. SM utilizations and ratios of executed LDG instruction counts for PRECOMP and our approach for pointwise convolutions on
depthwise convolutions with a batch size of 32 and a filter size of 3 × 3 two platforms. The baseline is GEMM. Average speedups
on the NVIDIA 2080Ti GPU.
of IMPLICIT, PRECOMP and our approach on 2080Ti and
Xavier are shown in Table 4. The performance delivered by
6.1.4 Summary our approach translates to an improvement of 2× and 1.5×
By reducing the number of memory accesses, our approach over IMPLICIT on 2080Ti and Xavier respectively.
leads to faster memory access time and overall quick com- INT8 implementation. Fig. 11 shows speedups of PRE-
putation time when performing depthwise convolutions. COMP and our approach over IMPLICIT for pointwise con-
Compared to the fastest available algorithms in cuDNN, volution. Table 4 presents average speedups of PRECOMP
our approach achieves an average speedup of 2.8× and and our approach over IMPLICIT on 2080Ti and Xavier.
1.8× when performing depthwise convolutions on 2080Ti Overall, our approach obtains 1.3× and 1.5× improvement
and Xavier, respectively. over PRECOMP on 2080Ti and Xavier respectively.

6.2.3 Further analysis


6.2 Pointwise Convolution
Fig. 12 reports the measured SM utilizations and ratios
6.2.1 Setup of executed LDG instruction counts of IMPLICIT to our
In this experiment, TensorFlow uses cuDNN implementa- approach when using a batch size of 32 on 2080Ti. Other
tions as their backend. Therefore, we only compare our configurations have a similar performance.
approach against all available pointwise convolution imple- For a specific layer configuration, our approach tries to
mentations in cuDNN. The reported execution time of our find a suitable number of thread blocks to utilize the GPU.
approach includes the code running on both the CPU and While using a higher number of thread blocks can improve
the GPU, as described in Algorithm 5. We use the layer con- the GPU utilization, doing so can also incur frequent reloads
figurations from MobileNetV2 and EfficientNet-B0 in this of filters or inputs shared between thread blocks. As shown
experiment. Across different layers of the MobileNetV2 and in Fig. 12b, our approach leads to 2× more LDG instruc-
EfficientNet-B0 models, there are 30 different configurations tions than IMPLICIT in some cases. Although using more
for pointwise convolution. We test all these configurations LDG instructions incurs extra memory load overhead, our
and report the performance of 20 selected layers. The other approach still gives 2× faster execution time over IMPLICIT
10 layers exhibit similar performance as the selected ones due to our schemes for improving the SM utilization and
and hence are omitted for clarity. We report the performance hiding memory, elaborating as follows.
when batch sizes are set to 1, 8, 16, 32, 64 and 128. Our approach exhibits a much higher SM utilization
To aid clarify, we compare our approach to the best- than IMPLICIT. Unlike depthwise convolution, improving
performing alternative scheme - IMPLICIT and PRECOMP SM utilization is key for optimizing pointwise convolution
13

4 CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7


2
1
CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 CONV14
Speedup

4
2
1
CONV15 CONV16 CONV17 CONV18 CONV19 CONV20 cuDNN
IMPLICIT
4 cuDNN
2 PRECOMP
1 ours
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size
3
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7
2
1

3 CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 CONV14


Speedup

2
1

CONV15 CONV16 CONV17 CONV18 CONV19 CONV20 cuDNN


3 IMPLICIT
2 cuDNN
1 PRECOMP
ours
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size
Fig. 10. Speedups of IMPLICIT, PRECOMP and ours over GEMM for pointwise convolutions with FP32 on 2080Ti (top) and Xavier (bottom).

4 CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7


2
1
CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 CONV14
2
Speedup

CONV15 CONV16 CONV17 CONV18 CONV19 CONV20


2 cuDNN
PRECOMP
1
ours
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7
2
1

CONV8 CONV9 CONV10 CONV11 CONV12 CONV13 CONV14


2
Speedup

CONV15 CONV16 CONV17 CONV18 CONV19 CONV20 cuDNN


3 PRECOMP
2
1 ours
1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128 1 8 16 32 64 128
Batch Size
Fig. 11. Speedups of PRECOMP and ours over IMPLICIT for pointwise convolutions with INT8 on 2080Ti (top) and Xavier (bottom).
14

TABLE 3
Layer configurations of pointwise convolutions (FC = IC , IW = IH and FW = FH = 1) and parameter values of W arpH , W arpW , Blocknum
and Cnum . We use the tuples (W arpH , W arpW , Blocknum , Cnum ) and [W arpH , W arpW , Blocknum , Cnum ] to represent parameter values
generated for 2080Ti and Xavier respectively.

LAYER IC IH FN IN = 1 IN = 8 IN = 16 IN = 32 IN = 64 IN = 128
( 4, 12, 2, 8) ( 4, 47, 4, 2) ( 4, 480, 4, 1) ( 4, 480, 4, 1) ( 4, 480, 4, 1) ( 4,1216, 2, 32)
CONV1 16 56 8
[ 4, 50, 4, 2] [ 4, 480, 4, 1] [ 4,1216, 2, 32] [ 4,1216, 2, 32] [ 4,1216, 2, 32] [ 4,1216, 2, 32]
( 8, 12, 2, 8) ( 8, 47, 4, 4) ( 8, 256, 4, 1) ( 8, 256, 4, 1) ( 8, 672, 2, 32) ( 8, 672, 2, 32)
CONV2 8 56 16
[ 8, 50, 4, 4] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32] [ 8, 672, 2, 32]
(12, 36, 2, 8) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4) (12, 36, 4, 4)
CONV3 16 56 72
[12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4]
( 6, 6, 2, 32) ( 6, 47, 2, 4) ( 6, 47, 4, 4) ( 6, 320, 4, 1) ( 6, 320, 4, 1) ( 6, 864, 2, 32)
CONV4 72 28 24
[ 6, 50, 2, 4] [ 6, 320, 4, 1] [ 6, 864, 2, 32] [ 6, 864, 2, 32] [ 6, 864, 2, 32] [ 6, 864, 2, 32]
( 3, 48, 2, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2)
CONV5 24 28 96
[12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2]
( 6, 2, 2, 32) ( 6, 12, 2, 16) ( 6, 24, 2, 8) ( 6, 47, 2, 4) ( 6, 47, 4, 4) ( 6, 320, 4, 1)
CONV6 96 14 24
[ 6, 13, 2, 16] [ 6, 50, 4, 4] [ 6, 320, 4, 1] [ 6, 320, 4, 1] [ 6, 864, 2, 32] [ 6, 864, 2, 32]
( 2, 48, 2, 1) ( 6, 48, 2, 4) (12, 48, 2, 8) (12, 48, 4, 2) (12, 48, 4, 2) (12, 48, 4, 2)
CONV7 24 14 96
[ 7, 48, 2, 4] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2] [12, 48, 4, 2]
( 2, 96, 2, 1) ( 6, 96, 2, 2) (12, 96, 2, 4) (12, 96, 4, 1) (12, 96, 4, 1) (12, 96, 4, 1)
CONV8 32 14 192
[ 7, 96, 2, 2] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1] [12, 96, 4, 1]
(12, 2, 2, 32) (12, 12, 2, 32) (12, 24, 2, 16) (12, 47, 2, 8) (12, 47, 4, 2) (12, 160, 4, 1)
CONV9 192 14 48
[12, 13, 2, 32] [12, 50, 4, 2] [12, 160, 4, 1] [12, 480, 2, 32] [12, 480, 2, 32] [12, 480, 2, 32]
(10, 2, 2, 32) (10, 12, 2, 32) (10, 24, 2, 16) (10, 47, 2, 8) (10, 47, 4, 4) (10, 192, 4, 1)
CONV10 96 14 40
[10, 13, 2, 16] [10, 50, 4, 2] [10, 192, 4, 1] [10, 544, 2, 32] [10, 544, 2, 32] [10, 544, 2, 32]
( 2, 60, 2, 1) ( 6, 60, 2, 2) (12, 60, 2, 4) (12, 60, 4, 2) (12, 60, 4, 2) (12, 60, 4, 2)
CONV11 40 14 120
[ 7, 60, 2, 4] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2] [12, 60, 4, 2]
( 8, 2, 2, 32) ( 8, 12, 2, 16) ( 8, 24, 2, 8) ( 8, 47, 2, 4) ( 8, 47, 4, 4) ( 8, 256, 4, 1)
CONV12 120 14 32
[ 8, 13, 2, 16] [ 8, 50, 4, 4] [ 8, 256, 4, 1] [ 8, 256, 4, 1] [ 8, 672, 2, 32] [ 8, 672, 2, 32]
( 2, 120, 2, 1) ( 6, 120, 2, 1) (12, 120, 2, 4) (12, 120, 4, 1) (12, 120, 4, 1) (12, 120, 4, 1)
CONV13 40 14 240
[ 7, 120, 2, 2] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1]
( 2, 32, 2, 2) ( 2, 32, 2, 2) ( 3, 32, 2, 2) ( 6, 32, 2, 4) (12, 32, 2, 8) (12, 32, 4, 4)
CONV14 240 7 64
[ 2, 32, 2, 2] [ 7, 32, 4, 8] [12, 32, 4, 4] [12, 32, 4, 4] [12, 32, 4, 4] [12, 32, 4, 4]
( 2, 120, 2, 1) ( 2, 120, 2, 1) ( 3, 120, 2, 1) ( 6, 120, 2, 1) (12, 120, 2, 4) (12, 120, 4, 1)
CONV15 64 7 240
[ 2, 120, 2, 1] [ 7, 120, 4, 2] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1] [12, 120, 4, 1]
( 2, 216, 2, 1) ( 2, 216, 2, 1) ( 3, 216, 2, 1) ( 6, 216, 2, 1) (12, 216, 2, 2) ( 9, 216, 4, 32)
CONV16 72 7 432
[ 2, 216, 2, 1] [ 7, 216, 4, 1] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32]
( 2, 56, 2, 1) ( 2, 56, 2, 1) ( 3, 56, 2, 1) ( 6, 56, 2, 4) (12, 56, 2, 8) (12, 56, 4, 2)
CONV17 432 7 112
[ 2, 56, 2, 1] [ 7, 56, 4, 4] [12, 56, 4, 2] [12, 56, 4, 2] [12, 56, 4, 2] [12, 56, 4, 2]
( 2, 216, 2, 1) ( 2, 216, 2, 1) ( 3, 216, 2, 1) ( 6, 216, 2, 1) (12, 216, 2, 2) ( 9, 216, 4, 32)
CONV18 112 7 432
[ 2, 216, 2, 1] [ 7, 216, 4, 1] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32] [ 9, 216, 4, 32]
( 2, 36, 2, 1) ( 2, 36, 2, 1) ( 3, 36, 2, 2) ( 6, 36, 2, 4) (12, 36, 2, 8) (12, 36, 4, 4)
CONV19 432 7 72
[ 2, 36, 2, 1] [ 7, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4] [12, 36, 4, 4]
( 2, 256, 2, 1) ( 3, 256, 2, 1) ( 6, 256, 2, 1) (12, 256, 2, 1) ( 8, 256, 4, 32) ( 8, 256, 4, 32)
CONV20 432 7 1024
[ 4, 256, 2, 1] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32] [ 8, 256, 4, 32]

TABLE 4 of our memory optimization strategies, consider now Fig.


Average speedups of three pointwise convolution implementations over 13 that shows the average number of cycles each GPU warp
GEMM and IMPLICIT for FP32 and INT8 respectively.
spends on waiting for the GPU global memory access oper-
FP32, 2080Ti FP32, Xavier INT8, 2080Ti INT8, Xavier ation to complete. As a baseline, we implemented a simple
IMPLICIT 1.5 1.3 1.0 1.0 pointwise without latency hiding, denoted as simple. We
PRECOMP 1.3 1.1 1.3 1.2 can see that our approach can significantly reduce the mem-
ours 3.0 2.0 1.7 1.6 ory access latency compared to the simple implementation.
Therefore, although our approach incurs a larger number of
LDG instructions, much of the memory access overhead can
because utilizing more SMs can significantly accelerate the be hidden by our memory optimization strategy.
computation. As can be seen from Fig. 12a, our approach has Furthermore, we observe some performance degrada-
an average of 1.9× higher SM utilization compared to IM- tion for pointwise convolutions with the INT8 data type.
PLICIT. IMPLICIT is optimized for training and large batch- When performing pointwise convolutions with INT8, we
sized inference. It uses a fixed tile size work distribution use N HW C data format, and four continuous INT8 chan-
strategy, which fails to utilize SMs efficiently when using a nels can be viewed as an INT32 channel. Thus, the size of the
batch size of 128 or smaller. Our dynamic tile size scheme channel dimension is reduced to one-fourth of the original
(Section 4.1) overcomes this limitation by adaptively deter- size. For small channel sizes (IC ≤ 96), the corresponding
mining the right tile size to use at runtime, which thus leads reduced channel sizes restrict choices of the number of chan-
to better SM utilization and performance improvement. nels distributed, which leads to suboptimal performance
To hide the global memory access latency, our approach compared to original channel sizes. This can be improved
employs double buffering and channel distribution tech- by having a better channel size allocation scheme for INT8.
niques as described in Section 4.2. To quantify the benefit We leave this as our future work.
15

cuDNN IMPLICIT ours Inference. For inference, we test standard and quantized
70 MobileNetV2 and EfficientNet-B0 with batch sizes of 1,
8, 16, 32, 64 and 128 on both platforms and report the
60
SM Utilization (%)

respective inference time. For quantization, the input and


50 filter are converting from FP32 to INT8, and the results are
converted back to FP32 as the model output. As cuDNN
40 performs poorly for depthwise convolutions with INT8, we
30 do not apply quantization to depthwise convolutions for fair
comparisons.
20
Training. For training, we test MobileNetV2 and
10 EfficientNet-B0 with batch sizes of 16, 32, 64 and 128 on
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
NV NV NV NV NV NV NV NV NV V1 V1 V1 V1 V1 V1 V1 V1 V1 V1 V2
CO CO CO CO CO CO CO CO COCONCONCONCONCONCONCONCONCONCONCON
2080Ti and report the average training time of one training
iteration, including the forward and the back-propagation
(a) SM utilizations of IMPLICIT and our approach. phases.
Workload and performance report. We use the open-source
2 MobileNetV2 and EfficientNet-B0 implemented using the
Caffe framework, but we replace the implementations
of batch normalization and depthwise convolution layers
with the heavily optimized cuDNN implementations. The
Ratio

cuDNN implementation is denoted as cuDNN and our im-


1 plementation is denoted as Ours. We report the percentage
of performance improvement of our approach compared to
cuDNN implementations, denoted as Improved.

6.3.2 Overall results


1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 Table 5 reports the measured inference time. For Mo-
NV NV NV NV NV NV NV NV NV V1 V1 V1 V1 V1 V1 V1 V1 V1 V1 V2
CO CO CO CO CO CO CO CO COCONCONCONCONCONCONCONCONCONCONCON bileNetV2 with FP32, our approach improves the perfor-
mance of inference by 12.2% and 13.5% on average com-
(b) Ratios of executed LDG (load from global memory) pared to IMPLICIT on 2080Ti and Xavier, respectively.
instruction counts of IMPLICIT to our approach (ratio = For MobileNetV2 with INT8, we obtain 8.5% and 11.7%
LDG inst counts of cuDN N
LDG inst counts of ours ). improvements on average over PRECOMP on 2080Ti and
Fig. 12. SM utilizations and ratios of executed LDG instruction counts for Xavier, respectively. For EfficientNet-B0 with FP32, our ap-
pointwise convolutions with a batch size of 32 on 2080Ti. proach improves IMPLICIT by 14.4% and 12.3% on average
on 2080Ti and Xavier, respectively. For EfficientNet-B0 with
From Figs. 10 and 11, we see that our approach is more INT8, we obtain 9.9% and 9.6% improvements on average
noticeable on small batch sizes. This is because that point- over PRECOMP on 2080Ti and Xavier, respectively. Table
wise convolution is more sensitive to the GPU utilization. 6 shows that our approach averagely reduces the training
A larger batch size tends to use more warps, which alone time of MobileNetV2 and EfficientNet-B0 by 9.7% and 7.3%
can improve the GPU utilization and further improve the compared to IMPLICIT on 2080Ti, respectively. The results
performance of cuDNN. For example, the speedups of our show that our approach can significantly reduce both the
approach over cuDNN when IN = 128 are much smaller model inference and training time by speeding up DSC
than the speedups when IN < 128. By contrast, when the operations.
batch size is smaller, the resulting warps alone is insufficient
in utilizing the GPU where our dynamic tile size scheme can 7 R ELATED W ORK
help.
Numerous efforts have been dedicated to optimizing convo-
6.2.4 Summary lution operations. As previously mentioned, GEMM-, FFT-
Our approach uses a dynamic tile size method to improve and Winograd-based convolutions are broadly adopted con-
SM utilization and double-buffering and channel distribu- volution algorithms.
tion to hide memory access latency. With the help of both GEMM-based convolution is the first attempt to opti-
methods, we achieve an average speedup of 2× and 1.5× mize convolution. Chellapilla et al. [48] developed an un-
over IMPLICIT on 2080Ti and Xavier, respectively. rolling convolution algorithm called the im2col convolution
algorithm. Abdelfattah et al. [33] use a simple pruning
strategy to search for the optimal tiling size. However, their
6.3 End to End Performance for Inference and Training
method is inadequate for depthwise separable convolution
6.3.1 Setup because they ignore SM utilization and arithmetic intensity
In this experiment, we apply our depthwise and pointwise when searching for the optimal tiling size. Our approach
convolutions to MobileNetV2 and EfficientNet-B0 and re- avoids this problem by dynamically adjusting the tiling size.
port the end-to-end performance of inference and training A wide range of techniques on auto-tuning GEMM
with ImageNet dataset [47]. kernels have been proposed. Among these, the FFT- and
16

8
cuDNN IMPLICIT
6 ours
simple
Cycles

4
2
0
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ONV11 ONV12 ONV13 ONV14 ONV15 ONV16 ONV17 ONV18 ONV19 ONV20
C ON C ON C ON C ON C ON C ON C ON C ON C ON C ON C C C C C C C C C C
Fig. 13. The average number of cycles each warp spends on waiting for the global memory access to complete.

TABLE 5
Inference time of MobileNetV2 and EfficientNet-B0 with FP32 and INT8 on 2080Ti and Xavier.
MobileNetV2 EfficientNet-B0
Batch 1 8 16 32 64 128 1 8 16 32 64 128
cuDNN (ms) 7.5 8.8 9.7 14.4 19.1 28.7 10.1 13.7 18.1 25.0 36.4 52.3
2080Ti
Ours (ms) 6.1 7.1 8.0 12.0 16.9 26.3 7.9 11.3 15.3 21.9 32.6 47.6
(FP32)
Improved (%) 18.6 19.3 17.5 16.7 11.5 8.4 21.8 17.5 15.5 12.4 10.4 9.0
cuDNN (ms) 16.6 22.3 32.1 52.6 84.2 140.1 19.3 27.4 38.3 57.2 94.0 157.8
Xavier
Ours (ms) 13.2 18.9 27.8 44.7 76.1 130.0 15.5 23.2 32.1 50.7 87.3 151.1
(FP32)
Improve (%) 20.5 15.2 13.4 15.0 9.6 7.2 19.7 15.3 16.3 11.4 7.1 4.2
cuDNN (ms) 6.3 7.4 7.7 11.2 14.6 20.2 8.0 9.5 13.3 18.7 26.8 38.3
2080Ti
Ours (ms) 5.5 6.6 6.8 10.3 14.0 19.7 6.8 8.2 11.8 16.9 25.3 36.6
(INT8)
Improved (%) 12.7 10.8 11.7 8.0 4.1 2.5 15.0 13.7 11.3 9.6 5.6 4.4
cuDNN (ms) 13.3 18.0 27.0 42.6 64.8 103.7 16.1 21.0 33.7 52.8 80.3 127.5
Xavier
Ours (ms) 11.7 15.4 22.7 38.8 58.3 94.4 14.2 18.8 30.3 48.2 73.2 117.7
(INT8)
Improved (%) 12.0 14.4 16.0 8.9 10.0 9.0 11.8 10.5 10.1 8.7 8.8 7.7

TABLE 6 memory overhead of GEMM-based convolutions using a


Training time of MobileNetV2 and EfficientNet-B0 with FP32 on 2080Ti. compact lowering scheme to reduce the redundancy in the
MobileNetV2 EfficientNet-B0 lowered matrix and then performed multiple small matrix
multiplications in parallel. However, this algorithm still
Batch 16 32 64 128 16 32 64 128
needs to transform the input and filter tensors into lowered
cuDNN (ms) 16.6 27.6 43,4 75.4 33.5 49.3 74.7 116.2
matrixes to compute the convolution. Iandola et al. [52] re-
Ours (ms) 14.5 24.1 39.9 71.3 30.0 45.1 69.6 112.4
Improved (%) 12.7 12.7 8.1 5.4 10.4 8.5 6.8 3.3 duced memory communication of 2D convolutions on GPU.
They also prefetched the image regions to the registers.
Winograd-based convolutions are the dominating methods While their method uses fewer threads, each thread operates
because they can reduce computational complexity and on a larger number of data items. As a result, their method
improve convolution performance. Mathieu et al. [49] pro- does not reduce the number of global memory transactions.
posed an FFT-based convolution to compute convolutions Unlike [52], our approach promotes register use and can
as pointwise products in the Fourier domain and reuse significantly reduce the number of memory accesses.
the transformed input data, which significantly reduces the The work presented in [53] splits larger batches into
complexity of the convolution. However, FFT-based convo- smaller batch sizes to mitigate computation resources re-
lution is more suitable for large filters than for small ones. strictions of large batch size. Li et al. [54] explore the impact
Because padding the filters to the same size as the input data of data layout on convolution operations. Zhang et al.
is necessary, and the latter (e.g., 3 × 3 filters) needs more [55] design a method to map computation to FMA (fused
memory than the former. Lavin et al. [50] used Winograd’s multiply-add) units and focus on maximizing arithmetic
minimal filtering algorithm to accelerate the convolution on intensity. Unlike our approach, none of these methods ex-
GPU. This algorithm can reduce the arithmetic complexity plicitly considers SM utilization, which is vital for achieving
of convolution by up to four times compared with direct good performance for DSC on GPUs. The work presented in
convolution. However, Winograd’s algorithm is only suit- [24] also targets column and row reuse. However, there are
able for small filters due to its numerical instability. Zhen two main drawbacks in their approach. First, 32 threads of
et al. [12] extended Winograd’s algorithm to support any a warp in their approach will generate 28 output elements
filter size. However, the traditional and extended Winograd- for a 5 × 5 filter. Our column reuse method ensures that 32
based algorithms need to transform the input and filter threads of a warp generate 32 output elements for any filter
before performing matrix multiplication, and both require sizes, thus our approach needs fewer warps and is more ef-
more operations than the FFT algorithm. ficient than their approach. Second, their approach will load
Transforming the input and filter before performing ma- all needed input elements into registers to avoid reloading
trix multiplication incurs a large memory overhead, which shared input elements, while our approach loads one input
can outweigh the performance gains obtained through low- element each time and calculate all output elements that
ering the computational complexity. Therefore, recent stud- depend on the loaded input element. Thus, our approach
ies have looked into minimizing the memory overhead uses less registers and can execute more warps on one SM
of the transformation phases. Cho et al. [51] reduced the concurrently.
17

8 C ONCLUSION [9] D. Haase and M. Amthor, “Rethinking depthwise separable con-


volutions: How intra-kernel correlations lead to improved mo-
We have presented two novel approaches to optimize bilenets,” in Proceedings of the IEEE/CVF Conference on Computer
memory performance and SM utilization for depthwise Vision and Pattern Recognition, 2020, pp. 14 600–14 609.
and pointwise convolutions respectively. Our approach im- [10] R. Zhang, F. Zhu, J. Liu, and G. Liu, “Depth-wise separable
convolutions and multi-level pooling for an efficient spatial cnn-
proves the data locality for convolutional operations per- based steganalysis,” IEEE Transactions on Information Forensics and
formed on the row and column directions to reduce the Security, vol. 15, pp. 1138–1150, 2019.
memory access. Our techniques utilize the common GPU [11] L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski,
shuffle operations supported by mainstream GPU program- “Seed rl: Scalable and efficient deep-rl with accelerated central
inference,” 2019.
ming models, including CUDA and OpenCL, and do not
[12] J. Zhen, A. Zlateski, F. Durand, and L. Kai, “Optimizing n-
require hardware modifications. For pointwise convolution, dimensional, winograd-based convolution for manycore cpus,” in
the main problem is low SM utilization because cuDNN Acm Sigplan Symposium on Principles & Practice of Parallel Program-
uses a fixed tile size for all pointwise convolutions. We ming, 2018.
[13] Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing
design a dynamic tile size method and meanwhile hide the
cnn model inference on cpus,” in 2019 USENIX Annual Technical
memory access latency. We evaluate our approach for FP32 Conference (USENIX ATC 19), 2019, pp. 1025–1040.
and INT8 on NVIDIA RTX 2080Ti and Jetson AGX Xavier [14] M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger,
GPUs. We compare our approach against a wide range “Adaptive sparse matrix-matrix multiplication on the gpu,” in
Proceedings of the 24th Symposium on Principles and Practice of Parallel
of heavily optimized convolution algorithms. Experimental Programming, 2019, pp. 68–81.
results show that our approach consistently outperforms the [15] Z. Li, H. Jia, Y. Zhang, T. Chen, L. Yuan, L. Cao, and X. Wang,
competing methods by delivering the best overall perfor- “Autofft: a template-based fft codes auto-generation framework
mance for depthwise and pointwise convolutions. for arm and x86 cpus,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis,
2019, pp. 1–15.
[16] D. Yan, W. Wang, and X. Chu, “Optimizing batched winograd
ACKNOWLEDGMENTS convolution on gpus,” in Proceedings of the 25th ACM SIGPLAN
This work was supported in part by the National Key Symposium on Principles and Practice of Parallel Programming, 2020,
pp. 32–44.
Research and Development Program of China under grant [17] A. Vasudevan, A. Anderson, and D. Gregg, “Parallel multi channel
agreement 2017YFB0202901, the Key-Area Research and De- convolution using general matrix multiplication,” in IEEE Interna-
velopment Program of Guangdong Province under grant tional Conference on Application-specific Systems, 2017.
agreement 2019B010136001, the National Natural Science [18] X. Li, Y. Liang, S. Yan, L. Jia, and Y. Li, “A coordinated tiling and
batching framework for efficient gemm on gpus,” in Proceedings
Foundation of China (NSFC) under grant agreements of the 24th Symposium on Principles and Practice of Parallel Program-
61672186 and 61872294, and the Shenzhen Technology ming, 2019, pp. 229–241.
Research and Development Fund under grant agreement [19] D. Wu, J. Li, R. Yin, H. Hsiao, Y. Kim, and J. San Miguel, “ugemm:
JCYJ20190806143418198. Professor Zhang is the correspond- unary computing architecture for gemm applications,” in 2020
ACM/IEEE 47th Annual International Symposium on Computer Ar-
ing author. chitecture (ISCA). IEEE, 2020, pp. 377–390.
[20] A. Zlateski, Z. Jia, K. Li, and F. Durand, “The anatomy of efficient
fft and winograd convolutions on modern cpus,” in Proceedings
R EFERENCES of the ACM International Conference on Supercomputing, 2019, pp.
414–424.
[1] D. Zoran, M. Chrzanowski, P.-S. Huang, S. Gowal, A. Mott, and [21] NVIDIA, CUDA C++ Best Practices Guide.
P. Kohli, “Towards robust image classification using sequential [Online]. Available: https://docs.nvidia.com/cuda/
attention models,” in Proceedings of the IEEE/CVF Conference on cuda-c-best-practices-guide/index.html
Computer Vision and Pattern Recognition, 2020, pp. 9483–9492.
[22] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-
[2] C. Yang, Z. An, H. Zhu, X. Hu, K. Zhang, K. Xu, C. Li, and
zaro, and E. Shelhamer, “cudnn: Efficient primitives for deep
Y. Xu, “Gated convolutional networks with hybrid connectivity
learning,” CoRR, vol. abs/1410.0759, 2014.
for image classification.” in AAAI, 2020, pp. 12 581–12 588.
[23] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and
[3] S. Wang, Y. Gong, J. Xing, L. Huang, C. Huang, and W. Hu,
Y. LeCun, “Fast convolutional nets with fbfft: A GPU performance
“Rdsnet: A new deep architecture forreciprocal object detection
evaluation,” in 3rd International Conference on Learning Representa-
and instance segmentation,” in Proceedings of the AAAI Conference
tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference
on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 208–12 215.
Track Proceedings, 2015.
[4] B. Chen, G. Ghiasi, H. Liu, T.-Y. Lin, D. Kalenichenko, H. Adam,
and Q. V. Le, “Mnasfpn: Learning latency-aware pyramid architec- [24] P. Chen, M. Wahib, S. Takizawa, R. Takano, and S. Matsuoka,
ture for object detection on mobile devices,” in Proceedings of the “A versatile software systolic execution model for gpu memory-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, bound kernels,” in Proceedings of the International Conference for
2020, pp. 13 607–13 616. High Performance Computing, Networking, Storage and Analysis, 2019,
[5] Z. Zhong, Z. Q. Lin, R. Bidart, X. Hu, I. B. Daya, Z. Li, W.-S. pp. 1–81.
Zheng, J. Li, and A. Wong, “Squeeze-and-attention networks for [25] B. Pourghassemi, C. Zhang, J. H. Lee, and A. Chandramowlish-
semantic segmentation,” in Proceedings of the IEEE/CVF Conference waran, “On the limits of parallelizing convolutional neural net-
on Computer Vision and Pattern Recognition, 2020, pp. 13 065–13 074. works on gpus,” in Proceedings of the 32nd ACM Symposium on
[6] H. Tokunaga, Y. Teramoto, A. Yoshizawa, and R. Bise, “Adaptive Parallelism in Algorithms and Architectures, 2020, pp. 567–569.
weighting multi-field-of-view cnn for semantic segmentation in [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
pathology,” in Proceedings of the IEEE Conference on Computer Vision D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
and Pattern Recognition, 2019, pp. 12 597–12 606. convolutions,” in Proceedings of the IEEE conference on computer
[7] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, vision and pattern recognition, 2015, pp. 1–9.
W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
mobilenetv3,” in Proceedings of the IEEE International Conference on “Mobilenetv2: Inverted residuals and linear bottlenecks,” in The
Computer Vision, 2019, pp. 1314–1324. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
[8] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- June 2018.
volutional neural networks,” in International Conference on Machine [28] G. Lu, W. Zhang, and Z. Wang, “Optimizing gpu memory trans-
Learning. PMLR, 2019, pp. 6105–6114. actions for convolution operations,” in 2020 IEEE International
18

Conference on Cluster Computing (CLUSTER). IEEE, 2020, pp. 399– lutional neural networks for document processing,” Tenth Interna-
403. tional Workshop on Frontiers in Handwriting Recognition, 2006.
[29] X. Mei and X. Chu, “Dissecting gpu memory hierarchy through [49] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolu-
microbenchmarking,” IEEE Transactions on Parallel and Distributed tional networks through ffts,” arXiv preprint arXiv:1312.5851, 2013.
Systems, vol. 28, no. 1, pp. 72–86, 2016. [50] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
[30] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting networks,” in Proceedings of the IEEE Conference on Computer Vision
the nvidia volta gpu architecture via microbenchmarking,” arXiv and Pattern Recognition, 2016, pp. 4013–4021.
preprint arXiv:1804.06826, 2018. [51] M. Cho and D. Brand, “Mec: memory-efficient convolution for
[31] D. E. Tanner, “Tensile: Auto-tuning gemm gpu assembly for all deep neural network,” in Proceedings of the 34th International Con-
problem sizes,” in 2018 IEEE International Parallel and Distributed ference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 815–
Processing Symposium Workshops (IPDPSW). IEEE, 2018, pp. 1066– 824.
1075. [52] F. N. Iandola, D. Sheffield, M. J. Anderson, P. M. Phothilimthana,
[32] V. Kelefouras, A. Kritikakou, I. Mporas, and V. Kolonias, “A high- and K. Keutzer, “Communication-minimizing 2d convolution in
performance matrix–matrix multiplication methodology for cpu gpu registers,” in IEEE International Conference on Image Processing,
and gpu architectures,” The Journal of Supercomputing, vol. 72, no. 3, 2014.
pp. 804–844, 2016. [53] Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka, “Accelerating
deep learning frameworks with micro-batches,” in 2018 IEEE
[33] A. Abdelfattah, S. Tomov, and J. Dongarra, “Fast batched matrix
International Conference on Cluster Computing (CLUSTER). IEEE,
multiplication for small sizes using half-precision arithmetic on
2018, pp. 402–412.
gpus,” in 2019 IEEE International Parallel and Distributed Processing
[54] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, “Optimizing
Symposium (IPDPS). IEEE, 2019, pp. 111–122.
memory efficiency for deep convolutional neural networks on
[34] J. Kurzak, H. Anzt, M. Gates, and J. Dongarra, “Implementation gpus,” in SC ’16: Proceedings of the International Conference for High
and tuning of batched cholesky factorization and solve for nvidia Performance Computing, Networking, Storage and Analysis, 2016, pp.
gpus,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, 633–644.
no. 7, pp. 2036–2048, 2015. [55] J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-
[35] L. Jiang, C. Yang, and W. Ma, “Enabling highly efficient batched memory overhead direct convolutions,” in International Conference
matrix multiplications on sw26010 many-core processor,” ACM on Machine Learning, 2018, pp. 5776–5785.
Transactions on Architecture and Code Optimization (TACO), vol. 17,
no. 1, pp. 1–23, 2020.
[36] P. Tillet and D. Cox, “Input-aware auto-tuning of compute-bound
hpc kernels,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, 2017, pp.
1–12.
Gangzhao Lu received the B.S. degree in com-
[37] H. Lan, J. Meng, C. Hundt, B. Schmidt, M. Deng, X. Wang, W. Liu, puter science and engineering from Harbin In-
Y. Qiao, and S. Feng, “Feathercnn: Fast inference computation with stitute of Technology, China, in 2014. He is
tensorgemm on arm architectures,” IEEE Transactions on Parallel currently working toward the Ph.D. degree in
and Distributed Systems, vol. 31, no. 3, pp. 580–594, 2019. the School of Cyberspace Science, Harbin In-
[38] Y. Zhang and F. Mueller, “Autogeneration and autotuning of 3d stitute of Technology. His research interests in-
stencil codes on homogeneous and heterogeneous gpu clusters,” clude performance modeling, parallel optimiza-
IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 3, tion, auto-tuning.
pp. 417–427, 2012.
[39] D. Yan, W. Wang, and X. Chu, “Demystifying tensor cores to
optimize half-precision matrix multiply,” in 2020 IEEE International
Parallel and Distributed Processing Symposium, IPDPS, 2020, pp. 20–
24.
[40] L. Jia, Y. Liang, X. Li, L. Lu, and S. Yan, “Enabling efficient fast con-
volution algorithms on gpus via megakernels,” IEEE Transactions
on Computers, 2020. Weizhe Zhang (Senior Member, IEEE) received
[41] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: B.Eng, M.Eng and Ph.D. degree of Engineering
An automatic schedule exploration and optimization framework in computer science and technology in 1999,
for tensor computation on heterogeneous system,” in Proceedings of 2001 and 2006 respectively from Harbin Institute
the Twenty-Fifth International Conference on Architectural Support for of Technology.
Programming Languages and Operating Systems, 2020, pp. 859–873. He is currently a professor in the School of
[42] D. Nichols, N.-S. Tomov, F. Betancourt, S. Tomov, K. Wong, and Computer Science and Technology at Harbin
J. Dongarra, “Magmadnn: towards high-performance data analyt- Institute of Technology, China, and director in
ics and machine learning for data-driven scientific computing,” in the Cyberspace Security Research Center, Peng
International Conference on High Performance Computing. Springer, Cheng Laboratory, Shenzhen, China. His re-
2019, pp. 490–503. search interests are primarily in parallel com-
[43] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, puting, distributed computing, cloud and grid computing, and computer
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for network. He has published more than 100 academic papers in journals,
large-scale machine learning,” in 12th {USENIX} symposium on books, and conference proceedings.
operating systems design and implementation ({OSDI} 16), 2016, pp.
265–283.
[44] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling, “Data-free
quantization through weight equalization and bias correction,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 1325–1334. Zheng Wang is an associate professor with the
[45] NVIDIA, CUDA Toolkit Programming Guide. University of Leeds. His research focuses on
[Online]. Available: https://docs.nvidia.com/cuda/ parallel computing, compilation and systems se-
cuda-c-programming-guide/index.html curity.
[46] ——, NVIDIA Nsight Compute. [Online]. Available: https:
//developer.nvidia.com/nsight-compute
[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
agenet: A large-scale hierarchical image database,” in 2009 IEEE
conference on computer vision and pattern recognition. Ieee, 2009, pp.
248–255.
[48] K. Chellapilla, S. Puri, and P. Simard, “High performance convo-

You might also like