GPU in Supercomputer
GPU in Supercomputer
GPU in Supercomputer
University of Rochester
[email protected]
Abstract
The demands placed by the gaming industry has driven advancements of the GPU architecture. GPUs
today is a low-cost, highly-parallel coprocessor with a high memory bandwidth, which supports single-
and double-precision floating point calculations via an instruction set. Due this extreme parallelism
and high performance for monetary cost, has naturally raised interest in harnessing GPUs power for
HPC (High Performance Computing) Systems. Therefore GPU architectures have become increasingly
important in the multi-core era. However Programming thousands of massively parallel threads is a big
challenge for software engineers, but understanding the performance bottlenecks of those parallel programs
on GPU architectures to improve application performance is even more difficult. In this survey we try to
understand performance bottlenecks of GPU in High performance Computing Systems by studying about
the architecture of GPU, bottlenecks of thread and memory level parallelism with the help of an analytical
model, comparing a high end GPU with a single node of a supercomputer to get an insight about the weak
points of a GPU, errors caused by GPUs in a super computer.
1
GPU in Supercomputer 8th May 2016
2
GPU in Supercomputer 8th May 2016
executing in the same SM. Each thread in a of the memory requests within a warp can be
block has its own independent version of reg- coalesced into a single memory transaction. If
ister variables declared. Variables that are too every thread in a warp generates a different
large will be placed in Local Memory which is memory address, then it will generate different
located in Device memory. The Local Memory transactions. If memory requests in a warp are
space is not cached, so accesses to it are as ex- uncoalesced, the warp cannot be executed until
pensive as normal accesses to Device Memory. all memory transactions from the same warp
Device memory or global memory is located are serviced, which takes significantly longer
on the graphics card which can be accessed than waiting for only one memory request.
by all threads. Constant memory is used for
data that will not change over the course of
a kernel execution. Constant memory is used
in place of device memory to reduce memory
bandwidth.
Texture Cache - texture memory is another
variety of read-only memory that can improve
performance and reduce memory traffic when
reads have certain access patterns. In short Tex-
ture cache is used to provide spatially locality. Figure 3.3 a) coalesced memory access b) unco-
alesced memory access
III. Analytical Model To explain how executing multiple warps
in each SM effects the total execution time, we
To analyze the performance bottlenecks in consider the following scenarios. A computa-
GPU. An analytical model has been referred tion period indicates the period when memory
to estimate the cost of memory operations to requests are being serviced.
estimate performance parallelism in GPU ap- Case 1: In Figure 3.1 a, we assume that
plications. Warp is considered as the execution all the computation periods and memory wait-
unit of GPU in this model. The warp that is ing periods are from different warps and one
waiting for memory values is called a memory computation period is roughly one third of
warp. Two term referred in this sections are one memory waiting warp period. The system
CWP and MWP. Number of memory requests can service two memory warps simultaneously.
that can be executed concurrently is called a The processor can finish 3 warps computation
memory warp parallelism (MWP). Computa- periods during one memory waiting warp pe-
tion warp parallelism (CWP) can defined as the riod. As a result, the 6 computation periods
amount of computation that can be done by are completely overlapped with other memory
other warps while waiting for memory values. waiting periods. Hence, only 2 computations
Memory access pattern by a warp can be and 4 memory waiting periods contribute to
coalesced and uncoalesced. The SM processor the total execution cycles.
executes one warp at one time, and schedules Case 2: Figure 3.1 b, there are four warps
warps in a time-sharing fashion. When SM and each warp has two computation periods
processor executes a memory instruction, it and two memory waiting periods. The sec-
generates memory requests and switches to ond computation period can start only after
another warp until all the memory value in the the first memory waiting period of the same
warp are ready. Ideally, all memory access by warp is finished. Since the system can service
a warp is executed as a single memory trans- two memory warps simultaneously. First, the
action. However it depends on the memory processor executes four of the first computa-
access pattern. If the memory addresses ac- tion periods from each warp one by one. By
cessed by threads in a warp are sequential, all the time the processor finishes the first com-
3
GPU in Supercomputer 8th May 2016
Figure 3.2
Case 3: Here the system can service 8 memory
warps simultaneously. Therefore the total exe-
cution cycles are equal to 8 computation and 1
memory period.
Case 4: Every warp consists of two com- Figure 3.5
putation period and 2 memory period and Synchronization Effects: The CUDA program-
2nd computation period cannot start until 1st ming model supports thread synchronization.
memory period of the same warp has finished. Mostly, all the threads are executed asyn-
Therefore the total execution cycles are same chronously whenever all the source operands
as case 4. in a warp are ready. However, if there is a
barrier, the processor cannot execute the in-
structions after the barrier until all the threads
reach the barrier. Hence, there will be addi-
tional delays due to thread synchronization.
Figure 3.6 illustrates the additional delay effect.
Figure 3.3
Case 5: System can service 8 memory warps
at once. However let us consider computation
period longer than memory warp period. In
this case, not even one computation period
can be finished while one memory waiting Figure 3.6
period is completed. Total execution cycles are The key idea of Analytical models it to under-
8 computation and 1 memory cycle. stand bottlenecks in GPU thread level paral-
4
GPU in Supercomputer 8th May 2016
5
GPU in Supercomputer 8th May 2016
supercomputer discussed are as follows: SIMD most of these GPU cards were present in the
vector length (32 or more vs 4), small memory cages with high temperature. Although the
(4 GB vs 32 GB) relative to the peak perfor- observation points that GPUs are sensitive to-
mance of single precision, absence of fast L2 wards temperature. However it so not evident,
cache on GPU, and register spill penalty (hun- since not all the Double bit errors were due to
dreds vs a few cycles). the temperature.
This investigation was carried out one step
further to find the GPU cards with SBE. It was
V. Understanding Errors in a
found that 98 percent of all single bit errors
Supercomputer occur in only 10 GPU cards and hence more
prone to re-occurrence of SBEs. Also GPU
Understanding GPUs errors and its implica-
cards which experience most of the SBEs are
tions on a large scale system gives an insight
likely to have all the SBEs occur in the device
for the future GPU architects. This section
memory instead of the L2 cache. This find-
presents analyzes of GPU errors on the Ti-
ing can be useful for future architects in terms
tan supercomputer at the Oak Ridge Leader-
of which structures need better protection (de-
ship Computing Facility (OLCF). Data analyzes
vice memory and L2 cache) and which struc-
includes single and double bit errors for all
tures may not need additional costly protection
18,688 GPUs in the Titan supercomputer.
schemes (L1 cache, register file and texture
In Titan Supercomputer, all major storage
memory).
structures of GPUs for HPC applications are
protected with a Single Error Correction, Dou- Several other observations were made by
ble Error Detection (SECDED) ECC including conducting a radiation experiment to evaluate
device memory, L2 cache, instruction cache, the performance and found that Kepler gener-
register files, shared memory, and L1 cache re- ation of GPU are significantly more resilient
gion. However, not all resources benefit from than Fermi generation of GPUs, due the im-
ECC protection, for example, logic, queues, provised cell design. Kepler has a significantly
the thread block scheduler, warp scheduler, higher DBE rate than the Fermi architecture
instruction dispatch unit, and interconnect net- due to smaller transistor size.
work are not covered. To carry out this analysis GPU study of GPU failures on large-scales
all large- scale scientific applications were run system derives insights about GPU error char-
on a Titan supercomputer. These applications acteristic that can be used to improve opera-
run for days. The failures during these periods tional efficiency of large-scale HPC facilities.
was recorded and analyzed. Implications from these observations help fu-
On performing study on the errors it was ture generation GPU architectures.
observed that most of the errors occurred dur-
ing same day or in the span of couple of days. VI. Conclusion
Therefore GPU failures have a strong tempo-
ral locality. Due to this characteristic of GPU This survey was aimed to study about the per-
cards, rigorous tests can be performed dur- formance of GPU in HPCs. To harness perfor-
ing the production phase to identify bad cards mance of a GPU, it is very important that a pro-
early. This result is also important for reducing grammer is fully aware about the thread and
the I/O overhead significantly by employing memory level parallelism offered by GPU. Al-
techniques such as "Lazy checkpointing" This though GPU have high parallelism, implement-
finding is also useful for fault-tolerance stud- ing HPC algorithms using GPU still has many
ies. challenges like communication latency, mem-
On investigating the GPU cards that caused ory bandwidth, Register spill penalty. Analyz-
the failures, it was seen that certain GPU expe- ing errors caused by GPU in Supercomputer
rience DBEs more often than the other. Also gives an insight about future architectural im-
6
GPU in Supercomputer 8th May 2016
provements to be made on current GPUs. [4] Sunpyo Hong , Hyesoon Kim An An-
alytical Model for a GPU Architecture
with Memory-level and Thread-level Par-
References
allelism Awareness
[1] Yao Zhang and John D Owens A Quan-
[5] GPU Applications Nvidia Website
titative Performance Analysis Model for
GPU Architectures [6] John D Owens, David Luebke, Naga
Govindaraju, Mark Harris, Jens Krijger,
[2] Devesh Tiwari, Saurabh Gupta, James Aaron E Lefohn, and Timothy J Purcell A
Rogers,Don Maxwell, Paolo Rech, Sud- Survey of General-Purpose Computation
harshan Vazhkudai, Daniel Oliveira,Dave on Graphics Hardware
Londo, Nathan DeBardeleben, Philippe
Navaux, Luigi Carro, and Arthur Bland [7] Reiji Suda, Takayuki, Aoki Shoichi Hira-
Understanding GPU Errors on Large-scale sawa Aspects of GPU for General Purpose
HPC Systems and the Implications for Sys- High Performance Computing
tem Design and Operation
[8] Adam Coates,Brody Huval ,Tao Wang,
[3] A Richardson, A Gray Utilization of the David J Wu ,Andrew Y Ng Deep learning
GPU architecture for HPC with COTS HPC systems