0% found this document useful (0 votes)
111 views7 pages

GPU in Supercomputer

1) GPUs have become popular for high-performance computing due to their extreme parallelism and high performance per cost. 2) Programming thousands of parallel threads on a GPU is challenging, and understanding performance bottlenecks is difficult. 3) The document aims to understand GPU performance bottlenecks in high-performance computing by examining GPU architecture, analyzing thread and memory parallelism with a model, comparing GPUs to supercomputer nodes, and studying errors in GPU-based supercomputers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views7 pages

GPU in Supercomputer

1) GPUs have become popular for high-performance computing due to their extreme parallelism and high performance per cost. 2) Programming thousands of parallel threads on a GPU is challenging, and understanding performance bottlenecks is difficult. 3) The document aims to understand GPU performance bottlenecks in high-performance computing by examining GPU architecture, analyzing thread and memory parallelism with a model, comparing GPUs to supercomputer nodes, and studying errors in GPU-based supercomputers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

GPU in Supercomputer 8th May 2016

GPU in Supercomputer
University of Rochester
[email protected]

Abstract

The demands placed by the gaming industry has driven advancements of the GPU architecture. GPUs
today is a low-cost, highly-parallel coprocessor with a high memory bandwidth, which supports single-
and double-precision floating point calculations via an instruction set. Due this extreme parallelism
and high performance for monetary cost, has naturally raised interest in harnessing GPUs power for
HPC (High Performance Computing) Systems. Therefore GPU architectures have become increasingly
important in the multi-core era. However Programming thousands of massively parallel threads is a big
challenge for software engineers, but understanding the performance bottlenecks of those parallel programs
on GPU architectures to improve application performance is even more difficult. In this survey we try to
understand performance bottlenecks of GPU in High performance Computing Systems by studying about
the architecture of GPU, bottlenecks of thread and memory level parallelism with the help of an analytical
model, comparing a high end GPU with a single node of a supercomputer to get an insight about the weak
points of a GPU, errors caused by GPUs in a super computer.

I. Introduction veloped by Intel, Brook+ by AMD. NVIDIA has


also developed several libraries such as cuDNN
raphics processing units (GPUs) have for deep learning, cuFTT for Fast Fourier Trans-

G become one of the most popular com-


puting platforms for high-throughput
computing applications. GPUs have democra-
formation, cuBLAS, FFmpeg, GIE etc. Despite
all these, a programmer has to spend several
hours to optimize the code and achieve high
tized supercomputing and researchers have dis- performance. Cleary understanding the under-
covered that power. GPUs are extremely con- lying architecture and associated performance
ducive in the field of Data Science, Computer makes a great deal of difference while pro-
Vision, Medical Imaging, Computational Struc- gramming. Therefore we try to understand
tural Mechanics and several other research GPU architecture and analyze the thread and
fields. For such applications, GPUs achieve memory level parallelism with the help an an-
high orders throughput by exploiting thou- alytical model.
sands of cores and gigabytes of high band- Despite high level parallelism offered by GPU,
width memory. To satisfy the applications in- there are differences between GPU and a su-
creasing resource requirements, GPU manufac- percomputer node. On comparing GPU with a
turers consistently scaled up GPUs by adding supercomputer node, we find few weaknesses
more cores and increasing memory size and in a GPU that are important to be considered
bandwidth. Tesla K40 of NVIDIA has mem- for High Performance Computing. To further
ory bandwidth of 288 GB/sec, memory size understand nature of GPU in large- scale HPC,
of 12 GB and 2880 cores. NVIDIA is con- analyzing errors in Supercomputer gives in-
stantly advancing visualization by developing sights and recommendations for current and
rendering technologies that leverage the most future large- scale GPU- enabled HPC centers.
advanced GPU architectures and compute lan- By understanding above mentioned areas, we
guages. The compute languages like CUDA aim to realize extremity of parallelism offered
has helped harness the extreme compute power by GPU in High performance computing.
and scalability of GPUs.
There have being many new programming
languages like CUDA by NVIDIA, Larrabee de-

1
GPU in Supercomputer 8th May 2016

II. Background and Motivation II. Background on GPU architecture


The GPU Architecture consists of a scalable
I. Background on CUDA program- number of streaming multiprocessors (SMs),
ming model each containing streaming processor (SP) cores,
special function units (SFUs), a multithreaded
In order to harness thread level parallelism instruction fetch and issue unit, a read-only
NIVIDA has introduced CUDA programming constant cache, and a read/write shared mem-
language. A CUDA program consists of a ory. NVIDIA has developed microarchitec-
host program and data-parallel kernel func- tures like Fermi, Maxwell, Kepler, Tesla, Pascal,
tions which are executed on the GPU. The Volta. Fermi architecture consists of 32 CUDA
kernel functions are invoked by host program processor cores, 16 load/store units, four spe-
with specified number of threads, blocks and cial function units, a 64 Kbyte configurable
girds. The CPU is referred as host and GPU is shared memory/L1 cache, 128- Kbyte register
referred as device in general. file, instruction cache, and two multithreaded
CUDA provides three key abstractions, a warp schedulers and instruction dispatch units.
hierarchy of thread groups, shared memories, Figure II depicts fermi architecture.
and barrier synchronization. Threads in CUDA
have a three level hierarchy. Figure I depicts the
thread level hierarchy. A grid is a set of thread
blocks that execute a kernel function. Each
grid consists of blocks of threads. Each block
is composed of hundreds of threads. Threads
within one block can share data using shared
memory and can be synchronized at a barrier.
All threads within a block are executed con-
currently on a multithreaded architecture. 32
threads executing concurrently are called as
warps. A programmer has all the liberty to
specify number of threads per block, and the
number of blocks per grid.

Figure 2.2 Fermi Architecture


Load/Store Units: Allow source and desti-
nation addresses to be calculated for 16 threads
per clock. Load and store the data from/to
cache or DRAM.
Special Functions Units (SFUs): Execute
transcendental instructions such as sin, cosine,
reciprocal, and square root. Each SFU executes
one instruction per thread, per clock; a warp
executes over eight clocks. The SFU pipeline is
decoupled from the dispatch unit, allowing the
dispatch unit to issue to other execution units
while the SFU is occupied.
Register file- each SM has a large and uni-
Figure 2.1 Thread Level Hierarchy fied register file that is shared across warps

2
GPU in Supercomputer 8th May 2016

executing in the same SM. Each thread in a of the memory requests within a warp can be
block has its own independent version of reg- coalesced into a single memory transaction. If
ister variables declared. Variables that are too every thread in a warp generates a different
large will be placed in Local Memory which is memory address, then it will generate different
located in Device memory. The Local Memory transactions. If memory requests in a warp are
space is not cached, so accesses to it are as ex- uncoalesced, the warp cannot be executed until
pensive as normal accesses to Device Memory. all memory transactions from the same warp
Device memory or global memory is located are serviced, which takes significantly longer
on the graphics card which can be accessed than waiting for only one memory request.
by all threads. Constant memory is used for
data that will not change over the course of
a kernel execution. Constant memory is used
in place of device memory to reduce memory
bandwidth.
Texture Cache - texture memory is another
variety of read-only memory that can improve
performance and reduce memory traffic when
reads have certain access patterns. In short Tex-
ture cache is used to provide spatially locality. Figure 3.3 a) coalesced memory access b) unco-
alesced memory access
III. Analytical Model To explain how executing multiple warps
in each SM effects the total execution time, we
To analyze the performance bottlenecks in consider the following scenarios. A computa-
GPU. An analytical model has been referred tion period indicates the period when memory
to estimate the cost of memory operations to requests are being serviced.
estimate performance parallelism in GPU ap- Case 1: In Figure 3.1 a, we assume that
plications. Warp is considered as the execution all the computation periods and memory wait-
unit of GPU in this model. The warp that is ing periods are from different warps and one
waiting for memory values is called a memory computation period is roughly one third of
warp. Two term referred in this sections are one memory waiting warp period. The system
CWP and MWP. Number of memory requests can service two memory warps simultaneously.
that can be executed concurrently is called a The processor can finish 3 warps computation
memory warp parallelism (MWP). Computa- periods during one memory waiting warp pe-
tion warp parallelism (CWP) can defined as the riod. As a result, the 6 computation periods
amount of computation that can be done by are completely overlapped with other memory
other warps while waiting for memory values. waiting periods. Hence, only 2 computations
Memory access pattern by a warp can be and 4 memory waiting periods contribute to
coalesced and uncoalesced. The SM processor the total execution cycles.
executes one warp at one time, and schedules Case 2: Figure 3.1 b, there are four warps
warps in a time-sharing fashion. When SM and each warp has two computation periods
processor executes a memory instruction, it and two memory waiting periods. The sec-
generates memory requests and switches to ond computation period can start only after
another warp until all the memory value in the the first memory waiting period of the same
warp are ready. Ideally, all memory access by warp is finished. Since the system can service
a warp is executed as a single memory trans- two memory warps simultaneously. First, the
action. However it depends on the memory processor executes four of the first computa-
access pattern. If the memory addresses ac- tion periods from each warp one by one. By
cessed by threads in a warp are sequential, all the time the processor finishes the first com-

3
GPU in Supercomputer 8th May 2016

putation periods from all warps; two memory


waiting periods are already serviced. So the
processor can execute the second computation
periods for these two warps. After that the
first memory periods of warp 3 and 4 have not
completed for second computation period of
warp 3 and 4 to begin, hence there few idle
cycles between computation periods. Despite
these idle cycles we see that the total execution
cycles are same as case 1. Figure 3.4
Case 6, 7: In this case if there are not enough
wraps running, the application cannot take ad-
vantage of all the available parallelism. Since
only only one warp is running(case 6) All the
executions are serialized.

Figure 3.2
Case 3: Here the system can service 8 memory
warps simultaneously. Therefore the total exe-
cution cycles are equal to 8 computation and 1
memory period.
Case 4: Every warp consists of two com- Figure 3.5
putation period and 2 memory period and Synchronization Effects: The CUDA program-
2nd computation period cannot start until 1st ming model supports thread synchronization.
memory period of the same warp has finished. Mostly, all the threads are executed asyn-
Therefore the total execution cycles are same chronously whenever all the source operands
as case 4. in a warp are ready. However, if there is a
barrier, the processor cannot execute the in-
structions after the barrier until all the threads
reach the barrier. Hence, there will be addi-
tional delays due to thread synchronization.
Figure 3.6 illustrates the additional delay effect.

Figure 3.3
Case 5: System can service 8 memory warps
at once. However let us consider computation
period longer than memory warp period. In
this case, not even one computation period
can be finished while one memory waiting Figure 3.6
period is completed. Total execution cycles are The key idea of Analytical models it to under-
8 computation and 1 memory cycle. stand bottlenecks in GPU thread level paral-

4
GPU in Supercomputer 8th May 2016

lelism. are similar. However, the penalty of register


spill is quite different. In CPU, the data spilled
from a register will be stored in L1 cache, and
IV. Differences between a GPU
thus reloaded in a few cycles. However in
and a Supercomputer node GPU, the spilled-out data is stored in the de-
vice memory, and it takes several hundreds of
Comparing with a supercomputer, we will find
cycles to reload it. This is the third point of
fairly many similarities between them, and also
big difference from GPU and supercomputer.
we will find few crucial differences of GPUs
Therefore the register allocation is very impor-
from conventional parallel supercomputers. Ta-
tant in GPU, but it is surprisingly difficult to
ble 4.1 gives the difference between T2K Todai
control register allocation for a programmer.
and GPU C1060. In this section we are compar-
ing T2K Todai fastest supercomputer in Japan
with NVIDIA GeForce 9300 Tesla C1060.
T2K has smaller SIMD vector length, hence
easier for programming. However SIMD vector
length is also the factor of costs of control hard-
ware. The SIMD vector length is the first big
difference between GPU and supercomputer.
Programming GPU is not easy.
The peak performance of one node of T2K
is 294 Gflops for single precision and 147
Gflops for double precision. Peak performance
of C1060 (933 Gflops) is much higher than that
of a node of T2K for single precision, but lower
for double precision (78 Gflops). The ratio of
single precision peak performance and double
one is 2 for T2K and 12 for C1060. On a node
of T2K, DDR2 memory of 8 GB (minimum) is
attached to each CPU. Since it has 4 CPUs, at
least 32 GB of main memory is installed on a
node (maximum installed memory is 128 GB
per node). This is much larger figure than that
of C1060, 4 GB. The ratio of main memory
size and the singleprecision performance is Figure 4.1
32 (GB) / 294 (Gflops) = 0.109 for T2K, and 4 Opteron processor has 512 KB of L2 cache per
(GB) / 933 (Gflops) = 0.004 for C1060. These core and 2MB of L3 cache per CPU. In many
ratios suggest fundamental limitations on the applications those caches reduce performance
granularity, which is a very important concept degradation caused by the long memory ac-
in parallel processing. The very small ratio on cess latency. An NVIDIA GPU has two levels
C1060 suggests its limited performance when of read-only caches, enabled by texture fetch.
the granularity affects much on the parallel However their access latencies are a few hun-
performance. The small main memory size is dreds of cycles. Those caches are to reduce
the second big difference between GPU and su- requirements on the memory bandwidth, not
percomputer. The L1 cache size of the Opteron to reduce the memory access latency. This is
processor of T2K is 64 KB per core. The shared the fourth big difference of GPU and super-
memory size of C1060 is only 16 KB per MP, computer.
but each MP of C1060 has 64 KB of registers.
Thus the sizes of the fastest on-chip memories The four biggest differences of GPU and

5
GPU in Supercomputer 8th May 2016

supercomputer discussed are as follows: SIMD most of these GPU cards were present in the
vector length (32 or more vs 4), small memory cages with high temperature. Although the
(4 GB vs 32 GB) relative to the peak perfor- observation points that GPUs are sensitive to-
mance of single precision, absence of fast L2 wards temperature. However it so not evident,
cache on GPU, and register spill penalty (hun- since not all the Double bit errors were due to
dreds vs a few cycles). the temperature.
This investigation was carried out one step
further to find the GPU cards with SBE. It was
V. Understanding Errors in a
found that 98 percent of all single bit errors
Supercomputer occur in only 10 GPU cards and hence more
prone to re-occurrence of SBEs. Also GPU
Understanding GPUs errors and its implica-
cards which experience most of the SBEs are
tions on a large scale system gives an insight
likely to have all the SBEs occur in the device
for the future GPU architects. This section
memory instead of the L2 cache. This find-
presents analyzes of GPU errors on the Ti-
ing can be useful for future architects in terms
tan supercomputer at the Oak Ridge Leader-
of which structures need better protection (de-
ship Computing Facility (OLCF). Data analyzes
vice memory and L2 cache) and which struc-
includes single and double bit errors for all
tures may not need additional costly protection
18,688 GPUs in the Titan supercomputer.
schemes (L1 cache, register file and texture
In Titan Supercomputer, all major storage
memory).
structures of GPUs for HPC applications are
protected with a Single Error Correction, Dou- Several other observations were made by
ble Error Detection (SECDED) ECC including conducting a radiation experiment to evaluate
device memory, L2 cache, instruction cache, the performance and found that Kepler gener-
register files, shared memory, and L1 cache re- ation of GPU are significantly more resilient
gion. However, not all resources benefit from than Fermi generation of GPUs, due the im-
ECC protection, for example, logic, queues, provised cell design. Kepler has a significantly
the thread block scheduler, warp scheduler, higher DBE rate than the Fermi architecture
instruction dispatch unit, and interconnect net- due to smaller transistor size.
work are not covered. To carry out this analysis GPU study of GPU failures on large-scales
all large- scale scientific applications were run system derives insights about GPU error char-
on a Titan supercomputer. These applications acteristic that can be used to improve opera-
run for days. The failures during these periods tional efficiency of large-scale HPC facilities.
was recorded and analyzed. Implications from these observations help fu-
On performing study on the errors it was ture generation GPU architectures.
observed that most of the errors occurred dur-
ing same day or in the span of couple of days. VI. Conclusion
Therefore GPU failures have a strong tempo-
ral locality. Due to this characteristic of GPU This survey was aimed to study about the per-
cards, rigorous tests can be performed dur- formance of GPU in HPCs. To harness perfor-
ing the production phase to identify bad cards mance of a GPU, it is very important that a pro-
early. This result is also important for reducing grammer is fully aware about the thread and
the I/O overhead significantly by employing memory level parallelism offered by GPU. Al-
techniques such as "Lazy checkpointing" This though GPU have high parallelism, implement-
finding is also useful for fault-tolerance stud- ing HPC algorithms using GPU still has many
ies. challenges like communication latency, mem-
On investigating the GPU cards that caused ory bandwidth, Register spill penalty. Analyz-
the failures, it was seen that certain GPU expe- ing errors caused by GPU in Supercomputer
rience DBEs more often than the other. Also gives an insight about future architectural im-

6
GPU in Supercomputer 8th May 2016

provements to be made on current GPUs. [4] Sunpyo Hong , Hyesoon Kim An An-
alytical Model for a GPU Architecture
with Memory-level and Thread-level Par-
References
allelism Awareness
[1] Yao Zhang and John D Owens A Quan-
[5] GPU Applications Nvidia Website
titative Performance Analysis Model for
GPU Architectures [6] John D Owens, David Luebke, Naga
Govindaraju, Mark Harris, Jens Krijger,
[2] Devesh Tiwari, Saurabh Gupta, James Aaron E Lefohn, and Timothy J Purcell A
Rogers,Don Maxwell, Paolo Rech, Sud- Survey of General-Purpose Computation
harshan Vazhkudai, Daniel Oliveira,Dave on Graphics Hardware
Londo, Nathan DeBardeleben, Philippe
Navaux, Luigi Carro, and Arthur Bland [7] Reiji Suda, Takayuki, Aoki Shoichi Hira-
Understanding GPU Errors on Large-scale sawa Aspects of GPU for General Purpose
HPC Systems and the Implications for Sys- High Performance Computing
tem Design and Operation
[8] Adam Coates,Brody Huval ,Tao Wang,
[3] A Richardson, A Gray Utilization of the David J Wu ,Andrew Y Ng Deep learning
GPU architecture for HPC with COTS HPC systems

You might also like