0% found this document useful (0 votes)

111 views7 pages

GPU in Supercomputer

1) GPUs have become popular for high-performance computing due to their extreme parallelism and high performance per cost. 2) Programming thousands of parallel threads on a GPU is challenging, and understanding performance bottlenecks is difficult. 3) The document aims to understand GPU performance bottlenecks in high-performance computing by examining GPU architecture, analyzing thread and memory parallelism with a model, comparing GPUs to supercomputer nodes, and studying errors in GPU-based supercomputers.

Uploaded by

Chinmai Panibathe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views7 pages

GPU in Supercomputer

Uploaded by

Chinmai Panibathe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

GPU in Supercomputer 8th May 2016

GPU in Supercomputer
University of Rochester
[email protected]

Abstract

The demands placed by the gaming industry has driven advancements of the GPU architecture. GPUs
today is a low-cost, highly-parallel coprocessor with a high memory bandwidth, which supports single-
and double-precision floating point calculations via an instruction set. Due this extreme parallelism
and high performance for monetary cost, has naturally raised interest in harnessing GPUs power for
HPC (High Performance Computing) Systems. Therefore GPU architectures have become increasingly
important in the multi-core era. However Programming thousands of massively parallel threads is a big
challenge for software engineers, but understanding the performance bottlenecks of those parallel programs
on GPU architectures to improve application performance is even more difficult. In this survey we try to
understand performance bottlenecks of GPU in High performance Computing Systems by studying about
the architecture of GPU, bottlenecks of thread and memory level parallelism with the help of an analytical
model, comparing a high end GPU with a single node of a supercomputer to get an insight about the weak
points of a GPU, errors caused by GPUs in a super computer.

I. Introduction veloped by Intel, Brook+ by AMD. NVIDIA has

also developed several libraries such as cuDNN
raphics processing units (GPUs) have for deep learning, cuFTT for Fast Fourier Trans-

G become one of the most popular com-

puting platforms for high-throughput
computing applications. GPUs have democra-
formation, cuBLAS, FFmpeg, GIE etc. Despite
all these, a programmer has to spend several
hours to optimize the code and achieve high
tized supercomputing and researchers have dis- performance. Cleary understanding the under-
covered that power. GPUs are extremely con- lying architecture and associated performance
ducive in the field of Data Science, Computer makes a great deal of difference while pro-
Vision, Medical Imaging, Computational Struc- gramming. Therefore we try to understand
tural Mechanics and several other research GPU architecture and analyze the thread and
fields. For such applications, GPUs achieve memory level parallelism with the help an an-
high orders throughput by exploiting thou- alytical model.
sands of cores and gigabytes of high band- Despite high level parallelism offered by GPU,
width memory. To satisfy the applications in- there are differences between GPU and a su-
creasing resource requirements, GPU manufac- percomputer node. On comparing GPU with a
turers consistently scaled up GPUs by adding supercomputer node, we find few weaknesses
more cores and increasing memory size and in a GPU that are important to be considered
bandwidth. Tesla K40 of NVIDIA has mem- for High Performance Computing. To further
ory bandwidth of 288 GB/sec, memory size understand nature of GPU in large- scale HPC,
of 12 GB and 2880 cores. NVIDIA is con- analyzing errors in Supercomputer gives in-
stantly advancing visualization by developing sights and recommendations for current and
rendering technologies that leverage the most future large- scale GPU- enabled HPC centers.
advanced GPU architectures and compute lan- By understanding above mentioned areas, we
guages. The compute languages like CUDA aim to realize extremity of parallelism offered
has helped harness the extreme compute power by GPU in High performance computing.
and scalability of GPUs.
There have being many new programming
languages like CUDA by NVIDIA, Larrabee de-

1
GPU in Supercomputer 8th May 2016

II. Background and Motivation II. Background on GPU architecture

The GPU Architecture consists of a scalable
I. Background on CUDA program- number of streaming multiprocessors (SMs),
ming model each containing streaming processor (SP) cores,
special function units (SFUs), a multithreaded
In order to harness thread level parallelism instruction fetch and issue unit, a read-only
NIVIDA has introduced CUDA programming constant cache, and a read/write shared mem-
language. A CUDA program consists of a ory. NVIDIA has developed microarchitec-
host program and data-parallel kernel func- tures like Fermi, Maxwell, Kepler, Tesla, Pascal,
tions which are executed on the GPU. The Volta. Fermi architecture consists of 32 CUDA
kernel functions are invoked by host program processor cores, 16 load/store units, four spe-
with specified number of threads, blocks and cial function units, a 64 Kbyte configurable
girds. The CPU is referred as host and GPU is shared memory/L1 cache, 128- Kbyte register
referred as device in general. file, instruction cache, and two multithreaded
CUDA provides three key abstractions, a warp schedulers and instruction dispatch units.
hierarchy of thread groups, shared memories, Figure II depicts fermi architecture.
and barrier synchronization. Threads in CUDA
have a three level hierarchy. Figure I depicts the
thread level hierarchy. A grid is a set of thread
blocks that execute a kernel function. Each
grid consists of blocks of threads. Each block
is composed of hundreds of threads. Threads
within one block can share data using shared
memory and can be synchronized at a barrier.
All threads within a block are executed con-
currently on a multithreaded architecture. 32
threads executing concurrently are called as
warps. A programmer has all the liberty to
specify number of threads per block, and the
number of blocks per grid.

Figure 2.2 Fermi Architecture

Load/Store Units: Allow source and desti-
nation addresses to be calculated for 16 threads
per clock. Load and store the data from/to
cache or DRAM.
Special Functions Units (SFUs): Execute
transcendental instructions such as sin, cosine,
reciprocal, and square root. Each SFU executes
one instruction per thread, per clock; a warp
executes over eight clocks. The SFU pipeline is
decoupled from the dispatch unit, allowing the
dispatch unit to issue to other execution units
while the SFU is occupied.
Register file- each SM has a large and uni-
Figure 2.1 Thread Level Hierarchy fied register file that is shared across warps

2
GPU in Supercomputer 8th May 2016

executing in the same SM. Each thread in a of the memory requests within a warp can be
block has its own independent version of reg- coalesced into a single memory transaction. If
ister variables declared. Variables that are too every thread in a warp generates a different
large will be placed in Local Memory which is memory address, then it will generate different
located in Device memory. The Local Memory transactions. If memory requests in a warp are
space is not cached, so accesses to it are as ex- uncoalesced, the warp cannot be executed until
pensive as normal accesses to Device Memory. all memory transactions from the same warp
Device memory or global memory is located are serviced, which takes significantly longer
on the graphics card which can be accessed than waiting for only one memory request.
by all threads. Constant memory is used for
data that will not change over the course of
a kernel execution. Constant memory is used
in place of device memory to reduce memory
bandwidth.
Texture Cache - texture memory is another
variety of read-only memory that can improve
performance and reduce memory traffic when
reads have certain access patterns. In short Tex-
ture cache is used to provide spatially locality. Figure 3.3 a) coalesced memory access b) unco-
alesced memory access
III. Analytical Model To explain how executing multiple warps
in each SM effects the total execution time, we
To analyze the performance bottlenecks in consider the following scenarios. A computa-
GPU. An analytical model has been referred tion period indicates the period when memory
to estimate the cost of memory operations to requests are being serviced.
estimate performance parallelism in GPU ap- Case 1: In Figure 3.1 a, we assume that
plications. Warp is considered as the execution all the computation periods and memory wait-
unit of GPU in this model. The warp that is ing periods are from different warps and one
waiting for memory values is called a memory computation period is roughly one third of
warp. Two term referred in this sections are one memory waiting warp period. The system
CWP and MWP. Number of memory requests can service two memory warps simultaneously.
that can be executed concurrently is called a The processor can finish 3 warps computation
memory warp parallelism (MWP). Computa- periods during one memory waiting warp pe-
tion warp parallelism (CWP) can defined as the riod. As a result, the 6 computation periods
amount of computation that can be done by are completely overlapped with other memory
other warps while waiting for memory values. waiting periods. Hence, only 2 computations
Memory access pattern by a warp can be and 4 memory waiting periods contribute to
coalesced and uncoalesced. The SM processor the total execution cycles.
executes one warp at one time, and schedules Case 2: Figure 3.1 b, there are four warps
warps in a time-sharing fashion. When SM and each warp has two computation periods
processor executes a memory instruction, it and two memory waiting periods. The sec-
generates memory requests and switches to ond computation period can start only after
another warp until all the memory value in the the first memory waiting period of the same
warp are ready. Ideally, all memory access by warp is finished. Since the system can service
a warp is executed as a single memory trans- two memory warps simultaneously. First, the
action. However it depends on the memory processor executes four of the first computa-
access pattern. If the memory addresses action periods from each warp one by one. By
cessed by threads in a warp are sequential, all the time the processor finishes the first com-

3
GPU in Supercomputer 8th May 2016

putation periods from all warps; two memory

waiting periods are already serviced. So the
processor can execute the second computation
periods for these two warps. After that the
first memory periods of warp 3 and 4 have not
completed for second computation period of
warp 3 and 4 to begin, hence there few idle
cycles between computation periods. Despite
these idle cycles we see that the total execution
cycles are same as case 1. Figure 3.4
Case 6, 7: In this case if there are not enough
wraps running, the application cannot take ad-
vantage of all the available parallelism. Since
only only one warp is running(case 6) All the
executions are serialized.

Figure 3.2
Case 3: Here the system can service 8 memory
warps simultaneously. Therefore the total exe-
cution cycles are equal to 8 computation and 1
memory period.
Case 4: Every warp consists of two com- Figure 3.5
putation period and 2 memory period and Synchronization Effects: The CUDA program-
2nd computation period cannot start until 1st ming model supports thread synchronization.
memory period of the same warp has finished. Mostly, all the threads are executed asyn-
Therefore the total execution cycles are same chronously whenever all the source operands
as case 4. in a warp are ready. However, if there is a
barrier, the processor cannot execute the in-
structions after the barrier until all the threads
reach the barrier. Hence, there will be addi-
tional delays due to thread synchronization.
Figure 3.6 illustrates the additional delay effect.

Figure 3.3
Case 5: System can service 8 memory warps
at once. However let us consider computation
period longer than memory warp period. In
this case, not even one computation period
can be finished while one memory waiting Figure 3.6
period is completed. Total execution cycles are The key idea of Analytical models it to under-
8 computation and 1 memory cycle. stand bottlenecks in GPU thread level paral-

4
GPU in Supercomputer 8th May 2016

lelism. are similar. However, the penalty of register

spill is quite different. In CPU, the data spilled
from a register will be stored in L1 cache, and
IV. Differences between a GPU
thus reloaded in a few cycles. However in
and a Supercomputer node GPU, the spilled-out data is stored in the de-
vice memory, and it takes several hundreds of
Comparing with a supercomputer, we will find
cycles to reload it. This is the third point of
fairly many similarities between them, and also
big difference from GPU and supercomputer.
we will find few crucial differences of GPUs
Therefore the register allocation is very impor-
from conventional parallel supercomputers. Ta-
tant in GPU, but it is surprisingly difficult to
ble 4.1 gives the difference between T2K Todai
control register allocation for a programmer.
and GPU C1060. In this section we are compar-
ing T2K Todai fastest supercomputer in Japan
with NVIDIA GeForce 9300 Tesla C1060.
T2K has smaller SIMD vector length, hence
easier for programming. However SIMD vector
length is also the factor of costs of control hard-
ware. The SIMD vector length is the first big
difference between GPU and supercomputer.
Programming GPU is not easy.
The peak performance of one node of T2K
is 294 Gflops for single precision and 147
Gflops for double precision. Peak performance
of C1060 (933 Gflops) is much higher than that
of a node of T2K for single precision, but lower
for double precision (78 Gflops). The ratio of
single precision peak performance and double
one is 2 for T2K and 12 for C1060. On a node
of T2K, DDR2 memory of 8 GB (minimum) is
attached to each CPU. Since it has 4 CPUs, at
least 32 GB of main memory is installed on a
node (maximum installed memory is 128 GB
per node). This is much larger figure than that
of C1060, 4 GB. The ratio of main memory
size and the singleprecision performance is Figure 4.1
32 (GB) / 294 (Gflops) = 0.109 for T2K, and 4 Opteron processor has 512 KB of L2 cache per
(GB) / 933 (Gflops) = 0.004 for C1060. These core and 2MB of L3 cache per CPU. In many
ratios suggest fundamental limitations on the applications those caches reduce performance
granularity, which is a very important concept degradation caused by the long memory ac-
in parallel processing. The very small ratio on cess latency. An NVIDIA GPU has two levels
C1060 suggests its limited performance when of read-only caches, enabled by texture fetch.
the granularity affects much on the parallel However their access latencies are a few hun-
performance. The small main memory size is dreds of cycles. Those caches are to reduce
the second big difference between GPU and su- requirements on the memory bandwidth, not
percomputer. The L1 cache size of the Opteron to reduce the memory access latency. This is
processor of T2K is 64 KB per core. The shared the fourth big difference of GPU and super-
memory size of C1060 is only 16 KB per MP, computer.
but each MP of C1060 has 64 KB of registers.
Thus the sizes of the fastest on-chip memories The four biggest differences of GPU and

5
GPU in Supercomputer 8th May 2016

supercomputer discussed are as follows: SIMD most of these GPU cards were present in the
vector length (32 or more vs 4), small memory cages with high temperature. Although the
(4 GB vs 32 GB) relative to the peak perfor- observation points that GPUs are sensitive to-
mance of single precision, absence of fast L2 wards temperature. However it so not evident,
cache on GPU, and register spill penalty (hun- since not all the Double bit errors were due to
dreds vs a few cycles). the temperature.
This investigation was carried out one step
further to find the GPU cards with SBE. It was
V. Understanding Errors in a
found that 98 percent of all single bit errors
Supercomputer occur in only 10 GPU cards and hence more
prone to re-occurrence of SBEs. Also GPU
Understanding GPUs errors and its implica-
cards which experience most of the SBEs are
tions on a large scale system gives an insight
likely to have all the SBEs occur in the device
for the future GPU architects. This section
memory instead of the L2 cache. This find-
presents analyzes of GPU errors on the Ti-
ing can be useful for future architects in terms
tan supercomputer at the Oak Ridge Leader-
of which structures need better protection (de-
ship Computing Facility (OLCF). Data analyzes
vice memory and L2 cache) and which struc-
includes single and double bit errors for all
tures may not need additional costly protection
18,688 GPUs in the Titan supercomputer.
schemes (L1 cache, register file and texture
In Titan Supercomputer, all major storage
memory).
structures of GPUs for HPC applications are
protected with a Single Error Correction, Dou- Several other observations were made by
ble Error Detection (SECDED) ECC including conducting a radiation experiment to evaluate
device memory, L2 cache, instruction cache, the performance and found that Kepler gener-
register files, shared memory, and L1 cache re- ation of GPU are significantly more resilient
gion. However, not all resources benefit from than Fermi generation of GPUs, due the im-
ECC protection, for example, logic, queues, provised cell design. Kepler has a significantly
the thread block scheduler, warp scheduler, higher DBE rate than the Fermi architecture
instruction dispatch unit, and interconnect net- due to smaller transistor size.
work are not covered. To carry out this analysis GPU study of GPU failures on large-scales
all large- scale scientific applications were run system derives insights about GPU error char-
on a Titan supercomputer. These applications acteristic that can be used to improve opera-
run for days. The failures during these periods tional efficiency of large-scale HPC facilities.
was recorded and analyzed. Implications from these observations help fu-
On performing study on the errors it was ture generation GPU architectures.
observed that most of the errors occurred dur-
ing same day or in the span of couple of days. VI. Conclusion
Therefore GPU failures have a strong tempo-
ral locality. Due to this characteristic of GPU This survey was aimed to study about the per-
cards, rigorous tests can be performed dur- formance of GPU in HPCs. To harness perfor-
ing the production phase to identify bad cards mance of a GPU, it is very important that a pro-
early. This result is also important for reducing grammer is fully aware about the thread and
the I/O overhead significantly by employing memory level parallelism offered by GPU. Al-
techniques such as "Lazy checkpointing" This though GPU have high parallelism, implement-
finding is also useful for fault-tolerance stud- ing HPC algorithms using GPU still has many
ies. challenges like communication latency, mem-
On investigating the GPU cards that caused ory bandwidth, Register spill penalty. Analyz-
the failures, it was seen that certain GPU expe- ing errors caused by GPU in Supercomputer
rience DBEs more often than the other. Also gives an insight about future architectural im-

6
GPU in Supercomputer 8th May 2016

provements to be made on current GPUs. [4] Sunpyo Hong , Hyesoon Kim An An-
alytical Model for a GPU Architecture
with Memory-level and Thread-level Par-
References
allelism Awareness
[1] Yao Zhang and John D Owens A Quan-
[5] GPU Applications Nvidia Website
titative Performance Analysis Model for
GPU Architectures [6] John D Owens, David Luebke, Naga
Govindaraju, Mark Harris, Jens Krijger,
[2] Devesh Tiwari, Saurabh Gupta, James Aaron E Lefohn, and Timothy J Purcell A
Rogers,Don Maxwell, Paolo Rech, Sud- Survey of General-Purpose Computation
harshan Vazhkudai, Daniel Oliveira,Dave on Graphics Hardware
Londo, Nathan DeBardeleben, Philippe
Navaux, Luigi Carro, and Arthur Bland [7] Reiji Suda, Takayuki, Aoki Shoichi Hira-
Understanding GPU Errors on Large-scale sawa Aspects of GPU for General Purpose
HPC Systems and the Implications for Sys- High Performance Computing
tem Design and Operation
[8] Adam Coates,Brody Huval ,Tao Wang,
[3] A Richardson, A Gray Utilization of the David J Wu ,Andrew Y Ng Deep learning
GPU architecture for HPC with COTS HPC systems

GPU Basics
No ratings yet
GPU Basics
93 pages
Mala Designs Buho
100% (2)
Mala Designs Buho
10 pages
Chapter 20 Overview PDF
No ratings yet
Chapter 20 Overview PDF
48 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Progress OpenEdge Database Administration
No ratings yet
Progress OpenEdge Database Administration
812 pages
Jha - de Spading Activities For N2 Purging
67% (3)
Jha - de Spading Activities For N2 Purging
4 pages
Condenser Vacuum Problem
100% (4)
Condenser Vacuum Problem
4 pages
10kw Hybrid System 57 6kwh
No ratings yet
10kw Hybrid System 57 6kwh
1 page
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
BIg Data
100% (1)
BIg Data
15 pages
Ingersoll Compressor
100% (4)
Ingersoll Compressor
68 pages
Assignments
100% (1)
Assignments
11 pages
Checking Summary: Customer Service Information
100% (1)
Checking Summary: Customer Service Information
4 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Common Electrical Wire Splices and Joints - Basics About Electrical
100% (1)
Common Electrical Wire Splices and Joints - Basics About Electrical
6 pages
SAP Fiori Elements Usage Guide - August 2019
No ratings yet
SAP Fiori Elements Usage Guide - August 2019
29 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Rubber Stop Buffers
No ratings yet
Rubber Stop Buffers
2 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Cuda
No ratings yet
Cuda
69 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Part1 22
No ratings yet
Part1 22
77 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Ms 13 336
No ratings yet
Ms 13 336
20 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Unit 4
No ratings yet
Unit 4
48 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
PART19
No ratings yet
PART19
20 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Camp Heater Manual
No ratings yet
Camp Heater Manual
24 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Exploring The Gpu Architecture
No ratings yet
Exploring The Gpu Architecture
9 pages
Cuda
No ratings yet
Cuda
25 pages
MySQL Data Types Quick Reference Table
No ratings yet
MySQL Data Types Quick Reference Table
3 pages
02-CP Lab Manual PDF
No ratings yet
02-CP Lab Manual PDF
71 pages
M-7-1 Heat Exchangers
No ratings yet
M-7-1 Heat Exchangers
11 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
CUDA
No ratings yet
CUDA
46 pages
Note2 4
No ratings yet
Note2 4
11 pages
172S Emergency Checklist 12.17.2015 PDF
No ratings yet
172S Emergency Checklist 12.17.2015 PDF
7 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Lec 1
No ratings yet
Lec 1
27 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Curricular Is T
No ratings yet
Curricular Is T
18 pages
Mahagun Medalleo - Mahagun Medalleo Sector 107
No ratings yet
Mahagun Medalleo - Mahagun Medalleo Sector 107
3 pages
TFM - Unfinished
No ratings yet
TFM - Unfinished
17 pages
Log
No ratings yet
Log
2 pages
Tam 50
No ratings yet
Tam 50
5 pages
Test 1
No ratings yet
Test 1
4 pages
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
No ratings yet
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
10 pages
The Explainer: Technopreneurship: January 2018
No ratings yet
The Explainer: Technopreneurship: January 2018
18 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Philips Halogen Safety Light Bulletin 3-90
No ratings yet
Philips Halogen Safety Light Bulletin 3-90
2 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Hindustan Institute of Technology & Science School of Aeronautical Sciences
No ratings yet
Hindustan Institute of Technology & Science School of Aeronautical Sciences
8 pages
Ener Chi Ionization
No ratings yet
Ener Chi Ionization
2 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Application For Repatriation and Citizenship
No ratings yet
Application For Repatriation and Citizenship
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
ICP Course Outline PDF
No ratings yet
ICP Course Outline PDF
3 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet

GPU in Supercomputer

Uploaded by

GPU in Supercomputer

Uploaded by

GPU in Supercomputer 8th May 2016

I. Introduction veloped by Intel, Brook+ by AMD. NVIDIA has

G become one of the most popular com-

II. Background and Motivation II. Background on GPU architecture

Figure 2.2 Fermi Architecture

putation periods from all warps; two memory

lelism. are similar. However, the penalty of register

You might also like