0% found this document useful (0 votes)

74 views37 pages

Gpu-Arc

This document outlines the course organization for a course on GPU architectures and programming. The course covers topics like GPU architectures, CUDA and OpenCL programming, optimization techniques like memory access coalescing and kernel fusion. It discusses handling data parallelism on vector processors, SIMD instructions, and GPUs. Vector processors use vector registers to hold multiple data elements and perform the same operation on these elements simultaneously using vectorized functional units.

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views37 pages

Gpu-Arc

Uploaded by

Vijay Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

GPU Architectures and Programming

Soumyajit Dey, Assistant Professor,

CSE, IIT Kharagpur

December 5, 2019

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Course Organization
Topic Week Hours
Review of basic COA w.r.t. performance 1 2
Intro to GPU architectures 2 3
Intro to CUDA programming 3 2
Multi-dimensional data and synchronization 4 2
Warp Scheduling and Divergence 5 2
Memory Access Coalescing 6 2
Optimizing Reduction Kernels 7 3
Kernel Fusion, Thread and Block Coarsening 8 3
OpenCL - runtime system 9 3
OpenCL - heterogeneous computing 10 2
Efficient Neural Network Training/Inferencing 11-12 6
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Handling Data Level Parallelism

Data parallel algorithms handle multiple data points in each basic step (single thread of
control)
I Vector Processors : early style of data parallel compute
I Single Instruction Multiple Data (SIMD) in x86 : MMX (Multimedia Extensions),
AVX (Advanced Vector Extensions)
I GPUs : have their own distinguishing characteristics

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors

I Vector registers : Each vector register is a fixed-length bank holding a single vector,
I Functional units are also vectorized,
I Original Scalar registers are also present.
I VMIPS has eight vector registers, and each vector register holds 64 elements, each
64 bits wide.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
L.D F0,a ;load scalar a
DADDIU R4,Rx,#512 ;last address to load
Loop: L.D F2,0(Rx) ;load X[i]
Vector Processors : Consider a simple Y = a ∗ X + Y operation
er Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures MUL.D
L.D
F2,F2,F0
F4,0(Ry)
;a × X[i]
;load Y[i]
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
S.D F4,9(Ry) ;store into Y[i]
L.D F0,a ;load scalar a DADDIU Rx,Rx,#8 ;increment index to X
DADDIU R4,Rx,#512 ;last address to load DADDIU Ry,Ry,#8 ;increment index to Y
Loop: L.D F2,0(Rx) ;load X[i] DSUBU R20,R4,Rx ;compute bound
MUL.D F2,F2,F0 ;a × X[i] BNEZ R20,Loop ;check if done
L.D F4,0(Ry) ;load Y[i] Here is the VMIPS code for DAXPY.
ADD.D F4,F4,F2 ;a × X[i] + Y[i]
L.D F0,a ;load scalar a
S.D F4,9(Ry) ;store into Y[i] LV V1,Rx ;load vector X
DADDIU Rx,Rx,#8 ;increment index to X MULVS.D V2,V1,F0 ;vector-scalar multiply
DADDIU Ry,Ry,#8 ;increment index to Y LV V3,Ry ;load vector Y
DSUBU R20,R4,Rx ;compute bound ADDVV.D V4,V2,V3 ;add
SV V4,Ry ;store the result
BNEZ R20,Loop ;check if done
The most dramatic difference is that the vector processor greatly reduces the
Here is the VMIPS code for DAXPY.
(a) MIPS dynamic instruction bandwidth, executing only 6 instructions versus almost
(b) VMIPS
600 for MIPS. This reduction occurs because the vector operations work on 64
L.D F0,a ;load scalar a elements and the overhead instructions that constitute nearly half the loop on
Figure: Assuming the data size < vector storage (Ref: CoA: a quantitative approach (Hennessy
LV V1,Rx ;load vector X MIPS are not present in the VMIPS code. When the compiler produces vector
MULVS.D V2,V1,F0 instructions for such a sequence and the resulting code spends much of its time
;vector-scalar multiply
& Patterson))
LV V3,Ry ;load vector Y running in vector mode, the code is said to be vectorized or vectorizable. Loops
can be vectorized when they do not have dependences between iterations of a
ADDVV.D V4,V2,V3 ;add loop, which are called loop-carried dependences (see Section 4.5).
TE
OF
TECHNO
LO

In non-vectorized code, every ADD.D must wait for a MUL.D, and every S.D must wait

GY
ITU
IAN INST

KH
ARAGPUR
SV V4,Ry ;store the result Another important difference between MIPS and VMIPS is the frequency of

IND

19 5 1

for the ADD.D pipeline interlocks. In the straightforward MIPS code, every ADD.D must wait for
The most dramatic difference is that the vector processor greatlya MUL.D,
reduces andthe
every S.D must wait for the ADD.D. On the vector processor, each
yog, kms kOflm^

dynamic instruction bandwidth,

GPU Architectures and Programming executing only 6 instructions versus almost willSoumyajit
vector instruction only stall for the first
Dey, elementProfessor,
Assistant in each vector,
CSE,andIIT
thenKharagpur
sub-
Vector Processors

I A vector instruction passes lot of parallel work to the hardware

I The FUs can be : fully parallel, or a combination of parallel and pipelined units
I If the clock rate of a vector processor is halved, doubling the number of lanes will
retain the same potential performance.
I Work for compilers - loop vectorization, dependency handling

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Vector Processors

A[9] B[9]
…. ….
… …

Four add pipelines can complete four additions per cycle

Elements are interleaved

A[8] B[8] A[9] B[9]

A[4] B[4] A[5] B[5] A[6] B[6] A[7] B[7]

A[1] B[1]

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

Single Lane yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs
Ideas from parallel instruction handling by vector architectures, ILP techniques etc were
borrowed to accelerate graphics processing
Host CPU Bridge System Memory
GPU
Host Interface

Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM

Distribution
Vertex Work Distribution Pixel Work Distribution I-Cache
MT Issue
TPC TPC TPC
C-Cache

SM SM SM SM SM SM SP SP

SP
SP

SP
….. SP

SP
SP

SP
SP SP
SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP
Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1
SFU SFU
Interconnection Network

ROP L2 ROP L2 ….. ROP L2 Display Interface

Shared
Memory
DRAM DRAM DRAM DISPLAY
TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
Figure: GPU systems (GeForce 8800) - Hennessy, Patterson (reproduced)

IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPU Architecture (Tesla)

I Earlier figure depicts a GPU with an array of 128 streaming/scalar processor (SP)
cores, organized as 16 multithreaded streaming multiprocessors (SM),
I Each SM has 8 SPs,
I 2 SMs together are arranged as independent processing units called
texture/processor clusters (TPCs).

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Early GPUs
Early GPUs accelerated the logical graphics pipeline

Geometry Setup and

Input Assembler Vertex Shader
Shader Rasterizer

Raster Operations/
Pixel Shader
Output Merger

TECHNO
OF LO
TE

Figure: Graphics logical pipeline

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs

Graphics application sends the GPU a sequence of vertices grouped into geometric
primitives—points, lines, triangles, and polygons.
I The input assembler collects vertices and primitives.
I Vertex shader programs map the position of vertices onto the screen, altering their
position, color, or orientation.
I Geometry shader programs operate on geometric primitives (such as lines and
triangles) defined by multiple vertices, changing them or generating additional
primitives.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Shader Programs

Usually dataflow style, model how light interacts with different materials and to render
complex lighting and shadows.
I The setup and rasterizer unit generates pixel fragments (which are potential
contributions to pixels) that are covered by a geometric primitive.
I The pixel shader program fills the interior of primitives, including interpolating
per-fragment parameters, texturing, and coloring.
I The raster operations processing (or output merger) stage : depth testing and
stencil testing, color blending operation etc
Ref : "Computer Organization and Architecture" - Hennessy, Patterson (Appendix A
on GPUs) TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs : massive multi-threading

Design goals
I Cover the latency of memory loads and texture fetches from DRAM
I Support fine-grained parallel graphics shader (and general parallel compute)
programming models
I Virtualize the physical processors as threads and thread blocks to provide
transparent scalability
I Simplify the parallel programming model to writing a serial program for one thread

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
First generation GPUs

I GeForce 256, introduced in 1999

I Contained fixed function vertex, pixel shaders programmed with OpenGL and the
Microsoft DX7 API
I GeForce 3 - the first programmable vertex processor executing vertex shaders
- Ref for contents and here and subsequent places : "NVIDIA Tesla: A Unified Graphics and
Computing Architecture" by Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym,
(NVIDIA) IEEE Micro, Volume 28, Issue 2, March 2008

I Vertex work distributor distributes vertex work packets to the various TPCs
I The TPCs execute vertex/geometry shader programs
I output data is written to on-chip buffers
I buffers then pass their results to the viewport/clip/setup/raster/zcull block
We continue from here to general purpose processing

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU

Each TPC has two SMs, each SM has

I eight streaming/scalar processor (SP) cores,
I two special function units (SFUs),
I a multi-threaded instruction fetch and issue unit (MT Issue),
I an instruction cache, a read-only constant cache,
I a 16-Kbyte read/write shared memory.

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPGPU

I Each SP core contains a scalar multiply-add (MAD) unit, giving the SM eight
MAD units
I The SM uses its two SFU units for transcendental functions
I Each SFU also contains four floating-point multipliers
I In total an SM has eight MAD and floating-point multipliers

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
SIMT

GPU execution model

I SIMT architecture is similar to SIMD design, which applies one instruction to
multiple data lanes.
I The difference is that SIMT applies one instruction to multiple independent
threads in parallel, not just multiple data lanes.
I A SIMD instruction controls a vector of multiple data lanes together, a SIMT
instruction controls the execution and branching behavior of one thread.

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
Fermi Streaming Multiprocessor (SM)
Instruction Cache

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit

I Each SM has 16 Load/store units
Register File (32,768 x 32-bit)

Core Core Core Core

LD/ST
(load/store data at each address
LD/ST

Core Core Core Core

LD/ST
LD/ST
SFU

CUDA Core
to cache or DRAM.) - 16 SIMD
Core Core Core Core
LD/ST
LD/ST
LD/ST
SFU
Dispatch Port lanes
Core Core Core Core Operand Collector
LD/ST

Core Core Core Core

LD/ST
LD/ST
I Each lane has 2048 registers
SFU FP Unit INT Unit
LD/ST
Core Core

Core Core
Core

Core
Core

Core
LD/ST
LD/ST
I Each SM has 4 SFUs, Each SP
LD/ST
Result Queue
Core Core Core Core
LD/ST
LD/ST
SFU
has one FP, one Integer ALU.
Interconnect Network

Figure: Single SP I ALUs also support Boolean, shift,

64 KB Shared Memory / L1 Cache

Uniform Cache move, compare, convert, bit-field

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs becoming ubiquitous

GPUs have started finding wide usage in several domains where workloads have become
intensive
I Mobile GPUs : ARM Mali, Adreno GPUs (Qualcom) - accelerate graphics as well
as compute tasks
I NVIDIA in embedded space : Jetson TX/ Nano / AGX Xavier ⇒ Multi core ARM
CPU + 128-512 core GPU targeting AI / Deep Learning tasks
I NVIDIA Drive : for implementing autonomous car and ADAS functionality
powered by deep learning (Tesla cars !!)

TECHNO
OF LO
TE

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
GPUs as mobile workload accelerators

CPU Big cores

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur
NVIDIA Drive series of systems

I The Nvidia Drive PX 2 is based on 1/2 Tegra

SoCs where each SoC contains 2 Denver cores,
4 ARM A57 cores and a GPU from the Pascal
generation
I Useful for implementing high throughput real
time neural net processing - self driving / drive
assist systems
Figure: Source- Wiki, NVIDIA Drive OF
TECHNO
LO
TE

PX Platform

GY
ITU
IAN INST

KH
ARAGPUR
IND

19 5 1

yog, kms kOflm^

GPU Architectures and Programming Soumyajit Dey, Assistant Professor, CSE, IIT Kharagpur

EBM2.1 MANUAL For Compute and Tablet
No ratings yet
EBM2.1 MANUAL For Compute and Tablet
40 pages
Final Project Report
100% (2)
Final Project Report
66 pages
Linux Essentials Full Course
100% (5)
Linux Essentials Full Course
210 pages
Typing Lessons
No ratings yet
Typing Lessons
2 pages
Unit 2
No ratings yet
Unit 2
43 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
MCA Unit-5 QB
No ratings yet
MCA Unit-5 QB
3 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
No ratings yet
Department of Cse CP7103 Multicore Architecture Unit - 2, DLP in Vector, Simd and Gpu Architectures 100% THEORY Question Bank
3 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Lecture #4
No ratings yet
Lecture #4
16 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Vector
No ratings yet
Vector
38 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Zareen 6
No ratings yet
Zareen 6
11 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
2 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
GPGPU
No ratings yet
GPGPU
139 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
Implementation of DSP Algorithms
No ratings yet
Implementation of DSP Algorithms
20 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Data-Level Parallelism Presentation (1) Morning 6 35AM
No ratings yet
Data-Level Parallelism Presentation (1) Morning 6 35AM
87 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
CA Classes-236-240
No ratings yet
CA Classes-236-240
5 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpus
No ratings yet
Gpus
32 pages
CA Classes-201-205
No ratings yet
CA Classes-201-205
5 pages
04 DLP
No ratings yet
04 DLP
19 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Comparison of Multimedia SIMD, GPUs and Vector
No ratings yet
Comparison of Multimedia SIMD, GPUs and Vector
13 pages
Is There A Real Difference Between DSPs and GPUs
100% (1)
Is There A Real Difference Between DSPs and GPUs
18 pages
Mem-Coalesce
No ratings yet
Mem-Coalesce
69 pages
Reduction
No ratings yet
Reduction
91 pages
Multi - Dim
No ratings yet
Multi - Dim
29 pages
Module 3 Antenna Part
No ratings yet
Module 3 Antenna Part
35 pages
Wave Guides
No ratings yet
Wave Guides
29 pages
Module 1 and Module 2
No ratings yet
Module 1 and Module 2
56 pages
Double Stub and LC Matching Circuit
No ratings yet
Double Stub and LC Matching Circuit
31 pages
Gigabyte RX470 V1.1
No ratings yet
Gigabyte RX470 V1.1
29 pages
Simplifying Radicals
No ratings yet
Simplifying Radicals
8 pages
End User Procedure (EUP) Display RFQ ME43: Purpose
No ratings yet
End User Procedure (EUP) Display RFQ ME43: Purpose
12 pages
Characteristics of Multislice CT: Recent Topics
No ratings yet
Characteristics of Multislice CT: Recent Topics
5 pages
Cisco Asa Firepower
No ratings yet
Cisco Asa Firepower
11 pages
De MC Smo PRG en 01 v4 3 1 CNRSZR
No ratings yet
De MC Smo PRG en 01 v4 3 1 CNRSZR
458 pages
Magel Is
No ratings yet
Magel Is
40 pages
Digital System Design Q1 Q2
No ratings yet
Digital System Design Q1 Q2
3 pages
SaaS Implementation Best Practices - v2
No ratings yet
SaaS Implementation Best Practices - v2
24 pages
Main Ldap Training Day2
No ratings yet
Main Ldap Training Day2
39 pages
Samsung GT c3520 Service Manual PDF
No ratings yet
Samsung GT c3520 Service Manual PDF
71 pages
Step by Step Procdure by Power Point Presentation 5289M
No ratings yet
Step by Step Procdure by Power Point Presentation 5289M
34 pages
Alternate Autonomous AP Upgrade Procedure
No ratings yet
Alternate Autonomous AP Upgrade Procedure
14 pages
Microprocessor Unit 3
No ratings yet
Microprocessor Unit 3
58 pages
Kawai CN290 Digital Piano Manual
No ratings yet
Kawai CN290 Digital Piano Manual
24 pages
Grade 10 CAT Year Planner 2025
No ratings yet
Grade 10 CAT Year Planner 2025
9 pages
Scribbed 223751127-Chapter-12-Enhanced-Entity-Relationship-Modeling PDF
No ratings yet
Scribbed 223751127-Chapter-12-Enhanced-Entity-Relationship-Modeling PDF
16 pages
CCNA Lab 1
No ratings yet
CCNA Lab 1
19 pages
Thomas Adrienne 2
No ratings yet
Thomas Adrienne 2
2 pages
LESSON PLAN (Reflective Approach)
No ratings yet
LESSON PLAN (Reflective Approach)
10 pages
Slides Erp - SCM
No ratings yet
Slides Erp - SCM
79 pages
EE3706 - Chapter 6 - Capacitors and Inductors
No ratings yet
EE3706 - Chapter 6 - Capacitors and Inductors
27 pages
M241 Logic Controller - Hardware Guide
No ratings yet
M241 Logic Controller - Hardware Guide
1 page
Chapter-III Water Resources Systems: Analysis
No ratings yet
Chapter-III Water Resources Systems: Analysis
53 pages
FiberVU User Guide v01
No ratings yet
FiberVU User Guide v01
35 pages
Carpathia
No ratings yet
Carpathia
13 pages

Gpu-Arc

Uploaded by

Gpu-Arc

Uploaded by

GPU Architectures and Programming

Soumyajit Dey, Assistant Professor,

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

dynamic instruction bandwidth,

I A vector instruction passes lot of parallel work to the hardware

yog, kms kOflm^

Four add pipelines can complete four additions per cycle

A[8] B[8] A[9] B[9]

Single Lane yog, kms kOflm^

Input Assembler Clip/Setup/Raster/ZCull Compute Work HD Video Processor SM

ROP L2 ROP L2 ….. ROP L2 Display Interface

yog, kms kOflm^

yog, kms kOflm^

Geometry Setup and

Figure: Graphics logical pipeline

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

I GeForce 256, introduced in 1999

yog, kms kOflm^

I Vertex processors were designed for low-latency, high-precision math operations

yog, kms kOflm^

yog, kms kOflm^

I The input assembler collects vertex work

yog, kms kOflm^

Each TPC has two SMs, each SM has

yog, kms kOflm^

yog, kms kOflm^

GPU execution model

yog, kms kOflm^

I In contrast to SIMD vector architectures, SIMT enables programmers to write

yog, kms kOflm^

yog, kms kOflm^

I Support for floating-point, integer, bit, conversion, transcendental, flow control,

yog, kms kOflm^

Each SIMD processor (SM)

yog, kms kOflm^

yog, kms kOflm^

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit

Core Core Core Core

Core Core Core Core

Core Core Core Core

Figure: Single SP I ALUs also support Boolean, shift,

Uniform Cache move, compare, convert, bit-field

yog, kms kOflm^

I Local memory for per-thread, private, temporary data (implemented in external

yog, kms kOflm^

yog, kms kOflm^

I L1 (Data) cache + Shared memory is

L2 Cache L2 Cache L2 Cache

yog, kms kOflm^

I The instruction set target of the NVIDIA compilers is an abstraction of the

yog, kms kOflm^

yog, kms kOflm^

yog, kms kOflm^

CPU Big cores

I Objective : Maximize performance and reduce

Figure: Typical architecture of an

yog, kms kOflm^

yog, kms kOflm^

Architectures with Shared Last Level

Caches" - Henkel et. al.

yog, kms kOflm^

yog, kms kOflm^

I The Nvidia Drive PX 2 is based on 1/2 Tegra

yog, kms kOflm^

You might also like