0% found this document useful (0 votes)

12 views52 pages

Lec 14

This document discusses Amdahl's law, GPUs, and classifying applications for GPU acceleration. It provides an overview of Amdahl's law and discusses how some of its assumptions do not always hold, such as communication costs not being zero and parallel sections not always being perfectly divisible. It then describes GPU architecture, including their use of thousands of tiny cores organized in clusters compared to CPUs with fewer but larger cores. It discusses how applications with small independent functions that can run many times in parallel are well-suited to GPU acceleration. Finally, it provides an example of mapping CUDA threads to GPU hardware resources.

Uploaded by

Abinash sonowal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views52 pages

Lec 14

Uploaded by

Abinash sonowal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

CS528 (HPC)

Amdhal’s Law, GPU

A Sahu
Dept of CSE, IIT Guwahati

A Sahu 1
Outline
• Amdhal’s Law
– Relaxing Assumptions
• GPU
• APP classification for Acceleration
Performance of Parallel Program
(Amdahl’s Law)
Example: OpenMP Parallel Program
printf(“begin\n”);
N = 1000; Serial
#pragma omp parallel for
for (i=0; i<N; i++)
A[i] = B[i] + C[i]; Parallel

M = 500;
Serial
#pragma omp parallel for
for (j=0; j<M; j++)
p[j] = q[j] – r[j]; Parallel

printf(“done\n”);
Serial
• Notion : T1 =Time on Uni-processor, Tp= Time on
p Processors
Speed up = Sp=T1/Tp ≤p
Efficiency = Ep= T1/(p.Tp)
• Usually Sp < p or Ep <1 due to overhead
• Some time superliner speed up reported (Sp > p
or Ep >1 )
– Failure to use the best sequential algorithm
– Advantage due to larger memory
100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

Work 500, Work 500, Work 500, Work 500,
Time 500 Time 400 Time 350 Time 300
Sp=1X Sp=1.25X Sp=1.4X Sp=1.7X
Ts
Serial fraction = s =
T1
T1 − Ts
T p = Ts +
p
T1 T1 T1
Sp = = =
Tp T + T1 − Ts T (1 − 1 ) + T1
s s
p p p
T1 1 p
Sp = Sp = =
1 T1 1 1 s ( p − 1) + 1
s(1 − ) +
Ts (1 − ) + p p
p p
Ts Lt 1
s= S p =
p →∞ s
T1 S =pp

Sp
Sp=1

0.5 1
s
Assumption behind Amdahl’s Law
• All the processors are homogeneous
• All the communication costs are zero
• All the memory accesses takes unit time
(PRAM)
• All the parallel section are purely parallel:
Divisible load
1
Sp = =
1 1
s (1 − ) +
p p
Lt 1
S p =
p → ∞ s
All the memory accesses takes unit
time (PRAM)
• Memory Hierarchy: Cache Memory
• Suppose Application A run on 2Ghz Intel
Pentium P4 uni-processor takes 10 minutes
• Same Application A run on 2Ghz Intel i5
Processor (Quad core) may run much faster
than 4X. Super linear Speedup
– Earlier cache size was 1MB in P4
– Now cahce size is 4MB, the whole App A may be
fit into Cache. No capacity misses…
All the communication cost are zero
• Pthread Creation, Fork/Join takes significant
amount of time

time

Ideal Scenario Actual Scenario

T= Ts1+Ts2+Tp T= Ts1+Ts2+Tp +4(Tf+Tj)
All the parallel section are purely
parallel: Divisible load
• Parallel threads accessing to shared resources
make it serial
• Using higher number of processor may need
to collaborate and have more communication
• Application parallel section may not be scale
up with processor : Grain size
All the parallel section are purely
parallel: Divisible load
• Parallel threads accessing to shared resources
make it serial

time

Ideal Scenario
Actual Scenario with
share Resource
T= Ts1+Ts2+Tp
T= Ts1+Ts2+Tp +4(Tcs)
All the processors are homogeneous
• Asymmetric Processing Environment
• One big core and many small or tiny cores
• Intel Xeon : 8 big cores

• GPU : 4/8 big cores+ 2000 tiny cores

• Intel Phi : 4/8 big cores(host )+ 250 small
cores

Grain Size
Overhead limited load imbalance
and parallelism
limited

Speed up

Fine grain Opt grain size Coarse grain

GPU
&
Application Analysis for GPGPU
GPU
• Graphics Cards to Motherboard PCI Slot
• To accelerate Graphics computation
• Earlier day : It was fixed purpose
• Now a days, it is programmable, configurable
– Why not to use them for general purpose?
– For what kind of application

16GB 16GB

8 Big 10000
Cores Tiny cores
Host Card
GPU
• GPU vs CPU

• GPU Cards
– GTX4090: 16496 cores, 48GB DDR6-384bit
interface, Rs 1.3L
GPU Philosophy
• Small independent function/code executed huge
number of time
• Number of cores in thousands, tiny cores
• Cores are organized in cluster
– Kepler SMX: 14 SM, 192 SP Cuda cores/SM, 64 DP
units, 32 SFU, 32 (LD/ST) U
– TU102-RTX 2080Ti: 72 SM, 4608 Cuda Cores, 576
Tensor core, 72 RayTrace cores
• Explicit memory hierarchy, programmer controlled
• Also implicit memory hierarchy : cache
GPU
• Graphics Cards to Motherboard PCI Slot
– Peripheral Components Interconnect
• To accelerate Graphics computation
• Earlier day : It was fixed purpose
• Now a days, it is programmable, configurable
– Why not to use them for general purpose?
– For what kind of application

16GB 16GB

8 Big 10000
Cores Tiny cores
Host Card
….

FSB
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4/8.

SSE has 4 data lanes GPU has 8/16/24/... data lanes

The Stream Multiprocessor (SM) is a
light weight core compared to IA core.

Light weight PE:

Fused Multiply Add
(FMA)

SFU:
Special Function
Unit
Source: Nvidia CUDA Programming Guide
Streaming Processor Array Grid of thread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC

Multiple thread blocks,

many warps of threads
Texture Processor
Cluster (TPC) Streaming Multiprocessor

SP SP
SM SP SP
SFU SFU
SP SP
Texture Unit

SP SP
SM

SM
Individual threads
GTX 690-Architetcure
• 4606 cuda cores
• 72 SMX, 64 SP/SMX
• 576 Tensor Core
• 72 RayTrace Core
• 11GB DDR6 RAM
• Each SMX have
– Texture Cache 56KB
– Constant memory (Scratch PAD) 65 KB
– 49 KB L1/shared memory
– Uniform Cache
– Separate Shares Instruction Cache for SM
• 18,176 cuda cores
• 142 SMX, 1024 SP/SMX
• 24GB DDR6 RAM
– 96 MB L2 Cache
• Each SMX have
– 128 kb cache
– Separate Shares Instruction Cache for SM
Source: CUDA Prog. Guide 4.0
• Given the hardware invested to do graphics well,
how can be supplement it to improve
performance of a wider range of applications?
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Prog. model is “Single Instruction Multiple Thread”
• A thread is associated with each data element
• Threads are organized into blocks
• Blocks are organized into a grid
• GPU hardware handles thread management,
not applications or OS
Mapping SW Organized to GPU
• Software : Cuda Program or Cuda Threads
– Thread organized into Grid, Block and Threads
– Grid contain Blocks, Block contains many threads
• Hardware: GPU
– Cores are organized in to clusters (SM)
– GTX 690 per device: 8 SM, 192 core/SM
– GTX980Ti: 22 SM, 128core/SM
• Mapping
– Block get mapped to SM
– Thread get map to core (SP)
Example thread scheduling
• Suppose we want to create parallel 2000 threads
for an application
• We organize the thread into 10 blocks, each
contains 200 threads
• When we run on top of GTX 690 GPU with 8 SM
with 192 SP/SM
• Scheduler:
– Map 10 blocks to 8 SM, takes ceil(10/8) =2 times
– Map 200 threads to 192 SP of SM, it also takes
ceil(200/192)=2 times
• Thumb rules:
– Num block should be multiple of SM
– Num thread/Block should be multiple of SP/SM
//Invoke DAXPY (double aX+Y)
DAXPY(n,2.0,x,y);
//function in C
void DAXPY(int n, double a,
double *x, double *y){
for(int i=0;i<n;i++)
y[i]=a*x[i]+y[i];
}
//Invoke DAXPY with
//256 threads per threadblock
__host__ int nb=(n+255)/256;
DAXPY<<<nb,256>>>(n,2.0,x,y);
//function in Cuda
__device__ void DAXPY(int n, double a,
double *x, double *y){
int i=blockIdx.x*blockDim.x
+threadIdx.x;
if (i<n)
y[i]=a*x[i]+y[i];
}
__device__ void DAXPY(int n, double a, double *x, double *y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If (i<n) y[i]=a*x[i]+y[i];
}
main(){ #define N 1024
int x[N], y[N], size=sizeof(int)*N; Initialize(x,y);
__device__ int *x_d, *y_d; //Declaration and Memory Creation
cudaMalloc( (void **)&x_d, size); // On device
cudaMalloc( (void **)&y_d, size);
cudaMemcpy( x_d, x, size, cudaMemcpyHostToDevice );
cudaMemcpy( y_d, y, size, cudaMemcpyHostToDevice );
__host__ int nb=ceil(N/256.0); //Invoke DAXPY with 256 thrds/TB
DAXPY<<<nb,256>>>(N,2.0,xd,yd);
cudaMemcpy( y, y_d, size, cudaMemcpyDeviceToHost );
}
kernelF<<<(4,1),(8,1)>>>(A);

__device__ kernelF(A){
i=blockIdx.x; j=threadIdx.x;
A[i][j]++;
}
Both grid and thread block can have two dimensional index.
kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Example:
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Executed on machine with width of 4:
Notes: the number of
Processing Elements
(PEs) is transparent to
programmer.

Executed on machine with width of 8:

Examples: Application Mapping to
Accelerator
• Machine Configuration
– Pentium Quad core with 4GB RAM and installed
GPU card with 3000 core and 4GB RAM on GPU
– Need to consider Host to GPU memory data
transfer and vice-verse.
• Applications
– Vector Addition, Vector Sum
– Matrix Multiplication
– N-Body Simulation
– Image Adaptive Histogram Equalization
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on Host
– O(n) time, zero data transferred
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on GPU
– 2*1000 data need to send from Host to GPU
– Execution in O(1) time parallel using 1000 cores
– 1000 data need to be return from GPU to Host
• Take: 3000 unit communication overheads
funny, This program is not a good candidate
to run on GPU
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on the Host

– using OpenMP/Pthread
– Each thread can run from 250 locations
– No need to transfer data, shared among all...
– Good news : (N/4 time)
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on Host
– O(n) time, zero data transferred
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on GPU
– 1000 data need to send from Host to GPU
– Execution in Lg(1000) time parallel using 1000 cores
(GPU core is have very week support for shared
variable locking)
– 1 data need to be return from GPU to Host
• Take: 1000 unit communication overheads funny,
This program is not a good candidate to run on
GPU
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on the Host

– Using OpenMP/Pthread
– Each thread can run the sum from 250 locations
– No need to transfer data, shared among all...
– Good news : time (N/4 +4)
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j];
}

• Executing on Host
– O(n3) time, zero data transferred
– O(n2.8) Strasen method
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j]; Can be done in
one core
}

• Executing on GPU
– 2*n2 data send from Host to GPU + n2 GPU to Host
– (n2/3000) * n time parallel using 3000 cores
– Significant reduction in running time
– This program is a good candidate to run on GPU
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j];
}

• Executing on the Host

– Using OpenMP/Pthread
– Each thread can run for N2/4 Cij computation
– No need to transfer data, shared among all...
– OK, not much reduction in time
Example 4: N Body Simulation
typedef struct pos {float x,y,z;} pos;
typedef Force {float fx,fy,fz} Force;
PosP[N]; float Mass[N]; Force F[N];
while(1){
for(i=0;i<N;i++){ //Calculate force for all body from others
F[i]=0;
for(j=0;j<N;j++) F[i]=F[i]+force([i][j]);
}
for(i=0;i<N;i++){ Ai=Fi/Mi; Pi=f(Ai,Pi);} //update Acc and Pos
}
}

• Executing on Host
– O(n^2) time for each iteration of while loop...,
zero data transferred , for N>100, Very slow
Example 4: N Body Simulation
while(1){ //for each i, it can be in parallel
for(i=0;i<N;i++){ //Calculate force for all body from others
F[i]=0;
for(j=0;j<N;j++)
F[i]=F[i]+G*M[i]*M[j]/r[i][j]*r[i][j];
} //for each i, it can be in parallel
for(i=0;i<N;i++){
A[i]=F[i]/M[i]; P[i]=f(A[i],P[i]);} //update A &P
}
}
• Executing on GPU (Embarrassingly parallel)
– 1*n data send from Host to GPU (old positions), 1*n from
GPU to Host, onetime transfer of Mass
– Execution in O(N) time parallel using N cores (if N<3000)
– Total time: 1*n + O(N) + 1*n = O(n)

GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
Lec 3
No ratings yet
Lec 3
48 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
00 CourseIntroduction
No ratings yet
00 CourseIntroduction
33 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Owens
No ratings yet
Owens
67 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Unit 4
No ratings yet
Unit 4
48 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Note2 4
No ratings yet
Note2 4
11 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Lec 6
No ratings yet
Lec 6
16 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
GPGPU
No ratings yet
GPGPU
139 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Little Machine Shop Catalog
100% (2)
Little Machine Shop Catalog
128 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
A Look Into Parallel Architectures
No ratings yet
A Look Into Parallel Architectures
43 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Ceragon FibeAir IP-20G Technical Description 11.1 ETSI Rev A.01 PDF
100% (2)
Ceragon FibeAir IP-20G Technical Description 11.1 ETSI Rev A.01 PDF
322 pages
TM 9-1005-286-10
100% (1)
TM 9-1005-286-10
293 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
CUDA
No ratings yet
CUDA
33 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GM Series 6864115b62-c GM DSM
No ratings yet
GM Series 6864115b62-c GM DSM
452 pages
Top 50 Microservices Interview Questions
No ratings yet
Top 50 Microservices Interview Questions
16 pages
Slurm Guide
No ratings yet
Slurm Guide
78 pages
User Guide Manual PDF
100% (1)
User Guide Manual PDF
166 pages
Action Script 1 Flash 8
100% (3)
Action Script 1 Flash 8
5 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Bca On 16
No ratings yet
Bca On 16
51 pages
Case Study - (Q & R) - DFC10033 - 1 2021 - 2022
No ratings yet
Case Study - (Q & R) - DFC10033 - 1 2021 - 2022
6 pages
Lesson04 - Config and Manage DNS Server Role
No ratings yet
Lesson04 - Config and Manage DNS Server Role
64 pages
Is 101 Module 1 Week 1
No ratings yet
Is 101 Module 1 Week 1
7 pages
Oracle Fusion Middleware Administration: Atul Kumar
No ratings yet
Oracle Fusion Middleware Administration: Atul Kumar
27 pages
Manual
No ratings yet
Manual
27 pages
Presentasi Robot Line Follower - 20241004 - 084903 - 0000
No ratings yet
Presentasi Robot Line Follower - 20241004 - 084903 - 0000
15 pages
Atm Machine Code
No ratings yet
Atm Machine Code
6 pages
Aim & Algorithm
No ratings yet
Aim & Algorithm
17 pages
(Chapter-06) The Tools of Structured Analysis
No ratings yet
(Chapter-06) The Tools of Structured Analysis
23 pages
HSRP Configuration With GNS3 - Step by Step - NetJNL
No ratings yet
HSRP Configuration With GNS3 - Step by Step - NetJNL
6 pages
Trace
No ratings yet
Trace
16 pages
Part B Unit 3 CH 4,5
No ratings yet
Part B Unit 3 CH 4,5
2 pages
2014-Mmac-Tr-Xxx - Ias - 10920ec001 - Investigation of Failure On Hot Standby Unit MCR Rev 01
No ratings yet
2014-Mmac-Tr-Xxx - Ias - 10920ec001 - Investigation of Failure On Hot Standby Unit MCR Rev 01
10 pages
Astrofísica Computacional
No ratings yet
Astrofísica Computacional
27 pages
HCLClassification Internal Xerox Corporation SAP Device Types and Solutions For Instructions
No ratings yet
HCLClassification Internal Xerox Corporation SAP Device Types and Solutions For Instructions
6 pages
Cream Cascade For G8M Dickator: Utorial
No ratings yet
Cream Cascade For G8M Dickator: Utorial
3 pages
Resume of Honeybear95
No ratings yet
Resume of Honeybear95
2 pages
ECU Tool V0.3
No ratings yet
ECU Tool V0.3
1 page
Net 202 Week 4
No ratings yet
Net 202 Week 4
8 pages
Dm2 Readme
No ratings yet
Dm2 Readme
4 pages
Anurag Dixit
No ratings yet
Anurag Dixit
1 page
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet

Lec 14

Uploaded by

Lec 14

Uploaded by

CS528 (HPC)

Amdhal’s Law, GPU

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

Ideal Scenario Actual Scenario

• GPU : 4/8 big cores+ 2000 tiny cores

Fine grain Opt grain size Coarse grain

SSE has 4 data lanes GPU has 8/16/24/... data lanes

Light weight PE:

Multiple thread blocks,

Executed on machine with width of 8:

• Executing on the Host

• Executing on the Host

• Executing on the Host

You might also like