0% found this document useful (0 votes)
12 views52 pages

Lec 14

This document discusses Amdahl's law, GPUs, and classifying applications for GPU acceleration. It provides an overview of Amdahl's law and discusses how some of its assumptions do not always hold, such as communication costs not being zero and parallel sections not always being perfectly divisible. It then describes GPU architecture, including their use of thousands of tiny cores organized in clusters compared to CPUs with fewer but larger cores. It discusses how applications with small independent functions that can run many times in parallel are well-suited to GPU acceleration. Finally, it provides an example of mapping CUDA threads to GPU hardware resources.

Uploaded by

Abinash sonowal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views52 pages

Lec 14

This document discusses Amdahl's law, GPUs, and classifying applications for GPU acceleration. It provides an overview of Amdahl's law and discusses how some of its assumptions do not always hold, such as communication costs not being zero and parallel sections not always being perfectly divisible. It then describes GPU architecture, including their use of thousands of tiny cores organized in clusters compared to CPUs with fewer but larger cores. It discusses how applications with small independent functions that can run many times in parallel are well-suited to GPU acceleration. Finally, it provides an example of mapping CUDA threads to GPU hardware resources.

Uploaded by

Abinash sonowal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CS528 (HPC)

Amdhal’s Law, GPU

A Sahu
Dept of CSE, IIT Guwahati

A Sahu 1
Outline
• Amdhal’s Law
– Relaxing Assumptions
• GPU
• APP classification for Acceleration
Performance of Parallel Program
(Amdahl’s Law)
Example: OpenMP Parallel Program
printf(“begin\n”);
N = 1000; Serial
#pragma omp parallel for
for (i=0; i<N; i++)
A[i] = B[i] + C[i]; Parallel

M = 500;
Serial
#pragma omp parallel for
for (j=0; j<M; j++)
p[j] = q[j] – r[j]; Parallel

printf(“done\n”);
Serial
• Notion : T1 =Time on Uni-processor, Tp= Time on
p Processors
Speed up = Sp=T1/Tp ≤p
Efficiency = Ep= T1/(p.Tp)
• Usually Sp < p or Ep <1 due to overhead
• Some time superliner speed up reported (Sp > p
or Ep >1 )
– Failure to use the best sequential algorithm
– Advantage due to larger memory
100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100

100 50 50 25 25 25 25 ∞ Processors, Time ≈0

100 100 100 100


Work 500, Work 500, Work 500, Work 500,
Time 500 Time 400 Time 350 Time 300
Sp=1X Sp=1.25X Sp=1.4X Sp=1.7X
Ts
Serial fraction = s =
T1
T1 − Ts
T p = Ts +
p
T1 T1 T1
Sp = = =
Tp T + T1 − Ts T (1 − 1 ) + T1
s s
p p p
T1 1 p
Sp = Sp = =
1 T1 1 1 s ( p − 1) + 1
s(1 − ) +
Ts (1 − ) + p p
p p
Ts Lt 1
s= S p =
p →∞ s
T1 S =pp

Sp
Sp=1

0.5 1
s
Assumption behind Amdahl’s Law
• All the processors are homogeneous
• All the communication costs are zero
• All the memory accesses takes unit time
(PRAM)
• All the parallel section are purely parallel:
Divisible load
1
Sp = =
1 1
s (1 − ) +
p p
Lt 1
S p =
p → ∞ s
All the memory accesses takes unit
time (PRAM)
• Memory Hierarchy: Cache Memory
• Suppose Application A run on 2Ghz Intel
Pentium P4 uni-processor takes 10 minutes
• Same Application A run on 2Ghz Intel i5
Processor (Quad core) may run much faster
than 4X. Super linear Speedup
– Earlier cache size was 1MB in P4
– Now cahce size is 4MB, the whole App A may be
fit into Cache. No capacity misses…
All the communication cost are zero
• Pthread Creation, Fork/Join takes significant
amount of time

time

Ideal Scenario Actual Scenario


T= Ts1+Ts2+Tp T= Ts1+Ts2+Tp +4(Tf+Tj)
All the parallel section are purely
parallel: Divisible load
• Parallel threads accessing to shared resources
make it serial
• Using higher number of processor may need
to collaborate and have more communication
• Application parallel section may not be scale
up with processor : Grain size
All the parallel section are purely
parallel: Divisible load
• Parallel threads accessing to shared resources
make it serial

time

Ideal Scenario
Actual Scenario with
share Resource
T= Ts1+Ts2+Tp
T= Ts1+Ts2+Tp +4(Tcs)
All the processors are homogeneous
• Asymmetric Processing Environment
• One big core and many small or tiny cores
• Intel Xeon : 8 big cores

• GPU : 4/8 big cores+ 2000 tiny cores


• Intel Phi : 4/8 big cores(host )+ 250 small
cores

Grain Size
Overhead limited load imbalance
and parallelism
limited

Speed up

Fine grain Opt grain size Coarse grain


GPU
&
Application Analysis for GPGPU
GPU
• Graphics Cards to Motherboard PCI Slot
• To accelerate Graphics computation
• Earlier day : It was fixed purpose
• Now a days, it is programmable, configurable
– Why not to use them for general purpose?
– For what kind of application

16GB 16GB

8 Big 10000
Cores Tiny cores
Host Card
GPU
• GPU vs CPU

• GPU Cards
– GTX4090: 16496 cores, 48GB DDR6-384bit
interface, Rs 1.3L
GPU Philosophy
• Small independent function/code executed huge
number of time
• Number of cores in thousands, tiny cores
• Cores are organized in cluster
– Kepler SMX: 14 SM, 192 SP Cuda cores/SM, 64 DP
units, 32 SFU, 32 (LD/ST) U
– TU102-RTX 2080Ti: 72 SM, 4608 Cuda Cores, 576
Tensor core, 72 RayTrace cores
• Explicit memory hierarchy, programmer controlled
• Also implicit memory hierarchy : cache
GPU
• Graphics Cards to Motherboard PCI Slot
– Peripheral Components Interconnect
• To accelerate Graphics computation
• Earlier day : It was fixed purpose
• Now a days, it is programmable, configurable
– Why not to use them for general purpose?
– For what kind of application

16GB 16GB

8 Big 10000
Cores Tiny cores
Host Card
….

FSB
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4/8.

SSE has 4 data lanes GPU has 8/16/24/... data lanes


The Stream Multiprocessor (SM) is a
light weight core compared to IA core.

Light weight PE:


Fused Multiply Add
(FMA)

SFU:
Special Function
Unit
Source: Nvidia CUDA Programming Guide
Streaming Processor Array Grid of thread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC

Multiple thread blocks,


many warps of threads
Texture Processor
Cluster (TPC) Streaming Multiprocessor

SP SP
SM SP SP
SFU SFU
SP SP
Texture Unit

SP SP
SM

SM
Individual threads
GTX 690-Architetcure
• 4606 cuda cores
• 72 SMX, 64 SP/SMX
• 576 Tensor Core
• 72 RayTrace Core
• 11GB DDR6 RAM
• Each SMX have
– Texture Cache 56KB
– Constant memory (Scratch PAD) 65 KB
– 49 KB L1/shared memory
– Uniform Cache
– Separate Shares Instruction Cache for SM
• 18,176 cuda cores
• 142 SMX, 1024 SP/SMX
• 24GB DDR6 RAM
– 96 MB L2 Cache
• Each SMX have
– 128 kb cache
– Separate Shares Instruction Cache for SM
Source: CUDA Prog. Guide 4.0
• Given the hardware invested to do graphics well,
how can be supplement it to improve
performance of a wider range of applications?
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Prog. model is “Single Instruction Multiple Thread”
• A thread is associated with each data element
• Threads are organized into blocks
• Blocks are organized into a grid
• GPU hardware handles thread management,
not applications or OS
Mapping SW Organized to GPU
• Software : Cuda Program or Cuda Threads
– Thread organized into Grid, Block and Threads
– Grid contain Blocks, Block contains many threads
• Hardware: GPU
– Cores are organized in to clusters (SM)
– GTX 690 per device: 8 SM, 192 core/SM
– GTX980Ti: 22 SM, 128core/SM
• Mapping
– Block get mapped to SM
– Thread get map to core (SP)
Example thread scheduling
• Suppose we want to create parallel 2000 threads
for an application
• We organize the thread into 10 blocks, each
contains 200 threads
• When we run on top of GTX 690 GPU with 8 SM
with 192 SP/SM
• Scheduler:
– Map 10 blocks to 8 SM, takes ceil(10/8) =2 times
– Map 200 threads to 192 SP of SM, it also takes
ceil(200/192)=2 times
• Thumb rules:
– Num block should be multiple of SM
– Num thread/Block should be multiple of SP/SM
//Invoke DAXPY (double aX+Y)
DAXPY(n,2.0,x,y);
//function in C
void DAXPY(int n, double a,
double *x, double *y){
for(int i=0;i<n;i++)
y[i]=a*x[i]+y[i];
}
//Invoke DAXPY with
//256 threads per threadblock
__host__ int nb=(n+255)/256;
DAXPY<<<nb,256>>>(n,2.0,x,y);
//function in Cuda
__device__ void DAXPY(int n, double a,
double *x, double *y){
int i=blockIdx.x*blockDim.x
+threadIdx.x;
if (i<n)
y[i]=a*x[i]+y[i];
}
__device__ void DAXPY(int n, double a, double *x, double *y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If (i<n) y[i]=a*x[i]+y[i];
}
main(){ #define N 1024
int x[N], y[N], size=sizeof(int)*N; Initialize(x,y);
__device__ int *x_d, *y_d; //Declaration and Memory Creation
cudaMalloc( (void **)&x_d, size); // On device
cudaMalloc( (void **)&y_d, size);
cudaMemcpy( x_d, x, size, cudaMemcpyHostToDevice );
cudaMemcpy( y_d, y, size, cudaMemcpyHostToDevice );
__host__ int nb=ceil(N/256.0); //Invoke DAXPY with 256 thrds/TB
DAXPY<<<nb,256>>>(N,2.0,xd,yd);
cudaMemcpy( y, y_d, size, cudaMemcpyDeviceToHost );
}
kernelF<<<(4,1),(8,1)>>>(A);

__device__ kernelF(A){
i=blockIdx.x; j=threadIdx.x;
A[i][j]++;
}
Both grid and thread block can have two dimensional index.
kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Example:
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Executed on machine with width of 4:
Notes: the number of
Processing Elements
(PEs) is transparent to
programmer.

Executed on machine with width of 8:


Examples: Application Mapping to
Accelerator
• Machine Configuration
– Pentium Quad core with 4GB RAM and installed
GPU card with 3000 core and 4GB RAM on GPU
– Need to consider Host to GPU memory data
transfer and vice-verse.
• Applications
– Vector Addition, Vector Sum
– Matrix Multiplication
– N-Body Simulation
– Image Adaptive Histogram Equalization
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on Host
– O(n) time, zero data transferred
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on GPU
– 2*1000 data need to send from Host to GPU
– Execution in O(1) time parallel using 1000 cores
– 1000 data need to be return from GPU to Host
• Take: 3000 unit communication overheads
funny, This program is not a good candidate
to run on GPU
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card

• Executing on the Host


– using OpenMP/Pthread
– Each thread can run from 250 locations
– No need to transfer data, shared among all...
– Good news : (N/4 time)
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on Host
– O(n) time, zero data transferred
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on GPU
– 1000 data need to send from Host to GPU
– Execution in Lg(1000) time parallel using 1000 cores
(GPU core is have very week support for shared
variable locking)
– 1 data need to be return from GPU to Host
• Take: 1000 unit communication overheads funny,
This program is not a good candidate to run on
GPU
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card

• Executing on the Host


– Using OpenMP/Pthread
– Each thread can run the sum from 250 locations
– No need to transfer data, shared among all...
– Good news : time (N/4 +4)
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j];
}

• Executing on Host
– O(n3) time, zero data transferred
– O(n2.8) Strasen method
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j]; Can be done in
one core
}

• Executing on GPU
– 2*n2 data send from Host to GPU + n2 GPU to Host
– (n2/3000) * n time parallel using 3000 cores
– Significant reduction in running time
– This program is a good candidate to run on GPU
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j];
}

• Executing on the Host


– Using OpenMP/Pthread
– Each thread can run for N2/4 Cij computation
– No need to transfer data, shared among all...
– OK, not much reduction in time
Example 4: N Body Simulation
typedef struct pos {float x,y,z;} pos;
typedef Force {float fx,fy,fz} Force;
PosP[N]; float Mass[N]; Force F[N];
while(1){
for(i=0;i<N;i++){ //Calculate force for all body from others
F[i]=0;
for(j=0;j<N;j++) F[i]=F[i]+force([i][j]);
}
for(i=0;i<N;i++){ Ai=Fi/Mi; Pi=f(Ai,Pi);} //update Acc and Pos
}
}

• Executing on Host
– O(n^2) time for each iteration of while loop...,
zero data transferred , for N>100, Very slow
Example 4: N Body Simulation
while(1){ //for each i, it can be in parallel
for(i=0;i<N;i++){ //Calculate force for all body from others
F[i]=0;
for(j=0;j<N;j++)
F[i]=F[i]+G*M[i]*M[j]/r[i][j]*r[i][j];
} //for each i, it can be in parallel
for(i=0;i<N;i++){
A[i]=F[i]/M[i]; P[i]=f(A[i],P[i]);} //update A &P
}
}
• Executing on GPU (Embarrassingly parallel)
– 1*n data send from Host to GPU (old positions), 1*n from
GPU to Host, onetime transfer of Mass
– Execution in O(N) time parallel using N cores (if N<3000)
– Total time: 1*n + O(N) + 1*n = O(n)

You might also like