Lec 14
Lec 14
A Sahu
Dept of CSE, IIT Guwahati
A Sahu 1
Outline
• Amdhal’s Law
– Relaxing Assumptions
• GPU
• APP classification for Acceleration
Performance of Parallel Program
(Amdahl’s Law)
Example: OpenMP Parallel Program
printf(“begin\n”);
N = 1000; Serial
#pragma omp parallel for
for (i=0; i<N; i++)
A[i] = B[i] + C[i]; Parallel
M = 500;
Serial
#pragma omp parallel for
for (j=0; j<M; j++)
p[j] = q[j] – r[j]; Parallel
printf(“done\n”);
Serial
• Notion : T1 =Time on Uni-processor, Tp= Time on
p Processors
Speed up = Sp=T1/Tp ≤p
Efficiency = Ep= T1/(p.Tp)
• Usually Sp < p or Ep <1 due to overhead
• Some time superliner speed up reported (Sp > p
or Ep >1 )
– Failure to use the best sequential algorithm
– Advantage due to larger memory
100 100 100 100
Sp
Sp=1
0.5 1
s
Assumption behind Amdahl’s Law
• All the processors are homogeneous
• All the communication costs are zero
• All the memory accesses takes unit time
(PRAM)
• All the parallel section are purely parallel:
Divisible load
1
Sp = =
1 1
s (1 − ) +
p p
Lt 1
S p =
p → ∞ s
All the memory accesses takes unit
time (PRAM)
• Memory Hierarchy: Cache Memory
• Suppose Application A run on 2Ghz Intel
Pentium P4 uni-processor takes 10 minutes
• Same Application A run on 2Ghz Intel i5
Processor (Quad core) may run much faster
than 4X. Super linear Speedup
– Earlier cache size was 1MB in P4
– Now cahce size is 4MB, the whole App A may be
fit into Cache. No capacity misses…
All the communication cost are zero
• Pthread Creation, Fork/Join takes significant
amount of time
time
time
Ideal Scenario
Actual Scenario with
share Resource
T= Ts1+Ts2+Tp
T= Ts1+Ts2+Tp +4(Tcs)
All the processors are homogeneous
• Asymmetric Processing Environment
• One big core and many small or tiny cores
• Intel Xeon : 8 big cores
Grain Size
Overhead limited load imbalance
and parallelism
limited
Speed up
16GB 16GB
8 Big 10000
Cores Tiny cores
Host Card
GPU
• GPU vs CPU
• GPU Cards
– GTX4090: 16496 cores, 48GB DDR6-384bit
interface, Rs 1.3L
GPU Philosophy
• Small independent function/code executed huge
number of time
• Number of cores in thousands, tiny cores
• Cores are organized in cluster
– Kepler SMX: 14 SM, 192 SP Cuda cores/SM, 64 DP
units, 32 SFU, 32 (LD/ST) U
– TU102-RTX 2080Ti: 72 SM, 4608 Cuda Cores, 576
Tensor core, 72 RayTrace cores
• Explicit memory hierarchy, programmer controlled
• Also implicit memory hierarchy : cache
GPU
• Graphics Cards to Motherboard PCI Slot
– Peripheral Components Interconnect
• To accelerate Graphics computation
• Earlier day : It was fixed purpose
• Now a days, it is programmable, configurable
– Why not to use them for general purpose?
– For what kind of application
16GB 16GB
8 Big 10000
Cores Tiny cores
Host Card
….
FSB
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4/8.
SFU:
Special Function
Unit
Source: Nvidia CUDA Programming Guide
Streaming Processor Array Grid of thread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC
SP SP
SM SP SP
SFU SFU
SP SP
Texture Unit
SP SP
SM
SM
Individual threads
GTX 690-Architetcure
• 4606 cuda cores
• 72 SMX, 64 SP/SMX
• 576 Tensor Core
• 72 RayTrace Core
• 11GB DDR6 RAM
• Each SMX have
– Texture Cache 56KB
– Constant memory (Scratch PAD) 65 KB
– 49 KB L1/shared memory
– Uniform Cache
– Separate Shares Instruction Cache for SM
• 18,176 cuda cores
• 142 SMX, 1024 SP/SMX
• 24GB DDR6 RAM
– 96 MB L2 Cache
• Each SMX have
– 128 kb cache
– Separate Shares Instruction Cache for SM
Source: CUDA Prog. Guide 4.0
• Given the hardware invested to do graphics well,
how can be supplement it to improve
performance of a wider range of applications?
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Prog. model is “Single Instruction Multiple Thread”
• A thread is associated with each data element
• Threads are organized into blocks
• Blocks are organized into a grid
• GPU hardware handles thread management,
not applications or OS
Mapping SW Organized to GPU
• Software : Cuda Program or Cuda Threads
– Thread organized into Grid, Block and Threads
– Grid contain Blocks, Block contains many threads
• Hardware: GPU
– Cores are organized in to clusters (SM)
– GTX 690 per device: 8 SM, 192 core/SM
– GTX980Ti: 22 SM, 128core/SM
• Mapping
– Block get mapped to SM
– Thread get map to core (SP)
Example thread scheduling
• Suppose we want to create parallel 2000 threads
for an application
• We organize the thread into 10 blocks, each
contains 200 threads
• When we run on top of GTX 690 GPU with 8 SM
with 192 SP/SM
• Scheduler:
– Map 10 blocks to 8 SM, takes ceil(10/8) =2 times
– Map 200 threads to 192 SP of SM, it also takes
ceil(200/192)=2 times
• Thumb rules:
– Num block should be multiple of SM
– Num thread/Block should be multiple of SP/SM
//Invoke DAXPY (double aX+Y)
DAXPY(n,2.0,x,y);
//function in C
void DAXPY(int n, double a,
double *x, double *y){
for(int i=0;i<n;i++)
y[i]=a*x[i]+y[i];
}
//Invoke DAXPY with
//256 threads per threadblock
__host__ int nb=(n+255)/256;
DAXPY<<<nb,256>>>(n,2.0,x,y);
//function in Cuda
__device__ void DAXPY(int n, double a,
double *x, double *y){
int i=blockIdx.x*blockDim.x
+threadIdx.x;
if (i<n)
y[i]=a*x[i]+y[i];
}
__device__ void DAXPY(int n, double a, double *x, double *y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If (i<n) y[i]=a*x[i]+y[i];
}
main(){ #define N 1024
int x[N], y[N], size=sizeof(int)*N; Initialize(x,y);
__device__ int *x_d, *y_d; //Declaration and Memory Creation
cudaMalloc( (void **)&x_d, size); // On device
cudaMalloc( (void **)&y_d, size);
cudaMemcpy( x_d, x, size, cudaMemcpyHostToDevice );
cudaMemcpy( y_d, y, size, cudaMemcpyHostToDevice );
__host__ int nb=ceil(N/256.0); //Invoke DAXPY with 256 thrds/TB
DAXPY<<<nb,256>>>(N,2.0,xd,yd);
cudaMemcpy( y, y_d, size, cudaMemcpyDeviceToHost );
}
kernelF<<<(4,1),(8,1)>>>(A);
__device__ kernelF(A){
i=blockIdx.x; j=threadIdx.x;
A[i][j]++;
}
Both grid and thread block can have two dimensional index.
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Example:
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
Executed on machine with width of 4:
Notes: the number of
Processing Elements
(PEs) is transparent to
programmer.
• Executing on Host
– O(n) time, zero data transferred
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card
• Executing on GPU
– 2*1000 data need to send from Host to GPU
– Execution in O(1) time parallel using 1000 cores
– 1000 data need to be return from GPU to Host
• Take: 3000 unit communication overheads
funny, This program is not a good candidate
to run on GPU
Example 1: Vector Addition
int A[1000], B[1000], C[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
A[i]=B[i]+C[i]; Host Card
• Executing on Host
– O(n) time, zero data transferred
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card
• Executing on GPU
– 1000 data need to send from Host to GPU
– Execution in Lg(1000) time parallel using 1000 cores
(GPU core is have very week support for shared
variable locking)
– 1 data need to be return from GPU to Host
• Take: 1000 unit communication overheads funny,
This program is not a good candidate to run on
GPU
Example 2: Vector Sum
int A[1000]; 4GB 4GB
for(i=0;i<1000;i++) 4 Big 3000
//Parallel, Independent Work Cores Tiny cores
S= S+A[i]; Host Card
• Executing on Host
– O(n3) time, zero data transferred
– O(n2.8) Strasen method
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j]; Can be done in
one core
}
• Executing on GPU
– 2*n2 data send from Host to GPU + n2 GPU to Host
– (n2/3000) * n time parallel using 3000 cores
– Significant reduction in running time
– This program is a good candidate to run on GPU
Example 3: Matrix Multiplication
int A[100][100], B[100][100], 4GB 4GB
C[100][100]; 4 Big 3000
for(i=0;i<100;i++) Cores Tiny cores
for(j=0;j<100;j++){ Host Card
Cij=0 ;
for(k=0;k<100;k++)
Cij=Cij+A[i][k]*B[k][j];
}
• Executing on Host
– O(n^2) time for each iteration of while loop...,
zero data transferred , for N>100, Very slow
Example 4: N Body Simulation
while(1){ //for each i, it can be in parallel
for(i=0;i<N;i++){ //Calculate force for all body from others
F[i]=0;
for(j=0;j<N;j++)
F[i]=F[i]+G*M[i]*M[j]/r[i][j]*r[i][j];
} //for each i, it can be in parallel
for(i=0;i<N;i++){
A[i]=F[i]/M[i]; P[i]=f(A[i],P[i]);} //update A &P
}
}
• Executing on GPU (Embarrassingly parallel)
– 1*n data send from Host to GPU (old positions), 1*n from
GPU to Host, onetime transfer of Mass
– Execution in O(N) time parallel using N cores (if N<3000)
– Total time: 1*n + O(N) + 1*n = O(n)