0% found this document useful (0 votes)

187 views52 pages

GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use

The document summarizes lecture 3 on efficient shared memory use in GPU computing with CUDA. It recaps thread hierarchies and memory organization, then discusses using shared memory in a 1D finite difference method example, where each thread loads its element and neighboring elements from shared memory to compute the update. Proper use of shared memory can improve performance by avoiding redundant global memory loads.

Uploaded by

Sandeep Srikumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

187 views52 pages

GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use

Uploaded by

Sandeep Srikumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

GPU Computing with CUDA

Lecture 3 - Efficient Shared Memory Use

Christopher Cooper
Boston University
August, 2011
UTFSM, Valparaso, Chile
1

Outline of lecture
Recap of Lecture 2
Shared memory in detail
Tiling
Bank conflicts

Recap
Thread hierarchy
- Thread are grouped in thread blocks
- Threads of the same block are executed on the same SM at the same
time
Threads can communicate with shared memory
An SM can have up to 8 blocks at the same time

- Thread blocks are divided sequentially into warps of 32 threads each

- Threads of the same warp are scheduled together
- SM implements a zero-overhead warp scheduling
3

Recap
Memory hierarchy

Smart use of
memory hierarchy!
4

Recap
Programming model: Finite Difference case
- One node per thread
- Node indexing automatically groups into thread blocks!

Recap
Programming model: Finite Difference case
- One node per thread
- Node indexing automatically groups into thread blocks!
Thread = node

Thread block

Shared Memory
Small (48kB per SM)
Fast (~4 cycles): On-chip
Private to each block
- Allows thread communication

How can we use it?

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];
if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];
if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];

Loads
element i

if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];

Loads
element i

Loads
element i-1

if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];
if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i +1
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];
if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i +1
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];

Loads
element i+1

if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i +1
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];

Loads
element i+1

Loads
element i

if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Shared Memory - Making use of it

Looking at a 1D FDM example (similar to lab)

u
u
=c
t
x

n+1
ui

n
ui

ct n
n

(ui ui1 )
x

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc,int
BLOCKSIZE)
{
//Eachthreadwillloadoneelement
Thread i +1
inti=threadIdx.x+BLOCKSIZE*blockIdx.x;

if(i>=N){return;}
u_prev[i]=u[i];

Loads
element i+1

Loads
element i

if(i>0)
{u[i]=u_prev[i]c*dt/dx*(u_prev[i]u_prev[i1]);
}
}

Order N redundant loads!

Shared Memory - Making use of it

Idea: We could load only once to shared memory, and operate there

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x;
intI=threadIdx.x+BLOCKSIZE*blockIdx.x;
__shared__floatu_shared[BLOCKSIZE];

if(I>=N){return;}
u_shared[i]=u[I];
__syncthreads();
if(I>0)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);}
}

Shared Memory - Making use of it

Idea: We could load only once to shared memory, and operate there

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x;
intI=threadIdx.x+BLOCKSIZE*blockIdx.x;
__shared__floatu_shared[BLOCKSIZE];
Allocate shared array

if(I>=N){return;}
u_shared[i]=u[I];
__syncthreads();
if(I>0)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);}
}

Shared Memory - Making use of it

Idea: We could load only once to shared memory, and operate there

if(I>=N){return;}
u_shared[i]=u[I];
Load to shared mem
__syncthreads();
if(I>0)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);}
}

Shared Memory - Making use of it

Idea: We could load only once to shared memory, and operate there

if(I>=N){return;}
u_shared[i]=u[I];
Load to shared mem
__syncthreads();
if(I>0)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);}
}

Fetch shared mem

Shared Memory - Making use of it

Idea: We could load only once to shared memory, and operate there

if(I>=N){return;}
u_shared[i]=u[I];
Load to shared mem
__syncthreads();
if(I>0)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);}
}

Fetch shared mem

Works if N <= Block size... What if not?

Shared Memory - Making use of it

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x;
intI=threadIdx.x+BLOCKSIZE*blockIdx.x;
__shared__floatu_shared[BLOCKSIZE];

if(I>=N){return;}
u_prev[I]=u[I];
u_shared[i]=u[I];
__syncthreads();
if(i>0&&i<BLOCKSIZE1)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);
}
else
{u[I]=u_prev[I]c*dt/dx*(u_prev[I]u_prev[I1]);
}

Shared Memory - Making use of it

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x;
intI=threadIdx.x+BLOCKSIZE*blockIdx.x;
__shared__floatu_shared[BLOCKSIZE];

if(I>=N){return;}
u_prev[I]=u[I];
u_shared[i]=u[I];
__syncthreads();
if(i>0&&i<BLOCKSIZE1)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);
}
else
{u[I]=u_prev[I]c*dt/dx*(u_prev[I]u_prev[I1]);
}

Shared Memory - Making use of it

__global__voidupdate(float*u,float*u_prev,intN,floatdx,floatdt,floatc)
{
//Eachthreadwillloadoneelement
inti=threadIdx.x;
intI=threadIdx.x+BLOCKSIZE*blockIdx.x;
__shared__floatu_shared[BLOCKSIZE];

if(I>=N){return;}
u_prev[I]=u[I];
u_shared[i]=u[I];
__syncthreads();
if(i>0&&i<BLOCKSIZE1)
{u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);
}
else
{u[I]=u_prev[I]c*dt/dx*(u_prev[I]u_prev[I1]);
}

Reduced loads from 2N to N+2N/BLOCKSIZE

Using shared memory as cache

Looking at the 2D heat diffusion problem from lab 2
2

u
u
= 2
t
x
Explicit scheme
un+1
i,j

uni,j

k n
+ 2 (ui,j+1 + uni,j1 + uni+1,j + uni1,j 4uni,j )
h
T=0

T = 200

T=0

T = 200

Shared Memory Implementation - Mapping Problem

Using row major flattened array
inti=threadIdx.x;
intj=threadIdx.y;
intI=blockIdx.y*BSZ*N+blockIdx.x*BSZ+j*N+i;

Nx*(Ny-1)

I increasing

Nx*Ny-1

I increasing

Nx-1
11

Shared Memory Implementation - Global Memory

This implementation has redundant loads to global memory

slow

__global__voidupdate(float*u,float*u_prev,intN,floath,floatdt,float
alpha,intBSZ)
{
//Settingupindices
inti=threadIdx.x;
intj=threadIdx.y;
intI=blockIdx.y*BSZ*N+blockIdx.x*BSZ+j*N+i;

if(I>=N*N){return;}
u_prev[I]=u[I];

//ifnotboundarydo
if((I>N)&&(I<N*N1N)&&(I%N!=0)&&(I%N!=N1))
{u[I]=u_prev[I]+alpha*dt/(h*h)*(u_prev[I+1]+u_prev[I1]+
u_prev[I+N]+u_prev[IN]4*u_prev[I]);
}
}
12

Shared Memory Implementation - Solution 1

Recast solution given earlier
- Load to shared memory
- Use shared memory if not on boundary of a block

Global memory

- Use global memory otherwise

Advantage
- Easy to implement

Disadvantage
- Branching statement
- Still have some redundant loads

Shared memory

Shared Memory Implementation - Solution 1

__global__voidupdate(float*u,float*u_prev,intN,floath,floatdt,floatalpha)
{
//Settingupindices
inti=threadIdx.x;
intj=threadIdx.y;
intI=blockIdx.y*BSZ*N+blockIdx.x*BSZ+j*N+i;
if(I>=N*N){return;}
__shared__floatu_prev_sh[BSZ][BSZ];
u_prev_sh[i][j]=u[I];
u_prev[I]=u[I];
__syncthreads();
boolbound_check=((I>N)&&(I<N*N1N)&&(I%N!=0)&&(I%N!=N1));
boolblock_check=((i>0)&&(i<BSZ1)&&(j>0)&&(j<BSZ1));

//ifnotonblockboundarydo
if(block_check)
{u[I]=u_prev_sh[i][j]+alpha*dt/h/h*(u_prev_sh[i+1][j]+u_prev_sh[i1]
[j]+u_prev_sh[i][j+1]+u_prev_sh[i][j1]4*u_prev_sh[i][j]);
}
//ifnotonboundary
elseif(bound_check)
{u[I]=u_prev[I]+alpha*dt/(h*h)*(u_prev[I+1]+u_prev[I1]+u_prev[I+N]
+u_prev[IN]4*u_prev[I]);
}
14
}

Shared Memory Implementation - Solution 2

We want to avoid the reads from global memory
- Lets use halo nodes to compute block edges

Images: Mark Giles, Oxford, UK

Shared Memory Implementation - Solution 2

Change indexing so to
jump in steps of BSZ-2
instead of BSZ

BSZ

BSZ -2

Operate on internal
nodes

BSZ

Load data to shared

memory

BSZ -2

Well need Nx/(BSZ-2)

blocks per dimension,
instead of BSZ/N

Shared Memory Implementation - Solution 2

__global__voidupdate(float*u,float*u_prev,intN,floath,floatdt,floatalpha)
{
//Settingupindices
inti=threadIdx.x,j=threadIdx.y,bx=blockIdx.x,by=blockIdx.y;
intI=(BSZ2)*bx+i,J=(BSZ2)*by+j;
intIndex=I+J*N;

if(I>=N||J>=N){return;}
__shared__floatu_prev_sh[BSZ][BSZ];
u_prev_sh[i][j]=u[Index];
__syncthreads();
boolbound_check=((I!=0)&&(I<N1)&&(J!=0)&&(J<N1));
boolblock_check=((i!=0)&&(i<BSZ1)&&(j!=0)&&(j<BSZ1));
if(bound_check&&block_check)
{u[Index]=u_prev_sh[i][j]+alpha*dt/h/h*(u_prev_sh[i+1][j]+
u_prev_sh[i1][j]+u_prev_sh[i][j+1]+u_prev_sh[i][j1]4*u_prev_sh[i][j]);
}
}
17

Shared Memory Implementation - Solution 2

Weve eliminated all redundant global memory accesses!
But...
- Theres still a heavy amount of branching
GPUs are not great at branching... well look into that later today

- All threads read, but only some operate

Were underutilizing the device!
If we have 16x16 = 256 threads, all read, but only 14x14 = 196
operate, and were using only ~75% of the device. In 3D this number
drops to ~40%!
18

Shared Memory Implementation - Solution 3

We need to go further...

BSZ+2

BSZ

- Load in two stages

BSZ+2

- To not underutilize the

device, we need to
load more data than
threads

BSZ

- Operate on [i+1][j+1]
threads

Shared Memory Implementation - Solution 3

Loading in 2 steps
- Use the 64 available threads to load the
64 first values to shared
__shared__floatu_prev_sh[BSZ+2][BSZ+2];
intii=j*BSZ+i,//Flattenthreadindexing
I=ii%(BSZ+2),//xdirectionindexincludinghalo
J=ii/(BSZ+2);//ydirectionindexincludinghalo
intI_n=I_0+J*N+I;//Generalindex
u_prev_sh[I][J]=u[I_n];

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

Loading in 2 steps
- Use the 64 available threads to load the
64 first values to shared

64th

__shared__floatu_prev_sh[BSZ+2][BSZ+2];
intii=j*BSZ+i,//Flattenthreadindexing
I=ii%(BSZ+2),//xdirectionindexincludinghalo
J=ii/(BSZ+2);//ydirectionindexincludinghalo
intI_n=I_0+J*N+I;//Generalindex
u_prev_sh[I][J]=u[I_n];

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

Loading in 2 steps
- Use the 64 available threads to load the
64 first values to shared

64th

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Shared Memory Implementation - Solution 3

- Load the remaining values

intii2=BSZ*BSZ+j*BSZ+i;
intI2=ii2%(BSZ+2);
intJ2=ii2/(BSZ+2);

8x8 threads
10x10 loads

intI_n2=I_0+J2*N+I2;//Generalindex
if((I2<(BSZ+2))&&(J2<(BSZ+2))&&(ii2<N*N))
u_prev_sh[I2][J2]=u[I_n2];

Some threads wont load

Shared Memory Implementation - Solution 3

Compute on interior points: threads [i+1][j+1]

Index
intIndex=by*BSZ*N+bx*BSZ+(j+1)*N+i+1;

u[Index]=u_prev_sh[i+1][j+1]+alpha*dt/h/h*(u_prev_sh[i+2][j+1]+u_prev_sh[i][j+1]+
u_prev_sh[i+1][j+2]+u_prev_sh[i+1][j]4*u_prev_sh[i+1][j+1]);
21

SM Implementation
The technique described is called tiling
- Tiling means loading data to shared memory in tiles
- Useful when shared memory is used as cache
- Also used when all data is to large to fit in shared memory and you
load it in smaller chunks

We will implement this in tomorrows lab!

Shared Memory - Bank conflicts

Shared memory arrays are subdivided into smaller subarrays called
banks
Shared memory has 32 (16) banks in 2.X (1.X). Successive 32-bit
words are assigned to successive banks
Different banks can be accessed simultaneously
If two or more addresses of a memory request are in the same bank,
the access is serialized
- Bank conflicts exist only within a warp (half warp for 1.X)

In 2.X there is no bank conflict if the memory request is for the same
32-bit word. This is not valid in 1.X.
23

Shared Memory - Bank conflicts

Shared Memory
__syncthreads()
- Barrier that waits for all threads of the block before continuing
- Need to make sure all data is loaded to shared before access
- Avoids race conditions
- Serializes the code: dont over use!
u_shared[i]=u[I];
__syncthreads();
if(i>0&&i<BLOCKSIZE1)
u[I]=u_shared[i]c*dt/dx*(u_shared[i]u_shared[i1]);

Race condition
When two or more threads want to access and operate on a memory
location without syncronization
Example: we have the value 3 stored in global memory and two
threads want to add one to that value.
- Possibility 1:
Thread 1 reads the value 3 adds 1 and writes 4 back to memory
Thread 2 reads the value 4 and writes 5 back to memory
- Possibility 2:
Thread 1 reads the value 3
Thread 2 reads the value 3
Both threads operate on 3 and write back the value 4 to memory

Solutions:
- __syncthreads() or atomic operations

Atomic operations
Atomic operations deal with race conditions
- It guarantees that while the operation is being executed, that
location in memory os not accessed
- Still we cant rely on any ordering of thread execution!
- Types
atomicAdd
atomicSub
atomicExch
atomicMin
atomicMax
etc...

Atomic operations

__global__update(int*values,int*who)
{
inti=threadIdx.x+blockDim.x*blockIdx.x;
intI=who[i];
atomicAdd(&values[I],1);
}

David Tarjan - NVIDIA

Atomic operations
Useful if you have a sparse access pattern
Atomic operations are slower than normal function
They can serialize your execution if many threads want to access the
same memory location
- Think about parallelizing your data, not only execution
- Use hierarchy of atomic operations to avoid this

Prefer __syncthreads() if you can use it instead

- If you have a regular access pattern

Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Tilining
No ratings yet
Tilining
23 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Class 10
No ratings yet
Class 10
13 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
HPC
No ratings yet
HPC
90 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Gpu-Accelerated Face Detection Algorithm
No ratings yet
Gpu-Accelerated Face Detection Algorithm
9 pages
EE 4702 Take-Home Pre-Final Questions: Solution
No ratings yet
EE 4702 Take-Home Pre-Final Questions: Solution
11 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Web GPU
0% (1)
Web GPU
40 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Main Memory: Prof. Mike Giles
No ratings yet
Main Memory: Prof. Mike Giles
9 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014
No ratings yet
How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014
72 pages
Comp422 2011 Lecture8 UPC
No ratings yet
Comp422 2011 Lecture8 UPC
44 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
3H
No ratings yet
3H
34 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
5 Computation
No ratings yet
5 Computation
13 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
HPC File
No ratings yet
HPC File
22 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Hpca2020 Gpu 2
No ratings yet
Hpca2020 Gpu 2
41 pages
The Art of Multiprocessor Programming
No ratings yet
The Art of Multiprocessor Programming
11 pages
STM User Guide
No ratings yet
STM User Guide
84 pages
What Every Systems Programmer Should Know About Concurrency: Matt Kline
No ratings yet
What Every Systems Programmer Should Know About Concurrency: Matt Kline
12 pages
The Art of Multiprocessor Programming
No ratings yet
The Art of Multiprocessor Programming
12 pages
Scalable Aggregation On Multicore Processors
No ratings yet
Scalable Aggregation On Multicore Processors
9 pages
Chapter 6 - Synchronization Tools - Part 1
No ratings yet
Chapter 6 - Synchronization Tools - Part 1
24 pages
Download
No ratings yet
Download
24 pages
Transactions in Java Card: Marcus Oestreicher IBM Zurich Research Laboratory 8053 Rueschlikon, Switzerland
No ratings yet
Transactions in Java Card: Marcus Oestreicher IBM Zurich Research Laboratory 8053 Rueschlikon, Switzerland
8 pages
Classical IPC Problems Reader's and Writer Problem
No ratings yet
Classical IPC Problems Reader's and Writer Problem
79 pages
Kernel Kernel Synchronization
No ratings yet
Kernel Kernel Synchronization
9 pages
Brewer Conjecture (CAP)
No ratings yet
Brewer Conjecture (CAP)
17 pages
Disadvantages of File Processing System
No ratings yet
Disadvantages of File Processing System
4 pages
Environment Programming in Mas: With Cartago
No ratings yet
Environment Programming in Mas: With Cartago
59 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
A Primer On Memory Consistency and Coherence
100% (2)
A Primer On Memory Consistency and Coherence
214 pages
Os All Notes 1st Semester
No ratings yet
Os All Notes 1st Semester
133 pages
Parallel & Concurrent Programming in Kotlin
No ratings yet
Parallel & Concurrent Programming in Kotlin
53 pages
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
No ratings yet
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
20 pages
Traditional File Oriented Approach
No ratings yet
Traditional File Oriented Approach
6 pages
Versatile Java - Deepak Mali 351
No ratings yet
Versatile Java - Deepak Mali 351
351 pages
Java Multithreading Interview Questions: Concurrency
No ratings yet
Java Multithreading Interview Questions: Concurrency
14 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Concurrency Freaks - Interrupt Handler in C11 With Atomics
No ratings yet
Concurrency Freaks - Interrupt Handler in C11 With Atomics
4 pages
II2250 Manajemen Basis Data Failures and Recovery
No ratings yet
II2250 Manajemen Basis Data Failures and Recovery
52 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
2001 Caslists
No ratings yet
2001 Caslists
15 pages
ch6 Revised
No ratings yet
ch6 Revised
33 pages
Atomic Transactions
No ratings yet
Atomic Transactions
19 pages
An Asynchronous Messaging Library For C: Georgio Chrysanthakopoulos Satnam Singh
No ratings yet
An Asynchronous Messaging Library For C: Georgio Chrysanthakopoulos Satnam Singh
9 pages

GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use

Uploaded by

GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use

Uploaded by

GPU Computing with CUDA

Lecture 3 - Efficient Shared Memory Use

- Thread blocks are divided sequentially into warps of 32 threads each

How can we use it?

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Order N redundant loads!

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Fetch shared mem

Shared Memory - Making use of it

Fetch shared mem

Works if N <= Block size... What if not?

Shared Memory - Making use of it

Shared Memory - Making use of it

Shared Memory - Making use of it

Reduced loads from 2*N to N+2*N/BLOCKSIZE

Using shared memory as cache

Shared Memory Implementation - Mapping Problem

Shared Memory Implementation - Global Memory

Shared Memory Implementation - Solution 1

- Use global memory otherwise

Shared Memory Implementation - Solution 1

Shared Memory Implementation - Solution 2

Images: Mark Giles, Oxford, UK

Shared Memory Implementation - Solution 2

Load data to shared

Well need Nx/(BSZ-2)

Shared Memory Implementation - Solution 2

Shared Memory Implementation - Solution 2

- All threads read, but only some operate

Shared Memory Implementation - Solution 3

- Load in two stages

- To not underutilize the

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Shared Memory Implementation - Solution 3

- Load the remaining values

Some threads wont load

Shared Memory Implementation - Solution 3

We will implement this in tomorrows lab!

Shared Memory - Bank conflicts

Shared Memory - Bank conflicts

David Tarjan - NVIDIA

Prefer __syncthreads() if you can use it instead

You might also like

Reduced loads from 2N to N+2N/BLOCKSIZE