0% found this document useful (0 votes)

30 views91 pages

CSE Lec4 Cuda

The document provides an introduction to CUDA and GPU architecture, detailing the structure and functioning of GPUs, including their cores, memory hierarchy, and the CUDA programming model. It explains the concept of CUDA kernels, thread organization, and memory management, emphasizing the importance of parallelism for efficient GPU computing. Additionally, it outlines the execution flow of CUDA programs and provides examples of vector addition on the device.

Uploaded by

lamya.gandhi4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views91 pages

CSE Lec4 Cuda

Uploaded by

lamya.gandhi4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Introduction to CUDA

CSE 461
Spring ‘24

Dr. Arpan Jain

SVNIT Surat
E-mail: [email protected]
http://www.cse.ohio-state.edu/~jain.575
Introduction to the GPU
• Comprises many cores (that almost double each passing year)
– Tesla V100 have 5,120 CUDA Cores and 640 Tensor Cores
• Each core
– Runs at a clock speed significantly slower than a CPU’s clock
– Heavily multithreaded, in-order, single-instruction issue processor
• SIMD − single instruction, multiple-data
– Shares its control and instruction cache with multiple other cores
• Focus on execution throughput of massively-parallel programs.
• Much better Floating-Point Operations per Second (FLOPS) than CPUs
• Do not have virtual memory, interrupts, or means of addressing devices such as
the keyboard and the mouse

SVNIT, Surat CSE 461 2

Introduction to the GPU (Cont.)
• Terribly inefficient when we do not have SPMD
– Such programs are best handled by CPUs
– e.g., Serial operations
• Designed for data intensive applications
– Emphasized-upon by significantly higher memory bandwidths for GPUs
• Tesla V100 have 900 GB/s HBM2

• Originally designed for 3D rendering

– Requires holding large amount of texture and polygon data
– Caches cannot hold such large amount of data
– The only design that would have increased rendering performance was to increase the bus width and the memory
clock.
• Intel i7 has a memory bus of width 192b and a memory clock up to 800MHz.
• GTX 285 had a bus width of 512b, and a memory clock of 1242 MHz.

• Recently, CPUs are also adopting the concept of high bandwidth memory and 3D stacked memory to
improve memory access performance.

SVNIT, Surat CSE 461 3

GPU Processor SM SP

• Basic unit is called a Streaming

Multiprocessor (SM)
• Each SM has some set of
Streaming Processors (SP)
– 16 SM with 8 SP = 128 SPs
• Each SP has
– MAD unit (Multiply and Addition Unit)
– An additional MU (Multiply Unit)
• Each SP is massively threaded and can run thousands of threads
– The G80 card supports 96 threads per SP
– Each SM has 8 SPs, so each SM supports a maximum of 768 threads.
– Total threads that can run: 16 * 8 * 96 = 12,228 -> ‘massively parallel’
SVNIT, Surat CSE 461 4
The CUDA Kernel
• CUDA kernel is a function that gets executed on GPU
• The parallel portion of your applications is executed K times in
parallel by K different CUDA threads, as opposed to only one
time like regular C/C++ functions
• A group of threads is called a CUDA block. CUDA blocks are
grouped into a grid. The kernel is a function
executed on the GPU
• A kernel is executed as a grid of blocks of threads
• Each CUDA block is executed by one SM
– Cannot be migrated to other SMs in GPU
• Except during preemption, debugging, or CUDA dynamic parallelism

• One SM can run several concurrent CUDA blocks depending

on the resources needed by CUDA blocks.
CUDA kernels are subdivided into blocks and
• Each kernel is executed on one device and CUDA supports executed on the SMs on the GPU
running multiple kernels on a device at one time.
SVNIT, Surat CSE 461 5
GPU Thread Hierarchy
• Half-Warp/Warp
– Group of consecutive threads and generally executed together in parallel
– A warp is what executes on each SM at any given timestep
– Are aligned
• E.g., Threads 0->15 will be in the same half-warp/warp, 16->31, etc.
• Block
– A block is made up of warps
– Shared memory is shared among all threads in a block
• Threads within the same block can synchronize with each other, and
quickly communicate with each other
– Synchronization occurs at the block level
– So, the block is the ‘scope’ within which sets of threads can communicate
• Grid
– A collection of blocks
– Blocks can not synchronize with each other
• Threads within one block can not synchronize with threads in another

Kernel execution on GPU

SVNIT, Surat CSE 461 6
GPU Memory
• GPUs have less memory, but substantially higher
memory bandwidths
– Tesla V100 have 16 GB HBM2 @900 GB/s
• Fully coherent L2 Cache
– Smaller than CPUs L2 cache, but much higher bandwidth
– Tesla V100 have 6 MB L2 and 10 MB L1 caches
• Memory Coalescing
– Due to high degree of parallelism, many threads may want to
write to memory at the same time
– GPU coalesces memory operations to minimize number of
transactions Memory structure in NVIDIA Fermi GPU

SVNIT, Surat CSE 461 7

GPU Memory (Cont.)
• Registers
– These are private to each thread => registers assigned to a thread are not visible to other threads
– The compiler makes decisions about register utilization.
• L1 Cache/Shared Memory
– Each SM has a small amount of Shared memory
– Threads within the same block can quickly and easily communicate with each other by writing and reading to the
shared memory
– Generally used as a very quick working space for threads within a block
• At least 100 times faster than global memory, very advantageous if used correctly

• L2 cache
– Shared across all SMs => every thread in every CUDA block can access it
• Global Memory
– Global memory can be thought of as the physical memory on your graphics card.
– All threads can read and write to Global memory.
– You can even read and write to Global memory from a thread on the CPU.

SVNIT, Surat CSE 461 8

Processor and Memory in NVIDIA A100 GPU

SVNIT, Surat CSE 461 9

What is CUDA? An Introduction
• CUDA stands for Compute Unified Device Architecture
• An extension of the C programming language and was created by NVIDIA.
• CUDA lets the programmer take advantage of the hundreds of ALUs inside a
graphics processor
– Much more powerful than the handful of ALUs available in any CPU.
– This does put a limit on the types of applications that are well suited to CUDA.
• CUDA is only well suited for highly parallel algorithms
– In order to run efficiently on a GPU, you need to have many hundreds of threads.
– Generally, the more threads you have, the better.
– If you have an algorithm that is mostly serial, then CUDA does not make sense

SVNIT, Surat CSE 461 10

What is CUDA? (Cont.)

• CUDA program has code intended both for

the GPU and the CPU
• CPU is referred to as the host, and GPU is
referred to as the device
• Device code needs a special compiler to
understand the CUDA specific APIs
• NVCC (Nvidia C Compiler) separates the host
code from the device code (kernels)
• Kernels are executed on the GPU

SVNIT, Surat CSE 461 11

The Execution Flow

CPU Serial Code

GPU Parallel Kernel

CPU Serial Code

GPU Parallel Kernel

SVNIT, Surat CSE 461 12

CUDA Thread Organization
• The programmer has explicit control on the number of threads to launch
– This is a carefully decided-upon number
• These threads collectively form a three-dimensional grid
– Remember: threads are packed into blocks, and blocks are packed into grids
• Each thread is given a unique identifier
• Used to identify what data it is to be acted upon
• For all threads in a block, the block index is the same
– Can be accessed using the blockIdx variable inside a kernel.
– Each thread also has an associated index, and it can be accessed by using threadIdx variable
inside the kernel.
– Note that blockIdx and threadIdx are built-in CUDA variables that are only accessible from
inside the kernel.

SVNIT, Surat CSE 461 13

CUDA Thread Organization (Cont.)
• In a similar fashion, CUDA also has gridDim and blockDim variables that are also
built-in.
• They return the dimensions of the grid and block along a particular axis
respectively.
• As an example, blockDim.x can be used to find how many threads a particular
block has along the x axis.

SVNIT, Surat CSE 461 14

CUDA Thread Organization (Cont.)
• Consider an image that needs to be processed which is 76 pixels along the x
axis, and 62 pixels along the y axis – total of 4712 pixels
• Assuming at least one thread per pixel, we need a minimum of 4712 threads
• For performance reasons (we will come to this later), let us take number of
threads in each direction to be a multiple of 4.
• One possible layout is 80 threads on x-axis and 64 threads on the y-axis for a
total of 5120 threads. We’ll ensure extra threads do not do any work.
• Our grid has 5120 threads. How can we split it into blocks?
• One possible layout is to have 16x16x1 blocks (X*Y*Z).

SVNIT, Surat CSE 461 15

CUDA Thread Organization (Cont.) 16

• Each block is
16X16X1 16 BLOCK

• 5 Blocks on x-axis
• 4 blocks on y-axis
• We use these
when launching GRID

the kernel

SVNIT, Surat CSE 461 16

CUDA Thread Organization (Cont.)
• A kernel is launched as a grid of blocks of threads
– blockIdx and threadIdx are 3D
– Only one dimension (x)
• Built-in variables:
– threadIdx.x, threadIdx.y, threadIdx.z
• Return the thread ID in the x-axis, y-axis, and z-axis of the thread that is
being executed by this stream processor in this block.
– blockIdx.x, blockIdx.y, blockIdx.z
• Returns the block ID in the x-axis, y-axis, and z-axis of the block that is
executing the given block of code.
– blockDim.x, blockDim.y, blockDim.z
• Return the “block dimension” (i.e., the number of threads in a block in
the x-axis, y-axis, and z-axis).
– gridDim

SVNIT, Surat CSE 461 17

Simple Processing Flow

PCI Bus

• Copy input data from CPU memory to GPU

memory

SVNIT, Surat CSE 461 18

Simple Processing Flow (Cont.)

PCI Bus

• Copy input data from CPU memory to GPU

memory
• Load GPU code and execute it, caching data
on chip for performance

SVNIT, Surat CSE 461 19

Simple Processing Flow (Cont.)

PCI Bus

• Copy input data from CPU memory to GPU

memory
• Load GPU code and execute it, caching data
on chip for performance
• Copy results from GPU memory to CPU
memory
SVNIT, Surat CSE 461 20
Hello World on Host with NVCC

• Standard C that runs on the host

• NVIDIA compiler (nvcc) can be used to compile programs with no device code

SVNIT, Surat CSE 461 21

Hello World! with Device Code
• Two new syntactic elements…
• __global__ void mykernel(void)
• CUDA C/C++ __global__ keyword indicates a
function that:
– Runs on the device
– Is called from host code
• nvcc separates source code into host and
device components • mykernel<<<1,1>>>();

– Device functions (e.g. mykernel()) processed by • Triple angle brackets mark a call from host
NVIDIA compiler code to device code
– Host functions (e.g. main()) processed by – Also called a “kernel launch”
standard host compiler – We’ll return to the parameters (1,1) in a
– gcc, cl.exe moment

SVNIT, Surat CSE 461 22

Vector Addition on Device
• Let us write a simple kernel to add two integers
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
• As before __global__ is a CUDA C/C++ keyword
meaning
– add() will execute on the device
– add() will be called from the host
• add() runs on the device, so a, b and c must point to
device memory a b c
– We need to allocate memory on the GPU

SVNIT, Surat CSE 461 23

Memory Management
• Host and device memory are separate entities
• Device pointers point to GPU memory
– May be passed to/from host code
– May NOT be dereferenced in host code
• Host pointers point to CPU memory
– May be passed to/from device code
– May NOT be dereferenced in device code
• Simple CUDA API for handling device memory
• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to the C equivalents malloc(), free(), memcpy()

SVNIT, Surat CSE 461 24

Code
Vector Addition on Device (Cont.)

• Use cudaMalloc and cudaFree to allocate and

deallocate memory on device
• Use cudaMemcpy to copy the data to and from
device
– An argument specifies direction of copy
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost

Compilation and Output

SVNIT, Surat CSE 461 25

Block 0 Block 1
Parallel Vector Addition c[0] = a[0] + b[0]; c[1] = a[1] + b[1];

• So how do we run code in parallel on the device? Block 2 Block 3

add<<< 1, 1 >>>(); c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
….
How the GPU threads look at the kernel code
add<<< N, 1 >>>();
• Instead of executing add() once, execute N times in parallel
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred to as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
– By using blockIdx.x to index into the array, each block handles a different element of the array

SVNIT, Surat CSE 461 26

Parallel Vector Addition (Cont.)
• How about the main function that runs on the host? /* Dummy input values */
/* Host copies of a, b, and c */ for (i = 0; i < N; ++i) {
int *a = NULL, *b = NULL, *c = NULL; a[i] = b[i] = i;
… c[i] = 0;
/* Read number of elements from command line */ }
N = atoi(argv[1]); ...
/* Compute the size */ /* Copy input to device */
size = N*sizeof(int); cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
/* Allocate space for host copies of a, b, and c */
a = (int*) malloc(size); /* Launch kernel for addition */
b = (int*) malloc(size); add<<<N,1>>>(d_a, d_b, d_c);
c = (int*) malloc(size);
… /* Copy result back to the host */
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

SVNIT, Surat CSE 461 27

Parallel Vector Addition (Cont.)
Compilation and Output Scaling

SVNIT, Surat CSE 461 28

CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instead of parallel blocks
__global__ void add(int *a, int *b, int *c)
{
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
• We use threadIdx.x instead of blockIdx.x
• Need to make one change in main()…
// Launch add() kernel on GPU with N threads
add<<<1,N>>>(d_a, d_b, d_c);

SVNIT, Surat CSE 461 29

CUDA Threads (Cont.)
Compilation and Output Scaling

SVNIT, Surat CSE 461 30

Combining Threads and Blocks
• We’ve seen parallel vector addition using:
– Several blocks with one thread each
– One block with several threads
• Let’s adapt vector addition to use both blocks and threads
• Why? We’ll come to that
• First let’s discuss data indexing

SVNIT, Surat CSE 461 31

Indexing Arrays with Blocks and Threads

• No longer as simple as using blockIdx.x and threadIdx.x

• Consider indexing an array with one element per thread (8 threads/block)
• With M threads per block, a unique index for each thread is given by:
• int index = threadIdx.x + blockIdx.x * M;

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

SVNIT, Surat CSE 461 32

Indexing Arrays with Blocks and Threads: Example

• Which thread will operate on the red element?

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

int index = threadIdx.x + blockIdx.x * M;

= 0 + 3 * 8;
= 24;
SVNIT, Surat CSE 461 33
Vector Addition with Blocks and Threads
• Use the built-in variable blockDim.x for threads per block
– int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combined version of add() to use parallel threads and parallel blocks
__global__ void add(int *a, int *b, int *c)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
• What changes need to be made in main()?
– add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

SVNIT, Surat CSE 461 34

Vector Addition with Blocks and Threads (Cont.)
• Typical problems are not friendly multiples of blockDim.x
• Avoid accessing beyond the end of the arrays:
__global__ void add(int *a, int *b, int *c, int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

SVNIT, Surat CSE 461 35

Why Blocks and Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?

SVNIT, Surat CSE 461 36

Why Do You Need Threads?
• Key to understanding:
– Instructions are issued in order
– A thread stalls when one of the operands isn’t ready:
• Memory read by itself doesn’t stall execution
– Latency is hidden by switching threads
• GMEM latency: >100 cycles (varies by architecture/design)
• Arithmetic latency: <100 cycles (varies by architecture/design)

SVNIT, Surat CSE 461 37

Understanding GPU Latency Hiding
• In CUDA C source code:
– int idx = threadIdx.x+blockDim.x*blockIdx.x; c[idx] = a[idx] * b[idx];

• In machine code:
– I0: LD R0, a[idx];
– I1: LD R1, b[idx];
– I2: MPY R2,R0,R1

SVNIT, Surat CSE 461 38

GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0:
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 39
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 40
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 41
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 42
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 43
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3:
W4:
W5:
W6:
W7:
W8:
W9:
…
SVNIT, Surat CSE 461 44
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
W8:
W9
…
SVNIT, Surat CSE 461 45
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
W8 I0 I1

…
SVNIT, Surat CSE 461 46
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1 I2
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
I0 I1

…
SVNIT, Surat CSE 461 47
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1

clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1 I2
W1: I0 I1 I2
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
I0 I1

…
SVNIT, Surat CSE 461 48
Why Do You Need Threads?
• Key to understanding:
– Instructions are issued in order
– A thread stalls when one of the operands isn’t ready:
• Memory read by itself doesn’t stall execution
– Latency is hidden by switching threads
• GMEM latency: >100 cycles (varies by architecture/design)
• Arithmetic latency: <100 cycles (varies by architecture/design)

• What can you do to make this even better?

– Have more WARPS.
– Most SMs limit the number of WARPS to 64
• How many threads/threadblocks to launch?
• Conclusion:
– Need enough threads to hide latency

SVNIT, Surat CSE 461 49

Why Blocks and Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?

• Unlike parallel blocks, threads have mechanisms to efficiently:

– Communicate
– Synchronize

SVNIT, Surat CSE 461 50

How Threads Communicate
• Generally, threads may only safely communicate with
each other if and only if they exist within the same
thread block.
– There are technically ways where two threads from different
blocks can communicate with each other, but this is much
more difficult, and much more prone to bugs within your
program.
– This is out of scope of this class.
• Threads within the same block have two main ways to
communicate data with each other.
– Shared memory
– Global Memory
Memory Hierarchy

SVNIT, Surat CSE 461 51

Shared Memory
• When a block of threads starts executing, it runs on an SM, a multiprocessor unit inside
the GPU.
• Each SM has a small amount of shared memory associated with it, usually 16KB of
memory.
• To make matters more difficult, often, multiple thread blocks can run simultaneously on
the same SM.
• For example, if each SM has 16KB of shared memory and there are 4 thread blocks
running simultaneously on an SM, then the maximum amount of shared memory
available to each thread block would be 16KB/4, or 4KB.
• So, as you can see, if you only need the threads to share a small amount of data at any
given time, using shared memory is by far the fastest and most convenient way to do it.

SVNIT, Surat CSE 461 52

Shared Memory (Example)
• There are multiple ways to declare shared memory inside a kernel
– Depends on whether the amount of memory needed is known at compile/run time
• Static Shared Memory
– If the shared memory array size is known at compile time
– __shared__ int s[64];
• Dynamic Shared Memory
– Used when the amount of shared memory needed is not known at compile time
– The shared memory allocation size per thread block must be specified (in bytes) using an optional third
execution configuration parameter
– dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
– In the kernel, we use an “unsized” “extern” variable to access the shared memory segment
• extern shared int s[] (note the empty brackets and use of the extern specifier).
• Size is implicitly determined from the third execution configuration parameter when the kernel is launched

SVNIT, Surat CSE 461 53

Global Memory
• However, if your program is using too much shared memory to store data, or
your threads simply need to share too much data at once, then it is possible that
the shared memory is not big enough to accommodate all the data that needs
to be shared among the threads.
• In such a situation, threads always have the option of writing to and reading
from global memory.
• Global memory is much slower than accessing shared memory, however, global
memory is much larger.

SVNIT, Surat CSE 461 54

Example: 1D Stencil

• Consider applying a 1D stencil to a 1D array of elements

– Each output element is the sum of input elements within a radius
• If radius is 3, then each output element is the sum of 7 input elements

out

SVNIT, Surat CSE 461 55

Implementing Within a Block

• Each thread processes one output element

– blockDim.x elements per block
• Input elements are read several times radius radius

– With radius 3, each input element is read seven times

out

ThreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhread
0 1 2 3 4 5 6 7 8

SVNIT, Surat CSE 461 56

Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory

• Extremely fast on-chip memory

– By opposition to device memory, referred to as global memory
– Like a user-managed cache

• Declare using shared, allocated per block

– Places a variable into shared memory for each respective thread block
• Data is not visible to threads in other blocks

SVNIT, Surat CSE 461 57

Implementing With Shared Memory

• Cache data in shared memory

– Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
• Each block needs a halo of radius elements at each boundary

out

SVNIT, Surat CSE 461 58

Stencil Kernel
• Kernel Code
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
SVNIT, Surat CSE 461 59
Data Race
▪ The stencil example will not work…

▪ Suppose thread 15 reads the halo before thread 0 has fetched it…

...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];

Skipped since threadId.x > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset]; Load from temp[19]

...

SVNIT, Surat CSE 461 60

Synchronize Threads
• void __syncthreads();
• Synchronizes all threads within a block
– A thread’s execution can only proceed past a __syncthreads() after all threads in its block
have executed the __syncthreads().
– All threads within a thread block must call __syncthreads() at the same point
• Otherwise, it can lead to deadlock
– Used to prevent RAW / WAR / WAW hazards
• All threads must reach the barrier
– In conditional code, the condition must be uniform across the block

SVNIT, Surat CSE 461 61

Thread Divergence
• Recall that threads from a block are bundled into fixed-size warps for execution
on a CUDA core, and threads within a warp must follow the same execution
trajectory.
• All threads must execute the same instruction at the same time. In other words,
threads cannot diverge.
• The most common code construct that can cause thread divergence is
branching for conditionals in an if-then-else statement.
• If some threads in a single warp evaluate to 'true' and others to 'false', then the
'true' and 'false' threads will branch to different instructions.
• Some threads will want proceed to the 'then' instruction, while others the 'else'.

SVNIT, Surat CSE 461 62

Thread Divergence (Cont.)
• Intuitively, we would think statements in then and else should be executed in parallel.
• However, because of the requirement that threads in a warp cannot diverge, this cannot
happen.
• The CUDA platform has a workaround that fixes the problem, but has negative
performance consequences.
• When executing the if-then-else statement, the CUDA platform will instruct the warp to
execute the then part first, and then proceed to the else part.
• While executing the then part, all threads that evaluated to false (e.g. the else threads)
are effectively deactivated.
• When execution proceeds to the else condition, the situation is reversed. As you can see,
the then and else parts are not executed in parallel, but in serial.
• This serialization can result in a significant performance loss.

SVNIT, Surat CSE 461 63

Deadlock with Thread Divergence
• Thread divergence can also cause a program to deadlock. Consider the following example:
//my_Func_then and my_Func_else are some device functions
if (threadidx.x <16) {
myFunc_then();
__syncthreads();
} else if (threadidx >=16) {
myFunc_else();
__syncthreads();
}
• The first half of the warp will execute the then part, then wait for the second half of the
warp to reach __syncthreads().
• However, the second half of the warp did not enter the then part; therefore, the first half
of the warp will be waiting for them forever.

SVNIT, Surat CSE 461 64

Fixed Stencil Kernel
▪ Synchronize threads
▪ Basically – put a barrier in between to make them wait

...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];

Skipped since threadId.x > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset]; Load from temp[19]
…

SVNIT, Surat CSE 461 65

Dynamic Shared Memory

• Dynamic shared memory can only be declared as a 1D array

• What if you need multiple dynamically sized arrays in a single kernel?
• You must declare a single extern unsized array as before and use pointers into it to
divide it into multiple arrays, as in the following excerpt.
extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF]; // nC chars
• In the kernel launch, specify the total shared memory needed, as in the following.
myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

SVNIT, Surat CSE 461 66

Defining Grid/Block Structure
• Need to provide each kernel call with values for two key structures:
– Number of blocks in each dimension
– Threads per block in each dimension
• myKernel<<< B, T >>>(arg1, … );
• B – a structure that defines the number of blocks in grid in each dimension (1D
or 2D).
• T – a structure that defines the number of threads in a block in each dimension
(1D, 2D, or 3D).

SVNIT, Surat CSE 461 67

1D Grids and/or 1D Blocks
• If want a 1-D structure, can use a integer for B and T in:
• myKernel<<< B, T >>>(arg1, … );
• B – An integer would define a 1D grid of that size
• T – An integer would define a 1D block of that size
• Example: myKernel<<< 1, 100 >>>(arg1, ... );

SVNIT, Surat CSE 461 68

Higher Dimensional Grids/Blocks
• 1D grids/blocks are suitable for 1D data, but higher dimensional grids/blocks are
necessary for:
• Higher dimensional data.
• Data set larger than the hardware dimensional limitations of blocks.
• CUDA has built-in variables and structures to define the number of blocks in a
grid in each dimension and the number of threads in a block in each dimension.

SVNIT, Surat CSE 461 69

CUDA Built-In Vector Types and Structures
• uint3 and dim3 are CUDA-defined structures of unsigned integers: x, y, and z.
• struct uint3 {x; y; z;};
• struct dim3 {x; y; z;};
• The unsigned structure components are automatically initialized to 1.
• These vector types are mostly used to define grid of blocks and threads

SVNIT, Surat CSE 461 70

CUDA Built-In Variables for Grid/Block Sizes

• dim3 gridDim -- Grid dimensions, x, y, and z.

• Number of blocks in grid = gridDim.x * gridDim.y * gridDim.z
• dim3 blockDim -- Size of block dimensions x, y, and z.
• Number of threads in a block = blockDim.x * blockDim.y * blockDim.z
• Example
– dim3 grid(16,16); // grid = 16 x 16 blocks
– dim3 block(32,32); // block = 32 x 32 threads
– myKernel<<<grid, block>>>(...);
– which sets:
– grid.x = 16; grid.y = 16; grid.z = 1
– block.x = 32; block.y = 32; block.z = 1;

SVNIT, Surat CSE 461 71

2D Grids and 2D Blocks

• Full global thread ID in x and y dimensions

can be computed by:
– x = blockIdx.x * blockDim.x + threadIdx.x;
– y = blockIdx.y * blockDim.y + threadIdx.y;

SVNIT, Surat CSE 461 72

Flatten Matrices into Linear Memory

• Generally, memory allocated dynamically on device (GPU) and we cannot not

use two-dimensional indices (e.g., A[row][column]) to access matrices.
• We will need to know how the matrix is laid out in memory and then compute
the distance from the beginning of the matrix.
• C uses row-major order --- rows are stored one after the other in memory, i.e.,
row 0 then row 1 etc.

M0,0 M1,0 M2,0 M3,0

M0,1 M1,1 M2,1 M3,1

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3

SVNIT, Surat CSE 461 73
Accessing Matrices in Linear Memory

• Logically
– a[row][column] == a[offset]
– offset = column + row * N

• In CUDA:
– int col = blockIdx.x*blockDim.x+threadIdx.x;
– int row = blockIdx.y*blockDim.y+threadIdx.y;
– int index = col + row * N;
– A[index] = …

SVNIT, Surat CSE 461 74

Matrix Addition: Add two 2D matrices
• Corresponding elements of each array (a,b) added together to form element of
third array (c):

• 2D matrices are added to form a sum 2D matrix.

• We use dim3 variables to set the Grid and Block dimensions.
• We calculate a global thread ID to index the column and row of the matrix.
• We calculate the linear index of the matrix.

SVNIT, Surat CSE 461 75

General Rules and Best Practices
• Choosing the number of threads per block is very complicated
• To simplify things, use a constant number of threads per block
• There is a limit to the number of threads per block
– Since all threads of a block are expected to reside on the same processor core and must
share the limited memory resources of that core.
• On current GPUs, a thread block may contain up to 1024 threads.
• Your thread block size should always be a multiple of 32
– Kernels issue instructions in warps (32 threads).
– For example, if you have a block size of 50 threads, the GPU will still issue commands to 64
threads, and you'd just be wasting them.

SVNIT, Surat CSE 461 76

General Rules and Best Practices (Cont.)
• Try to size your blocks based on the maximum numbers of threads and blocks
that correspond to the compute capability of your card.
– For example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads.
– If you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048
thread limit.
– If you use 256 threads, you can only fit 8, but you're still using all the available threads and
will still have full occupancy.
– However, using 64 threads per block will only use 1024 threads when the 16-block limit is
hit, so only 50% occupancy.
– If shared memory and register usage is not a bottleneck, this should be your main concern
(other than your data dimensions).

SVNIT, Surat CSE 461 77

General Rules and Best Practices (Cont.)
• The blocks in your grid are spread out over the SMs to start
• Then the remaining blocks are placed into a pipeline.
• Blocks are moved into the SMs for processing as soon as there are enough
resources in that SM to take the block.
• In other words, as blocks complete in an SM, new ones are moved in.
• You could make the argument that having smaller blocks (128 instead of 256 in
the previous example) may complete faster since a particularly slow block will
hog fewer resources, but this is very much dependent on the code.

SVNIT, Surat CSE 461 78

CUDA Occupancy Calculator
• The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of
a GPU by a given CUDA kernel.
• The multiprocessor occupancy is the ratio of active warps to the maximum number of
warps supported on a multiprocessor of the GPU.
• Each multiprocessor on the device has a set of N registers available for use by CUDA
program threads.
• These registers are a shared resource that are allocated among the thread blocks
executing on a multiprocessor.
• The CUDA compiler attempts to minimize register usage to maximize the number of
thread blocks that can be active in the machine simultaneously.
• If a program tries to launch a kernel for which the registers used per thread times the
thread block size is greater than N, the launch will fail
• https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
SVNIT, Surat CSE 461 79
Determining Values at Runtime
• Query GPU properties using cudaGetDeviceProperties
– __host__cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device)
– Returns information about the compute-device.
• Some useful properties
– maxThreadsDim[3] - Maximum size of each dimension of a block
– maxGridSize[3] - Maximum size of each dimension of a grid
– maxThreadsPerBlock - Maximum number of threads per block
– maxThreadsPerMultiProcessor - Maximum resident threads per multiprocessor
– maxBlocksPerMultiProcessor - Maximum number of resident blocks per multiprocessor
– reservedSharedMemPerBlock - Shared memory reserved by CUDA driver per block in bytes
– sharedMemPerBlock - Shared memory available per block in bytes

SVNIT, Surat CSE 461 80

Determining Values at Runtime (Cont.)
• // CUDA device properties variable
cudaDeviceProp prop;
// Query GPU properties
cudaGetDeviceProperties(&prop, device_id);
printf("maxThreadsDim x,y,z = %d,%d,%d\n", prop.maxThreadsDim[0],
prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
printf("maxGridSize x,y,z = %d,%d,%d\n", prop.maxGridSize[0],
prop.maxGridSize[1], prop.maxGridSize[2]);
printf("maxThreadsPerBlock = %d, maxThreadsPerMultiProcessor = %d, maxBlocksPerMultiProcessor = %d\n",
prop.maxThreadsPerBlock, prop.maxThreadsPerMultiProcessor, prop.maxBlocksPerMultiProcessor);
printf("reservedSharedMemPerBlock = %d, sharedMemPerBlock = %d\n",
prop.reservedSharedMemPerBlock, prop.sharedMemPerBlock);
• https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html

SVNIT, Surat CSE 461 81

Matrix Multiplication Review

• To calculate the product of two matrices A

and B, we multiply the rows of A by the
columns of B and add them up.
• Then place the sum in the appropriate
position in the matrix C.

SVNIT, Surat CSE 461 82

Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 83

Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 84

Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 85

Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 86

j
Parallelizing Matrix Multiplication
B

• To compute a single value of C(i,j), only a single

i
thread be necessary to traverse the ith row of A
and the jth column of B. k
• Therefore, the number of threads needed to C
compute a square matrix multiply is O(N2).
SVNIT, Surat CSE 461 87
Quick Reference Guide
• Function attributes
– __global__ function called by the host, executes on the device
– __device__ function called by the device, executes on the device
– __host__ function called by the host, executes on the host
– __host__ __device__ generates both host and device code for the function
• Variables attributes
– __device__ variable on device (Global Memory)
– __shared__ variable in Shared Memory
– __restrict__ restricted pointers, assert to the compiler that pointers are not aliased
– No qualifier automatic variable, resides in Register or in Local Memory

SVNIT, Surat CSE 461 88

Quick Reference Guide (Cont.)
• Built-in Variables
– dim3 gridDim size of the grid in number of blocks along the x, y, z axes
– dim3 blockDim size of the block in number of threads along the x, y, z axes
– dim3 blockIdx position (x,y,z) of the block in the grid
– dim3 threadIdx position (x,y,z) of the thread in the block
• Shared memory
– __shared__ int x[10]; statically allocated array in shared memory
– extern __shared__ int x[]; dynamically allocated array in shared memory
• kernel<<<blocks, threadsperblock, dyn shared mem in bytes>>>

SVNIT, Surat CSE 461 89

Quick Reference Guide (Cont.)
• Memory Management
– cudaMalloc(&dptr, size) allocates size memory on the device
– cudaFree(dptr) frees size memory from the device
– cudaMallocHost(&hptr, size) allocates size pinned memory on the host
– cudaFreeHost(hptr) frees size memory from the host
– cudaMemcpy(trgptr, srcptr, size, direction) copies size memory from the
source pointer to the target pointer using the direction specified
• e.g. cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost,
cudaMemcpyDeviceToDevice

SVNIT, Surat CSE 461 90

Thank You!

SVNIT, Surat CSE 461 91

Photoshop MCQ Questions and Answers
73% (15)
Photoshop MCQ Questions and Answers
9 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lec 1
No ratings yet
Lec 1
27 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA
No ratings yet
CUDA
18 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
1 Cuda
100% (1)
1 Cuda
173 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Cuda
No ratings yet
Cuda
25 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Cuda C
No ratings yet
Cuda C
70 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Lec 6
No ratings yet
Lec 6
16 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Community Detection - Part 2
No ratings yet
Community Detection - Part 2
25 pages
Lab Assignment 6
No ratings yet
Lab Assignment 6
2 pages
Nia, Si
No ratings yet
Nia, Si
70 pages
Introduction To SNA
No ratings yet
Introduction To SNA
39 pages
Finite - Element - Modeling - of - Prestressed - Concrete - SP
No ratings yet
Finite - Element - Modeling - of - Prestressed - Concrete - SP
11 pages
Python Dictionaries
No ratings yet
Python Dictionaries
62 pages
How To Reduce EMI in Switching Power Supplies
No ratings yet
How To Reduce EMI in Switching Power Supplies
3 pages
P235GH Engl PDF
No ratings yet
P235GH Engl PDF
4 pages
Cribbage Rules1
No ratings yet
Cribbage Rules1
5 pages
EH Liquipoint FTW31 FTW32 Datasheet
No ratings yet
EH Liquipoint FTW31 FTW32 Datasheet
24 pages
HITEC PowerPRO2700 - 2016 PDF
100% (4)
HITEC PowerPRO2700 - 2016 PDF
55 pages
Object Oriented Analysis
No ratings yet
Object Oriented Analysis
6 pages
Introduction To Language and Communication-Week11
No ratings yet
Introduction To Language and Communication-Week11
33 pages
Experiment 2 VOM
No ratings yet
Experiment 2 VOM
5 pages
Cortex™ M3
No ratings yet
Cortex™ M3
384 pages
Haque 2008 - Durability Design in The African Concrete Code
No ratings yet
Haque 2008 - Durability Design in The African Concrete Code
17 pages
C 4
No ratings yet
C 4
61 pages
Exam Paper 2 Year 6 (Math)
50% (2)
Exam Paper 2 Year 6 (Math)
7 pages
Homomorphism
No ratings yet
Homomorphism
10 pages
Lecture 1 - INTRODUCTION
No ratings yet
Lecture 1 - INTRODUCTION
25 pages
2 Failure Theory
No ratings yet
2 Failure Theory
53 pages
Brochure Force Sensor
No ratings yet
Brochure Force Sensor
7 pages
Production of PHA
No ratings yet
Production of PHA
8 pages
Pp12a
No ratings yet
Pp12a
55 pages
23 Hack in Sight 2014
100% (2)
23 Hack in Sight 2014
652 pages
IFOS Presentation-PAK Mobilink0704
No ratings yet
IFOS Presentation-PAK Mobilink0704
13 pages
Preceptron
No ratings yet
Preceptron
17 pages
Nodal Analysis and (IPR, TPC) Curve
No ratings yet
Nodal Analysis and (IPR, TPC) Curve
9 pages
3BUS094398 H C en System 800xa 5.0 Harmony Overview Hires
No ratings yet
3BUS094398 H C en System 800xa 5.0 Harmony Overview Hires
34 pages
MLE1101 - Tutorial 2 - Suggested Solutions
No ratings yet
MLE1101 - Tutorial 2 - Suggested Solutions
8 pages
Loan Eligibility Prediction Using Logistics Regression Algorithm
No ratings yet
Loan Eligibility Prediction Using Logistics Regression Algorithm
11 pages
Ec24 33
No ratings yet
Ec24 33
3 pages
How To Build A Stacking Harness
100% (2)
How To Build A Stacking Harness
5 pages