Arallel Rocessing NIT
Arallel Rocessing NIT
UNIT 1
1
UNDERSTANDING PARALLEL
ENVIRONMENT
2
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?
Faster Clocks
More Processors
3
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?
Faster Clocks
More Processors
3
SEYMOUR CRAY (SUPER COMPUTER DESIGNER)
4
PARALLEL COMPUTING
It was intended to be used by super computing.
Now all computers/mobiles are using parallel
computing.
Modern GPUs
Hundred of processors
Thousand of ALUs (3,000)
Ten or thousands of concurrent threads.
6
CLOCK SPEED (NO MORE SPEED)
7
QUIZ
Are processing today getting faster Because
10
Less speed with
MORE TO UNDERSTAND (CONT.) simple
structure
More speed
with
complex
structure
Less
Power Power
11
QUIZ
Which techniques are computer designer using
today to build more power-efficient chips?
12
QUIZ
Which techniques are computer designer using
today to build more power-efficient chips?
12
ANOTHER FACTOR FOR POWER EFFICIENCY
Power Efficiency
14
GPU DESIGN BELIEVES
Lots of simple compute units
Explicitly parallel programming model
We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
Optimized for throughput not latency
15
INTRO TO PARALLEL
PROGRAMMING
16
IMPORTANCE OF PARALLEL
PROGRAMMING
Intel 8 core Ivy bridge
8-wide AVX vector operations/core
17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns
C
io
ns
CPU GPU
"Host" Co-processor "Device "
Memory Memory
Respond to CPU request to receive data from
CPU to GPU
Compute a kernel lunched by CPU
Respond to CPU request to receive data from
CPU to GPU
Compute a kernel lunched by CPU
21
QUIZ
What is the GPU good at?
Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
here we have 1 thread do 64 multiplications each
takes 2 ns.
23
GPU POWER (CONT.)
Example:
In : [1, 2, 3, …., 64]
Out: [02, 12, 22, …., 642]
CPU GPU
Parallel solution: j
CPU code: square kernel <<<64>>>(out, in)
here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMPLE
25 start
THREADS AND BLOCKS
26
THREADS
27
BLOCKS
28
GRID
29
MAXIMUMS
You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).
30
WHY BLOCKS AND THREADS?
You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
Suppose you wrote a program for a GPU can which can
run 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
Each GPU has a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can run some number of blocks concurrently, executing
some number of threads simultaneously.
By adding the extra level of abstraction, higher
performance GPU's can simply run more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
nVidia has done this to allow automatic performance gains
when your code is run on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE
SomeKernel<<<100, 25>>>(...);
Inside the kernel, each thread can calculate a unique
id with:
int id = blockIdx.x * blockDim.x + threadIdx.x;
So the 5th thread of the 4th block would calculate:
int id = 4 * 25 + 5 = 105
The 14th thread of the 76th block would calculate:
int id = 76 * 25 + 14 = 1914 35
MAPPING
36
MAP
Set of elements to process [64 floats]
Function to run on each element [square]
Map(element, function)
37
QUIZ
Which programs can be solved using Map
38
QUIZ
Which programs can be solved using Map
38