0% found this document useful (0 votes)
21 views44 pages

Arallel Rocessing NIT

Uploaded by

Kareem Dwidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views44 pages

Arallel Rocessing NIT

Uploaded by

Kareem Dwidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

PARALLEL PROCESSING

UNIT 1
1
UNDERSTANDING PARALLEL
ENVIRONMENT
2
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?

Faster Clocks

 Longer Clock Period


More Work per Clock Cycle

 Larger Hard Disk


 More Processors

 Reduce amount of memory

3
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?

Faster Clocks

 Longer Clock Period


More Work per Clock Cycle

 Larger Hard Disk


 More Processors

 Reduce amount of memory

3
SEYMOUR CRAY (SUPER COMPUTER DESIGNER)

 Ifyou are plowing a field, which would


you rather use?

 Two strong oxen.


 1024 chickens

4
PARALLEL COMPUTING
 It was intended to be used by super computing.
 Now all computers/mobiles are using parallel
computing.
 Modern GPUs
 Hundred of processors
 Thousand of ALUs (3,000)
 Ten or thousands of concurrent threads.

 This requires a different way of programming


than a single scalar processor
 General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON MOORE’S
PATH . . . FOR NOW

6
CLOCK SPEED (NO MORE SPEED)

7
QUIZ
 Are processing today getting faster Because

 We are clocking their transistors faster


 We have more transistors available for

computation.

o Why don’t we keep increasing clock speed of a


single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
QUIZ
 Are processing today getting faster Because

 We are clocking their transistors faster


 We have more transistors available for

computation.

o Why don’t we keep increasing clock speed of a


single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
WHAT KIND OF PROCESSORS WILL WE
BUILD?

 Assume major design constraint is Power

 Why are traditional CPU-like processors are not


the most energy efficient processors?
 It has complex control hardware
 This increase flexibility and performance
 And increase power consumption and design
complexity as well
 How to increase power efficiency (GPU-like)?
 Build simple control structure.
 Take those transistors and devote them to support
more computation on the data path
9
 The challenge becomes how to program?
MORE TO UNDERSTAND

10
Less speed with
MORE TO UNDERSTAND (CONT.) simple
structure

More speed
with
complex
structure

Less
Power Power
11
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors


More, Simpler processors

 Maximizing the speed of the processor clock

 Increasing the complexity of the control


hardware

12
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors


More, Simpler processors

 Maximizing the speed of the processor clock

 Increasing the complexity of the control


hardware

12
ANOTHER FACTOR FOR POWER EFFICIENCY

Power Efficiency

Decrease latency Increase Throughput


(Amount of time to (Task completed per
complete a task) unit time)
“Time” “Number”

 The two goals are not aligned


 CPU-like: design to decrease latency
 GPU-like: design to increase throughput
13
 The choice depends on the application (Image processing
prefer to increase the throughput)
SUPER QUIZ
 Why do I say GPU-like and not saying Multi-core
CPU? Is there a deference ?!

 They both build for parallel programming. However,


Multi-core CPUs can be used for sequential and
parallel programming as well (provides branches and
interrupts ). On the other hand GPU build for
parallel programming from scratch.

14
GPU DESIGN BELIEVES
 Lots of simple compute units
 Explicitly parallel programming model
 We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
 Optimized for throughput not latency

15
INTRO TO PARALLEL
PROGRAMMING
16
IMPORTANCE OF PARALLEL
PROGRAMMING
 Intel 8 core Ivy bridge
 8-wide AVX vector operations/core

 2 threads core (hyper threading)

 This means the processor has 128 way of


parallelism
 Parallel programming is more complex however
Running sequential C program means using less
than 1% of this processor power

17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns

C
io
ns

CPU GPU
"Host" Co-processor "Device "

Memory Memory

 CUDA compiler generate two separated program one


for CPU (Host) and another for GPU (Device).
 CPU in charge and control the GPU
 Moves data between memories (cudaMemcpy)
 Allocates memory on GPU (cudaMalloc)
 Invokes programs (kernels) on the GPU: ”Host lunches 18
kernels on the Device”
QUIZ
The GPU can do the following:

 Initiate data send from GPU to CPU


Respond to CPU request to send data from GPU

to CPU
 Initiate data request from CPU to GPU


 Respond to CPU request to receive data from
CPU to GPU

 Compute a kernel lunched by CPU

 Compute a kernel lunched by GPU 


19
QUIZ
The GPU can do the following:

 Initiate data send from GPU to CPU


Respond to CPU request to send data from GPU

to CPU
 Initiate data request from CPU to GPU


 Respond to CPU request to receive data from
CPU to GPU

 Compute a kernel lunched by CPU

 Compute a kernel lunched by GPU 


19
TYPICAL GPU PROGRAM
 CPU allocate storage on GPU
 CPU copy input data from CPU to GPU

 CPU lunches the kernels on the GPU to process


the data
 CPU copies results back to the CPU from the
GPU

 If you need to move data many times between


CPU and GPU, CUDA is not good for your
program because it takes many steps to do so as
showing above 20
MAIN ISSUE
 Defining the GPU computation

 Write a Kernel like serial program


 When lunching the kernel tell the GPU how many
threads to lunch

21
QUIZ
What is the GPU good at?

 Lunching a small number of threads efficiently


 Lunching a large number of threads efficiently

 Running one thread very Quickly

 Respond to CPU request to receive data from


CPU to GPU
 Running one thread that does lots of work in
parallel
Running a large number of threads in parallel

22
QUIZ
What is the GPU good at?

 Lunching a small number of threads efficiently


 Lunching a large number of threads efficiently

 Running one thread very Quickly

 Respond to CPU request to receive data from


CPU to GPU
 Running one thread that does lots of work in
parallel
Running a large number of threads in parallel

22
GPU POWER
 Example:
 In : [1, 2, 3, …., 64]
 Out: [02, 12, 22, …., 642]

 Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
 here we have 1 thread do 64 multiplications each
takes 2 ns.

23
GPU POWER (CONT.)
 Example:
 In : [1, 2, 3, …., 64]
 Out: [02, 12, 22, …., 642]

CPU GPU

Allocate memory out= in * in


Copy data to/from GPU
launch kernel

 Parallel solution: j
 CPU code: square kernel <<<64>>>(out, in)
 here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMPLE
25 start
THREADS AND BLOCKS
26
THREADS

A single execution units that run kernels on the


GPU. Similar to CPU threads but there's usually
many more of them. They are sometimes drawn as
arrows

27
BLOCKS

 Thread blocks are a virtual collection of threads.


 All the threads in any single thread block can
communicate

28
GRID

 A kernel is launched as a collection of thread


blocks called the grid.

29
MAXIMUMS
 You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).

 You can launch 232-1 blocks in a single launch(or


216-1 if your card is compute capability 2.0 or
less).

 So my relatively inexpensive GeForce GT 440 can


launch a rather ridiculous 67,108,864 threads.

30
WHY BLOCKS AND THREADS?
 You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
 Suppose you wrote a program for a GPU can which can
run 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
 Each GPU has a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can run some number of blocks concurrently, executing
some number of threads simultaneously.
 By adding the extra level of abstraction, higher
performance GPU's can simply run more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
 nVidia has done this to allow automatic performance gains
when your code is run on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE

 Dim3 is a 3d structure or vector type with three


integers, x, y and z. You can initialize as many of
the three coordinates as you like:
 dim3 threads(256); // Initialize with x as 256, y and z
// will both be 1
 dim3 blocks(100, 100); // Initialize x and y, z will be 1

 dim3 anotherOne(10, 54, 32); // Initialises all


three values, x
 // will be 10, y gets 54 and z
 // will be the 32.
33
THREAD ACCESS PARAMETERS
 Each of the running threads is individual, they
know the following:

 threadIdx ← Thread index within the block


 blockIdx ← Block index within the grid

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be


read in the kernel to assign particular workloads
to any thread. 34
THREAD ACCESS PATTERN
 Its common to have threads calculate a unique id
within the kernel to process some specific data. If we
launch a kernel with:

 SomeKernel<<<100, 25>>>(...);
 Inside the kernel, each thread can calculate a unique
id with:
 int id = blockIdx.x * blockDim.x + threadIdx.x;
 So the 5th thread of the 4th block would calculate:
 int id = 4 * 25 + 5 = 105
 The 14th thread of the 76th block would calculate:
 int id = 76 * 25 + 14 = 1914 35
MAPPING
36
MAP
 Set of elements to process [64 floats]
 Function to run on each element [square]

Map(element, function)

37
QUIZ
Which programs can be solved using Map

 Sort an input array


 Add one to each element of an input array

 Sum up all elements of an input array

 Compute the average of and input array

38
QUIZ
Which programs can be solved using Map

 Sort an input array


 Add one to each element of an input array

 Sum up all elements of an input array

 Compute the average of and input array

38

You might also like