0% found this document useful (0 votes)

21 views44 pages

Arallel Rocessing NIT

Uploaded by

Kareem Dwidar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views44 pages

Arallel Rocessing NIT

Uploaded by

Kareem Dwidar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

PARALLEL PROCESSING

UNIT 1
1
UNDERSTANDING PARALLEL
ENVIRONMENT
2
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?

Faster Clocks


 Longer Clock Period

More Work per Clock Cycle


 Larger Hard Disk


 More Processors

 Reduce amount of memory

3
QUIZ
What are 3 traditional ways HW Designers make
computers run faster?

Faster Clocks


 Longer Clock Period

More Work per Clock Cycle


 Larger Hard Disk


 More Processors

 Reduce amount of memory

3
SEYMOUR CRAY (SUPER COMPUTER DESIGNER)

 Ifyou are plowing a field, which would

you rather use?

 Two strong oxen.

 1024 chickens

4
PARALLEL COMPUTING
 It was intended to be used by super computing.
 Now all computers/mobiles are using parallel
computing.
 Modern GPUs
 Hundred of processors
 Thousand of ALUs (3,000)
 Ten or thousands of concurrent threads.

 This requires a different way of programming

than a single scalar processor
 General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON MOORE’S
PATH . . . FOR NOW

6
CLOCK SPEED (NO MORE SPEED)

7
QUIZ
 Are processing today getting faster Because

 We are clocking their transistors faster

 We have more transistors available for

computation.

o Why don’t we keep increasing clock speed of a

single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
QUIZ
 Are processing today getting faster Because

 We are clocking their transistors faster

 We have more transistors available for

computation.

o Why don’t we keep increasing clock speed of a

single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
WHAT KIND OF PROCESSORS WILL WE
BUILD?

 Assume major design constraint is Power

 Why are traditional CPU-like processors are not

the most energy efficient processors?
 It has complex control hardware
 This increase flexibility and performance
 And increase power consumption and design
complexity as well
 How to increase power efficiency (GPU-like)?
 Build simple control structure.
 Take those transistors and devote them to support
more computation on the data path
9
 The challenge becomes how to program?
MORE TO UNDERSTAND

10
Less speed with
MORE TO UNDERSTAND (CONT.) simple
structure

More speed
with
complex
structure

Less
Power Power
11
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors

More, Simpler processors


 Maximizing the speed of the processor clock

 Increasing the complexity of the control

hardware

12
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors

More, Simpler processors


 Maximizing the speed of the processor clock

 Increasing the complexity of the control

hardware

12
ANOTHER FACTOR FOR POWER EFFICIENCY

Power Efficiency

Decrease latency Increase Throughput

(Amount of time to (Task completed per
complete a task) unit time)
“Time” “Number”

 The two goals are not aligned

 CPU-like: design to decrease latency
 GPU-like: design to increase throughput
13
 The choice depends on the application (Image processing
prefer to increase the throughput)
SUPER QUIZ
 Why do I say GPU-like and not saying Multi-core
CPU? Is there a deference ?!

 They both build for parallel programming. However,

Multi-core CPUs can be used for sequential and
parallel programming as well (provides branches and
interrupts ). On the other hand GPU build for
parallel programming from scratch.

14
GPU DESIGN BELIEVES
 Lots of simple compute units
 Explicitly parallel programming model
 We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
 Optimized for throughput not latency

15
INTRO TO PARALLEL
PROGRAMMING
16
IMPORTANCE OF PARALLEL
PROGRAMMING
 Intel 8 core Ivy bridge
 8-wide AVX vector operations/core

 2 threads core (hyper threading)

 This means the processor has 128 way of

parallelism
 Parallel programming is more complex however
Running sequential C program means using less
than 1% of this processor power

17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns

C
io
ns

CPU GPU
"Host" Co-processor "Device "

Memory Memory

 CUDA compiler generate two separated program one

for CPU (Host) and another for GPU (Device).
 CPU in charge and control the GPU
 Moves data between memories (cudaMemcpy)
 Allocates memory on GPU (cudaMalloc)
 Invokes programs (kernels) on the GPU: ”Host lunches 18
kernels on the Device”
QUIZ
The GPU can do the following:

 Initiate data send from GPU to CPU

Respond to CPU request to send data from GPU

to CPU
 Initiate data request from CPU to GPU


 Respond to CPU request to receive data from
CPU to GPU

 Compute a kernel lunched by CPU

 Compute a kernel lunched by GPU 

19
QUIZ
The GPU can do the following:

 Initiate data send from GPU to CPU

Respond to CPU request to send data from GPU

to CPU
 Initiate data request from CPU to GPU


 Respond to CPU request to receive data from
CPU to GPU

 Compute a kernel lunched by CPU

 Compute a kernel lunched by GPU 

19
TYPICAL GPU PROGRAM
 CPU allocate storage on GPU
 CPU copy input data from CPU to GPU

 CPU lunches the kernels on the GPU to process

the data
 CPU copies results back to the CPU from the
GPU

 If you need to move data many times between

CPU and GPU, CUDA is not good for your
program because it takes many steps to do so as
showing above 20
MAIN ISSUE
 Defining the GPU computation

 Write a Kernel like serial program

 When lunching the kernel tell the GPU how many
threads to lunch

21
QUIZ
What is the GPU good at?

 Lunching a small number of threads efficiently

 Lunching a large number of threads efficiently

 Running one thread very Quickly

 Respond to CPU request to receive data from

CPU to GPU
 Running one thread that does lots of work in
parallel
Running a large number of threads in parallel

22
QUIZ
What is the GPU good at?

 Lunching a small number of threads efficiently

 Lunching a large number of threads efficiently

 Running one thread very Quickly

 Respond to CPU request to receive data from

CPU to GPU
 Running one thread that does lots of work in
parallel
Running a large number of threads in parallel

22
GPU POWER
 Example:
 In : [1, 2, 3, …., 64]
 Out: [02, 12, 22, …., 642]

 Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
 here we have 1 thread do 64 multiplications each
takes 2 ns.

23
GPU POWER (CONT.)
 Example:
 In : [1, 2, 3, …., 64]
 Out: [02, 12, 22, …., 642]

CPU GPU

Allocate memory out= in * in

Copy data to/from GPU
launch kernel

 Parallel solution: j
 CPU code: square kernel <<<64>>>(out, in)
 here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMPLE
25 start
THREADS AND BLOCKS
26
THREADS

A single execution units that run kernels on the

GPU. Similar to CPU threads but there's usually
many more of them. They are sometimes drawn as
arrows

27
BLOCKS

 Thread blocks are a virtual collection of threads.

 All the threads in any single thread block can
communicate

28
GRID

 A kernel is launched as a collection of thread

blocks called the grid.

29
MAXIMUMS
 You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).

 You can launch 232-1 blocks in a single launch(or

216-1 if your card is compute capability 2.0 or
less).

 So my relatively inexpensive GeForce GT 440 can

launch a rather ridiculous 67,108,864 threads.

30
WHY BLOCKS AND THREADS?
 You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
 Suppose you wrote a program for a GPU can which can
run 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
 Each GPU has a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can run some number of blocks concurrently, executing
some number of threads simultaneously.
 By adding the extra level of abstraction, higher
performance GPU's can simply run more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
 nVidia has done this to allow automatic performance gains
when your code is run on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE

 Dim3 is a 3d structure or vector type with three

integers, x, y and z. You can initialize as many of
the three coordinates as you like:
 dim3 threads(256); // Initialize with x as 256, y and z
// will both be 1
 dim3 blocks(100, 100); // Initialize x and y, z will be 1

 dim3 anotherOne(10, 54, 32); // Initialises all

three values, x
 // will be 10, y gets 54 and z
 // will be the 32.
33
THREAD ACCESS PARAMETERS
 Each of the running threads is individual, they
know the following:

 threadIdx ← Thread index within the block

 blockIdx ← Block index within the grid

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

read in the kernel to assign particular workloads
to any thread. 34
THREAD ACCESS PATTERN
 Its common to have threads calculate a unique id
within the kernel to process some specific data. If we
launch a kernel with:

 SomeKernel<<<100, 25>>>(...);
 Inside the kernel, each thread can calculate a unique
id with:
 int id = blockIdx.x * blockDim.x + threadIdx.x;
 So the 5th thread of the 4th block would calculate:
 int id = 4 * 25 + 5 = 105
 The 14th thread of the 76th block would calculate:
 int id = 76 * 25 + 14 = 1914 35
MAPPING
36
MAP
 Set of elements to process [64 floats]
 Function to run on each element [square]

Map(element, function)

37
QUIZ
Which programs can be solved using Map

 Sort an input array

 Add one to each element of an input array

 Sum up all elements of an input array

 Compute the average of and input array

38
QUIZ
Which programs can be solved using Map

 Sort an input array

 Add one to each element of an input array

 Sum up all elements of an input array

 Compute the average of and input array

GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
KBL Consumer FW Bring Up Guide
100% (1)
KBL Consumer FW Bring Up Guide
147 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Resume For Ece Students
100% (2)
Resume For Ece Students
9 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
9 AIS (Systems Development and Documentation Techniques)
No ratings yet
9 AIS (Systems Development and Documentation Techniques)
44 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Topic 8
No ratings yet
Topic 8
71 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Lec 3
No ratings yet
Lec 3
48 pages
Cours 1
No ratings yet
Cours 1
38 pages
#Include #Include #Include Int Main (// Make Two Process Which Run Same // Program After This Instruction Fork Printf ("Hello World!/n") Return 0 )
No ratings yet
#Include #Include #Include Int Main (// Make Two Process Which Run Same // Program After This Instruction Fork Printf ("Hello World!/n") Return 0 )
5 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Chap. 9 Pipeline and Vector Processing
0% (1)
Chap. 9 Pipeline and Vector Processing
12 pages
GSS4100 User Manual: DGP00673AAA
No ratings yet
GSS4100 User Manual: DGP00673AAA
149 pages
NS3-Installation and Coding
No ratings yet
NS3-Installation and Coding
50 pages
Cuda
No ratings yet
Cuda
69 pages
KB Implementing SD-WAN Workbook
No ratings yet
KB Implementing SD-WAN Workbook
150 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
OBL Module GIT Class Notes 2022 2023
No ratings yet
OBL Module GIT Class Notes 2022 2023
51 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Assignment 3 Generating Mapping Products Using Agisoft Metashape (Part 2) Final
No ratings yet
Assignment 3 Generating Mapping Products Using Agisoft Metashape (Part 2) Final
12 pages
PAP-Solution 18CS752 15CS664
No ratings yet
PAP-Solution 18CS752 15CS664
34 pages
Computer Security Basics
100% (1)
Computer Security Basics
10 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Crypto Casino Carding ??
No ratings yet
Crypto Casino Carding ??
2 pages
Programming For Problem Solving (ICT 151) : Mukul Sharma
No ratings yet
Programming For Problem Solving (ICT 151) : Mukul Sharma
18 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Atul Final
No ratings yet
Atul Final
64 pages
Owens
No ratings yet
Owens
67 pages
Shi 2021
No ratings yet
Shi 2021
30 pages
8.0 - IBM - Security - Data - Encryption
No ratings yet
8.0 - IBM - Security - Data - Encryption
21 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Medium Com Better Programming Here Are 6 Frontend Challenges
No ratings yet
Medium Com Better Programming Here Are 6 Frontend Challenges
16 pages
Operating Systems
No ratings yet
Operating Systems
22 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Principles of Security PDF
No ratings yet
Principles of Security PDF
11 pages
DLive Firmware Update Instructions Issue 6.1
No ratings yet
DLive Firmware Update Instructions Issue 6.1
3 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Galaxy 360 Manual and Installation Guide
No ratings yet
Galaxy 360 Manual and Installation Guide
23 pages
Lec 14
No ratings yet
Lec 14
52 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
DBMS Case Study
No ratings yet
DBMS Case Study
12 pages
Note2 4
No ratings yet
Note2 4
11 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
SN Quick Reference 2018
No ratings yet
SN Quick Reference 2018
6 pages
Yash Sir Sanchit Sir: Visit Our Website
No ratings yet
Yash Sir Sanchit Sir: Visit Our Website
6 pages
EE426 OS Lab7 Spring2023
No ratings yet
EE426 OS Lab7 Spring2023
7 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Unit 4
No ratings yet
Unit 4
48 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
ADVANCED DB Presentation
No ratings yet
ADVANCED DB Presentation
15 pages
2.06 Sam Hodges
No ratings yet
2.06 Sam Hodges
5 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Des & Rsa
No ratings yet
Des & Rsa
10 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
GPGPU
No ratings yet
GPGPU
139 pages
Programming
No ratings yet
Programming
2 pages
Aspirant Series Session-Cyber Security and Digital Forensics ATTENDANCE
No ratings yet
Aspirant Series Session-Cyber Security and Digital Forensics ATTENDANCE
6 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
How Telnet Works and Its Insecurity
No ratings yet
How Telnet Works and Its Insecurity
2 pages
Lec 1
No ratings yet
Lec 1
27 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Book Flyer For Computer Architecture: A Minimalist Perspective
No ratings yet
Book Flyer For Computer Architecture: A Minimalist Perspective
2 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
ITunes Diagnostics
No ratings yet
ITunes Diagnostics
3 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet

Arallel Rocessing NIT

Uploaded by

Arallel Rocessing NIT

Uploaded by

PARALLEL PROCESSING

 Longer Clock Period

 Larger Hard Disk

 Reduce amount of memory

 Longer Clock Period

 Larger Hard Disk

 Reduce amount of memory

 Ifyou are plowing a field, which would

 Two strong oxen.

 This requires a different way of programming

 We are clocking their transistors faster

o Why don’t we keep increasing clock speed of a

 We are clocking their transistors faster

o Why don’t we keep increasing clock speed of a

 Assume major design constraint is Power

 Why are traditional CPU-like processors are not

 Fewer, more complex processors

 Maximizing the speed of the processor clock

 Increasing the complexity of the control

 Fewer, more complex processors

 Maximizing the speed of the processor clock

 Increasing the complexity of the control

Decrease latency Increase Throughput

 The two goals are not aligned

 They both build for parallel programming. However,

 2 threads core (hyper threading)

 This means the processor has 128 way of

 CUDA compiler generate two separated program one

 Initiate data send from GPU to CPU

 Compute a kernel lunched by GPU 

 Initiate data send from GPU to CPU

 Compute a kernel lunched by GPU 

 CPU lunches the kernels on the GPU to process

 If you need to move data many times between

 Write a Kernel like serial program

 Lunching a small number of threads efficiently

 Respond to CPU request to receive data from

 Lunching a small number of threads efficiently

 Respond to CPU request to receive data from

Allocate memory out= in * in

A single execution units that run kernels on the

 Thread blocks are a virtual collection of threads.

 A kernel is launched as a collection of thread

 You can launch 232-1 blocks in a single launch(or

 So my relatively inexpensive GeForce GT 440 can

 Dim3 is a 3d structure or vector type with three

 dim3 anotherOne(10, 54, 32); // Initialises all

 threadIdx ← Thread index within the block

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

 Sort an input array

 Compute the average of and input array

 Sort an input array

 Compute the average of and input array

You might also like