0% found this document useful (0 votes)

37 views38 pages

Parralel 01

The document discusses parallel processing units and parallel computing. It explains that traditional CPUs are not the most energy efficient processors due to their complex control hardware, while GPU-like processors are more efficient because they have simpler control structures and devote more transistors to computation. It also discusses how computer designers are building more power efficient chips today by using more simpler processors rather than fewer complex ones.

Uploaded by

demro channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views38 pages

Parralel 01

Uploaded by

demro channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

PARALLEL

PROCESSING
UNIT
1
1
UNDERSTANDING
PARALLEL
2 ENVIRONMENT
QUIZ
What are 3 traditional ways HW Designers make
computers r u n faster?

Faster Clocks
 Longer Clock Period

More Work per Clock Cycle

 Larger Hard Disk
More Processors
 Reduce amount of memory
3
SEYMOUR CRAY (SUPER COMPUTER
DESIGNER)
 Ifyou are plowing a field, which would
you rat h er use?

⚫ Two strong oxen.

⚫ 1024 chickens

4
PARALLEL COMPUTING
 It was intended to be used by super computing.
 Now all computers/mobiles are using parallel
computing.
 Modern GPUs
⚫ Hundred of processors
⚫ Thousand of ALUs (3,000)
⚫ Ten or thousands of concurrent threads.
 This requires a different way of programming
t h a n a single scalar processor
 General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON
MOORE’S PATH . . . FOR NOW

6
CLOCK SPEED (NO MORE
SPEED)

7
QUIZ
 Are processing today getting faster Because

We are clocking their transistors faster

We have more transistors available
for computation.

o Why don’t we keep increasing clock speed of a

single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
WHAT KIND OF PROCESSORS
WILL WE BUILD?
 Assume major design constraint is Power

 Why are traditional CPU-like processors are not

the most energy efficient processors?
⚫ It has complex control hardware
⚫ This increase flexibility and performance
⚫ And increase power consumption and design
complexity as well
 How to increase power efficiency (GPU-
like)?
⚫ Build simple control structure.
⚫ Take those transistors and devote them to support
9
more computation on the data path
⚫ The challenge becomes how to program?
MORE TO UNDERSTAND

10
Less speed with
M ORE TO UNDERSTAND (CONT.) simple
structure

More speed
with
complex
structure

Less
Power Power
11
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors

More, Simpler processors

 Maximizing the speed of the processor clock
 Increasing the complexity of the control
hardware

12
ANOTHER FACTOR FOR POWER
EFFICIENCY
Power Efficiency

Decrease latency Increase Throughput

(Amount of time to (Task completed per
complete a task) unit time)
“Time” “Number”

 The two goals are not

⚫ CPU-like: design to decrease latency
aligned
⚫ GPU-like: design to increase throughput
13
 The choice depends on the application (Image processing
prefer to increase the throughput)
SUPER QUIZ
 Why do I say GPU-like and not saying Multi-core
CPU? Is there a deference ?!

⚫ They both build for parallel programming. However,

Multi-core CPUs can be used for sequential and
parallel programming as well (provides branches and
interrupts ). On the other hand GPU build for
parallel programming from scratch.

14
GPU DESIGN BELIEVES
 Lots of simple compute units
 Explicitly parallel programming model
⚫ We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
 Optimized for throughput not latency

15
INTRO TO
PARALLEL
16 PROGRAMMING
IMPORTANCE OF PARALLEL
PROGRAMMING
 Intel 8 core Ivy bridge
 8-wide AVX vector operations/core

 2 threads core (hyper threading)

 This means the processor has 128 way of

parallelism
 Parallel programming is more complex however
Running sequential C program means using less
t h a n 1% of this processor power

17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns
io

C
ns

CPU GPU
"Host" Co-processor "Device "

Memory Memory

 CUDA compiler generate two separated program one

for CPU (Host) and another for GPU (Device).
 CPU in charge and control the GPU
⚫ Moves data between memories (cudaMemcpy)
⚫ Allocates memory on GPU (cudaMalloc)
⚫ Invokes programs (kernels) on the GPU: ”Host 18
lunches kernels on the Device”
QUIZ
The GPU can do the following:

 Initiate dat a send from GPU to CPU

Respond to CPU request to send data from GPU
to CPU
 Initiate dat a request from CPU to GPU

Respond to CPU request to receive data from

CPU to GPU
Compute a kernel lunched by CPU
 Compute a kernel lunched by GPU 
19
TYPICAL GPU PROGRAM
 CPU allocate storage on GPU
 CPU copy input data from CPU to GPU

 CPU lunches the kernels on the GPU to process

the data
 CPU copies results back to the CPU from the
GPU

 If you need to move data many times between

CPU and GPU, CUDA is not good for your
program because it takes many steps to do so as
showing above 20
MAIN ISSUE
 Defining the GPU computation

⚫ Write a Kernel like serial program

⚫ When lunching the kernel tell the GPU how
many threads to lunch

21
QUIZ
What is the GPU good at?

 Lunching a small number of threads

efficiently

Lunching a large number of threads efficiently

 Running one thread very Quickly
 Respond to CPU request to receive data from
CPU to GPU
 Running one thread t h a t does lots of work
in parallel 22

Running a large number of threads in

GPU P OWER
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

 Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
⚫ here we have 1 thread do 64 multiplications
each takes 2 ns.

23
GPU P OWER (CONT.)
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

CPU GPU

Allocate memory out= in * in

Copy data to/from GPU
launch kernel

 Parallel solution: j
⚫ CPU code: square kernel <<<64>>>(out, in)
⚫ here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMP
25 LE
start
THREADS AND
26 BLOCKS
THREADS

A single execution units t h a t r u n kernels on the

GPU. Similar to CPU threads but there's usually
many more of them. They are sometimes drawn as
arrows

27
BLOCKS

 Thread blocks are a virtual collection of threads.

 All the threads in any single thread block can
communicate

28
GRID

 A kernel is launched as a collection of thread

blocks called the grid.

29
MAXIMUMS
 You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).

 You can launch 2 32 -1 blocks in a single launch(or

2 16 -1 if your card is compute capability 2.0 or
less).

 So my relatively inexpensive GeForce GT 440 can

launch a rat her ridiculous 67,108,864 threads.

30
WHY BLOCKS AND THREADS?
 You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
 Suppose you wrote a program for a GPU can which can
r u n 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
 Each GPU h as a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can r u n some number of blocks concurrently, executing
some number of threads simultaneously.
 By adding the extra level of abstraction, higher
performance GPU's can simply r u n more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
 nVidia h as done this to allow automatic performance gains
when your code is r u n on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE

 Dim3 is a 3d structure or vector type with three

integers, x, y and z. You can initialize as many of
the three coordinates as you like:
⚫ dim3 threads(256); // Initialize with x as 256, y and z
// will both be 1
⚫ dim3 blocks(100, 100); // Initialize x and y, z will be 1

 dim3 anotherOne(10, 54, 32); // Initialises all

three values, x
⚫ // will be 10, y gets 54 and z
⚫ // will be the 32.
33
THREAD ACCESS PARAMETERS
 Each of the running threads is individual, they
know the following:

 threadIdx ← Thread index within the block

 blockIdx ← Block index within the grid

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

read in the kernel to assign particular workloads
to any thread. 34
THREAD ACCESS PATTERN
 Its common to have threads calculate a unique id
within the kernel to process some specific data. If we
launch a kernel with:

 SomeKernel<<<100, 25>>>(...);
 Inside the kernel, each thread can calculate a unique
id with:
⚫ int id = blockIdx.x * blockDim.x + threadIdx.x;
 So the 5th thread of the 4th block would calculate:
⚫ int id = 4 * 25 + 5 = 105
 The 14th thread of the 76th block would calculate:
⚫ int id = 76 * 25 + 14 = 1914 35
MAPPIN
36 G
MAP
 Set of elements to process [64 floats]
 Function to r u n on each element [square]

Map(element, function)

37
QUIZ
Which programs can be solved using Map

 Sort a n input array

Add one to each element of a n input array

 Sum up all elements of a n input array
 Compute the average of and input array

CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
Cuda
No ratings yet
Cuda
69 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Lec 3
No ratings yet
Lec 3
48 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Cours 1
No ratings yet
Cours 1
38 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Lec 14
No ratings yet
Lec 14
52 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Hardware
No ratings yet
Hardware
54 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA
No ratings yet
CUDA
18 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Gaius Julius Solinus - Collectanea Rerum Memorabilium, Ed Mommsen, 2nd Ed
100% (1)
Gaius Julius Solinus - Collectanea Rerum Memorabilium, Ed Mommsen, 2nd Ed
392 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Owens
No ratings yet
Owens
67 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
The Interpretation of Piano Music (By Mary Venable) (1913) PDF
No ratings yet
The Interpretation of Piano Music (By Mary Venable) (1913) PDF
272 pages
Note2 4
No ratings yet
Note2 4
11 pages
Unit 4
No ratings yet
Unit 4
48 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
5 - 681103 - Control Loop & Safety Loop Overview - REV
100% (1)
5 - 681103 - Control Loop & Safety Loop Overview - REV
12 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
GPGPU
No ratings yet
GPGPU
139 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Geographical Works of Sadik Isfahani
No ratings yet
Geographical Works of Sadik Isfahani
219 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Veeam Backup m365!7!0 User Guide
No ratings yet
Veeam Backup m365!7!0 User Guide
448 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
List of Concerned Officers
No ratings yet
List of Concerned Officers
42 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
Iphone SE 3G Repair Manual SP
No ratings yet
Iphone SE 3G Repair Manual SP
79 pages
Thermodynamics1 Ch7 Second Law
No ratings yet
Thermodynamics1 Ch7 Second Law
54 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Hostel Management
100% (1)
Hostel Management
9 pages
Understanding Role of Fonts in Linking Brand Perception and Identity
100% (1)
Understanding Role of Fonts in Linking Brand Perception and Identity
15 pages
Pattern of Essay Development
100% (3)
Pattern of Essay Development
12 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Thermodynamics1 Ch6 Control Volume p1
No ratings yet
Thermodynamics1 Ch6 Control Volume p1
23 pages
A Course in Abstract Harmonic Analysis 2nd Edition Gerald B. Folland Instant Download
No ratings yet
A Course in Abstract Harmonic Analysis 2nd Edition Gerald B. Folland Instant Download
36 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
LEC 4,5 Linked List
No ratings yet
LEC 4,5 Linked List
50 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Logoaudiometria
No ratings yet
Logoaudiometria
9 pages
Answer Scheme Paper 1 F4 Midterm
No ratings yet
Answer Scheme Paper 1 F4 Midterm
7 pages
English A2 Speaking Rubric
No ratings yet
English A2 Speaking Rubric
1 page
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Class Notes: About The Poet
No ratings yet
Class Notes: About The Poet
4 pages
GUJRAT REG63c
No ratings yet
GUJRAT REG63c
20 pages
BUS1710 Chapter 2 Emotions
No ratings yet
BUS1710 Chapter 2 Emotions
32 pages
Reflective Lesson Plan # 6
No ratings yet
Reflective Lesson Plan # 6
6 pages
lEC - 10 - Sorting - Part1
No ratings yet
lEC - 10 - Sorting - Part1
162 pages
Thermodynamics1 Ch2 Basic Concepts
No ratings yet
Thermodynamics1 Ch2 Basic Concepts
42 pages
2021A Survey On Windows-Based Ransomware Taxonomy and Detection Mechanisms
No ratings yet
2021A Survey On Windows-Based Ransomware Taxonomy and Detection Mechanisms
36 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Unit 4 PDF
No ratings yet
Unit 4 PDF
20 pages
LEC - 3 Queue
No ratings yet
LEC - 3 Queue
23 pages
Integrated Approach To Model-Based Systems Engineering and Object-Oriented Software Engineering
No ratings yet
Integrated Approach To Model-Based Systems Engineering and Object-Oriented Software Engineering
10 pages
LEC - 2 Stack
No ratings yet
LEC - 2 Stack
20 pages
MIS Summary
No ratings yet
MIS Summary
14 pages
Photograph
No ratings yet
Photograph
5 pages
Puma - See Product Analysis
No ratings yet
Puma - See Product Analysis
29 pages
TB Chapter 13
No ratings yet
TB Chapter 13
15 pages
CS105 W9 eCommerceAndEnterpriseSystems
No ratings yet
CS105 W9 eCommerceAndEnterpriseSystems
33 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Katalog Evamat Full
No ratings yet
Katalog Evamat Full
9 pages
Chapter 11 Aggregate Planning and Master Scheduling - Part 1
No ratings yet
Chapter 11 Aggregate Planning and Master Scheduling - Part 1
14 pages
CV Aditya Patawari
No ratings yet
CV Aditya Patawari
3 pages
Arid Agriculture University Software Engineering-I CS-452 Assignment - 2 Spring 2021
No ratings yet
Arid Agriculture University Software Engineering-I CS-452 Assignment - 2 Spring 2021
5 pages
An Analysis of Proposed Solutions To Hume
No ratings yet
An Analysis of Proposed Solutions To Hume
4 pages
Test
No ratings yet
Test
6 pages
L7 Demro
No ratings yet
L7 Demro
13 pages
097 Language Development Pyramid Poster
No ratings yet
097 Language Development Pyramid Poster
79 pages
CSC423 - Lec10 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec10 - Distributed and Parallel ComputerSystems
29 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Lecture 3
No ratings yet
Lecture 3
27 pages
Word Power Made Easy PDF Capsule 99
No ratings yet
Word Power Made Easy PDF Capsule 99
6 pages
Smile 2 Track List
No ratings yet
Smile 2 Track List
2 pages
CSC423 - Lec11 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec11 - Distributed and Parallel ComputerSystems
19 pages
Surplus
No ratings yet
Surplus
7 pages
L 3 - Demro
No ratings yet
L 3 - Demro
4 pages
Ethiopia ETHNIC GROUPS
No ratings yet
Ethiopia ETHNIC GROUPS
3 pages
CSC423 - Lec9 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec9 - Distributed and Parallel ComputerSystems
16 pages
2-Summary L 6
No ratings yet
2-Summary L 6
6 pages
People of IS
No ratings yet
People of IS
5 pages
L 6 Part 1 Summary
No ratings yet
L 6 Part 1 Summary
3 pages
Class 3 Syllabus 2017-18
No ratings yet
Class 3 Syllabus 2017-18
5 pages
Group V
No ratings yet
Group V
9 pages
CH 10 OB Summary
No ratings yet
CH 10 OB Summary
7 pages
Comparison Between Inventory Management Models
No ratings yet
Comparison Between Inventory Management Models
1 page
Ch5 - Revision Questions + Model Answers
No ratings yet
Ch5 - Revision Questions + Model Answers
4 pages

Parralel 01

Uploaded by

Parralel 01

Uploaded by

PARALLEL

More Work per Clock Cycle

⚫ Two strong oxen.

We are clocking their transistors faster

o Why don’t we keep increasing clock speed of a

 Why are traditional CPU-like processors are not

 Fewer, more complex processors

More, Simpler processors

Decrease latency Increase Throughput

 The two goals are not

⚫ They both build for parallel programming. However,

 2 threads core (hyper threading)

 This means the processor has 128 way of

 CUDA compiler generate two separated program one

 Initiate dat a send from GPU to CPU

Respond to CPU request to receive data from

 CPU lunches the kernels on the GPU to process

 If you need to move data many times between

⚫ Write a Kernel like serial program

 Lunching a small number of threads

Lunching a large number of threads efficiently

Running a large number of threads in

Allocate memory out= in * in

A single execution units t h a t r u n kernels on the

 Thread blocks are a virtual collection of threads.

 A kernel is launched as a collection of thread

 You can launch 2 32 -1 blocks in a single launch(or

 So my relatively inexpensive GeForce GT 440 can

 Dim3 is a 3d structure or vector type with three

 dim3 anotherOne(10, 54, 32); // Initialises all

 threadIdx ← Thread index within the block

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

 Sort a n input array

Add one to each element of a n input array

You might also like