0% found this document useful (0 votes)

113 views60 pages

Part 1 - Lecture 2 - Parallel Hardware

This document discusses parallel hardware architectures. It describes the von Neumann model and how it has been modified with caches and instruction-level parallelism techniques like pipelining and multiple issue. It also covers Flynn's taxonomy of parallel systems - SISD, SIMD, MISD, and MIMD. Shared memory systems like UMA and NUMA multicore CPUs are discussed. Cache coherence in shared memory systems is an important concept to ensure consistency across processor caches.

Uploaded by

Ahmad Abba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views60 pages

Part 1 - Lecture 2 - Parallel Hardware

Uploaded by

Ahmad Abba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

CSC4305: Parallel Programming

Lecture 2: Parallel Hardware

Sana Abdullahi Mu’az & Ahmad Abba Datti

Bayero University, Kano
Part 1: Foundations

• Introduction

• Parallel Hardware

• Parallel Software
Roadmap

• Some background

• Modifications to the von Neumann model

• Parallel hardware
Some background
Serial hardware and software
programs
input

Computer runs one

program at a time.
output
The von Neumann Architecture
Main memory
• This is a collection of locations, each of which is capable of
storing both instructions and data.

• Every location consists of an address, which is used to

access the location, and the contents of the location.
Central processing unit (CPU)

Divided into two parts.

• Control unit - responsible for deciding which instruction

in a program should be executed. (oga at the top)

• Arithmetic and logic unit (ALU) - responsible for

executing the actual instructions. (the worker)
Key terms
• Register – very fast storage, part of the CPU.
• Program counter – stores address of the next instruction to
be executed.

• Bus – wires and hardware that connects the CPU and

memory.
memory

fetch/read

CPU
memory

write/store

CPU
von Neumann bottleneck
An operating system “process”

• An instance of a computer program that is being executed.

• Components of a process:
• The executable machine language program.
• A block of memory.
• allocated resources, security information, state of the process.
Multitasking

• Gives the illusion that a single processor system is running

multiple programs simultaneously.

• Each process takes turns running. (time slice)

• After its time is up, it waits until it has a turn again. (blocks)
Threading
• Threads are contained within processes.

• They allow programmers to divide their programs into

(more or less) independent tasks.

• The hope is that when one thread blocks because it is

waiting on a resource, another will have work to do and can
run.
A process and two threads
the “master” thread

terminating a thread
starting a thread Is called joining
Is called forking
Modifications to the von neumann
model
Basics of caching

• A collection of memory locations that can be accessed in

less time than some other memory locations.

• A CPU cache is typically located on the same chip, or one

that can be accessed much faster than ordinary memory.
Levels of Cache
smallest & fastest

largest & slowest

Instruction Level Parallelism (ILP)

• Attempts to improve processor performance by having

multiple processor components or functional units
simultaneously executing instructions.
Instruction Level Parallelism

• Pipelining - functional units are arranged in stages.

• Multiple issue - multiple instructions can be

simultaneously initiated.
Pipelining
Pipelining
Suppose that assembling one car requires three tasks that take 20, 10,
and 15 minutes, respectively.

Then, if all three tasks were performed by a single station, the factory
would output one car every 45 minutes.

By using a pipeline of three stations, the factory would output the first
car in 45 minutes, and then a new one every 20 minutes.

As this example shows, pipelining does not decrease the latency, that
is, the total time for one item to go through the whole system. It does
however increase the system's throughput, that is, the rate at which
new items are processed after the first one.
Multiple Issue
• Multiple issue processors replicate functional units and
try to simultaneously execute different instructions in a
program.

for (i = 0; i < 1000; i++)

z[i] = x[i] + y[i];
z[3] z[4]
z[1] z[2]
adder #1 adder #2
So far....
 Classical Von Neumann Architecture
 Serial
 Von Neumann bottleneck
 OS Level
 Process
 Threads
 Multitasking
 Modified Von Neumann Architecture
 Caches
ILP
 Pipelining

 Multiple Issue
A programmer can write code to exploit.

Parallel hardware
Flynn’s Taxonomy

SISD (SIMD)
Single instruction stream Single instruction stream
Single data stream Multiple data stream

MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• Single instruction:
• only one instruction stream is being acted on by
the CPU during any one clock cycle.

Single data:
• only one data stream is being used as input during
any one clock cycle

• Deterministic execution.
• This is the oldest and until recently, the
most prevalent form of computer
• Examples: relatively old PCs.
SIMD
• Parallelism achieved by dividing data among the processors.
• Applies the same instruction to multiple data items.

• Goes with data parallelism.

SIMD example

control unit
n data items
n ALUs

x[1] x[2] … x[n]

ALU1 ALU2 ALUn

for (i = 0; i < n; i++)

x[i] += y[i];
SIMD
• What if we don’t have as many ALUs as data items?
• Divide the work and process iteratively.
• Ex. m = 4 ALUs and n = 15 data items.
Round3 ALU1 ALU2 ALU3 ALU4
1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]
SIMD drawbacks
• All ALUs are required to execute the same instruction, or remain
idle.

• In classic design, they must also operate synchronously (at the same
time).

• The ALUs have no instruction storage.

• Efficient for large data parallel problems, but not other types of
more complex parallel problems.
Graphics Processing Units (GPU)
• Real time graphics application programming interfaces or
API’s use points, lines, and triangles to internally represent
the surface of an object.
GPUs
•A graphics processing pipeline converts the internal
representation into an array of pixels that can be sent to a
computer screen.

• Several stages of this pipeline (called shader functions) are

programmable.
• Typically just a few lines of C code.
GPUs

• Shader functions are also implicitly parallel, since they can

be applied to multiple elements in the graphics stream.

• GPU’s can often optimize performance by using SIMD

parallelism.

• The current generation of GPU’s use SIMD parallelism.

• Although they are not pure SIMD systems.
MIMD

• Supports multiple simultaneous instruction streams

operating on multiple data streams.

• Typically consist of a collection of fully independent

processing units or cores, each of which has its own control
unit and its own ALU.
MIMD
MIMD Subcategories: Shared Memory System

•A collection of autonomous processors is connected to a

memory system via an interconnection network.

• Each processor can access each memory location.

• The processors usually communicate implicitly by accessing

shared data structures.
Shared Memory System

• Most widely available shared memory systems use one or

more multicore processors.
• (multiple CPU’s or cores on a single chip)
Shared Memory System
UMA multicore system
Time to access all the memory locations will be the same for all the cores.
NUMA multicore system
A memory location a core is directly connected to can be accessed faster than a
memory location that must be accessed through another chip.
Cache coherence

• Programmers have no control over caches

and when they get updated.

A shared memory system with two cores and two

caches
Cache coherence
y0 privately owned by Core 0
y1 and z1 privately owned by Core 1

x = 2; /* shared variable */

y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???
Shared Memory : UMA vs. NUMA
• Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor
updates a location in shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.

• Non-Uniform Memory Access (NUMA):

• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Shared Memory: Pros and Cons
• Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
• Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increase traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure
"correct" access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of
processors.
Distributed Memory System

• Clusters (Tightly Coupled)

• A collection of commodity systems (nodes).
• Connected by a commodity interconnection network.

• The Grid (Loosely Coupled)

• Nodes of a cluster are individual computations units joined
by a communication network.
Distributed Memory System
Distributed Memory: Pro and Con
• Advantages
• Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.

• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and
networking.

• Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• Memory access time is not uniform
Hybrid Distributed-Shared Memory
• The largest and fastest computers in the world today employ both shared and distributed
memory architectures.

• The shared memory component is usually a cache coherent SMP machine. Processors on a
given SMP can address that machine's memory as global.
• The distributed memory component is the networking of multiple SMPs. SMPs know only
about their own memory - not the memory on another SMP. Therefore, network
communications are required to move data from one SMP to another.
• Current trends seem to indicate that this type of memory architecture will continue to prevail
and increase at the high end of computing for the foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared and distributed memory
architectures.
Interconnection networks

• Affects performance of both distributed and shared memory

systems.

• Two categories:
• Shared memory interconnects (Buses and Crossbars)
• Distributed memory interconnects (Ethernet etc.)
Shared memory interconnects
Bus interconnect
• A collection of parallel communication wires together with some
hardware that controls access to the bus.

• Communication wires are shared by the devices that are

connected to it.

• As the number of devices connected to the bus increases,

contention for use of the bus increases, and performance
decreases.
More definitions

• Any time data is transmitted, we’re interested in how long

it will take for the data to reach its destination.
• Latency
• The time that elapses between the source’s beginning to
transmit the data and the destination’s starting to receive the
first byte.
• Bandwidth
• The rate at which the destination receives data after it has
started to receive the first byte.
Message transmission time = l + n / b

latency (seconds)

length of message (bytes)

bandwidth (bytes per second)

Parallel Software
…next

CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Pipeliningandvectorprocessing 140612142847 Phpapp01
No ratings yet
Pipeliningandvectorprocessing 140612142847 Phpapp01
53 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Parallel Prrocessor
No ratings yet
Parallel Prrocessor
12 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Pda 2
No ratings yet
Pda 2
105 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Seminar
No ratings yet
Seminar
85 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
No ratings yet
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
22 pages
2 Ebook Writing Research Proposal PDF
100% (5)
2 Ebook Writing Research Proposal PDF
113 pages
2 Ebook Writing Research Proposal PDF
100% (5)
2 Ebook Writing Research Proposal PDF
113 pages
Parallelism and Multicores
No ratings yet
Parallelism and Multicores
54 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Unit - 01 Easid
No ratings yet
Unit - 01 Easid
18 pages
DS1822 - Parallel Computing - Unit 1
No ratings yet
DS1822 - Parallel Computing - Unit 1
23 pages
Module 4 - Architecture
No ratings yet
Module 4 - Architecture
22 pages
Architecture
No ratings yet
Architecture
67 pages
GPU v1.1
No ratings yet
GPU v1.1
25 pages
HP LaserJet Enterprise M725MFP Repair Manual PDF
0% (1)
HP LaserJet Enterprise M725MFP Repair Manual PDF
366 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Parallel Processing
No ratings yet
Parallel Processing
22 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
CS802A Lec-2 PDF
No ratings yet
CS802A Lec-2 PDF
28 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Exercise SELECT CASE - Graduate Programming Courses - UCL Wiki
No ratings yet
Exercise SELECT CASE - Graduate Programming Courses - UCL Wiki
1 page
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Parallel Processing Report
No ratings yet
Parallel Processing Report
9 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Am37 Motherboard
100% (1)
Am37 Motherboard
54 pages
Data Structures and Algorithms - Lecture 1 - Arrays
100% (5)
Data Structures and Algorithms - Lecture 1 - Arrays
25 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
CP4253 Map Unit I
No ratings yet
CP4253 Map Unit I
31 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
CSC4212 Lecture 2 - 3D Viewing
No ratings yet
CSC4212 Lecture 2 - 3D Viewing
19 pages
Computer Book
No ratings yet
Computer Book
164 pages
Win32 Processes and Threads
No ratings yet
Win32 Processes and Threads
168 pages
MIT6 035S10 Lec05
No ratings yet
MIT6 035S10 Lec05
117 pages
FORTRAN Exam Questions
100% (1)
FORTRAN Exam Questions
3 pages
Lecture # 28: Extended Partitions
No ratings yet
Lecture # 28: Extended Partitions
6 pages
ICS 431-Ch5-Processes Scheduling
No ratings yet
ICS 431-Ch5-Processes Scheduling
129 pages
Introduction To Shift-Reduce Parsing
No ratings yet
Introduction To Shift-Reduce Parsing
95 pages
MS-7592 r52 Unlocked
100% (1)
MS-7592 r52 Unlocked
33 pages
ECU Flash Tool User's Guide v1.00
No ratings yet
ECU Flash Tool User's Guide v1.00
5 pages
SISd
No ratings yet
SISd
17 pages
Microprocessor Lab Manual - Final
No ratings yet
Microprocessor Lab Manual - Final
47 pages
ARM2SID Quick Guide Rev2 Web
No ratings yet
ARM2SID Quick Guide Rev2 Web
16 pages
Design of 32-Bit Risc Processor and Efficient Verification
No ratings yet
Design of 32-Bit Risc Processor and Efficient Verification
6 pages
Top-Down Parsing
No ratings yet
Top-Down Parsing
73 pages
(MCU) Lecture 3 - Interrupts
No ratings yet
(MCU) Lecture 3 - Interrupts
17 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Java Programming Language Lecturer: Kamaluddin Behzad
No ratings yet
Java Programming Language Lecturer: Kamaluddin Behzad
21 pages
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
64 pages
CSC4221 Lecture 1 - Introduction
No ratings yet
CSC4221 Lecture 1 - Introduction
60 pages
Fazer o Download Do Livro Os Ebos Secreto Dos Vodum de Andreia Camargo PDF
No ratings yet
Fazer o Download Do Livro Os Ebos Secreto Dos Vodum de Andreia Camargo PDF
5 pages
Sahara 3810 Manual Eng
No ratings yet
Sahara 3810 Manual Eng
92 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
LENOVO
No ratings yet
LENOVO
69 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
CSC4212 Lecture 3 - 3D Viewing - Projection Transformation
No ratings yet
CSC4212 Lecture 3 - 3D Viewing - Projection Transformation
31 pages
Question Bank Module
No ratings yet
Question Bank Module
4 pages
Midterm Exam A Solutions, CS 1313 010 Spring 2000, University of Oklahoma, Norman
No ratings yet
Midterm Exam A Solutions, CS 1313 010 Spring 2000, University of Oklahoma, Norman
13 pages
CSC4221 Lecture 2 - Graphics System
No ratings yet
CSC4221 Lecture 2 - Graphics System
21 pages
CSC3201 - Compiler Construction (Part II) - Lecture 1 - Type Checking
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 1 - Type Checking
13 pages
C++ Chap1 Quick Note
No ratings yet
C++ Chap1 Quick Note
14 pages
Cs 0411 Midterm Examination: March 3, 2010 Duration: One and Half Hours
No ratings yet
Cs 0411 Midterm Examination: March 3, 2010 Duration: One and Half Hours
7 pages
Data Structures and Algorithms - Linked Lists
No ratings yet
Data Structures and Algorithms - Linked Lists
16 pages
CS1007 Microprocessor and Interfacing 2014 15
No ratings yet
CS1007 Microprocessor and Interfacing 2014 15
8 pages
Iboard Pro: - Enhanced Ethernet Development Platform
No ratings yet
Iboard Pro: - Enhanced Ethernet Development Platform
6 pages
MC Quiz Questions PDF
No ratings yet
MC Quiz Questions PDF
2 pages
Rekod Transaksi
No ratings yet
Rekod Transaksi
6 pages
T200 Ident Printer - TE
No ratings yet
T200 Ident Printer - TE
3 pages
Computer. Ix Hassan
No ratings yet
Computer. Ix Hassan
10 pages
Unit 1 Ict
No ratings yet
Unit 1 Ict
7 pages
C1 OS Fundamentals
No ratings yet
C1 OS Fundamentals
19 pages
Open Gapps Log
No ratings yet
Open Gapps Log
2 pages
Exercises in Fortran Programming
No ratings yet
Exercises in Fortran Programming
2 pages
FORTRAN90 - Practical Exercises: Session - 1
No ratings yet
FORTRAN90 - Practical Exercises: Session - 1
3 pages
Computer Architecture Assignment 1
No ratings yet
Computer Architecture Assignment 1
8 pages
Practice Exam 1 With Answers
No ratings yet
Practice Exam 1 With Answers
4 pages
A Simple and Efficient FFT Implementation in C++ Part I
No ratings yet
A Simple and Efficient FFT Implementation in C++ Part I
4 pages
Communicating As A Scientists
No ratings yet
Communicating As A Scientists
3 pages
Mobile World Congress 2014
No ratings yet
Mobile World Congress 2014
2 pages
EECS 351-1 - Intro To Computer Graphics - Electrical Engineering & Computer Science - Northwestern Engineering
No ratings yet
EECS 351-1 - Intro To Computer Graphics - Electrical Engineering & Computer Science - Northwestern Engineering
2 pages
A3 Size Paper - Google Search
No ratings yet
A3 Size Paper - Google Search
1 page
Control Structure and Intrinsics
No ratings yet
Control Structure and Intrinsics
1 page
Power Macintosh 8500/150 and 8500/180
100% (1)
Power Macintosh 8500/150 and 8500/180
2 pages

Part 1 - Lecture 2 - Parallel Hardware

Uploaded by

Part 1 - Lecture 2 - Parallel Hardware

Uploaded by

CSC4305: Parallel Programming

Lecture 2: Parallel Hardware

Sana Abdullahi Mu’az & Ahmad Abba Datti

• Modifications to the von Neumann model

Computer runs one

• Every location consists of an address, which is used to

Divided into two parts.

• Control unit - responsible for deciding which instruction

• Arithmetic and logic unit (ALU) - responsible for

• Bus – wires and hardware that connects the CPU and

• An instance of a computer program that is being executed.

• Gives the illusion that a single processor system is running

• Each process takes turns running. (time slice)

• They allow programmers to divide their programs into

• The hope is that when one thread blocks because it is

• A collection of memory locations that can be accessed in

• A CPU cache is typically located on the same chip, or one

largest & slowest

• Attempts to improve processor performance by having

• Pipelining - functional units are arranged in stages.

• Multiple issue - multiple instructions can be

for (i = 0; i < 1000; i++)

• Goes with data parallelism.

x[1] x[2] … x[n]

for (i = 0; i < n; i++)

• The ALUs have no instruction storage.

• Several stages of this pipeline (called shader functions) are

• Shader functions are also implicitly parallel, since they can

• GPU’s can often optimize performance by using SIMD

• The current generation of GPU’s use SIMD parallelism.

• Supports multiple simultaneous instruction streams

• Typically consist of a collection of fully independent

•A collection of autonomous processors is connected to a

• Each processor can access each memory location.

• The processors usually communicate implicitly by accessing

• Most widely available shared memory systems use one or

• Programmers have no control over caches

A shared memory system with two cores and two

• Non-Uniform Memory Access (NUMA):

• Clusters (Tightly Coupled)

• The Grid (Loosely Coupled)

• Affects performance of both distributed and shared memory

• Communication wires are shared by the devices that are

• As the number of devices connected to the bus increases,

• Any time data is transmitted, we’re interested in how long

length of message (bytes)

bandwidth (bytes per second)

You might also like