Part 1 - Lecture 2 - Parallel Hardware
Part 1 - Lecture 2 - Parallel Hardware
Part 1 - Lecture 2 - Parallel Hardware
• Introduction
• Parallel Hardware
• Parallel Software
Roadmap
• Some background
• Parallel hardware
Some background
Serial hardware and software
programs
input
fetch/read
CPU
memory
write/store
CPU
von Neumann bottleneck
An operating system “process”
• Components of a process:
• The executable machine language program.
• A block of memory.
• allocated resources, security information, state of the process.
Multitasking
• After its time is up, it waits until it has a turn again. (blocks)
Threading
• Threads are contained within processes.
terminating a thread
starting a thread Is called joining
Is called forking
Modifications to the von neumann
model
Basics of caching
L1
L2
L3
Then, if all three tasks were performed by a single station, the factory
would output one car every 45 minutes.
By using a pipeline of three stations, the factory would output the first
car in 45 minutes, and then a new one every 20 minutes.
As this example shows, pipelining does not decrease the latency, that
is, the total time for one item to go through the whole system. It does
however increase the system's throughput, that is, the rate at which
new items are processed after the first one.
Multiple Issue
• Multiple issue processors replicate functional units and
try to simultaneously execute different instructions in a
program.
Multiple Issue
A programmer can write code to exploit.
Parallel hardware
Flynn’s Taxonomy
SISD (SIMD)
Single instruction stream Single instruction stream
Single data stream Multiple data stream
MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• Single instruction:
• only one instruction stream is being acted on by
the CPU during any one clock cycle.
Single data:
• only one data stream is being used as input during
any one clock cycle
• Deterministic execution.
• This is the oldest and until recently, the
most prevalent form of computer
• Examples: relatively old PCs.
SIMD
• Parallelism achieved by dividing data among the processors.
• Applies the same instruction to multiple data items.
control unit
n data items
n ALUs
• In classic design, they must also operate synchronously (at the same
time).
• Efficient for large data parallel problems, but not other types of
more complex parallel problems.
Graphics Processing Units (GPU)
• Real time graphics application programming interfaces or
API’s use points, lines, and triangles to internally represent
the surface of an object.
GPUs
•A graphics processing pipeline converts the internal
representation into an array of pixels that can be sent to a
computer screen.
x = 2; /* shared variable */
y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???
Shared Memory : UMA vs. NUMA
• Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor
updates a location in shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
• Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• Memory access time is not uniform
Hybrid Distributed-Shared Memory
• The largest and fastest computers in the world today employ both shared and distributed
memory architectures.
• The shared memory component is usually a cache coherent SMP machine. Processors on a
given SMP can address that machine's memory as global.
• The distributed memory component is the networking of multiple SMPs. SMPs know only
about their own memory - not the memory on another SMP. Therefore, network
communications are required to move data from one SMP to another.
• Current trends seem to indicate that this type of memory architecture will continue to prevail
and increase at the high end of computing for the foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared and distributed memory
architectures.
Interconnection networks
• Two categories:
• Shared memory interconnects (Buses and Crossbars)
• Distributed memory interconnects (Ethernet etc.)
Shared memory interconnects
Bus interconnect
• A collection of parallel communication wires together with some
hardware that controls access to the bus.
latency (seconds)