100% found this document useful (2 votes)
166 views18 pages

CSCI 8150 Advanced Computer Architecture

This document discusses different models of parallel computers, including multiprocessors with shared memory and multicomputers with distributed memory. It describes the uniform memory access (UMA) model, non-uniform memory access (NUMA) model, and cache-only memory architecture (COMA) models for shared memory multiprocessors. It also discusses message passing models for multicomputers and provides an example performance calculation comparing sequential and parallel execution.

Uploaded by

sunnynnus
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
166 views18 pages

CSCI 8150 Advanced Computer Architecture

This document discusses different models of parallel computers, including multiprocessors with shared memory and multicomputers with distributed memory. It describes the uniform memory access (UMA) model, non-uniform memory access (NUMA) model, and cache-only memory architecture (COMA) models for shared memory multiprocessors. It also discusses message passing models for multicomputers and provides an example performance calculation comparing sequential and parallel execution.

Uploaded by

sunnynnus
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

CSCI 8150

Advanced Computer Architecture

Hwang, Chapter 1
Parallel Computer Models
1.2 Multiprocessors and
Multicomputers
Categories of Parallel Computers
Considering their architecture only, there are
two main categories of parallel computers:
systems with shared common memories, and
systems with unshared distributed memories.
Shared-Memory Multiprocessors
Shared-memory multiprocessor models:
Uniform-memory-access (UMA)
Nonuniform-memory-access (NUMA)
Cache-only memory architecture (COMA)
These systems differ in how the memory and
peripheral resources are shared or
distributed.
The UMA Model - 1
Physical memory uniformly shared by all
processors, with equal access time to all
words.
Processors may have local cache memories.
Peripherals also shared in some fashion.
Tightly coupled systems use a common bus,
crossbar, or multistage network to connect
processors, peripherals, and memories.
Many manufacturers have multiprocessor
(MP) extensions of uniprocessor (UP) product
lines.
The UMA Model - 2
Synchronization and communication among
processors achieved through shared
variables in common memory.
Symmetric MP systems – all processors have
access to all peripherals, and any processor
can run the OS and I/O device drivers.
Asymmetric MP systems – not all peripherals
accessible by all processors; kernel runs only
on selected processors (master); others are
called attached processors (AP).
The UMA Multiprocessor Model

P1 P2 … Pn

System Interconnect
(Bus, Crossbar, Multistage network)

I/O SM1 … SMm


Example: Performance Calculation
Consider two loops. The first loop adds
corresponding elements of two N-element
vectors to yield a third vector. The second
loop sums elements of the third vector.
Assume each add/assign operation takes 1
cycle, and ignore time spent on other actions
(e.g. loop counter incrementing/testing,
instruction fetch, etc.). Assume
interprocessor communication requires k
cycles.
On a sequential system, each loop will
require N cycles, for a total of 2N cycles of
Example: Performance Calculation
On an M-processor system, we can partition each
loop into M parts, each having L = N / M add/assigns
requiring L cycles. The total time required is thus
2L. This leaves us with M partial sums that must be
totaled.
Computing the final sum from the M partial sums
requires l = log2(M) additions, each requiring k
cycles (to access a non-local term) and 1 cycle (for
the add/assign), for a total of l × (k+1) cycles.
The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles.
Example: Performance Calculation
Assume N = 220.
Sequential execution requires 2N = 221 cycles.
If processor synchronization requires k = 200
cycles, and we have M = 256 processors, parallel
execution requires
2N / M + (k + 1) log2(M)
= 221 / 28 + 201 × 8
= 213 + 1608 = 9800 cycles
Comparing results, the parallel solution is 214 times
faster than the sequential, with the best theoretical
speedup being 256 (since there are 256
processors). Thus the efficiency of the parallel
solution is 214 / 256 = 83.6 %.
The NUMA Model - 1
Shared memories, but access time depends on the
location of the data item.
The shared memory is distributed among the
processors as local memories, but each of these is
still accessible by all processors (with varying
access times).
Memory access is fastest from the locally-connected
processor, with the interconnection network adding
delays for other processor accesses.
Additionally, there may be global memory in a
multiprocessor system, with two separate
interconnection networks, one for clusters of
processors and their cluster memories, and another
for the global shared memories.
Shared Local Memories

LM1 P1

LM2 P2 Inter-
. connection
.
. Network .
. .
LMn Pn
Hierarchical Cluster Model
GSM GSM … GSM

Global Interconnect Network

P CS P CS
M M
P C CS P C CS

. I M. . I M.
.
. N .
.
.
. N .
.

P CS P CS
M M
The COMA Model
In the COMA model, processors only have
cache memories; the caches, taken
together, form a global address space.
Each cache has an associated directory that
aids remote machines in their lookups;
hierarchical directories may exist in
machines based on this model.
Initial data placement is not critical, as cache
blocks will eventually migrate to where they
are needed.
Cache-Only Memory Architecture
Interconnection Network

D D D

C C … C

P P P
Other Models
There can be other models used for
multiprocessor systems, based on a
combination of the models just presented.
For example:
cache-coherent non-uniform memory access
(each processor has a cache directory, and the
system has a distributed shared memory)
cache-coherent cache-only model (processors
have caches, no shared memory, caches must be
kept coherent).
Multicomputer Models
Multicomputers consist of multiple computers, or
nodes, interconnected by a message-passing
network.
Each node is autonomous, with its own processor
and local memory, and sometimes local peripherals.
The message-passing network provides point-to-
point static connections among the nodes.
Local memories are not shared, so traditional
multicomputers are sometimes called no-remote-
memory-access (or NORMA) machines.
Inter-node communication is achieved by passing
messages through the static connection network.
Generic Message-Passing Multicomputer
P P

M M

M P P M
Message-passing
interconnection
network
M P P M

P P

M M
Multicomputer Generations
Each multicomputer uses routers and channels in its
interconnection network, and heterogeneous
systems may involved mixed node types and
uniform data representation and communication
protocols.
First generation: hypercube architecture, software-
controlled message switching, processor boards.
Second generation: mesh-connected architecture,
hardware message switching, software for medium-
grain distributed computing.
Third generation: fine-grained distributed
computing, with each VLSI chip containing the
processor and communication resources.

You might also like