Chapter - 5 Multiprocessors and Thread-Level Parallelism: A Taxonomy of Parallel Architectures
Chapter - 5 Multiprocessors and Thread-Level Parallelism: A Taxonomy of Parallel Architectures
PU – Processing Unit
Uniprocessors
1. MIMDs offer flexibility. With the correct hardware and software support, MIMDs
can function as single-user multiprocessors focusing on high performance for one
application, as multiprogrammed multiprocessors running many tasks simultaneously, or
as some combination of these functions.
2. MIMDs can build on the cost/performance advantages of off-the-shelf
microprocessors. In fact, nearly all multiprocessors built today use the same
microprocessors found in workstations and single-processor servers.
With an MIMD, each processor is executing its own instruction stream. In many cases,
each processor executes a different process. Recall from the last chapter, that a process is
an segment of code that may be run independently, and that the state of the process
contains all the information necessary to execute that program on a processor. In a
multiprogrammed environment, where the processors may be running independent tasks,
each process is typically independent of the processes on other processors.
It is also useful to be able to have multiple processors executing a single program and
sharing the code and most of their address space. When multiple processes share code
and data in this way, they are often called threads
. Today, the term thread is often used in a casual way to refer to multiple loci of
execution that may run on different processors, even when they do not share an address
space. To take advantage of an MIMD multiprocessor with n processors, we must
usually have at least n threads or processes to execute. The independent threads are
typically identified by the programmer or created by the compiler. Since the parallelism
in this situation is contained in the threads, it is called thread-level parallelism.
Threads may vary from large-scale, independent processes–for example, independent
programs running in a multiprogrammed fashion on different processors– to parallel
iterations of a loop, automatically generated by a compiler and each executing for
perhaps less than a thousand instructions. Although the size of a thread is important in
considering how to exploit thread-level parallelism efficiently, the important qualitative
distinction is that such parallelism is identified at a high-level by the software system and
that the threads consist of hundreds to millions of instructions that may be executed in
parallel. In contrast, instruction level parallelism is identified by primarily by the
hardware, though with software help in some cases, and is found and exploited one
instruction at a time.
Existing MIMD multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictate a memory organization and interconnect
strategy. We refer to the multiprocessors by their memory organization, because what
constitutes a small or large number of processors is likely to change over time.
The first group, which we call
centralized shared-memory architectures
Centralized shared memory architectures have at most a few dozen processors in 2000.
For multiprocessors with small processor counts, it is possible for the processors to share
a single centralized memory and to interconnect the processors and memory by a bus.
With large caches, the bus and the single memory, possibly with multiple banks, can
satisfy the memory demands of a small number of processors. By replacing a single bus
with multiple buses, or even a switch, a centralized shared memory design can be scaled
to a few dozen processors. Although scaling beyond that is technically conceivable,
sharing a centralized memory, even organized as multiple banks, becomes less attractive
as the number of processors sharing it increases.
Because there is a single main memory that has a symmetric relationship to all processors
and a uniform access time from any processor, these multiprocessors are often called
symmetric (shared-memory) multiprocessors ( SMPs), and this style of architecture is
sometimes called UMA for uniform memory access. This type of centralized shared-
memory architecture is currently by far the most popular organization.
The second group consists of multiprocessors with physically distributed memory. To
support larger processor counts, memory must be distributed among the processors rather
than centralized; otherwise the memory system would not be able to support the
bandwidth demands of a larger number of processors without incurring excessively long
access latency. With the rapid increase in processor performance and the associated
increase in a processor’s memory bandwidth requirements, the scale of multiprocessor for
which distributed memory is preferred over a single, centralized memory continues to
decrease in number (which is another reason not to use small and large scale). Of course,
the larger number of processors raises the need for a high bandwidth interconnect.
Distributed-memory multiprocessor
Distributing the memory among the nodes has two major benefits. First, it is a cost-
effective way to scale the memory bandwidth, if most of the accesses are to the local
memory in the node. Second, it reduces the latency for accesses to the local memory.
These two advantages make distributed memory attractive at smaller processor counts as
processors get ever faster and require more memory bandwidth and lower memory
latency. The key disadvantage for a distributed memory architecture is that
communicating data between processors becomes somewhat more complex and has
higher latency, at least when there is no contention, because the processors no longer
share a single centralized memory. As we will see shortly, the use of distributed memory
leads to two different paradigms for interprocessor communication.
Typically, I/O as well as memory is distributed among the nodes of the multiprocessor,
and the nodes may be small SMPs (2–8 processors). Although the use of multiple
processors in a node together with a memory and a network interface is quite useful from
the cost-efficiency viewpoint.
Challenges for Parallel Processing
• Limited parallelism available in programs
– Need new algorithms that can have better parallel performance
• Suppose you want to achieve a speedup of 80 with 100 processors. What fraction
of the original computation can be sequential?
1
80 =
FractionParallel
+ (1 − FractionParallel )
100
FractionParallel = 0.9975
Developed by
• IBM – One chip multiprocessor
• AMD and INTEL- Two –Processor
• SUN – 8 processor multi core
Symmetric shared – memory support caching of
• Shared Data
• Private Data
Private data: used by a single processor
When a private item is cached, its location is migrated to the cache Since no other
processor uses the data, the program behavior is identical to that in a uniprocessor.
Cache Coherence
Unfortunately, caching shared data introduces a new problem because the view of
memory held by two different processors is through their individual caches, which,
without any additional precautions, could end up seeing two different values.
I.e, If two different processors have two different values for the same location, this
difficulty is generally referred to as cache coherence problem
Cache coherence problem for a single memory location
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and then P1 reads it, P’s write will be seen by P1 if the read
and write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order (could see older
value after a newer value)
Directory based
• Sharing status of a block of physical memory is kept in one location called the
directory.
• Directory-based coherence has slightly higher implementation overhead than
snooping.
• It can scale to larger processor count.
Snooping
• Every cache that has a copy of data also has a copy of the sharing status of the
block.
• No centralized state is kept.
• Caches are also accessible via some broadcast medium (bus or switch)
• Cache controller monitor or snoop on the medium to determine whether or not
they have a copy of a block that is represented on a bus or switch access.
Snooping protocols are popular with multiprocessor and caches attached to single shared
memory as they can use the existing physical connection- bus to memory, to interrogate
the status of the caches. Snoop based cache coherence scheme is implemented on a
shared bus. Any communication medium that broadcasts cache misses to all the
processors.
Write Invalidate
Write Update
Example Protocol
• Snooping coherence protocol is usually implemented by incorporating a finite-
state controller in each node
• Logically, think of a separate controller associated with each cache block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to distinct
blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
Example Write Back Snoopy Protocol
Performance Measurement
• Overall cache performance is a combination of
– Uniprocessor cache miss traffic
– Traffic caused by communication – invalidation and subsequent cache
misses
• Changing the processor count, cache size, and block size can affect these two
components of miss rate
• Uniprocessor miss rate: compulsory, capacity, conflict
• Communication miss rate: coherence misses
– True sharing misses + false sharing misses
True and False Sharing Miss
• True sharing miss
– The first write by a PE to a shared cache block causes an invalidation to
establish ownership of that block
– When another PE attempts to read a modified word in that cache block, a
miss occurs and the resultant block is transferred
• False sharing miss
– Occur when a block a block is invalidate (and a subsequent reference
causes a miss) because some word in the block, other than the one being
read, is written to
– The block is shared, but no word in the cache is actually shared, and
this miss would not occur if the block size were a single word
• Assume that words x1 and x2 are in the same cache block, which is in the shared
state in the caches of P1 and P2. Assuming the following sequence of events,
identify each miss as a true sharing miss or a false sharing miss.
Time P1 P2
1 Write x1
2 Read x2
3 Write x1
4 Write x2
5 Read x2
Example Result
• True sharing miss (invalidate P2)
• 2: False sharing miss
– x2 was invalidated by the write of P1, but that value of x1 is not used in
P2
• 3: False sharing miss
– The block containing x1 is marked shared due to the read in P2, but P2 did
not read x1. A write miss is required to obtain exclusive access to the
block
• 4: False sharing miss
• 5: True sharing miss
• In addition to tracking the state of each cache block, we must track the processors
that have copies of the block when it is shared (usually a bit vector for each
memory block: 1 if processor has copy)
• Keep it simple(r):
– Writes to non-exclusive data => write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent
Goal is provide a memory system with cost per byte than the next lower level
• Each level maps addresses from a slower, larger memory to a smaller but faster
memory higher in the hierarchy.
– Address mapping
– Address checking.
• Hence protection scheme for address for scrutinizing addresses are
also part of the memory hierarchy.
Memory Hierarchy
CPU Registers
500 bytes Register
0.25 ns
Cache
64 KB
Cach
Capacit
Spee
1 ns
Block
Main Memory
512 MB Memor
100ns
Page
Disk
100 GB I/O
5 ms
File Large
??? Lower
100,000
10,000
Performance
1,000
Processor
100
10
Memory
1
1980 1985 1990 1995 2000 2005 2010
Year
• The importance of memory hierarchy has increased with advances in performance
of processors.
• Prototype
– When a word is not found in cache
• Fetched from memory and placed in cache with the address tag.
• Multiple words( block) is fetched for moved for efficiency reasons.
– key design
• Set associative
– Set is a group of block in the cache.
– Block is first mapped on to set.
» Find mapping
» Searching the set
Chosen by the address of the data:
(Block address) MOD(Number of sets in cache)
• n-block in a set
– Cache replacement is called n-way set associative.
Cache data
- Cache read.
- Cache write.
Write through: update cache and writes through to update memory.
Both strategies
- Use write buffer.
this allows the cache to proceed as soon as the data is placed in the
buffer rather than wait the full latency to write the data into memory.
Metric
used to measure the benefits is miss rate
No of access that miss
No of accesses
Write back: updates the copy in the cache.
• Causes of high miss rates
– Three Cs model sorts all misses into three categories
• Compulsory: every first access cannot be in cache
– Compulsory misses are those that occur if there is an
infinite cache
• Capacity: cache cannot contain all that blocks that are needed for
the program.
– As blocks are being discarded and later retrieved.
• Conflict: block placement strategy is not fully associative
– Block miss if blocks map to its set.
- Increases larger hit time for larger cache memory and higher cost and power.
2.50
Access time (ns)
1.50
1.00
0.50
-
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
Hit Time
1.8
1.6
1.4
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
0.2
0
doduc
nasa7
espresso
ear
ora
wave5
eqntott
compress
tomcatv
fpppp
su2cor
hydro2d
spice2g6
xlisp
alvinn
swm256
mdljdp2
mdljsp2
block
0.1
0
0 50 100 150
Blocking Factor
2.20 1.97
2.00
1.80
1.60 1.45 1.49
1.40
1.26 1.29 1.32
1.40 1.18 1.20 1.21
1.16
1.20
1.00
u
p
el
3d
ke
im
c
id
cf
s
re
pl
ca
is
ga
lg
gr
m
ua
sw
m
pw
ap
ce
ga
lu
m
fa
eq
fa
wu
SPECint2000 SPECfp2000
The techniques to improve hit time, bandwidth, miss penalty and miss rate generally
affect the other components of the average memory access equation as well as the
complexity of the memory hierarchy.