Pdf24 Merged
Pdf24 Merged
Agenda
Unpipelined Microprocessors
Pipelining
Pipelining Hazards
Control Dependence
Data Dependence
Structural Hazard
Out-of-order Execution
Multiple Issue
Agenda:
Unpipelined microprocessors
Pipelining: simplest form of ILP
Out-of-order execution: more ILP
Multiple issue: drink more ILP
Scaling issues and Moore’s Law
Why multi-core
TLP and de-centralized design
Tiled CMP and shared cache
Implications on software
Research directions
Unpipelined Microprocessors:
Pipelining:
Pipelining Hazards:
Control Dependence:
Options: nothing, fall-through, learn past history and predict (today best predictors achieve on
average 97% accuracy for SPEC2000)
Data Dependence:
Allow the ALU to bypass the produced value in time: not always possible
Need a live bypass! (requires some negative time travel: not yet feasible in real world)
Bigger problems: load latency is often high; you may not find the data in cache
Structural Hazard:
Out-of-order Execution:
Multiple Issue:
Moore’s Law
Scaling Issues
Multi-core
Thread-level Parallelism
Communication in Multi-core
Niagara Floor-plan
Implications on Software
Research Directions
References
Moore’s Law:
Scaling Issues:
Hardware for extracting ILP has reached the point of diminishing return
Need a large number of in-flight instructions
Supporting such a large population inside the chip requires power-hungry delay-
sensitive logic and storage
Verification complexity is getting out of control
How to exploit so many transistors?
Must be a de-centralized design which avoids long wires
Multi-core:
Put a few reasonably complex processors or many simple processors on the chip
Each processor has its own primary cache and pipeline
Often a processor is called a core
Often called a chip-multiprocessor (CMP)
Did we use the transistors properly?
Depends on if you can keep the cores busy
Introduces the concept of thread-level parallelism (TLP)
Thread-level Parallelism:
Communication in Multi-core:
Niagara Floor-plan:
Implications on Software:
Research Directions:
Hexagon of puzzles
Running single-threaded programs efficiently on this sea of cores
Managing energy envelope efficiently
Allocating shared cache efficiently
Allocating shared off-chip bandwidth and memory banks efficiently
Making parallel programming easy
Transactional
Speculative parallelization
Verification of hardware and parallel software and tolerate faults
References:
Architect’s job
The applications
Parallel architecture
Performance metrics
Throughput metrics
Application trends
Commercial sector
Desktop market
Architect’s job
Design and engineer various parts of a computer system to maximize performance and
programmability within the technology limits and cost budget
Technology limit could mean process/circuit technology in case of microprocessor architecture
For bigger systems technology limit could mean interconnect technology (how one component
talks to another at macro level)
The applications
Parallel architecture
Parallelism helps
There are applications that can be parallelized easily
There are important applications that require enormous amount of computation (10 GFLOPS
to 1 TFLOPS)
NASA taps SGI, Intel for Supercomputers: 20 512p SGI Altix using Itanium 2
( http://zdnet.com.com/2100-1103_2-5286156.html) [27th July, 2004]
There are important applications that need to deliver high throughput
Parallelism is ubiquitous
Need to understand the design trade-offs
Microprocessors are now multiprocessors (more later)
Today a computer architect’s primary job is to find out how to efficiently extract
parallelism
Get involved in interesting research projects
Make an impact
Shape the future development
Have fun
Performance metrics
Throughput metrics
Sometimes metrics like jobs/hour may be more important than just the turn-around time of a
job
This is the case for transaction processing (the biggest commercial application for
servers)
Needs to serve as many transactions as possible in a given time provided time per
transaction is reasonable
Transactions are largely independent; so throw in as many hardware threads as
possible
Known as throughput computing
Application trends
Commercial sector
Desktop market
Technology trends
Architectural trends
Supercomputers
Bus-based MPs
Scaling: DSMs
On-chip TLP
Economics
Summary
Technology trends
Architectural trends
Also technology limits such as wire delay are pushing for a more distributed control
rather than the centralized control in today’s processors
If cannot boost ILP what can be done?
Thread-level parallelism (TLP)
Explicit parallel programs already have TLP (inherent)
Sequential programs that are hard to parallelize or ILP-limited can be speculatively
parallelized in hardware
Thread-level speculation (TLS)
Today’s trend: if cannot do anything to boost single-thread performance invest transistors and
resources to exploit TLP
Simplest solution: take the commodity boxes, connect them over gigabit ethernet and let them
talk via messages
The simplest possible message-passing machine
Also known as Network of Workstations (NOW)
Normally PVM (Parallel Virtual Machine) or MPI (Message Passing Interface) is used
for programming
Each processor sees only local memory
Any remote data access must happen through explicit messages (send/recv calls
trapping into kernel)
Optimizations in the messaging layer are possible (user level messages, active messages)
Supercomputers
Bus-based MPs
Scaling: DSMs
Large-scale shared memory MPs are normally built over a scalable switch-based network
Now each node has its local memory
Access to remote memory happens through load/store, but may take longer
Non-Uniform Memory Access (NUMA)
Distributed Shared Memory (DSM)
The underlying coherence protocol is quite different compared to a bus-based SMP
Need specialized memory controller to handle coherence requests and a router to connect to
the network
On-chip TLP
Current trend:
Tight integration
Minimize communication latency (data communication is the bottleneck)
Since we have transistors
Put multiple cores on chip (Chip multiprocessing)
They can communicate via either a shared bus or switch-based fabric on-chip (can be
custom designed and clocked faster)
Or put support for multiple threads without replicating cores (Simultaneous multi-
threading)
Both choices provide a good cost/performance trade-off
Economics
Summary
Long history
Single-threaded execution
Life of an instruction
Multi-cycle execution
Pipelining
More on pipelining
Control hazard
Branch prediction
Data hazards
More on RAW
Multi-cycle EX stage
WAW hazard
Overall CPI
Multiple issue
Long history
Single-threaded execution
Goal of a microprocessor
Given a sequential set of instructions it should execute them correctly as fast as
possible
Correctness is guaranteed as long as the external world sees the execution in-order
(i.e. sequential)
Within the processor it is okay to re-order the instructions as long as the changes to
states are applied in-order
Performance equation
Execution time = average CPI × number of instructions × cycle time
To reduce the execution time we can try to lower one or more the three terms
Reducing average CPI (cycles per instruction):
The starting point could be CPI=1
But complex arithmetic operations e.g. multiplication/division take more than a cycle
Memory operations take even longer
So normally average CPI is larger than 1
How to reduce CPI is the core of this lecture
Reducing number of instructions
Better compiler, smart instruction set architecture (ISA)
Reducing cycle time: faster clock
Life of an instruction
Multi-cycle execution
Simplest implementation
Assume each of five stages takes a cycle
Five cycles to execute an instruction
After instruction i finishes you start fetching instruction i+1
Without “long latency” instructions CPI is 5
Alternative implementation
You could have a five times slower clock to accommodate all the logic within one cycle
Then you can say CPI is 1 excluding mult/div, mem op
But overall execution time really doesn’t change
What can you do to lower the CPI?
Pipelining
Simple observation
In the multi-cycle implementation when the ALU is executing, say, an add instruction
the decoder is idle
Exactly one stage is active at any point in time
Wastage of hardware
Solution: pipelining
Process five instructions in parallel
Each instruction is in a different stage of processing
Each stage is called a pipeline stage
Need registers between pipeline stages to hold partially processed instructions (called
pipeline latches): why?
More on pipelining
Control hazard
The PC update hardware (selection between target and next PC) works on the lower edge
Can we utilize the branch delay slot?
Ask the compiler guy
The delay slot is always executed (irrespective of the fate of the branch)
Boost instructions common to fall through and target paths to the delay slot
Not always possible to find
You have to be careful also
Must boost something that does not alter the outcome of fall-through or target basic
blocks
If the BD slot is filled with useful instruction then we don’t lose anything in CPI;
otherwise we pay a branch penalty of one cycle
Branch prediction
We can put a branch target cache in the fetcher
Also called branch target buffer (BTB)
Use the lower bits of the instruction PC to index the BTB
Use the remaining bits to match the tag
In case of a hit the BTB tells you the target of the branch when it executed last time
You can hope that this is correct and start fetching from that predicted target provided
by the BTB
One cycle later you get the real target, compare with the predicted target, and throw
away the fetched instruction in case of misprediction; keep going if predicted correctly
Branch prediction
Data hazards
More on RAW
This type of dependencies is the primary cause of increase in CPI and lower ILP
Multi-cycle EX stage
WAW hazard
Overall CPI
Multiple issue
Thus far we have assumed that at most one instruction gets advanced to EX stage every cycle
If we have four ALUs we can issue four independent instructions every cycle
This is called superscalar execution
Ideally CPI should go down by a factor equal to issue width (more parallelism)
Extra hardware needed:
Wider fetch to keep the ALUs fed
More decode bandwidth, more register file ports; decoded instructions are put in an
issue queue
Selection of independent instructions for issue
In-order completion
Instruction selection
In-order multi-issue
Out-of-order issue
WAR hazard
Modified bypass
Register renaming
The pipeline
Alternative: VLIW
Current research in µP
Instruction selection
Cannot issue the last two even though they are independent of the first two: in-order
completion is a must for precise exception support
In-order multi-issue
Out-of-order issue
WAR hazard
Modified bypass
Register renaming
Now it is safe to issue them in parallel: they are really independent (compiler introduced
WAW)
Register renaming maintains a map table that records logical register to physical register map
After an instruction is decoded, its logical register numbers are available
The renamer looks up the map table to find mapping for the logical source regs of this
instruction, assigns a free physical register to the destination logical reg, and records the new
mapping
If the renamer runs out of physical registers, the pipeline stalls until at least one register is
available
When do you free a physical register?
Suppose a physical register P is mapped to a logical register L which is the destination
of instruction I
It is safe to free P only when the next producer of L retires (Why not earlier?)
The pipeline
Fetch, decode, rename, issue, register file read, ALU, cache, retire
Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle
The ROB entry is allocated in rename stage
Issue, register file, ALU, cache are out-of-order
Retire is again in-order, but multiple instructions may retire each cycle: need to free the
resources and drain the pipeline quickly
Alternative: VLIW
Current research in µP
Virtual memory
Addressing VM
VA to PA translation
Page fault
VA to PA translation
TLB
Caches
Addressing a cache
With a 32-bit address you can access 4 GB of physical memory (you will never get the full
memory though)
Seems enough for most day-to-day applications
But there are important applications that have much bigger memory footprint:
databases, scientific apps operating on large matrices etc.
Even if your application fits entirely in physical memory it seems unfair to load the full
image at startup
Just takes away memory from other processes, but probably doesn’t need the full
image at any point of time during execution: hurts multiprogramming
Need to provide an illusion of bigger memory: Virtual Memory (VM)
Virtual memory
Addressing VM
VA to PA translation
Access memory at this address to get 32 bits of data from the page table entry (PTE)
These 32 bits contain many things: a valid bit, the much needed PPFN (may be 20 bits
for a 4 GB physical memory), access permissions (read, write, execute), a
dirty/modified bit etc.
Page fault
The valid bit within the 32 bits tells you if the translation is valid
If this bit is reset that means the page is not resident in memory: results in a page fault
In case of a page fault the kernel needs to bring in the page to memory from disk
The disk address is normally provided by the page table entry (different interpretation of 31
bits)
Also kernel needs to allocate a new physical page frame for this virtual page
If all frames are occupied it invokes a page replacement policy
VA to PA translation
TLB
Caches
Once you have completed the VA to PA translation you have the physical address. What’s
next?
You need to access memory with that PA
Instruction and data caches hold most recently used (temporally close) and nearby (spatially
close) data
Use the PA to access the cache first
Caches are organized as arrays of cache lines
Each cache line holds several contiguous bytes (32, 64 or 128 bytes)
Addressing a cache
The block offset determines the starting byte address within a cache line
The index tells you which cache line to access
In that cache line you compare the tag to determine hit/miss
An example
PA is 32 bits
Cache line is 64 bytes: block offset is 6 bits
Number of cache lines is 512: index is 9 bits
So tag is the remaining bits: 17 bits
Total size of the cache is 512*64 bytes i.e. 32 KB
Each cache line contains the 64 byte data, 17-bit tag, one valid/invalid bit, and several
state bits (such as shared, dirty etc.)
Since both the tag and the index are derived from the PA this is called a physically
indexed physically tagged cache
Cache hierarchy
Inclusion policy
TLB access
Memory op latency
MLP
Out-of-order loads
Load/store ordering
When you need to evict a line in a particular set you run a replacement policy
LRU is a good choice: keeps the most recently used lines (favors temporal locality)
Thus you reduce the number of conflict misses
Two extremes of set size: direct-mapped (1-way) and fully associative (all lines are in a single
set)
Example: 32 KB cache, 2-way set associative, line size of 64 bytes: number of indices
or number of sets=32*1024/(2*64)=256 and hence index is 8 bits wide
Example: Same size and line size, but fully associative: number of sets is 1, within the
set there are 32*1024/64 or 512 lines; you need 512 tag comparisons for each access
Cache hierarchy
Inclusion policy
A cache hierarchy implements inclusion if the contents of level n cache (exclude the register
file) is a subset of the contents of level n+1 cache
Eviction of a line from L2 must ask L1 caches (both instruction and data) to invalidate
that line if present
A store miss fills the L2 cache line in M state, but the store really happens in L1 data
cache; so L2 cache does not have the most up-to-date copy of the line
Eviction of an L1 line in M state writes back the line to L2
Eviction of an L2 line in M state first asks the L1 data cache to send the most up-to-
date copy (if any), then it writes the line back to the next higher level (L3 or main
memory)
Inclusion simplifies the on-chip coherence protocol (more later)
TLB access
For every cache access (instruction or data) you need to access the TLB first
Puts the TLB in the critical path
Want to start indexing into cache and read the tags while TLB lookup takes place
Virtually indexed physically tagged cache
Extract index from the VA, start reading tag while looking up TLB
Once the PA is available do tag comparison
Overlaps TLB reading and tag reading
Memory op latency
L1 hit: ~1 ns
L2 hit: ~5 ns
L3 hit: ~10-15 ns
Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns
If a load misses in all caches it will eventually come to the head of the ROB and block
instruction retirement (in-order retirement is a must)
Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and
physical registers
Ultimately, the fetcher stalls: severely limits ILP
MLP
Out-of-order loads
sw 0(r7), r6
… /* other instructions */
lw r2, 80(r20)
Assume that the load issues before the store because r20 gets ready before r6 or r7
The load accesses the store buffer (used for holding already executed store values before
they are committed to the cache at retirement)
If it misses in the store buffer it looks up the caches and, say, gets the value somewhere
After several cycles the store issues and it turns out that 0(r7)==80(r20) or they overlap; now
what?
Load/store ordering
MIPS R10000
Overview
Stage 1: Fetch
Stage 2: Decode/Rename
Branch prediction
Branch predictor
Register renaming
Preparing to issue
Stage 3: Issue
Load-dependents
Functional units
Result writeback
Retirement or commit
Overview
Mid 90s: One of the first dynamic out-of-order superscalar RISC microprocessors
6.8 M transistors on 298 mm2 die (0.35 µm CMOS)
Out of 6.8 M transistors 4.4 M are devoted to L1 instruction and data caches
Fetches, decodes, renames 4 instructions every cycle
64-bit registers: the data path width is 64 bits
On-chip 32 KB L1 instruction and data caches, 2-way set associative
Off-chip L2 cache of variable size (512 KB to 16 MB), 2-way set associative, line size 128
bytes
Stage 1: Fetch
The instructions are slightly pre-decoded when the cache line is brought into Icache
Simplifies the decode stage
Processor fetches four sequential instructions every cycle from the Icache
The iTLB has eight entries, fully associative
No BTB
So the fetcher really cannot do anything about branches other than fetching sequentially
Stage 2: Decode/Rename
Branch prediction
Branch predictor
Register renaming
Preparing to issue
Finally, during the second stage every instruction is assigned an active list entry
The active list is a 32-entry FIFO queue which keeps track of all in-flight instructions
(at most 32) in-order
Each entry contains various info about the allocated instruction such as physical dest
reg number etc.
Also, each instruction is assigned to one of the three issue queues depending on its type
Integer queue: holds integer ALU instructions
Floating-point queue: holds FPU instructions
Address queue: holds the memory operations
Therefore, stage 2 may stall if the processor runs out of: active list entries, physical regs,
issue queue entries
Stage 3: Issue
Load-dependents
Functional units
Right after an instruction is issued it reads the source operands (dictated by physical reg
numbers) from the register file (integer or fp depending on instruction type)
From stage 4 onwards the instructions execute
Two ALUs: branch and shift can execute on ALU1, multiply/divide can execute on
ALU2, all other instructions can execute on any of the two ALUs; ALU1 is responsible
for triggering rollback in case of branch misprediction (marks all instructions after the
branch as squashed, restores the register map from correct branch stack entry, sets
fetch PC to the correct target)
Four FPUs: one dedicated for fp multiply, one for fp divide, one for fp square root, most
of the other instructions execute on the remaining FPU
LSU (Load/store unit): Address calc. ALU, dTLB is fully assoc. with 64 entries and
translates 44-bit VA to 40-bit PA, PA is used to match dcache tags (virtually indexed
physically tagged)
Result writeback
As soon as an instruction completes execution the result is written back to the destination
physical register
No need to wait till retirement since the renamer has guaranteed that this physical
destination is associated with a unique instruction in the pipeline
Also the results are launched on the bypass network (from outputs of ALU/FPU/dcache to
inputs of ALU/FPU/address calculation ALUs)
This guarantees that dependents can be issued back-to-back and still they can receive
the correct value
add r3, r4, r5; add r6, r4, r3; (can be issued in consecutive cycles, although the second
add will read a wrong value of r3 from the register file)
Retirement or commit
Immediately after the instructions finish execution, they may not be able to leave the pipe
In-order retirement is necessary for precise exception
When an instruction comes to the head of the active list it can retire
R10k retires 4 instructions every cycle
Retirement involves
Updating the branch predictor and freeing its branch stack entry if it is a branch
instruction
Moving the store value from the speculative store buffer entry to the L1 data cache if it
is a store instruction
Freeing old destination physical register and updating the register free list
And, finally, freeing the active list entry itself
Self-assessment Exercise
2. A set associative cache has longer hit time than an equally sized direct-
mapped cache. Why?
3. The Alpha 21264 has a virtually indexed virtually tagged instruction cache.
Do you see any security/protection issues with this? If yes, explain and offer a
solution. How would you maintain correctness of such a cache in a multi-
programmed environment?
4. Consider the following segment of C code for adding the elements in each
column of an NxN matrix A and putting it in a vector x of size N.
for(j=0;j<N;j++) {
for(i=0;i<N;i++) {
x[j] += A[i][j];
}
}
Assume that the C compiler carries out a row-major layout of matrix A i.e. A[i][j]
and A[i][j+1] are adjacent to each other in memory for all i and j in the legal
range and A[i][N-1] and A[i+1][0] are adjacent to each other for all i in the legal
range. Assume further that each element of A and x is a floating point double
i.e. 8 bytes in size. This code is executed on a modern speculative out-of-order
processor with the following memory hierarchy: page size 4 KB, fully associative
128-entry data TLB, 32 KB 2-way set associative single level data cache with
32 bytes line size, 256 MB DRAM. You may assume that the cache is virtually
indexed and physically tagged, although this information is not needed to
answer this question. For N=8192, compute the following (please show all the
intermediate steps). Assume that every instruction hits in the instruction cache.
Assume LRU replacement policy for physical page frames, TLB entries, and
cache sets.
5. Suppose you are running a program on two machines, both having a single
level of cache hierarchy (i.e. only L1 caches). In one machine the cache is
virtually indexed and physically tagged while in the other it is physically indexed
and physically tagged. Will there be any difference in cache miss rates when the
program is run on these two machines?
Solution: The physical memory is 2^32 bytes, since the physical address is 32
bits. Since the page size is 8 KB, the number of pages is (2^32)/(2^13) i.e.,
2^19. Since each page table entry is four bytes in size and each page must
have one page table entry, the size of the page table is (2^19)*4 bytes or 2 MB.
(B) Clearly show (with the help of a diagram) the addressing scheme if the
cache is virtually indexed and physically tagged. Your diagram should show the
width of TLB and cache tags.
Solution: I will describe the addressing scheme here. You can derive the
diagram from that. The processor generates 40-bit virtual addresses for memory
operations. This address must be translated to a physical address that can be
used to look up the memory (through the caches). The first step in this
translation is TLB lookup. Since the TLB has 32 sets, the index width for TLB
lookup is five bits. The lowest 13 bits of the virtual address constitute the page
offset and are not used for TLB lookup. The next lower five bits are used for
indexing into the TLB. This leaves the upper 22 bits of the virtual address to be
used as the TLB tag. On a TLB hit, the TLB entry provides the necessary page
table entry, which is 32 bits in width. On a TLB miss, the page table entry must
be read from the page table resident in memory or cache. Nonetheless, the net
effect of whichever path is taken is that we have the 32-bit page table entry.
From these 32 bits, the necessary 19-bit physical page frame number is
extracted (recall that the number of physical pages is 2^19). When the 13-bit
page offset is concatenated to this 19-bit frame number, we get the target
physical address. We must first look up the cache to check if the data
corresponding to this address is already resident there before querying the
memory. Since the cache is virtually indexed and physically tagged, the cache
lookup can start at the same time as TLB lookup. The cache has (2^15)/(64*2)
or 256 sets. So eight bits are needed to index the cache. The lower six bits of
the virtual address are the block offset and not used for cache indexing. The
next eight bits are used as cache index. The tags resident at both the ways of
this set are read out. The target tag is computed from the physical address and
must be compared against both the read out tags to test for a cache hit. Let's
try to understand how the target tag is computed from the physical address that
we have formed above with the help of the page table entry. Usually, the tag is
derived by removing the block offset and cache index bits from the physical
address. So, in this case, it is tempting to take the upper 18 bits of the physical
address as the cache tag. Unfortunately, this does not work for virtually indexed
physically tagged cache where the page offset is smaller than the block offset
plus cache index. In this particular example, they differ by one bit. Let's see what
the problem is. Consider a two different cache blocks residing at the same
cache index v derived from the virtual address. This means that these blocks
have identical lower 14 bits of the virtual address. This guarantees that these
two blocks will have identical lower 13 bits of physical address because virtual
to physical address translation does not change page offset bits. However,
nothing stops these two blocks from having identical upper 18 bits of the
physical address, but different 14th bit. Now, it is clear why the traditional tag
computation would make mistakes in identifying the correct block. So the cache
tag must also include the 14th bit. In other words, the cache tag needs to be
identical to the physical page frame number. This completes the cache lookup.
On a cache miss, the 32-bit physical address must be sent to memory for
satisfying the cache miss.
(C) If the cache was physically indexed and physically tagged, what part of the
addressing scheme would change?
Solution: Almost everything remains unchanged, except that the cache index
comes from the physical address now. As a result, the cache lookup cannot
start until the TLB lookup completes. The cache tag now can be only upper 18
bits of the physical address.
2. A set associative cache has longer hit time than an equally sized direct-
mapped cache. Why?
3. The Alpha 21264 has a virtually indexed virtually tagged instruction cache.
Do you see any security/protection issues with this? If yes, explain and offer a
solution. How would you maintain correctness of such a cache in a multi-
programmed environment?
switching in will see the cold start effect. Another solution would incorporate
process id in the cache tag. However, this may increase the cache latency
depending on the width of the process id. In general, it is very difficult to say
which one is going to be better and depends on the class of applications that
will run.
4. Consider the following segment of C code for adding the elements in each
column of an NxN matrix A and putting it in a vector x of size N.
for(j=0;j<N;j++) {
for(i=0;i<N;i++) {
x[j] += A[i][j];
}
}
Assume that the C compiler carries out a row-major layout of matrix A i.e., A[i][j]
and A[i][j+1] are adjacent to each other in memory for all i and j in the legal
range and A[i][N-1] and A[i+1][0] are adjacent to each other for all i in the legal
range. Assume further that each element of A and x is a floating point double
i.e., 8 bytes in size. This code is executed on a modern speculative out-of-order
issue processor with the following memory hierarchy: page size 4 KB, fully
associative 128-entry data TLB, 32 KB 2-way set associative single level data
cache with 32 bytes line size, 256 MB DRAM. You may assume that the cache
is virtually indexed and physically tagged, although this information is not
needed to answer this question. For N=8192, compute the following (please
show all the intermediate steps). Assume that every instruction hits in the
instruction cache. Assume LRU replacement policy for physical page frames,
TLB entries, and cache sets.
Solution: The total size of x is 64 KB and the total size of A is 512 MB. So,
these do not fit in the physical memory, which is of size 256 MB. Also, we note
that one row of A is of size 64 KB. As a result, every row of A starts on a new
page. As the computation starts, the first outer loop iteration suffers from one
page fault due to x and 8192 page faults due to A. Since one page can hold 512
elements of x and A, the next 511 outer loop iterations do not take any page
faults. The j=512 iteration again suffers from one page fault in x and 8192 fresh
page faults in A. This pattern continues until the memory gets filled up. At this
point we need to invoke the replacement policy, which is LRU. As a result, the
old pages of x and A will get replaced to make room for the new ones. Instead
of calculating the exact iteration point where the memory gets exhausted, we
only note that the page fault pattern continues to hold even beyond this point.
Therefore, the total number of page faults is 8193*(8192/512) or 8193*16 or
131088.
Solution: The TLB can hold 128 pages at a time. The TLB gets filled up at j=0,
i=126 with one translation for x[0] and 127 translations for A[0][0] to A[0][126]. At
this point, the LRU replacement policy is invoked and it replaces the translations
of A. The translation of x[0] does not get replaced because it is touched in every
inner loop iteration. By the time the j=0 iteration is finished, only the last 127
(C) Number of data cache misses. Assume that x and A do not conflict with
each other in the cache.
Solution: In this case also, x enjoys maximum reuse, while A suffers from a
cache miss on every access. This is because the number of blocks in the cache
is 1024, which is much smaller than the reuse distance in A. One cache block
can hold four elements of x. As a result, x takes a cache miss on every fourth
element. So, the total number of cache misses is 2048+8192*8192 or 2K+64M.
(D) At most how many memory operations can the processor overlap before
coming to a halt? Assume that the instruction selection logic (associated with
the issue unit) gives priority to older instructions over younger instructions if both
are ready to issue in a cycle.
Solution: Since every access of A suffers from a TLB miss and the TLB
misses are usually implemented as restartable exceptions, there cannot be any
overlap among multiple memory operations. A typical iteration would involve load
of x, TLB miss followed by load of A, addition, and store to x. No two memory
operations can overlap because the middle one always suffers from a TLB miss
leading to a pipe flush.
5. Suppose you are running a program on two machines, both having a single
level of cache hierarchy (i.e. only L1 caches). In one machine the cache is
virtually indexed and physically tagged while in the other it is physically indexed
and physically tagged. Will there be any difference in cache miss rates when the
program is run on these two machines?
Agenda
Communication architecture
Layered architecture
Shared address
Message passing
Convergence
Agenda
Communication architecture
Layered architecture
Shared address
Interconnect could be a crossbar switch so that any processor can talk to any memory
bank in one “hop” (provides latency and bandwidth advantages)
Scaling a crossbar becomes a problem: cost is proportional to square of the size
Instead, could use a scalable switch-based network; latency increases and bandwidth
decreases because now multiple processors contend for switch ports
Communication medium
From mid 80s shared bus became popular leading to the design of SMPs
Pentium Pro Quad was the first commodity SMP
Sun Enterprise server provided a highly pipelined wide shared bus for scalability
reasons; it also distributed the memory to each processor, but there was no local bus
on the boards i.e. the memory was still “symmetric” (must use the shared bus)
NUMA or DSM architectures provide a better solution to the scalability problem; the
symmetric view is replaced by local and remote memory and each node (containing
processor(s) with caches, memory controller and router) gets connected via a scalable
network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha
GS320, Alpha/HP GS1280 etc.
Message passing
Convergence
Shared address and message passing are two distinct programming models, but the
architectures look very similar
Both have a communication assist or network interface to initiate messages or
transactions
In shared memory this assist is integrated with the memory controller
In message passing this assist normally used to be integrated with the I/O, but the
trend is changing
There are message passing machines where the assist sits on the memory bus or
machines where DMA over network is supported (direct transfer from source memory
to destination memory)
Finally, it is possible to emulate send/recv. on shared memory through shared buffers
and flags
Possible to emulate a shared virtual mem. on message passing machines through
modified page fault handlers
Dataflow architecture
Systolic arrays
A generic architecture
Design issues
Naming
Operations
Ordering
Replication
Communication cost
Dataflow architecture
Systolic arrays
Each PE may have small instruction and data memory and may carry out a different operation
Data proceeds through the array at regular “heartbeats” (hence the name)
The dataflow may be multi-directional or optimized for specific algorithms
Optimize the interconnect for specific application (not necessarily a linear topology)
Practical implementation in iWARP
Uses general purpose processors as PEs
Dedicated channels between PEs for direct register to register communication
A generic architecture
In all the architectures we have discussed thus far a node essentially contains processor(s) +
caches, memory and a communication assist (CA)
CA = network interface (NI) + communication controller
The nodes are connected over a scalable network
The main difference remains in the architecture of the CA
And even under a particular programming model (e.g., shared memory) there is a lot of
choices in the design of the CA
Most innovations in parallel architecture take place in the communication assist (also
called communication controller or node controller)
Design issues
Naming
Operations
Ordering
Replication
Communication cost
Parallel Programming
Agenda
Galaxy simulation
Ray tracing
Some definitions
Static assignment
Dynamic assignment
Decomposition types
Orchestration
Mapping
An example
Sequential program
Agenda
Galaxy simulation
Ray tracing
Some definitions
Task
Arbitrary piece of sequential work
Concurrency is only across tasks
Fine-grained task vs. coarse-grained task: controls granularity of parallelism
(spectrum of grain: one instruction to the whole sequential program)
Process/thread
Logical entity that performs a task
Communication and synchronization happen between threads
Processors
Physical entity on which one or more processes execute
Static assignment
Dynamic assignment
Static assignment may lead to load imbalance depending on how irregular the application is
Dynamic decomposition/assignment solves this issue by allowing a process to dynamically
choose any available task whenever it is done with its previous task
Normally in this case you decompose the program in such a way that the number of
available tasks is larger than the number of processes
Same example: divide the array into portions each with 10 indices; so you have N/10
tasks
An idle process grabs the next available task
Provides better load balance since longer tasks can execute concurrently with the
smaller ones
Dynamic assignment comes with its own overhead
Now you need to maintain a shared count of the number of available tasks
The update of this variable must be protected by a lock
Need to be careful so that this lock contention does not outweigh the benefits of
dynamic decomposition
More complicated applications where a task may not just operate on an index range, but
could manipulate a subtree or a complex data structure
Normally a dynamic task queue is maintained where each task is probably a pointer to
the data
The task queue gets populated as new tasks are discovered
Decomposition types
Decomposition by data
The most commonly found decomposition technique
The data set is partitioned into several subsets and each subset is assigned to a
process
The type of computation may or may not be identical on each subset
Very easy to program and manage
Computational decomposition
Not so popular: tricky to program and manage
All processes operate on the same data, but probably carry out different kinds of
computation
More common in systolic arrays, pipelined graphics processor units (GPUs) etc.
Orchestration
Mapping
An example
Sequential program
Parallel Programming
Assignment
Mutual exclusion
LOCK optimization
More synchronization
Message passing
Major changes
Message passing
MPI-like environment
while (!done)
diff = 0.0;
for_all i = 0 to n-1
for_all j = 0 to n-1
temp = A[i, j];
A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j]);
diff += fabs (A[i, j] – temp);
end for_all
end for_all
if (diff/(n*n) < TOL) then done = 1;
end while
Assignment
/* include files */
MAIN_ENV;
int P, n;
void Solve ();
struct gm_t {
LOCKDEC (diff_lock);
BARDEC (barrier);
float **A, diff;
} *gm;
int main (char **argv, int argc)
{
int i;
MAIN_INITENV;
gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t));
LOCKINIT (gm->diff_lock);
BARINIT (gm->barrier);
n = atoi (argv[1]);
P = atoi (argv[2]);
gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*));
for (i = 0; i < n+2; i++) {
gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float));
}
Initialize (gm->A);
for (i = 1; i < P; i++) { /* starts at 1 */
CREATE (Solve);
}
Solve ();
WAIT_FOR_END (P-1);
MAIN_END;
}
Mutual exclusion
LOCK optimization
Suppose each processor updates a shared variable holding a global cost value, only if its
local cost is less than the global cost: found frequently in minimization problems
LOCK (gm->cost_lock);
if (my_cost < gm->cost) {
gm->cost = my_cost;
}
UNLOCK (gm->cost_lock);
/* May lead to heavy lock contention if everyone tries to update at the same time */
More synchronization
Global synchronization
Through barriers
Often used to separate computation phases
Point-to-point synchronization
A process directly notifies another about a certain event on which the latter was
waiting
Producer-consumer communication pattern
Semaphores are used for concurrent programming on uniprocessor through P and V
functions
Normally implemented through flags on shared memory multiprocessors (busy wait or
spin)
P0 : A = 1; flag = 1;
P1 : while (!flag); use (A);
Message passing
Major changes
Message passing
MPI-like environment
MAIN_ENV;
/* define message tags */
#define ROW 99
#define DIFF 98
#define DONE 97
int main(int argc, char **argv)
{
int pid, P, done, i, j, N;
float tempdiff, local_diff, temp, **A;
MAIN_INITENV;
GET_PID(pid);
GET_NUMPROCS(P);
N = atoi(argv[1]);
tempdiff = 0.0;
done = 0;
A = (double **) malloc ((N/P+2) * sizeof(float *));
for (i=0; i < N/P+2; i++) {
A[i] = (float *) malloc (sizeof(float) * (N+2));
}
initialize(A);
while (!done) {
local_diff = 0.0;
/* MPI_CHAR means raw byte format */
Performance Issues
Agenda
Load balancing
Task stealing
Architect’s job
Domain decomposition
Comm-to-comp ratio
Extra work
Data access
Agenda
Load balancing
Task stealing
A processor may choose to steal tasks from another processor’s queue if the former’s queue
is empty
How many tasks to steal? Whom to steal from?
The biggest question: how to detect termination? Really a distributed consensus!
Task stealing, in general, may increase overhead and communication, but a smart
design may lead to excellent load balance (normally hard to design efficiently)
This is a form of a more general technique called Receiver Initiated Diffusion (RID)
where the receiver of the task initiates the task transfer
In Sender Initiated Diffusion (SID) a processor may choose to insert into another
processor’s queue if the former’s task queue is full above a threshold
Architect’s job
Domain decomposition
Comm-to-comp ratio
Surely, there could be many different domain decompositions for a particular problem
For grid solver we may have a square block decomposition, block row decomposition
or cyclic row decomposition
How to determine which one is good? Communication-to-computation ratio
Computation (area): N2 /P
In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp
ratio
But depends on the application structure i.e. picking the lowest comm-to-comp may
have other problems
Normally this ratio gives you a rough estimate about average communication
bandwidth requirement of the application i.e. how frequent is communication
But it does not tell you the nature of communication i.e. bursty or uniform
For grid solver comm. happens only at the start of each iteration; it is not uniformly
distributed over computation
Thus the worst case BW requirement may exceed the average comm-to-comp ratio
Extra work
The memory hierarchy (caches and main memory) plays a significant role in determining
communication cost
May easily dominate the inherent communication of the algorithm
For uniprocessor, the execution time of a program is given by useful work time + data access
time
Useful work time is normally called the busy time or busy cycles
Data access time can be reduced either by architectural techniques (e.g., large caches)
or by cache-aware algorithm design that exploits spatial and temporal locality
Data access
In multiprocessors
Every processor wants to see the memory interface as its own local cache and the
main memory
In reality it is much more complicated
If the system has a centralized memory (e.g., SMPs), there are still caches of other
processors; if the memory is distributed then some part of it is local and some is
remote
For shared memory, data movement from local or remote memory to cache is
transparent while for message passing it is explicit
View a multiprocessor as an extended memory hierarchy where the extension includes
caches of other processors, remote memory modules and the network topology
Artifactual comm.
Capacity problem
Temporal locality
Spatial locality
2D to 4D conversion
Transfer granularity
Communication cost
Contention
Hot-spots
Overlap
Summary
Artifactual comm.
Capacity problem
Temporal locality
Block
transpose
Spatial locality
Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[i][j]
and A[i][j+1] have contiguous memory locations
The same page is local to a
processor while remote to
others; same applies to
straddling cache lines.
Ideally, I want to have all
pages within a partition local
to a single processor.
Standard trick is to covert the
2D array to 4D.
2D to 4D conversion
Transfer granularity
decomposition (you need only one element, but must transfer the whole cache line): no
good solution
Communication cost
Given the total volume of communication (in bytes, say) the goal is to reduce the end-to-end
latency
Simple model:
Contention
Hot-spots
Avoid location hot-spot by either staggering accesses to the same location or by designing the
algorithm to exploit a tree structured communication
Module hot-spot
Normally happens when a particular node saturates handling too many messages
(need not be to same memory location) within a short amount of time
Normal solution again is to design the algorithm in such a way that these messages
are staggered over time
Rule of thumb: design communication pattern such that it is not bursty; want to distribute it
uniformly over time
Overlap
Summary
Exercise : 1
1. [10 points] Suppose you are given a program that does a fixed amount of
work, and some fraction s of that work must be done sequentially. The
remaining portion of the work is perfectly parallelizable on P processors. Derive
a formula for execution time on P processors and establish an upper bound on
the achievable speedup.
2. [40 points] Suppose you want to transfer n bytes from a source node S to a
destination node D and there are H links between S and D. Therefore, notice
that there are H+1 routers in the path (including the ones in S and D). Suppose
W is the node-to-network bandwidth at each router. So at S you require n/W
time to copy the message into the router buffer. Similarly, to copy the message
from the buffer of router in S to the buffer of the next router on the path, you
require another n/W time. Assuming a store-and-forward protocol total time
spent doing these copy operations would be (H+2)n/W and the data will end up
in some memory buffer in D. On top of this, at each router we spend R amount
of time to figure out the exit port. So the total time taken to transfer n bytes from
S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand,
if you assume a cut-through protocol the critical path would just be n/W+(H+1)R.
Here we assume the best possible scenario where the header routing delay at
each node is exposed and only the startup n/W delay at S is exposed. The rest
is pipelined. Now suppose that you are asked to compare the performance of
these two routing protocols on an 8x8 grid. Compute the maximum, minimum,
and average latency to transfer an n byte message in this topology for both the
protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for
n=64 and 256. Note that for each protocol you will have three answers
(maximum, minimum, average) for each value of n. Here GB means 10^9 bytes
and not 2^30 bytes.
Solution of Exercise : 1
1. [10 points] Suppose you are given a program that does a fixed amount of
work, and some fraction s of that work must be done sequentially. The
remaining portion of the work is perfectly parallelizable on P processors. Derive
a formula for execution time on P processors and establish an upper bound on
the achievable speedup.
2. [40 points] Suppose you want to transfer n bytes from a source node S to a
destination node D and there are H links between S and D. Therefore, notice
that there are H+1 routers in the path (including the ones in S and D). Suppose
W is the node-to-network bandwidth at each router. So at S you require n/W
time to copy the message into the router buffer. Similarly, to copy the message
from the buffer of router in S to the buffer of the next router on the path, you
require another n/W time. Assuming a store-and-forward protocol total time
spent doing these copy operations would be (H+2)n/W and the data will end up
in some memory buffer in D. On top of this, at each router we spend R amount
of time to figure out the exit port. So the total time taken to transfer n bytes from
S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand,
if you assume a cut-through protocol the critical path would just be n/W+(H+1)R.
Here we assume the best possible scenario where the header routing delay at
each node is exposed and only the startup n/W delay at S is exposed. The rest
is pipelined. Now suppose that you are asked to compare the performance of
these two routing protocols on an 8x8 grid. Compute the maximum, minimum,
and average latency to transfer an n byte message in this topology for both the
protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for
n=64 and 256. Note that for each protocol you will have three answers
(maximum, minimum, average) for each value of n. Here GB means 10^9 bytes
and not 2^30 bytes.
A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4. Suppose you assign one matrix
element to one processor (i.e. you have n^2 processors). Compute the total
amount of data communication between processors.
Solution: Each processor requires the four neighbors i.e. 32 bytes. So total
amount of data communicated is 32n^2.
Shared cache
Private cache/Dancehall
Cache coherence
Implementations
Shared cache
Private cache/Dancehall
In shared caches, getting a block from a remote bank takes time proportional to the physical
distance between the requester and the bank
Non-uniform cache architecture (NUCA)
This is same for private caches, if the data resides in a remote cache
Shared cache may have higher average hit latency than the private cache
Hopefully most hits in the latter will be local
Shared caches are most likely to have less misses than private caches
Latter wastes space due to replication
Cache coherence
Implementations
Sharing patterns
Migratory hand-off
Stores
MSI protocol
State transition
MSI example
MESI protocol
State transition
MESI example
MOESI protocol
Hybrid inval+update
Sharing patterns
Migratory hand-off
Invalid (I), Shared (S), Modified or dirty (M), Clean exclusive (E), Owned (O)
Every processor does not support all five states
E state is equivalent to M in the sense that the line has permission to write, but in E
state the line is not yet modified and the copy in memory is the same as in cache; if
someone else requests the line the memory will provide the line
O state is exactly same as E state but in this case memory is not responsible for
servicing requests to the line; the owner must supply the line (just as in M state)
Stores
MSI protocol
State transition
MSI example
P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1
sets state of line to M
P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state,
memory writes back
P3 generates BusRdX, P0, P1, P2 snoop and invalidate, memory provides line, P3 puts
line in cache in M state
MESI protocol
State transition
MESI example
MOESI protocol
Hybrid inval+update
Four organizations
Hierarchical design
Cache Coherence
Example
Definitions
Ordering memory op
Example
Cache coherence
Bus-based SMP
Snoopy protocols
State transition
Ordering memory op
Four organizations
Shared cache
controller
The most popular organization for small to medium-scale servers
Possible to connect 30 or so processors with smart bus design
Bus bandwidth requirement is lower compared to shared cache approach
Why?
Dancehall
requirement
If an access is satisfied in cache, the transaction will not appear on the interconnect
and hence the bandwidth requirement of the interconnect will be less (shared L1 cache
does not have this advantage)
In distributed shared memory (DSM) cache and local memory should be used cleverly
Bus-based SMP and DSM are the two designs supported today by industry vendors
In bus-based SMP every cache miss is launched on the shared bus so that all
processors can see all transactions
In DSM this is not the case
Hierarchical design
Possible to combine bus-based SMP and DSM to build hierarchical shared memory
Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to
form a 112p multiprocessor
IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and
L3 caches (this is called a chip multiprocessor); connect these chips over a network to
form scalable multiprocessors
Next few lectures will focus on bus-based SMPs only
Cache Coherence
Example
Assume a write-through cache i.e. every store updates the value in cache as well as in
memory
P0: reads x from memory, puts it in its cache, and gets the value 5
P1: reads x from memory, puts it in its cache, and gets the value 5
P1: writes x=7, updates its cached value and memory value
P0: reads x from its cache and gets the value 5
P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is
completely incoherent)
P2: writes x=10, updates its cached value and memory value
Consider the same example with a writeback cache i.e. values are written back to memory
only when the cache line is evicted from the cache
P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not
write through)
The state of the line in P1 and P2 is M while the line in P0 is clean
Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from
P0 will not issue a writeback (clean lines do not need writeback)
Suppose P2 evicts the line first, and then P1
Final memory value is 7: we lost the store x=10 from P2
Lesson learned: must invalidate all cached copies before allowing a store to proceed
Writeback cache
Problem is even more complicated: stores are no longer visible to memory immediately
Writeback order is important
Lesson learned: do not allow more than one copy of a cache line in M state
Need to formalize the intuitive memory model
In sequential programs the order of read/write is defined by the program order; the
notion of “last write” is well-defined
For multiprocessors how do you define “last write to a memory location” in presence of
independent caches?
Within a processor it is still fine, but how do you order read/write across processors?
Definitions
Ordering memory op
A memory operation is said to complete when it has performed with respect to all processors
in the system
Assume that there is a single shared memory and no caches
Memory operations complete in shared memory when they access the corresponding
memory locations
Operations from the same processor complete in program order: this imposes a
partial order among the memory operations
Operations from different processors are interleaved in such a way that the program
order is maintained for each processor: memory imposes some total order (many
are possible)
Example
P0: x = 8; u = y; v = 9;
P1: r = 5; y = 4; t = v;
x = 8; u = y; r = 5; y = 4; t = v; v = 9;
x = 8; r = 5; y = 4; u = y; v = 9; t = v;
Cache coherence
Formal definition
A memory system is coherent if the values returned by reads to a memory location
during an execution of a program are such that all operations to that location can form
a hypothetical total order that is consistent with the serial order and has the following
two properties:
1. Operations issued by any particular processor perform according to the issue order
2. The value returned by a read is the value written to that location by the last write in
the total order
Two necessary features that follow from above:
A. Write propagation: writes must eventually become visible to all processors
B. Write serialization: Every processor should see the writes to a location in the same
order (if I see w1 before w2, you should not see w2 before w1)
Bus-based SMP
Snoopy protocols
Cache coherence protocols implemented in bus-based machines are called snoopy protocols
The processors snoop or monitor the bus and take appropriate protocol actions based
on snoop results
Cache controller now receives requests both from processor and bus
Since cache state is maintained on a per line basis that also dictates the coherence
granularity
Cannot normally take a coherence action on parts of a cache line
The coherence protocol is implemented as a finite state machine on a per cache line
basis
The snoop logic in each processor grabs the address from the bus and decides if any
action should be taken on the cache line containing that address (only if the line is in
cache)
State transition
Ordering memory op
Memory consistency
Consistency model
Sequential consistency
OOO and SC
SC example
Implementing SC
Write atomicity
Summary of SC
Snoopy protocols
Stores
MSI protocol
State transition
MSI protocol
M to S, or M to I?
MSI example
MESI protocol
State transition
MESI protocol
MESI example
Memory consistency
Consistency model
Sequential consistency
becomes visible
For modern microprocessors the program order is really the commit order
Can out-of-order (OOO) execution violate SC?
Yes. Need extra logic to support SC on top of OOO
OOO and SC
Suppose the load that reads w takes a miss and so w is not ready for a long time;
therefore, x=w+1 cannot complete immediately; eventually w returns with value 3
Inside the microprocessor r=y+1 completes (but does not commit) before x=w+1 and
gets the old value of y (possibly from cache); eventually instructions commit in order
with x=4, r=1, y=2, w=3
So we have the following partial orders
SC example
Implementing SC
correct from the viewpoint of cache coherence, but will violate SC)
Write atomicity
P0: A=1;
Summary of SC
Program order from each processor creates a partial order among memory operations
Interleaving of these partial orders defines a total order
Sequential consistency: one of many total orders
A multiprocessor is said to be SC if any execution on this machine is SC compliant
Sufficient but not necessary conditions for SC
Issue memory operation in program order
Every processor waits for write to complete before issuing the next operation
Every processor waits for read to complete and the write that affects the returned
value to complete before issuing the next operation (important for write atomicity)
Snoopy protocols
Stores
write transactions (carrying just the modified bytes) on the bus even on write hits (not
very attractive with writeback caches)
Advantage of update-based protocols: sharers continue to hit in the cache while in
invalidation-based protocols sharers will miss next time they try to access the line
Advantage of invalidation-based protocols: only write misses go on bus (suited for
writeback caches) and subsequent stores to the same line are cache hits
Difficult to answer
Depends on program behavior and hardware cost
When is update-based protocol good?
What sharing pattern? (large-scale producer/consumer)
Otherwise it would just waste bus bandwidth doing useless updates
When is invalidation-protocol good?
Sequence of multiple writes to a cache line
Saves intermediate write transactions
Also think about the overhead of initiating small updates for every write in update protocols
Invalidation-based protocols are much more popular
Some systems support both or maybe some hybrid based on dynamic sharing pattern
of a cache line
MSI protocol
State transition
MSI protocol
M to S, or M to I?
MSI example
P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1
sets state of line to M
P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state,
memory writes back
P3 generates BusRdX, P0, P1, P2 snoop and invalidate, memory provides line, P3 puts
line in cache in M state
MESI protocol
State transition
MESI protocol
If a cache line is in M state definitely the processor with the line is responsible for flushing it
on the next BusRd or BusRdX transaction
If a line is not in M state who is responsible?
Memory or other caches in S or E state?
Original Illinois MESI protocol assumed cache-to-cache transfer i.e. any processor in E
or S state is responsible for flushing the line
However, it requires some expensive hardware, namely, if multiple processors are
caching the line in S state who flushes it? Also, memory needs to wait to know if it
should source the line
Without cache-to-cache sharing memory always sources the line unless it is in M state
MESI example
MOESI protocol
Dragon protocol
State transition
Dragon example
Design issues
General issues
Evaluating protocols
Protocol optimizations
Cache size
Hybrid inval+update
Update-based protocol
Shared cache
MOESI protocol
Some SMPs implement MOESI today e.g., AMD Athlon MP and the IBM servers
Why is the O state needed?
O state is very similar to E state with four differences: 1. If a cache line is in O state
in some cache, that cache is responsible for sourcing the line to the next requester; 2.
The memory may not have the most up-to-date copy of the line (this implies 1); 3.
Eviction of a line in O state generates a BusWB; 4. Write to a line in O state must
generate a bus transaction
When a line transitions from M to S it is necessary to write the line back to memory
For a migratory sharing pattern (frequent in database workloads) this leads to a series
of writebacks to memory
These writebacks just keep the memory banks busy and consumes memory bandwidth
Take the following example
P0 reads x, P0 writes x, P1 reads x, P1 writes x, P2 reads x, P2 writes x, …
Thus at the time of a BusRd response the memory will write the line back: one
writeback per processor handover
O state aims at eliminating all these writebacks by transitioning from M to O instead of
M to S on a BusRd/Flush
Subsequent BusRd requests are replied by the owner holding the line in O state
The line is written back only when the owner evicts it: one single writeback
State transitions pertaining to O state
I to O: not possible (or maybe; see below)
E to O or S to O: not possible
M to O: on a BusRd/Flush (but no memory writeback)
O to I: on CacheEvict/BusWB or {BusRdX,BusUpgr}/Flush
O to S: not possible (or maybe; see below)
O to E: not possible (or maybe if silent eviction not allowed)
O to M: on PrWr/BusUpgr
At most one cache can have a line in O state at any point in time
Two main design choices for MOESI
Consider the example P0 reads x, P0 writes x, P1 reads x, P2 reads x, P3 reads x, …
When P1 launches BusRd, P0 sources the line and now the protocol has two options:
1. The line in P0 goes to O and the line in P1 is filled in state S; 2. The line in P0
goes to S and the line in P1 is filled in state O i.e. P1 inherits ownership from P0
For bus-based SMPs the two choices will yield roughly the same performance
For DSM multiprocessors we will revisit this issue if time permits
According to the second choice, when P2 generates a BusRd request, P1 sources the
line and transitions from O to S; P2 becomes the new owner
Some SMPs do not support the E state
In many cases it is not helpful, only complicates the protocol
MOSI allows a compact state encoding in 2 bits
Sun WildFire uses MOSI protocol
Dragon protocol
Design issues
Dragon example
Design issues
General issues
Thus far we have assumed an atomic bus where transactions are not interleaved
In reality, high performance busses are pipelined and multiple transactions are in
progress at the same time
How do you reason about coherence?
Thus far we have assumed that the processor has only one level of cache
How to extend the coherence protocol to multiple levels of cache?
Normally, the cache coherence protocols we have discussed thus far executes only on
the outermost level of cache hierarchy
A simpler but different protocol runs within the hierarchy to maintain coherence
We will revisit these questions soon
Evaluating protocols
In message passing machines the design of the message layer plays an important role
Similarly, cache coherence protocols are central to the design of a shared memory
multiprocessor
The protocol performance depends on an array of parameters
Experience and intuition help in determining good design points
Otherwise designers use workload-driven simulations for cost/performance analysis
Goal is to decide where to spend money, time and energy
The simulators model the underlying multiprocessor in enough detail to capture correct
performance trends as one explores the parameter space
Protocol optimizations
Cache size
With increasing problem size normally working set size also increases
More pressure on cache
With increasing number of processors working set per processor goes down
Less pressure on cache
This effect sometimes leads to superlinear speedup i.e. on P processors you get
speedup more than P
Important to design the parallel program so that the critical working sets fit in cache
Otherwise bus bandwidth requirement may increase dramatically
Impact of cache line size on true sharing heavily depends on application characteristics
Blocked matrix computations tend to have good spatial locality with shared data
because they access data in small blocks thereby exploiting temporal as well as spatial
locality
Nearest neighbor computations tend to have little spatial locality when accessing left
and right border elements
The exact proportion of various types of misses in an application normally changes with
cache size, problem size and the number of processors
With small cache, capacity miss may dominate everything else
With large cache, true sharing misses may cause the major traffic
When cache line size is increased it may seem that we bring in more data together and have
better spatial locality and reuse
Should reduce bus traffic per unit computation
However, bus traffic normally increases monotonically with cache line size
Unless we have enough spatial and temporal locality to exploit, bus traffic will increase
For most cases bus bandwidth requirement attains a minimum at a block size different
from the minimum size; this is because at very small line sizes the overhead of
communication becomes too high
Large cache lines are intended to amortize the DRAM access and bus transfer latency over a
large number of data points
But false sharing becomes a problem
Hardware solutions
Coherence at subblock level: divide the cache line into smaller blocks and maintain
coherence for each of them; subblock invalidation on a write reduces chances of
coherence misses even in the presence of false sharing
Delay invalidations: send invalidations only after the writer has completed several
writes; but this directly impacts the write propagation model and hence leads to
consistency models weaker than SC
Use update-based protocols instead of invalidation-based: probably not a good idea
Hybrid inval+update
Update-based protocol
Shared cache
Advantages
If there is only one level of cache no need for a coherence protocol
More example
Types
Synchronization
Waiting algorithms
Implementation
Hardwired locks
Software locks
Hardware support
Atomic exchange
Fetch & op
More example
In the previous example same situation may arise even if P0 misses in the cache; the timing
of P1’s read decides whether the race happens or not
Another example
P0 writes x, P1 writes x
Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
(Can P1 launch upgrade?)
Surely the launched cache line will have old value of x as before
Is it safe for the matching loads from P0 to use the new value of x from store buffer?
What happens when P0’s store ultimately commits?
In the previous example same situation may arise even if P0 misses in the cache; the timing
of P1’s read decides whether the race happens or not
Another example
P0 writes x, P1 writes x
Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
(Can P1 launch upgrade?)
Surely the launched cache line will have old value of x as before
Is it safe for the matching loads from P0 to use the new value of x from store buffer?
YES
What happens when P0 ’s store ultimately commits? READ-EXCLUSIVE MISS
Another example
P0 reads x, P0 writes x, P1 writes x
Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
Surely the launched cache line will have old value of x as before
What value does P0’s load commit?
Synchronization Types
Mutual exclusion
Synchronize entry into critical sections
Normally done with locks
Point-to-point synchronization
Tell a set of processors (normally set cardinality is one) that they can proceed
Normally done with flags
Global synchronization
Bring every processor to sync
Wait at a point until everyone is there
Normally done with barriers
Synchronization
Normally a two-part process: acquire and release; acquire can be broken into two parts: intent
and wait
Intent: express intent to synchronize (i.e. contend for the lock, arrive at a barrier)
Wait: wait for your turn to synchronization (i.e. wait until you get the lock)
Release: proceed past synchronization and enable other contenders to synchronize
Waiting algorithms do not depend on the type of synchronization
Waiting algorithms
Implementation
Popular trend
Architects offer some simple atomic primitives
Library writers use these primitives to implement synchronization algorithms
Normally hardware primitives for acquire and possibly release are provided
Hard to offer hardware solutions for waiting
Also hardwired waiting may not offer that much of flexibility
Hardwired locks
Software locks
Bakery algorithm
Shared: choosing[P] = FALSE, ticket[P] = 0;
Acquire: choosing[i] = TRUE; ticket[i] = max(ticket[0],…,ticket[P-1]) + 1; choosing[i] =
FALSE;
for j = 0 to P-1
while (choosing[j]);
while (ticket[j] && ((ticket[j], j) < (ticket[i], i)));
endfor
Release: ticket[i] = 0;
Does it work for multiprocessors?
Assume sequential consistency
Performance issues related to coherence?
Too much overhead: need faster and simpler lock algorithms
Need some hardware support
Hardware support
Atomic exchange
Loads current lock value in a register and sets location always with 1
Exchange allows to swap any value
A similar type of instruction is fetch & op
Fetch memory location in a register and apply op on the memory location
Op can be a set of supported operations e.g. add, increment, decrement, store etc.
In Test & set op=set
Fetch & op
Possible to implement a lock with fetch & clear then add (used to be supported in BBN
Butterfly 1)
Ticket lock
Array-based lock
RISC processors
LL/SC
Speculative SC?
Point-to-point synch.
In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported
every such instruction will generate a transaction (may be good or bad depending on
the support in memory controller; will discuss later)
Let us assume that the lock location is cacheable and is kept coherent
Every invocation of test & set must generate a bus transaction; Why? What is the
transaction? What are the possible states of the cache line holding lock_addr?
Therefore all lock contenders repeatedly generate bus transactions even if someone is
still in the critical section and is holding the lock
Can we improve this?
Test & set with backoff
delay = k
Lock: ts register, lock_addr
bez register, Enter_CS
pause (delay) /* Can be simulated as a timed loop */
delay = delay*c
j Lock
In the worst case everyone will try to enter the CS at the same time
First time P transactions for ts and one succeeds; every other processor suffers a miss
on the load in Test loop; then loops from cache
The lock-holder when unlocking generates an upgrade (why?) and invalidates all
others
All other processors suffer read miss and get value zero now; so they break Test loop
and try ts and the process continues until everyone has visited the CS
For distributed shared memory the situation is worse because each invalidation
becomes a separate message (more later)
Ticket lock
Initial fetch & inc generates O(P) traffic on bus-based machines (may be worse in DSM
depending on implementation of fetch & inc)
But the waiting algorithm still suffers from 0.5P2 messages asymptotically
Researchers have proposed proportional backoff i.e. in the wait loop put a delay
proportional to the difference between ticket value and last read release_count
Latency and storage-wise better than Bakery
Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery)
Guaranteed fairness: the ticket value induces a FIFO queue
Array-based lock
memory: on acquire it must take a remote miss; allocate P pages and let each
processor loop on one bit in a page? Too much wastage; better solution: MCS lock
(Mellor-Crummey & Scott)
Correctness concerns
Make sure to handle corner cases such as determining if someone is waiting on the
next location (this must be an atomic operation) while unlocking
Remember to reset your index location to zero while unlocking
RISC processors
LL/SC
Load linked behaves just like a normal load with some extra tricks
Puts the loaded value in destination register as usual
Sets a load_linked bit residing in cache controller to 1
Puts the address in a special lock_address register residing in the cache controller
Store conditional is a special store
sc reg, addr stores value in reg to addr only if load_linked bit is set; also it copies the
value in load_linked bit to reg and resets load_linked bit
Any intervening “operation” (e.g., bus transaction or cache replacement) to the cache line
containing the address in lock_address register clears the load_linked bit so that subsequent
sc fails
Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying
until comparison passes)
Speculative SC?
Point-to-point synch.
P0: A = 1; flag = 1;
P1: while (!flag); print A;
Barrier
Centralized barrier
Sense reversal
Centralized barrier
Tree barrier
Hardware support
Hardware barrier
Speculative synch.
Why is it good?
Why is it correct?
Performance concerns
Barrier
Centralized barrier
struct bar_type {
int counter;
struct lock_type lock;
int flag = 0;
} bar_name;
BARINIT (bar_name) {
LOCKINIT(bar_name.lock);
bar_name.counter = 0;
}
BARRIER (bar_name, P) {
int my_count;
LOCK (bar_name.lock);
if (!bar_name.counter) {
bar_name.flag = 0; /* first one */
}
my_count = ++bar_name.counter;
UNLOCK (bar_name.lock);
if (my_count == P) {
bar_name.counter = 0;
bar_name.flag = 1; /* last one */
}
else {
while (!bar_name.flag);
}
}
Sense reversal
The last implementation fails to work for two consecutive barrier invocations
Need to prevent a process from entering a barrier instance until all have left the
previous instance
Reverse the sense of a barrier i.e. every other barrier will have the same sense:
basically attach parity or sense to a barrier
BARRIER (bar_name, P) {
local sense = !local_sense; /* this is private per processor */
LOCK (bar_name.lock);
bar_name.counter++;
if (bar_name.counter == P) {
UNLOCK (bar_name.lock);
bar_name.counter = 0;
bar_name.flag = local_sense;
}
else {
UNLOCK (bar_name.lock);
while (bar_name.flag != local_sense);
}
}
Centralized barrier
Tree barrier
TreeBarrier (pid, P) {
unsigned int i, mask;
for (i = 0, mask = 1; (mask & pid) != 0; ++i, mask <<= 1) {
while (!flag[pid][i]);
flag[pid][i] = 0;
}
if (pid < (P - 1)) {
flag[pid + mask][i] = 1;
while (!flag[pid][MAX- 1]);
flag[pid][MAX - 1] = 0;
}
for (mask >>= 1; mask > 0; mask >>= 1) {
flag[pid - mask][MAX-1] = 1;
}
}
Hardware support
Read broadcast
Possible to reduce the number of bus transactions from P-1 to 1 in the best case
A processor seeing a read miss to flag location (possibly from a fellow processor)
backs off and does not put its read miss on the bus
Every processor picks up the read reply from the bus and the release completes with
one bus transaction
Needs special hardware/compiler support to recognize these flag addresses and resort
to read broadcast
Hardware barrier
Speculative synch.
Speculative synchronization
Basic idea is to introduce speculation in the execution of critical sections
Assume that no other processor will have conflicting data accesses in the critical
section and hence don’t even try to acquire the lock
Just venture into the critical section and start executing
Note the difference between this and speculative execution of critical section due to
speculation on the branch following SC: there you still contend for the lock generating
network transactions
Martinez and Torrellas. In ASPLOS 2002.
Rajwar and Goodman. In ASPLOS 2002.
We will discuss Martinez and Torrellas
Why is it good?
For certain input values it may happen that the processes could actually update the
hash table concurrently
Speculative locks
Every processor comes to the critical section and tries to acquire the lock
One of them succeeds and the rest fail
The successful processor becomes the safe thread
The failed ones don’t retry but venture into the critical section speculatively as if they
have the lock; at this point a speculative thread also takes a checkpoint of its register
state in case a rollback is needed
The safe thread executes the critical section as usual
The speculative threads are allowed to consume values produced by the safe thread
but not by the sp. threads
All stores from a speculative thread are kept inside its cache hierarchy in a special
“speculative modified” state; these lines cannot be sent to memory until it is known to
be safe; if such a line is replaced from cache either it can be kept in a small buffer or
the thread can be stalled
Speculative locks (continued)
If a speculative thread receives a request for a cache line that is in speculative M
state, that means there is a data race inside the critical section and by design the
receiver thread is rolled back to the beginning of critical section
Why can’t the requester thread be rolled back?
In summary, the safe thread is never squashed and the speculative threads are not
squashed if there is no cross-thread data race
If a speculative thread finishes executing the critical section without getting squashed,
it still must wait for the safe thread to finish the critical section before committing the
speculative state (i.e. changing speculative M lines to M); why?
Speculative locks (continued)
Upon finishing the critical section, a speculative thread can continue executing beyond
the CS, but still remaining in speculative mode
When the safe thread finishes the CS all speculative threads that have already
completed CS, can commit in some non-deterministic order and revert to normal
execution
The speculative threads that are still inside the critical section remain speculative; a
dedicated hardware unit elects one of them the lock owner and that becomes the safe
non-speculative thread; the process continues
Clearly, under favorable conditions speculative synchronization can reduce lock
contention enormously
Why is it correct?
Performance concerns
What if I pass a hint via the compiler (say, a single bit in each branch instruction) to the
branch predictor asking it to always predict not taken for this branch?
Isn’t it achieving the same effect as speculative flag, but with a much simpler
technique? No.
Agenda
Correctness goals
A simple design
Cache controller
Snoop logic
Writebacks
A simple design
Inherently non-atomic
Write serialization
Fetch deadlock
Livelock
Starvation
More on LL/SC
Multi-level caches
Agenda
Goal is to understand what influences the performance, cost and scalability of SMPs
Details of physical design of SMPs
At least three goals of any design: correctness, performance, low hardware complexity
Performance gains are normally achieved by pipelining memory transactions and
having multiple outstanding requests
These performance optimizations occasionally introduce new protocol races involving
transient states leading to correctness issues in terms of coherence and consistency
Correctness goals
A simple design
Cache controller
Snoop logic
Writebacks
A simple design
Inherently non-atomic
Even though the bus is atomic, a complete protocol transaction involves quite a few steps
which together forms a non-atomic transaction
Issuing processor request
Looking up cache tags
Arbitrating for bus
Snoop action in other cache controller
Refill in requesting cache controller at the end
Different requests from different processors may be in a different phase of a transaction
This makes a protocol transition inherently non-atomic
Consider an example
P0 and P1 have cache line C in shared state
Both proceed to write the line
Both cache controllers look up the tags, put a BusUpgr into the bus request queue, and
start arbitrating for the bus
P1 gets the bus first and launches its BusUpgr
P0 observes the BusUpgr and now it must invalidate C in its cache and change the
request type to BusRdX
So every cache controller needs to do an associative lookup of the snoop address
against its pending request queue and depending on the request type take appropriate
actions
One way to reason about the correctness is to introduce transient states
Possible to think of the last problem as the line C being in a transient S M state
On observing a BusUpgr or BusRdX, this state transitions to I M which is also
transient
The line C goes to stable M state only after the transaction completes
These transient states are not really encoded in the state bits of a cache line because
at any point in time there will be a small number of outstanding requests from a
particular processor (today the maximum I know of is 16)
These states are really determined by the state of an outstanding line and the state of
the cache controller
Write serialization
Fetch deadlock
Livelock
Starvation
More on LL/SC
We have seen that both LL and SC may suffer from cache misses (a read followed by an
upgrade miss)
Is it possible to save one transaction?
What if I design my cache controller in such a way that it can recognize LL instructions
and launch a BusRdX instead of BusRd?
This is called Read-for-Ownership (RFO); also used by Intel atomic xchg instruction
Nice idea, but you have to be careful
By doing this you have just enormously increased the probability of a livelock: before
the SC executes there is a high probability that another LL will take away the line
Possible solution is to buffer incoming snoop requests until the SC completes (buffer
space is proportional to P); may introduce new deadlock cycles (especially for modern
non-atomic busses)
Multi-level caches
We have talked about multi-level caches and the involved inclusion property
Multiprocessors create new problems related to multi-level caches
A bus snoop result may be relevant to inner levels of cache e.g., bus transactions are
not visible to the first level cache controller
Similarly, modifications made in the first level cache may not be visible to the second
level cache controller which is responsible for handling bus requests
Inclusion property makes it easier to maintain coherence
Since L1 cache is a subset of L2 cache a snoop miss in L2 cache need not be sent to
L1 cache
Recap of inclusion
L2 to L1 interventions
Invalidation acks?
Intervention races
Split-transaction bus
New issues
Snoop results
Recap of inclusion
A processor read
Looks up L1 first and in case of miss goes to L2, and finally may need to launch a
BusRd request if it misses in L2
Finally, the line is in S state in both L1 and L2
A processor write
Looks up L1 first and if it is in I state sends a ReadX request to L2 which may have
the line in M state
In case of L2 hit, the line is filled in M state in L1
In case of L2 miss, if the line is in S state in L2 it launches BusUpgr; otherwise it
launches BusRdX; finally, the line is in state M in both L1 and L2
If the line is in S state in L1, it sends an upgrade request to L2 and either there is an
L2 hit or L2 just conveys the upgrade to bus (Why can’t it get changed to BusRdX?)
L1 cache replacement
Replacement of a line in S state may or may not be conveyed to L2
Replacement of a line in M state must be sent to L2 so that it can hold the most up-to-
date copy
The line is in I state in L1 after replacement, the state of line remains unchanged in L2
L2 cache replacement
Replacement of a line in S state may or may not generate a bus transaction; it must
send a notification to the L1 caches so that they can invalidate the line to maintain
inclusion
Replacement of a line in M state first asks the L1 cache to send all the relevant L1
lines (these are the most up-to-date copies) and then launches a BusWB
The state of line in both L1 and L2 is I after replacement
Replacement of a line in E state from L1?
Replacement of a line in E state from L2?
Replacement of a line in O state from L1?
Replacement of a line in O state from L2?
In summary
A line in S state in L2 may or may not be in L1 in S state
A line in M state in L2 may or may not be in L1 in M state; Why? Can it be in S state?
A line in I state in L2 must not be present in L
BusRd snoop
Look up L2 cache tag; if in I state no action; if in S state no action; if in M state assert
wired-OR M line, send read intervention to L1 data cache, L1 data cache sends lines
back, L2 controller launches line on bus, both L1 and L2 lines go to S state
BusRdX snoop
Look up L2 cache tag; if in I state no action; if in S state invalidate and also notify L1;
if in M state assert wired-OR M line, send readX intervention to L1 data cache, L1 data
cache sends lines back, L2 controller launches line on bus, both L1 and L2 lines go to
I state
BusUpgr snoop
Similar to BusRdX without the cache line flush
L2 to L1 interventions
Invalidation acks?
Intervention races
AMD K7 (Athlon XP) and K8 (Athlon64, Opteron) architectures chose to have exclusive levels
of caches instead of inclusive
Definitely provides you much better utilization of on-chip caches since there is no
duplication
But complicates many issues related to coherence
The uniprocessor protocol is to refill requested lines directly into L1 without placing a
copy in L2; only on an L1 eviction put the line into L2; on an L1 miss look up L2 and in
case of L2 hit replace line from L2 and put it in L1 (may have to replace multiple L1
lines to accommodate the full L2 line; not sure what K8 does: possible to maintain
inclusion bit per L1 line sector in L2 cache)
For multiprocessors one solution could be to have one snoop engine per cache level
and a tournament logic that selects the successful snoop result
Split-transaction bus
New issues
turnaround cycle
Essentially two main buses and various control wires for snoop results, flow control etc.
Address bus: five cycle arbitration, used during request
Data bus: five cycle arbitration, five cycle transfer, used during response
Three different transactions may be in one of these three phases at any point in time
A request table entry is freed when the response is observed on the bus
Snoop results
Conflict resolution
Write serialization
Another example
In-order response
Multi-level caches
Dependence graph
SGI Challenge
Sun Enterprise
Conflict resolution
Write serialization
In a split-transaction bus setting, the request table provides sufficient support for write
serialization
Requests to the same cache line are not allowed to proceed at the same time
A read to a line after a write to the same line can be launched only after the write
response phase has completed; this guarantees that the read will see the new value
A write after a read to the same line can be started only after the read response has
completed; this guarantees that the value of the read cannot be altered by the value
written
Sequential consistency (SC) requires write atomicity i.e. total order of all writes seen by all
processors should be identical
Since a BusRdX or BusUpgr does not wait until the invalidations are actually applied to
the caches, you have to be careful
Another example
In-order response
it should source the line for this request after its own write is completed
The performance penalty may be huge
Essentially because of the memory
Consider a situation where three requests are pending to cache lines A, B, C in that
order
A and B map to the same memory bank while C is in a different bank
Although the response for C may be ready long before that of B, it cannot get the bus
Multi-level caches
Split-transaction bus makes the design of multi-level caches a little more difficult
The usual design is to have queues between levels of caches in each direction
How do you size the queues? Between processor and L1 one buffer is sufficient
(assume one outstanding processor access), L1-to-L2 needs P+1 buffers (why?), L2-to-
L1 needs P buffers (why?), L1 to processor needs one buffer
With smaller buffers there is a possibility of deadlock: suppose the L1-to-L2 and L2-to-L1
have one queue entry each, there is a request in L1-to-L2 queue and there is also an
intervention in L2-to-L1 queue; clearly L1 cannot pick up the intervention because it
does not have space to put the reply in L1-to-L2 queue while L2 cannot pick up the
request because it might need space in L2-to-L1 queue in case of an L2 hit
Formalizing the deadlock with dependence graph
There are four types of transactions in the cache hierarchy: 1. Processor requests
(outbound requests), 2. Responses to processor requests (inbound responses), 3.
Interventions (inbound requests), 4. Intervention responses (outbound responses)
Processor requests need space in L1-to-L2 queue; responses to processors need space
in L2-to-L1 queue; interventions need space in L2-to-L1 queue; intervention responses
need space in L1-to-L2 queue
Thus a message in L1-to-L2 queue may need space in L2-to-L1 queue (e.g. a processor
request generating a response due to L2 hit); also a message in L2-to-L1 queue may
need space in L1-to-L2 queue (e.g. an intervention response)
This creates a cycle in queue space dependence graph
Dependence graph
Multi-level caches
In summary
L2 cache controller refuses to drain L1-to-L2 queue if there is no space in L2-to-L1
queue; this is rather conservative because the message at the head of L1-to-L2 queue
may not need space in L2-to-L1 queue e.g., in case of L2 miss or if it is an intervention
reply; but after popping the head of L1-to-L2 queue it is impossible to backtrack if the
message does need space in L2-to-L1 queue
Similarly, L1 cache controller refuses to drain L2-to-L1 queue if there is no space in L1-
to-L2 queue
How do we break this cycle?
Observe that responses for processor requests are guaranteed not to generate any more
messages and intervention requests do not generate new requests, but can only
generate replies
Solving the queue deadlock
Introduce one more queue in each direction i.e. have a pair of queues in each direction
L1-to-L2 processor request queue and L1-to-L2 intervention response queue
Similarly, L2-to-L1 intervention request queue and L2-to-L1 processor response queue
Now L2 cache controller can serve L1-to-L2 processor request queue as long as there is
space in L2-to-L1 processor response queue, but there is no constraint on L1 cache
controller for draining L2-to-L1 processor response queue
Similarly, L1 cache controller can serve L2-to-L1 intervention request queue as long as
there is space in L1-to-L2 intervention response queue, but L1-to-L2 intervention
response queue will drain as soon as bus is granted
Dependence graph
Possible to combine PR and IY into a supernode of the graph and still be cycle-free
Leads to one L1 to L2 queue
Similarly, possible to combine IR and PY into a supernode
Leads to one L2 to L1 queue
Cannot do both
Leads to cycle as already discussed
Bottomline: need at least three queues for two-level cache hierarchy
SGI Challenge
Sun Enterprise
Split-transaction, 256 bits data, 41 bits address, 83.5 MHz (compare to 47.6 MHz of SGI
Powerpath-2)
Supports 16 boards
112 outstanding transactions (up to 7 from each board)
Special Topics
Virtual indexing
TLB coherence
TLB shootdown
Snooping on a ring
Scaling bandwidth
AMD Opteron
Opteron servers
Recall that to have concurrent accesses to TLB and cache, L1 caches are often made
virtually indexed
Can read the physical tag and data while the TLB lookup takes place
Later compare the tag for hit/miss detection
How does it impact the functioning of coherence protocols and snoop logic?
Even for uniprocessor the synonym problem
Two different virtual addresses may map to the same physical page frame
One simple solution may be to flush all cache lines mapped to a page frame at the
time of replacement
But this clearly prevents page sharing between two processes
Virtual indexing
Normally the L1 cache is designed to be virtually indexed and other levels are
physically indexed
L2 sends interventions to L1 by communicating the PA
L1 must determine the virtual index from that to access the cache: dual tags are
sufficient for this purpose
TLB coherence
A page table entry (PTE) may be held in multiple processors in shared memory because all of
them access the same shared page
A PTE may get modified when the page is swapped out and/or access permissions are
changed
Must tell all processors having this PTE to invalidate
How to do it efficiently?
No TLB: virtually indexed virtually tagged L1 caches
On L1 miss directly access PTE in memory and bring it to cache; then use normal
cache coherence because the PTEs also reside in the shared memory segment
On page replacement the page fault handler can flush the cache line containing the
replaced PTE
Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha
21264 instruction cache)
Hardware solution
Extend snoop logic to handle TLB coherence
PowerPC family exercises a tlbie instruction (TLB invalidate entry)
When OS modifies a PTE it puts a tlbie instruction on bus
Snoop logic picks it up and invalidates the TLB entry if present in all processors
This is well suited for bus-based SMPs, but not for DSMs because broadcast in a
large-scale machine is not good
TLB shootdown
Snooping on a ring
Length of the bus limits the frequency at which it can be clocked which in turn limits the
bandwidth offered by the bus leading to a limited number of processors
A ring interconnect provides a better solution
Connect a processor only to its two neighbors
Short wires, much higher switching frequency, better bandwidth, more processors
Each node has private local memory (more like a distributed shared memory
multiprocessor)
Every cache line has a home node i.e. the node where the memory contains this line:
can be determined by upper few bits of the PA
Transactions traverse the ring node by node
Snoop mechanism
When a transaction passes by the ring interface of a node it snoops the transaction,
takes appropriate coherence actions, and forwards the transaction to its neighbor if
necessary
The home node also receives the transaction eventually and let’s assume that it has a
dirty bit associated with every memory line (otherwise you need a two-phase protocol)
A request transaction is removed from the ring when it comes back to the requester
(serves as an acknowledgment that every node has seen the request)
The ring is essentially divided into time slots where a node can insert new request or
response; if there is no free time slot it must wait until one passes by: called a
slotted ring
The snoop logic must be able to finish coherence actions for a transaction before the next
time slot arrives
The main problem of a ring is the end-to-end latency, since the transactions must traverse
hop-by-hop
Serialization and sequential consistency is trickier
The order of two transactions may be differently seen by two processors if the source
of one transaction is between the two processors
The home node can resort to NACKs if it sees conflicting outstanding requests
Introduces many races in the protocol
Scaling bandwidth
Data bandwidth
Make the bus wider: costly hardware
Replace bus by point-to-point crossbar: since only the address portion of a transaction
is needed for coherence, the data transaction can be directly between source and
destination
Add multiple data busses
Snoop or coherence bandwidth
This is determined by the number of snoop actions that can be executed in unit time
Having concurrent non-conflicting snoop actions definitely helps improve the protocol
throughput
Multiple address busses: a separate snoop engine is associated with each bus on each
node
Order the address busses logically to define a partial order among concurrent requests
so that these partial orders can be combined to form a total order
AMD Opteron
Each node contains an x86-64 core, 64 KB L1 data and instruction caches, 1 MB L2 cache,
on-chip integrated memory controller, three fast routing links called hyperTransport, local DDR
memory
Glueless MP: just connect 8 Opteron chips via HT to design a distributed shared memory
multiprocessor
L2 cache supports 10 outstanding misses
Integrated memory controller and north bridge functionality help a lot
Can clock the memory controller at processor frequency (2 GHz)
No need to have a cumbersome motherboard; just buy the Opteron chip and connect it
to a few peripherals (system maintenance is much easier)
Overall, improves performance by 20-25% over Athlon
Snoop throughput and bandwidth is much higher since the snoop logic is clocked at 2
GHz
Integrated hyperTransport provides very high communication bandwidth
Point-to-point links, split-transaction and full duplex (bidirectional links)
On each HT link you can connect a processor or I/O
Opteron servers
Exercise : 2
1. [30 points] For each of the memory reference streams given in the following,
compare the cost of executing it on a bus-based SMP that supports (a) MESI
protocol without cache-to-cache sharing, and (b) Dragon protocol. A read from
processor N is denoted by rN while a write from processor N is denoted by wN.
Assume that all caches are empty to start with and that cache hits take a single
cycle, misses requiring upgrade or update take 60 cycles, and misses requiring
whole block transfer take 90 cycles. Assume that all caches are writeback.
Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3
Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1
Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3
2. [15 points] (a) As cache miss latency increases, does an update protocol
become more or less preferable as compared to an invalidation based protocol?
Explain.
(b) In a multi-level cache hierarchy, would you propagate updates all the way to
the first-level cache? What are the alternative design choices?
4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-
cache sharing). Each processor tries to acquire a test-and-set lock to gain
access to a null critical section. Assume that test-and-set instructions always go
on the bus and they take the same time as the normal read transactions. The
initial condition is such that processor 1 has the lock and processors 2, 3, 4 are
spinning on their caches waiting for the lock to be released. Every processor
gets the lock once, unlocks, and then exits the program. Consider the bus
transactions related to the lock/unlock operations only.
(a) What is the least number of transactions executed to get from the initial to
the final state? [10 points]
(c) Answer the above two questions if the protocol is changed to Dragon. [15
points]
5. [30 points] Answer the above question for a test-and-test-and-set lock for a
16-processor SMP. The initial condition is such that the lock is released and no
one has got the lock yet.
6. [10 points] If the lock variable is not allowed to be cached, how will the traffic
of a test-and-set lock compare against that of a test-and-test-and set lock?
7. [15 points] You are given a bus-based shared memory machine. Assume that
the processors have a cache block size of 32 bytes and A is an array of
integers (four bytes each). You want to parallelize the following loop.
(b) Under what conditions would it be better to use a statically scheduled loop?
(c) For a dynamically scheduled inner loop, how many iterations should a
processor pick each time?
struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;
else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;
}
}
UNLOCK(bar.lock);
}
Solution of Exercise : 2
1. [30 points] For each of the memory reference streams given in the following,
compare the cost of executing it on a bus-based SMP that supports (a) MESI
protocol without cache-to-cache sharing, and (b) Dragon protocol. A read from
processor N is denoted by rN while a write from processor N is denoted by wN.
Assume that all caches are empty to start with and that cache hits take a single
cycle, misses requiring upgrade or update take 60 cycles, and misses requiring
whole block transfer take 90 cycles. Assume that all caches are writeback.
Solution:
Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3
(a) MESI: read miss, hit, hit, hit, read miss, upgrade, hit, hit, read miss,
upgrade, hit, hit. Total latency = 90+1+1+1+2*(90+60+1+1) = 397 cycles
(b) Dragon: read miss, hit, hit, hit, read miss, update, hit, update, read miss,
update, hit, update. Total latency = 90+1+1+1+2*(90+60+1+60) = 515 cycles
Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1
(a) MESI: read miss, read miss, read miss, upgrade, readX, readX, read miss,
read miss, hit, upgrade, readX. Total latency =
90+90+90+60+90+90+90+90+1+60+90 = 841 cycles
(b) Dragon: read miss, read miss, read miss, update, update, update, hit, hit,
hit, update, update. Total latency = 90+90+90+60+60+60+1+1+1+60+60=573
cycles
Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3
(a) MESI: read miss, read miss, read miss, hit, upgrade, hit, hit, hit, readX,
readX. Total latency = 90+90+90+1+60+1+1+1+90+90 = 514 cycles
(b) Dragon: read miss, read miss, read miss, hit, update, update, update,
update, update, update. Total latency=90+90+90+1+60*6=631 cycles
2. [15 points] (a) As cache miss latency increases, does an update protocol
become more or less preferable as compared to an invalidation based protocol?
Explain.
(b) In a multi-level cache hierarchy, would you propagate updates all the way to
the first-level cache? What are the alternative design choices?
Solution: Observe that if u sees the new value A, v does not see the new
value of B, and w sees that new value of B, then x cannot see the old value of
A. So (u, v, w, x) = (1, 0, 1, 0) is not allowed. Similarly, if w sees the new value
of B, x sees the old value of A, u sees the new value of A, then v cannot see
the old value B. So (1, 0, 1, 0) is not allowed, which is already eliminated in the
above case. All other 15 combinations are possible.
Solution: If v=A happens before A=u+1, then the final (u, v, A) = (0, 0, 1).
If v=A happens after A=u+1, then the final (u, v, A) = (0, 1, 2).
Since u and v are symmetric, we will also observe the outcome (1, 0, 2) in some
cases.
4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-
cache sharing). Each processor tries to acquire a test-and-set lock to gain
access to a null critical section. Assume that test-and-set instructions always go
on the bus and they take the same time as the normal read transactions. The
initial condition is such that processor 1 has the lock and processors 2, 3, 4 are
spinning on their caches waiting for the lock to be released. Every processor
gets the lock once, unlocks, and then exits the program. Consider the bus
transactions related to the lock/unlock operations only.
(a) What is the least number of transactions executed to get from the initial to
the final state? [10 points]
transaction), 4 locks, 4 unlocks (no transaction). Notice that in the best possible
scenario, the timings will be such that when someone is in the critical section no
one will even attempt a test-and-set. So when the lock holder unlocks, the
cache block will still be in its cache in M state.
(c) Answer the above two questions if the protocol is changed to Dragon. [15
points]
5. [30 points] Answer the above question for a test-and-test-and-set lock for a
16-processor SMP. The initial condition is such that the lock is released and no
one has got the lock yet.
Solution:MESI:
Best case analysis: 1 locks, 1 unlocks, 2 locks, 2 unlocks, ... This involves
exactly 16 transactions (unlocks will not generate any transaction in the best
case timing).
Worst case analysis: Done in the class. The first round will have (16 + 15 + 1 +
15) transactions. The second round will have (15 + 14 + 1 + 14) transactions.
The last but one round will have (2 + 1 + 1 + 1) transactions and the last round
will have one transaction (just locking of the last processor). The last unlock will
not generate any transaction. If you add these up, you will get (1.5P+2)(P-1) +
1. For P=16, this is 391.
Dragon:
Best case analysis: Now both unlocks and locks will generate updates. So the
total number of transactions would be 32.
Worst case analysis: The test & set attempts in each round will generate
updates. The unlocks will also generate updates. Everything else will be cache
hits. So the number of transactions is (16+1)+(15+1)+...+(1+1) = 152.
6. [10 points] If the lock variable is not allowed to be cached, how will the traffic
of a test-and-set lock compare against that of a test-and-test-and-set lock?
7. [15 points] You are given a bus-based shared memory machine. Assume that
the processors have a cache block size of 32 bytes and A is an array of
integers (four bytes each). You want to parallelize the following loop.
(b) Under what conditions would it be better to use a statically scheduled loop?
(c) For a dynamically scheduled inner loop, how many iterations should a
processor pick each time?
struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;
}
}
UNLOCK(bar.lock);
}
Solution: There are too many problems with this implementation. I will not list
them here. The correct barrier code is given below which requires addition of
one line of code. Notice that the releasing variable nicely captures the notion of
sense reversal.
Scalable Multiprocessors
Agenda
Basics of scalability
Bandwidth scaling
Latency scaling
Cost scaling
Physical scaling
IBM SP-2
Programming model
Common challenges
Spectrum of designs
Physical DMA
nCUBE/2
User-level ports
User-level handling
Message co-processor
Intel Paragon
Meiko CS-2
Scalable synchronization
Agenda
Basics of scalability
Programming models
Physical DMA
User-level networking
Dedicated message processing
Shared physical address
Cluster of workstations (COWs) and Network of workstations (NOWs)
Scaling parallel software
Scalable synchronization
Basics of scalability
Bandwidth scaling
Latency scaling
Cost scaling
Physical scaling
IBM SP-2
Programming model
Common challenges
Spectrum of designs
Physical DMA
A reserved area in physical memory is used for sending and receiving messages
After setting up the memory region the processor takes a trap to the kernel
The interrupt handler typically copies the data into kernel area so that it can be
manipulated
Finally, kernel instructs the DMA device to push the message into the network via the
physical address of the message (typically called DMA channel address)
At the destination the DMA device will deposit the message in a predefined physical
memory area and generates an interrupt for the processor
The interrupt handler now can copy the message into kernel area and inspect and
parse the message to take appropriate actions (this is called blind deposit)
nCUBE/2
User-level ports
Network ports and status registers are memory-mapped in user address space
User program can initiate a transaction by composing the message and writing to the
status registers
Communication assist does the protection check and pushes the message into the
physical medium
A message at the destination can sit in the input queue until the user program pops it
off
A system message generates an interrupt through the destination assist, but user
messages do not require OS intervention
Problem with context switch: messages are now really part of the process state; need
to save and restore them
Thinking machine CM-5 has outbound message latency of 1.5 µs and inbound 1.6 µs
User-level handling
Instead of mapping the network ports to user memory, make them processor registers
Even faster messaging
Communication assist now looks really like a functional unit inside the processor (just
like a FPU)
Send and receive are now register to register transfers
iWARP from CMU and Intel, *T from MIT and Motorola, J machine from MIT
iWARP binds two processor registers to the heads of the network input and output
ports; the processor accesses the message word-by-word as it streams in
*T extended Motorola 88110 RISC core to include a network function unit containing
dedicated sets of input and output registers; a message is spread over a set of such
registers and a special instruction initiates the transfer
Message co-processor
Intel Paragon
Meiko CS-2
The CP is tightly integrated with the NI and has separate concurrent units
The command unit sits directly on the shared bus and is responsible for fielding
processor requests
The processor executes an atomic swap between one register and a fixed memory
location which is mapped to the head of the CP input queue
The command contains a command type and a VA
Depending on the command type the command processor can invoke the DMA
processor (may need assistance from VA to PA unit), an event processor (to wake up
some thread on the main processor), or a thread processor to construct and issue a
message
The input processor fields new messages from the network and may invoke the reply
processor, or any of the above three units
Memory controller on each node accepts PAs from the system bus
The processor initially issues a VA
The TLB provides the translation and the upper few bits of PA represent the home
node for this address (determined when the mapping is established for the first time)
If the address is local i.e. requester is the home node, the memory controller returns
data just as in uniprocessor
If address is remote the memory controller instructs the communication assist
(essentially the NI) to generate a remote memory request
In the remote home the CA issues a request to the memory controller to read memory
and eventually data is returned to the requester
Scalable synchronization
What is needed?
Adv. of MP nodes
Disadvantages
Basics of directory
Directory organization
Is directory useful?
Sharing pattern
Directory organization
Directory overhead
Correctness issues
What is needed?
Adv. of MP nodes
Amortization of node fixed cost over multiple processors; can use commodity SMPs
Much communication may be contained within a node i.e. less “remote” communication
Request combining by some extra hardware in memory controller
Possible to share caches e.g., chip multiprocessor nodes (IBM POWER4 and POWER5) or
hyper-threaded nodes (Intel Xeon MP)
Exact benefit depends on sharing pattern
Widely shared data or nearest neighbor (if properly mapped) may be good
Disadvantages
16P systems one with 4-way 4 nodes and one with 16 nodess
Basics of directory
Theoretically speaking each directory entry should have a dirty bit and a bitvector of length P
On a read from processor k, if dirty bit is off read cache line from memory, send it to k,
set bit[k] in vector; if dirty bit is on read owner id from vector (different interpretation of
bitvector), send read intervention to owner, owner replies line directly to k (how?),
sends a copy to home, home updates memory, directory controller sets bit[k] and
bit[owner] in vector
On a write from processor k, if dirty bit is off send invalidations to all sharers marked in
vector, wait for acknowledgments, read cache line from memory, send it to k, zero out
vector and write k in vector, set dirty bit; if dirty bit on same as read, but now
intervention is of readX type and memory does not write the line back, dirty bit is set
and vector=k
Directory organization
Is directory useful?
Sharing pattern
Directory organization
Directory overhead
Correctness issues
Serialization to a location
Schedule order at home
Use NACKs (extra traffic and livelock) or smarter techniques (back-off, NACK-free)
Flow control deadlock
Avoid buffer dependence cycles
Avoid network queue dependence cycles
Virtual networks multiplexed on physical networks
Coherence protocol dictates the virtual network usage
Virtual networks
Three-lane protocols
Performance issues
Origin directory
Serializing requests
Handling writebacks
Virtual networks
Consider a two-node system with one incoming and one outgoing queue on each node
Three-lane protocols
Performance issues
Latency optimizations
Reduce transactions on critical path: 3-hop vs. 4-hop
Overlap activities: protocol processing and data access, invalidations, invalidation
acknowledgments
Make critical path fast: directory cache, integrated memory controller, smart protocol
Reduce occupancy of protocol engine
Throughput optimizations
Pipeline the protocol processing
Multiple coherence engines
Protocol decisions: where to collect invalidation acknowledgments, existence of clean
replacement hints
Connections to Backplane
Each router has six pairs of 1.56 GB/s unidirectional links; two to nodes (bristled), four to other
routers
41 ns pin to pin latency
Four virtual networks: request, reply, priority, I/O
Any processor can access I/O device either through uncached ops or through coherent DMA
Any I/O device can access any data through router/hub
Origin directory
Directory formats
If exclusive in a cache, entry contains processor number (not node number)
If shared, entry is a bitvector of sharers where each corresponds to a node (not a
processor)
Origin protocol does not assume anything about ordering of messages in the network
At requesting hub
Address is decoded and home is located
Request forwarded to home if home is remote
At home
Directory lookup and data lookup are initiated in parallel
Directory banks are designed to be slightly faster than other banks
The directory entry may reveal several possible states
Actions taken depends on this
Directory state lookup
Unowned: mark directory to point to requester, state becomes M, send cache line
Shared: mark directory bit, send cache line
Busy: send NACK to requester
Modified: if owner is not home, forward to owner
messages to requester and home (no data is sent); in all cases cache state becomes S
Sharing writeback or completion message, on arrival at home, changes directory state to S
If the owner state is E, how does the requester get the data?
The famous speculative reply of Origin 2000
Note how processor design (in this case MIPS R10k) influences protocol decisions
Serializing requests
Handling writebacks