MCU Architecture
MCU Architecture
Contents
Virtual Memory and Caches (Recap) Fundamentals of Parallel Computers: ILP vs. TLP Parallel Programming: Shared Memory and Message Passing Performance Issues in Shared Memory Shared Memory Multiprocessors: Consistency and Coherence Synchronization Memory consistency models Case Studies of CMP
June-July 2009 2
Assume a 32-bit VA
Addressing VM
There are primarily three ways to address VM
Paging, Segmentation, Segmented paging We will focus on flat paging only
Paged VM
The entire VM is divided into small units called pages Virtual pages are loaded into physical page frames as and when needed (demand paging) Thus the physical memory is also divided into equal sized page frames The processor generates virtual addresses But memory is physically addressed: need a VA to PA translation
June-July 2009 6
VA to PA translation
The VA generated by the processor is divided into two parts:
Page offset and Virtual page number (VPN) Assume a 4 KB page: within a 32-bit VA, lower 12 bits will be page offset (offset within a page) and the remaining 20 bits are VPN (hence 1 M virtual pages total) The page offset remains unchanged in the translation Need to translate VPN to a physical page frame number (PPFN) This translation is held in a page table resident in memory: so first we need to access this page table How to get the address of the page table?
June-July 2009 7
VA to PA translation
Accessing the page table
The Page table base register (PTBR) contains the starting physical address of the page table PTBR is normally accessible in the kernel mode only Assume each entry in page table is 32 bits (4 bytes) Thus the required page table address is PTBR + (VPN << 2) Access memory at this address to get 32 bits of data from the page table entry (PTE) These 32 bits contain many things: a valid bit, the much needed PPFN (may be 20 bits for a 4 GB physical memory), access permissions (read, write, execute), a dirty/modified bit etc.
June-July 2009 8
Page fault
The valid bit within the 32 bits tells you if the translation is valid If this bit is reset that means the page is not resident in memory: results in a page fault In case of a page fault the kernel needs to bring in the page to memory from disk The disk address is normally provided by the page table entry (different interpretation of 31 bits) Also kernel needs to allocate a new physical page frame for this virtual page If all frames are occupied it invokes a page replacement policy
June-July 2009 9
VA to PA translation
Page faults take a long time: order of ms Once the page fault finishes, the page table entry is updated with the new VPN to PPFN mapping Of course, if the valid bit was set, you get the PPFN right away without taking a page fault Finally, PPFN is concatenated with the page offset to get the final PA PPFN Offset Processor now can issue a memory request with this PA to get the necessary data Really two memory accesses are needed Can we improve on this?
June-July 2009 10
TLB
Why cant we cache the most recently used translations?
Translation Look-aside Buffers (TLB) Small set of registers (normally fully associative) Each entry has two parts: the tag which is simply VPN and the corresponding PTE The tag may also contain a process id On a TLB hit you just get the translation in one cycle (may take slightly longer depending on the design) On a TLB miss you may need to access memory to load the PTE in TLB (more later) Normally there are two TLBs: instruction and data
June-July 2009 11
Caches
Once you have completed the VA to PA translation you have the physical address. Whats next? You need to access memory with that PA Instruction and data caches hold most recently used (temporally close) and nearby (spatially close) data Use the PA to access the cache first Caches are organized as arrays of cache lines Each cache line holds several contiguous bytes (32, 64 or 128 bytes)
June-July 2009 12
Addressing a cache
The PA is divided into several parts TAG INDEX BLK. OFFSET The block offset determines the starting byte address within a cache line The index tells you which cache line to access In that cache line you compare the tag to determine hit/miss
June-July 2009
13
Addressing a cache
PA TAG INDEX BLK. OFFSET
HIT/ MISS
TAG
DATA
STATE
June-July 2009
DATA
14
Addressing a cache
An example
PA is 32 bits Cache line is 64 bytes: block offset is 6 bits Number of cache lines is 512: index is 9 bits So tag is the remaining bits: 17 bits Total size of the cache is 512*64 bytes i.e. 32 KB Each cache line contains the 64 byte data, 17-bit tag, one valid/invalid bit, and several state bits (such as shared, dirty etc.) Since both the tag and the index are derived from the PA this is called a physically indexed physically tagged cache
June-July 2009 15
Conflict misses can be reduced by providing multiple lines per index Access to an index returns a set of cache lines
For an n-way set associative cache there are n lines per set
Carry out multiple tag comparisons in parallel to see if any one in the set hits
June-July 2009 16
INDEX
BLK. OFFSET
TAG1
TAG
DATA
TAG
DATA
STATE
June-July 2009
STATE
17
Two extremes of set size: direct-mapped (1-way) and fully associative (all lines are in a single set)
Example: 32 KB cache, 2-way set associative, line size of 64 bytes: number of indices or number of sets=32*1024/(2*64)=256 and hence index is 8 bits wide Example: Same size and line size, but fully associative: number of sets is 1, within the set there are 32*1024/64 or 512 lines; you need 512 tag comparisons for each access
June-July 2009 18
Cache hierarchy
Ideally want to hold everything in a fast cache
Never want to go to the memory
But, with increasing size the access time increases A large cache will slow down every access So, put increasingly bigger and slower caches between the processor and the memory Keep the most recently used data in the nearest cache: register file (RF) Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger) Then L2: way bigger than L1 and much slower
June-July 2009 19
Cache hierarchy
Example: Intel Pentium 4 (Netburst)
128 registers accessible in 2 cycles L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles for integer loads L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles 128 registers accessible in 1 cycle L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5 cycles Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible200914 cycles in 20 June-July
Inclusion policy
A cache hierarchy implements inclusion if the contents of level n cache (exclude the register file) is a subset of the contents of level n+1 cache
Eviction of a line from L2 must ask L1 caches (both instruction and data) to invalidate that line if present A store miss fills the L2 cache line in M state, but the store really happens in L1 data cache; so L2 cache does not have the most up-to-date copy of the line Eviction of an L1 line in M state writes back the line to L2 Eviction of an L2 line in M state first asks the L1 data cache to send the most up-to-date copy (if any), then it writes the line back to the next higher level (L3 or main memory) Inclusion simplifies the on-chip coherence protocol (more 22 June-July 2009 later)
24
TLB access
For every cache access (instruction or data) you need to access the TLB first Puts the TLB in the critical path Want to start indexing into cache and read the tags while TLB lookup takes place
Virtually indexed physically tagged cache Extract index from the VA, start reading tag while looking up TLB Once the PA is available do tag comparison Overlaps TLB reading and tag reading
June-July 2009 25
L1 hit: ~1 ns L2 hit: ~5 ns L3 hit: ~10-15 ns Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must) Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers Ultimately, the fetcher stalls: severely limits ILP
June-July 2009 26
Memory op latency
MLP
Simply speaking, need to mutually overlap several memory operations Allow multiple outstanding cache misses Mutually overlap multiple cache misses Supported by all microprocessors today (Alpha 21364 supported 16 outstanding cache misses) Issue loads out of program order (address is not known at the time of issue) How do you know the load didnt issue before a store to the same address? Issuing stores must check for this memory-order violation
June-July 2009 27
Out-of-order loads
/* other instructions */ lw r2, 80(r20) Assume that the load issues before the store because r20 gets ready before r6 or r7 The load accesses the store buffer (used for holding already executed store values before they are committed to the cache at retirement) If it misses in the store buffer it looks up the caches and, say, gets the value somewhere After several cycles the store issues and it turns out that 0(r7)==80(r20) or they overlap; now what?
June-July 2009 28
sw 0(r7), r6
Load/store ordering
Out-of-order load issue relies on speculative memory disambiguation
Assumes that there will be no conflicting store If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically Turns out that the speculation is correct most of the time To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that loads or load/store pairs past behavior)
June-July 2009 29
Researchers are working on load value prediction Even after doing all these, memory latency remains the biggest bottleneck Today microprocessors are trying to overcome one single wall: the memory wall
June-July 2009 30
Agenda
Convergence of parallel architectures Fundamental design issues ILP vs. TLP
June-July 2009
32
Communication architecture
Today parallel architecture is seen as an extension of microprocessor architecture with a communication architecture
Defines the basic communication and synchronization operations and provides hw/sw implementation of those
June-July 2009 33
Diverse designs made it impossible to write portable parallel software But the driving force was the same: need for fast processing
Layered architecture
Communication architecture = user/system interface + hw implementation (roughly defined by the last four layers)
Compiler and OS provide the user interface to communicate between and synchronize threads
June-July 2009 34
Shared address
The general communication hw consists of multiple processors connected over some medium so that they can talk to memory banks and I/O devices
The architecture of the interconnect may vary depending on projected cost and target performance
June-July 2009 35
Shared address
MEM I/O I/O DANCE HALL INTERCONNECT
Interconnect could be a crossbar switch so that any processor can talk to any memory bank in one hop (provides latency and bandwidth advantages) Scaling a crossbar becomes a problem: cost is proportional to square of the size Instead, could use a scalable switch-based network; latency increases and bandwidth decreases because now multiple processors contend for switch ports
June-July 2009
36
From mid 80s shared bus became popular leading to the design of SMPs Pentium Pro Quad was the first commodity SMP Sun Enterprise server provided a highly pipelined wide shared bus for scalability reasons; it also distributed the memory to each processor, but there was no local bus on the boards i.e. the memory was still symmetric (must use the shared bus) NUMA or DSM architectures provide a better solution to the scalability problem; the symmetric view is replaced by local and remote memory and each node (containing processor(s) with caches, memory controller and router) gets connected via a scalable network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha GS320, Alpha/HP GS1280 etc.
June-July 2009 37
Very popular for large-scale computing The system architecture looks exactly same as DSM, but there is no shared memory The user interface is via send/receive calls to the message layer The message layer is integrated to the I/O system instead of the memory system Send specifies a local data buffer that needs to be transmitted; send also specifies a tag A matching receive at dest. node with the same tag reads in the data from kernel space buffer to user memory Effectively, provides a memory-to-memory copy38 June-July 2009
Message passing
Message passing
Initially it was very topology dependent A node could talk only to its neighbors through FIFO buffers These buffers were small in size and therefore while sending a message send would occasionally block waiting for the receive to start reading the buffer (synchronous message passing) Soon the FIFO buffers got replaced by DMA (direct memory access) transfers so that a send can initiate a transfer from memory to I/O buffers and finish immediately (DMA happens in background); same applies to the receiving end also The parallel algorithms were designed specifically for certain topologies: a big problem
June-July 2009 39
To improve usability of machines, the message layer started providing support for arbitrary source and destination (not just nearest neighbors)
Message passing
Essentially involved storing a message in intermediate hops and forwarding it to the next node on the route Later this store-and-forward routing got moved to hardware where a switch could handle all the routing activities Further improved to do pipelined wormhole routing so that the time taken to traverse the intermediate hops became small compared to the time it takes to push the message from processor to network (limited by node-tonetwork bandwidth) Examples include IBM SP2, Intel Paragon Each node of Paragon had two i860 processors, one of which was dedicated to servicing the network (send/recv. 40 June-July 2009 etc.)
Shared address and message passing are two distinct programming models, but the architectures look very similar
Convergence
Both have a communication assist or network interface to initiate messages or transactions In shared memory this assist is integrated with the memory controller In message passing this assist normally used to be integrated with the I/O, but the trend is changing There are message passing machines where the assist sits on the memory bus or machines where DMA over network is supported (direct transfer from source memory to destination memory) Finally, it is possible to emulate send/recv. on shared memory through shared buffers, flags and locks a shared Possible to emulateJune-July 2009 virtual mem. on message 41 passing machines through modified page fault handlers
A generic architecture
CA = network interface (NI) + communication controller
In all the architectures we have discussed thus far a node essentially contains processor(s) + caches, memory and a communication assist (CA) The nodes are connected over a scalable network The main difference remains in the architecture of the CA
And even under a particular programming model (e.g., shared memory) there is a lot of choices in the design of the CA Most innovations in parallel architecture takes place in the communication assist (also called communication controller or node controller)
June-July 2009 42
A generic architecture
SCALABLE NETWORK NODE NODE NODE MEM XBAR
CACHE
NODE
CA
P
June-July 2009 43
Design issues
Need to understand architectural components that affect software
Compiler, library, program User/system interface and hw/sw interface How programming models efficiently talk to the communication architecture? How to implement efficient primitives in the communication layer? In a nutshell, what issues of a parallel machine will affect the performance of the parallel applications?
Naming
How are the data in a program referenced?
In sequential programs a thread can access any variable in its virtual address space In shared memory programs a thread can access any private or shared variable (same load/store model of sequential programs) In message passing programs a thread can access local data directly
Operations
What operations are supported to access data?
For sequential and shared memory models load/store are sufficient For message passing models send/receive are needed to access remote data For shared memory, hw (essentially the CA) needs to make sure that a load/store operation gets correctly translated to a message if the address is remote For message passing, CA or the message layer needs to copy data from local memory and initiate send, or copy data from receive buffer to local memory
June-July 2009
46
Ordering
For sequential model, it is the program order: true dependence order For shared memory, within a thread it is the program order, across threads some valid interleaving of accesses as expected by the programmer and enforced by synchronization operations (locks, point-to-point synchronization through flags, global synchronization through barriers) Ordering issues are very subtle and important in shared memory model (some microprocessor re-ordering tricks may easily violate correctness when used in shared memory context) For message passing, ordering across threads is implied through point-to-point send/receive pairs (producerconsumer relationship) and mutual exclusion is inherent 47 June-July 2009 (no shared variable)
Replication
How is the shared data locally replicated?
This is very important for reducing communication traffic In microprocessors data is replicated in the cache to reduce memory accesses In message passing, replication is explicit in the program and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache hierarchy so that subsequent accesses can be fast; this is totally hidden from the program and therefore the hardware must provide a layer that keeps track of the most recent copies of the data (this layer is central to the performance of shared memory multiprocessors and is called the cache coherence protocol)
June-July 2009 48
Communication cost
Three major components of the communication architecture that affect performance
Latency: time to do an operation (e.g., load/store or send/recv.) Bandwidth: rate of performing an operation Overhead or occupancy: how long is the communication layer occupied doing an operation Already a big problem for microprocessors Even bigger problem for multiprocessors due to remote operations Must optimize application or hardware to hide or lower latency (algorithmic optimizations or prefetching or overlapping computation with communication)
June-July 2009 49
Latency
Bandwidth
Communication cost
How many ops in unit time e.g. how many bytes transferred per second Local BW is provided by heavily banked memory or faster and wider system bus Communication BW has two components: 1. node-tonetwork BW (also called network link BW) measures how fast bytes can be pushed into the router from the CA, 2. within-network bandwidth: affected by scalability of the network and architecture of the switch or router
Linear cost model: Transfer time = T0 + n/B where T0 is start-up overhead, n is number of bytes transferred and B is BW
Not sufficient since overlap of comp. and comm. is not considered; also does not count how the transfer is done 50 (pipelined or not) June-July 2009
Better model:
Communication cost
Communication time for n bytes = Overhead + CA occupancy + Network latency + Size/BW + Contention T(n) = Ov + Oc + L + n/B + Tc Overhead and occupancy may be functions of n Contention depends on the queuing delay at various components along the communication path e.g. waiting time at the communication assist or controller, waiting time at the router etc. Overall communication cost = frequency of communication x (communication time overlap with useful computation) Frequency of communication depends on various factors such as how the program is written or the granularity of communication supported by the underlying hardware
June-July 2009 51
Microprocessors enhance performance of a sequential program by extracting parallelism from an instruction stream (called instruction-level parallelism) Multiprocessors enhance performance of an explicitly parallel program by running multiple threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger granularity compared to ILP In multiprocessors ILP and TLP work together
Within a thread ILP provides performance boost Across threads TLP provides speedup over a sequential version of the parallel program
June-July 2009 52
Parallel Programming
Agenda
Steps in writing a parallel program Example
June-July 2009
55
Start from a sequential description Identify work that can be done in parallel Partition work and/or data among threads or processes
Decomposition and assignment Orchestration
Add necessary communication and synchronization Map threads to processors (Mapping) How good is the parallel program?
June-July 2009
Measure speedup = sequential execution time/parallel execution time = number of processors ideally
56
Task
Some definitions
Arbitrary piece of sequential work Concurrency is only across tasks Fine-grained task vs. coarse-grained task: controls granularity of parallelism (spectrum of grain: one instruction to the whole sequential program)
Process/thread
Logical entity that performs a task Communication and synchronization happen between threads
Processors
Physical entity on which one or more processes execute
June-July 2009 57
Decomposition
Find concurrent tasks and divide the program into tasks
Level or grain of concurrency needs to be decided here Too many tasks: may lead to too much of overhead communicating and synchronizing between tasks Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy
Static assignment
Given a decomposition it is possible to assign tasks statically
For example, some computation on an array of size N can be decomposed statically by assigning a range of indices to each process: for k processes P0 operates on indices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,, Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and low-overhead
For certain index ranges you do some heavy-weight computation while for others you do something simple Is there a problem?
June-July 2009 59
Static assignment may lead to load imbalance depending on how irregular the application is Dynamic decomposition/assignment solves this issue by allowing a process to dynamically choose any available task whenever it is done with its previous task
Normally in this case you decompose the program in such a way that the number of available tasks is larger than the number of processes Same example: divide the array into portions each with 10 indices; so you have N/10 tasks An idle process grabs the next available task Provides better load balance since longer tasks can execute concurrently with the smaller ones
June-July 2009 60
Dynamic assignment
Dynamic assignment
Dynamic assignment comes with its own overhead
Now you need to maintain a shared count of the number of available tasks The update of this variable must be protected by a lock Need to be careful so that this lock contention does not outweigh the benefits of dynamic decomposition
More complicated applications where a task may not just operate on an index range, but could manipulate a subtree or a complex data structure
Normally a dynamic task queue is maintained where each task is probably a pointer to the data The task queue gets populated as new tasks are discovered 61 June-July 2009
Decomposition types
Decomposition by data
The most commonly found decomposition technique The data set is partitioned into several subsets and each subset is assigned to a process The type of computation may or may not be identical on each subset Very easy to program and manage
Computational decomposition
Not so popular: tricky to program and manage All processes operate on the same data, but probably carry out different kinds of computation More common in systolic arrays, pipelined graphics processor units (GPUs) etc.
June-July 2009
62
Involves structuring communication and synchronization among processes, organizing data structures to improve locality, and scheduling tasks
This step normally depends on the programming model and the underlying architecture
Orchestration
Goal is to
Reduce communication and synchronization costs Maximize locality of data reference Schedule tasks to maximize concurrency: do not schedule dependent tasks in parallel Reduce overhead of parallelization and concurrency management (e.g., management of the task queue, overhead of initiating a task etc.)
June-July 2009
63
At this point you have a parallel program Could be specified by the program
Mapping
Just need to decide which and how many processes go to each processor of the parallel machine Pin particular processes to a particular processor for the whole life of the program; the processes cannot migrate to other processors
Schedule processes on idle processors Various scheduling algorithms are possible e.g., round robin: process#k goes to processor#k NUMA-aware OS normally takes into account multiprocessor-specific metrics in scheduling How many processes per processor? Most common 64 June-July 2009 is one-to-one
An example
Iterative equation solver
Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j] Pad the n by n grid to (n+2) by (n+2) to avoid corner problems Update only interior n by n grid One iteration consists of updating all n2 points in-place and accumulating the difference from the previous value at each point If the difference is less than a threshold, the solver is said to have converged to a stable grid equilibrium
June-July 2009 65
Sequential program
int n; float **A, diff; begin main() read (n); /* size of grid */ Allocate (A); Initialize (A); Solve (A); end main begin Solve (A) int i, j, done = 0; float temp; while (!done) diff = 0.0; for i = 0 to n-1 for j = 0 to n-1 temp = A[i,j]; A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+ A[i-1,j]+A[i+1,j]; diff += fabs (A[i,j] - temp); endfor endfor if (diff/(n*n) < TOL) then done = 1; endwhile end Solve
June-July 2009 66
Decomposition
Look for concurrency in loop iterations
In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)
Each anti-diagonal can be computed in parallel Must synchronize after each anti-diagonal (or pt-to-pt) 67 June-July 2009 Alternative: red-black ordering (different update pattern)
Can update all red points first, synchronize globally with a barrier and then update all black points
May converge faster or slower compared to sequential program Converged equilibrium may also be different if there are multiple solutions Ocean simulation uses this decomposition
Decomposition
We will ignore the loop-carried dependence and go ahead with a straight-forward loop decomposition
Allow updates to all points in parallel This is yet another different update order and may affect convergence Update to a point may or may not see the new updates to the nearest neighbors (this parallel algorithm is nondeterministic) 68 June-July 2009
while (!done) diff = 0.0; for_all i = 0 to n-1 for_all j = 0 to n-1 temp = A[i, j]; A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j]; diff += fabs (A[i, j] temp); end for_all end for_all if (diff/(n*n) < TOL) then done = 1; end while
Decomposition
Offers concurrency across elements: degree of concurrency is n2 Make the j loop sequential to have row-wise decomposition: degree n concurrency
June-July 2009
69
Possible static assignment: block row decomposition Another static assignment: cyclic row decomposition Dynamic assignment
Assignment
Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/p to (2n/p)-1 etc.
Process 0 gets rows 0, p, 2p,; process 1 gets rows 1, p+1, 2p+1,. Grab next available row, work on that, grab a new row,
Static block row assignment minimizes nearest neighbor communication by assigning contiguous rows to the same process
June-July 2009 70
/* include files */ MAIN_ENV; int P, n; void Solve (); struct gm_t { LOCKDEC (diff_lock); BARDEC (barrier); float **A, diff; } *gm;
int main (char **argv, int argc) { int i; MAIN_INITENV; gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t)); } LOCKINIT (gm->diff_lock); June-July 2009
BARINIT (gm->barrier); n = atoi (argv[1]); P = atoi (argv[2]); gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*)); for (i = 0; i < n+2; i++) { gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float)); } Initialize (gm->A); for (i = 1; i < P; i++) { /* starts at 1 */ CREATE (Solve); } Solve (); WAIT_FOR_END (P-1); MAIN_END;
71
local_diff += fabs (gm->A[i] [j] void Solve (void) temp); { } /* end for */ int i, j, pid, done = 0; } /* end for */ float temp, local_diff; LOCK (gm->diff_lock); GET_PID (pid); gm->diff += local_diff; while (!done) { UNLOCK (gm->diff_lock); local_diff = 0.0; BARRIER (gm->barrier, P); if (!pid) gm->diff = 0.0; if (gm->diff/(n*n) < TOL) done = 1; BARRIER (gm->barrier, P);/*why?*/ BARRIER (gm->barrier, P); /* why? */ for (i = pid*(n/P); i < (pid+1)*(n/P); } /* end while */ i++) { } for (j = 0; j < n; j++) { temp = gm->A[i] [j]; gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm>A[i+1] [j] + gm->A[i-1] [j]; 72 June-July 2009
Mutual exclusion
Updates to shared variable diff must be sequential Heavily contended locks may degrade performance Try to minimize the use of critical sections: they are sequential anyway and will limit speedup This is the reason for using a local_diff instead of accessing gm->diff every time Also, minimize the size of critical section because the longer you hold the lock, longer will be the waiting time for other processors at lock acquire
June-July 2009
73
LOCK optimization
Suppose each processor updates a shared variable holding a global cost value, only if its local cost is less than the global cost: found frequently in minimization problems
LOCK (gm->cost_lock); if (my_cost < gm->cost) { gm->cost = my_cost; if (my_cost < gm->cost) { LOCK (gm->cost_lock); if (my_cost < gm->cost) { /* make sure*/ gm->cost = my_cost; } UNLOCK (gm->cost_lock); } /* this works because gm->cost is monotonically decreasing */
} UNLOCK (gm->cost_lock); /* May lead to heavy lock contention if everyone tries to update at the same time */
June-July 2009
74
More synchronization
Global synchronization
Through barriers Often used to separate computation phases
Point-to-point synchronization
A process directly notifies another about a certain event on which the latter was waiting Producer-consumer communication pattern Semaphores are used for concurrent programming on uniprocessor through P and V functions Normally implemented through flags on shared memory multiprocessors (busy wait or spin) P0: A = 1; flag = 1; P1: while (!flag); use (A);
June-July 2009 75
Message passing
No shared variable: expose communication through send/receive No lock or barrier primitive Must implement synchronization through send/receive
P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e. P1 waits to receive rows n/P to 2n/P-1 etc. (this is onetime) Within the while loop the first thing that every processor does is to send its first and last rows to the upper and the lower processors (corner cases need to be handled) Then each processor waits to receive the neighboring two rows from the upper and the lower processors
June-July 2009 76
Message passing
At the end of the loop each processor sends its local_diff to P0 and P0 sends back the accumulated diff so that each processor can locally compute the done flag
June-July 2009
77
Major changes
/* include files */ MAIN_ENV; int P, n; void Solve (); struct gm_t { LOCKDEC (diff_lock); BARDEC (barrier); float **A, diff; } *gm; BARINIT (gm->barrier); n = atoi (argv[1]); P = atoi (argv[2]); gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*)); for (i = 0; i < n+2; i++) { gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float)); } Initialize (gm->A); for (i = 1; i < P; i++) { /* starts at 1 */ CREATE (Solve); } Solve (); WAIT_FOR_END (P-1); MAIN_END;
78
Local Alloc.
int main (char **argv, int argc) { int i; int P, n; float **A; MAIN_INITENV; gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t)); } LOCKINIT (gm->diff_lock); June-July 2009
Major changes
local_diff += fabs (gm->A[i] [j] void Solve (void) temp); { } /* end for */ int i, j, pid, done = 0; } /* end for */ float temp, local_diff; LOCK (gm->diff_lock); GET_PID (pid); if (pid) Recv rows, n, P gm->diff += local_diff; Send local_diff while (!done) { to P0 Send up/down UNLOCK (gm->diff_lock); local_diff = 0.0; BARRIER (gm->barrier, P); Recv diff Recv up/down if (!pid) gm->diff = 0.0; if (gm->diff/(n*n) < TOL) done = 1; BARRIER (gm->barrier, P);/*why?*/ BARRIER (gm->barrier, P); /* why? */ for (i = pid*(n/P); i < (pid+1)*(n/P); } /* end while */ i++) { } for (j = 0; j < n; j++) { temp = gm->A[i] [j]; gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm>A[i+1] [j] + gm->A[i-1] [j]; 79 June-July 2009
Message passing
This algorithm is deterministic May converge to a different solution compared to the shared memory version if there are multiple solutions: why?
There is a fixed specific point in the program (at the beginning of each iteration) when the neighboring rows are communicated This is not true for shared memory
June-July 2009
80
MPI-like environment
MPI stands for Message Passing Interface PVM (Parallel Virtual Machine) is another wellknown platform for message passing programming Background in MPI is not necessary for understanding this lecture Only need to know
When you start an MPI program every thread runs the same main function We will assume that we pin one thread to one processor just as we did in shared memory A C library that provides a set of message passing primitives (e.g., send, receive, broadcast etc.) to the user
Instead of using the exact MPI syntax we will use 82 June-July 2009 some macros that call the MPI functions
MAIN_ENV; /* define message tags */ #define ROW 99 #define DIFF 98 #define DONE 97
while (!done) { local_diff = 0.0; /* MPI_CHAR means raw byte format */ if (pid) { /* send my first row up */ SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } int main(int argc, char **argv) if (pid != P-1) { /* recv last row */ { RECV(&A[N/P+1][1], N*sizeof(float), int pid, P, done, i, j, N; MPI_CHAR, pid+1, ROW); float tempdiff, local_diff, temp, **A; } if (pid != P-1) { /* send last row down */ MAIN_INITENV; SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); GET_PID(pid); } GET_NUMPROCS(P); if (pid) { /* recv first row from above */ N = atoi(argv[1]); RECV(&A[0][1], N*sizeof(float), tempdiff = 0.0; MPI_CHAR, pid-1, ROW); done = 0; } A = (double **) malloc ((N/P+2) * for (i=1; i <= N/P; i++) for (j=1; j <= N; sizeof(float *)); j++) { for (i=0; i < N/P+2; i++) { temp = A[i][j]; A[i] = (float *) malloc (sizeof(float) A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + * (N+2)); A[i-1][j] + A[i][j+1] + A[i+1][j]); } 83 June-July 2009 local_diff += fabs(A[i][j] - temp); initialize(A); }
MAIN_ENV; /* define message tags */ #define ROW 99 #define DIFF 98 #define DONE 97
while (!done) { local_diff = 0.0; /* MPI_CHAR means raw byte format */ if (pid) { /* send my first row up */ SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } int main(int argc, char **argv) if (pid != P-1) { /* recv last row */ { RECV(&A[N/P+1][1], N*sizeof(float), int pid, P, done, i, j, N; MPI_CHAR, pid+1, ROW); float tempdiff, local_diff, temp, **A; } if (pid != P-1) { /* send last row down */ MAIN_INITENV; SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); GET_PID(pid); } GET_NUMPROCS(P); if (pid) { /* recv first row from above */ N = atoi(argv[1]); RECV(&A[0][1], N*sizeof(float), tempdiff = 0.0; MPI_CHAR, pid-1, ROW); done = 0; } A = (double **) malloc ((N/P+2) * for (i=1; i <= N/P; i++) for (j=1; j <= N; sizeof(float *)); j++) { for (i=0; i < N/P+2; i++) { temp = A[i][j]; A[i] = (float *) malloc (sizeof(float) A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + * (N+2)); A[i-1][j] + A[i][j+1] + A[i+1][j]); } 84 June-July 2009 local_diff += fabs(A[i][j] - temp); initialize(A); }
Performance Issues
85
June-July 2009
Agenda
Partitioning for performance Data access and communication Summary Goal is to understand simple trade-offs involved in writing a parallel program keeping an eye on parallel performance
Getting good performance out of a multiprocessor is difficult Programmers need to be careful A little carelessness may lead to extremely poor performance 86 June-July 2009
A well-balanced parallel program automatically has low barrier or point-to-point synchronization time
Ideally I want all the threads to arrive at a barrier at the same time
June-July 2009 87
Load balancing
Sequential exec. time / Max. time for any processor Thus speedup is maximized when the maximum time and minimum time across all processors are close (want to minimize the variance of parallel execution time) This directly gets translated to load balancing
Ultimately all processors finish at the same time But some do useful work all over this period while others may spend a significant time at synchronization points This may arise from a bad partitioning There may be other architectural reasons for load imbalance beyond the scope of a programmer e.g., network congestion, unforeseen cache conflicts etc. (slows down a few threads)
June-July 2009 88
A processor may choose to steal tasks from another processors queue if the formers queue is empty
Task stealing
How many tasks to steal? Whom to steal from? The biggest question: how to detect termination? Really a distributed consensus! Task stealing, in general, may increase overhead and communication, but a smart design may lead to excellent load balance (normally hard to design efficiently) This is a form of a more general technique called Receiver Initiated Diffusion (RID) where the receiver of the task initiates the task transfer In Sender Initiated Diffusion (SID) a processor may choose to insert into another processors queue if the formers task queue is full above a threshold
June-July 2009 90
Architects job
However, an architecture may provide efficient primitives to implement task queues and task stealing For example, the task queue may be allocated in a special shared memory segment, accesses to which may be optimized by special hardware in the memory controller But this may expose some of the architectural features to the programmer There are multiprocessors that provide efficient implementations for certain synchronization primitives; this may improve load balance Sophisticated hardware tricks are possible: dynamic load monitoring and favoring slow threads dynamically
June-July 2009 91
Goal is to assign tasks such that accessed data are mostly local to a process
Ideally I do not want any communication But in life sometimes you need to talk to people to get some work done!
June-July 2009 92
Domain decomposition
Normally applications show a local bias on data usage
Communication is short-range e.g. nearest neighbor Even if it is long-range it falls off with distance View the dataset of an application as the domain of the problem e.g., the 2-D grid in equation solver If you consider a point in this domain, in most of the applications it turns out that this point depends on points that are close by Partitioning can exploit this property by assigning contiguous pieces of data to each process Exact shape of decomposed domain depends on the application and load balancing requirements
June-July 2009 93
Comm-to-comp ratio
Surely, there could be many different domain decompositions for a particular problem
For grid solver we may have a square block decomposition, block row decomposition or cyclic row decomposition How to determine which one is good? Communication-tocomputation ratio Assume P processors and NxN grid for grid solver P0 P1 P2 P3 P4 P5 P6 P7
P15
Size of each block: N/P by N/P Communication (perimeter): 4N/P Computation (area): N2/P Comm-to-comp ratio = 4P/N
June-July 2009 94
Comm-to-comp ratio
For block row decomposition
Each strip has N/P rows Communication (boundary rows): 2N Computation (area): N2/P (same as square block) Comm-to-comp ratio: 2P/N Each processor gets N/P isolated rows Communication: 2N2/P Computation: N2/P Comm-to-comp ratio: 2
Comm-to-comp ratio
In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp ratio But depends on the application structure i.e. picking the lowest comm-to-comp may have other problems Normally this ratio gives you a rough estimate about average communication bandwidth requirement of the application i.e. how frequent is communication But it does not tell you the nature of communication i.e. bursty or uniform For grid solver comm. happens only at the start of each iteration; it is not uniformly distributed over computation Thus the worst case BW requirement may exceed the average comm-to-comp ratio
June-July 2009 96
Extra work
Speedup is bounded above by Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work) where the Max is taken over all processors But this is still incomplete
We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture 97 June-July 2009 completely
Useful work time is normally called the busy time or busy cycles Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality 98 June-July 2009
In multiprocessors
Data access
Every processor wants to see the memory interface as its own local cache and the main memory In reality it is much more complicated If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network 99 June-July 2009 topology
Artifactual comm.
Inherent communication assumes infinite capacity and perfect knowledge of what should be 100 June-July 2009 transferred
Capacity problem
Most probable reason for artifactual communication
Due to finite capacity of cache, local memory or remote memory May view a multiprocessor as a three-level memory hierarchy for this purpose: local cache, local memory, remote memory Communication due to cold or compulsory misses and inherent communication are independent of capacity Capacity and conflict misses generate communication resulting from finite capacity Generated traffic may be local or remote depending on the allocation of pages General technique: exploit spatial and temporal locality to use the cache properly
June-July 2009 101
Temporal locality
Maximize reuse of data
Schedule tasks that access same data in close succession Many linear algebra kernels use blocking of matrices to improve temporal (and spatial) locality Example: Transpose phase in Fast Fourier Transform (FFT); to improve locality, the algorithm carries out blocked transpose i.e. transposes a block of data at a time
Block transpose
June-July 2009 102
Spatial locality
Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[i][j] and A[i][j+1] have contiguous memory locations
Memory allocation Page Cache line Cache line across partition
June-July 2009
The same page is local to a processor while remote to others; same applies to straddling cache lines. Ideally, I want to have all pages within a partition local to a single processor. Standard trick is to covert the 2D array to 103 4D.
2D to 4D conversion
The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P6 these are 1 and 2 respectively (assuming 16 processors) The next two dimensions hold the data elements within that partition Thus the 4D array may be declared as float B[P][P][N/P][N/P] The element B[3][2][5][10] corresponds to the element in 10th column, 5th row of the partition of P14 Now all elements within a partition have contiguous 104 June-July 2009 addresses
Transfer granularity
How much data do you transfer in one communication?
For message passing it is explicit in the program For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy)
Contention
Location hot-spot:
Consider accumulating a global variable; the accumulation takes place on a single node i.e. all nodes access the variable allocated on that particular node whenever it tries to increment it
CA on this node becomes a bottleneck
June-July 2009
107
Hot-spots
Avoid location hot-spot by either staggering accesses to the same location or by designing the algorithm to exploit a tree structured communication Module hot-spot
Normally happens when a particular node saturates handling too many messages (need not be to same memory location) within a short amount of time Normal solution again is to design the algorithm in such a way that these messages are staggered over time
Rule of thumb: design communication pattern such that it is not bursty; want to distribute it uniformly 108 June-July 2009 over time
Overlap
Increase overlap between communication and computation
Not much to do at algorithm level unless the programming model and/or OS provide some primitives to carry out prefetching, block data transfer, non-blocking receive etc. Normally, these techniques increase bandwidth demand because you end up communicating the same amount of data, but in a shorter amount of time (execution time hopefully goes down if you can exploit overlap)
June-July 2009
109
Summary
Parallel programs introduce three overhead terms: busy overhead (extra work), remote data access time, and synchronization time
Goal of a good parallel program is to minimize these three terms Goal of a good parallel computer architecture is to provide sufficient support to let programmers optimize these three terms (and this is the focus of the rest of the course)
June-July 2009
110
Shared cache
Four organizations
P0 SWITCH INTERLEAVED SHARED CACHE INTERLEAVED MEMORY Pn
The switch is a simple controller for granting June-July 2009 access to cache banks
Interconnect is between the processors and the shared cache Which level of cache hierarchy is shared depends on the design: Chip multiprocessors today normally share the outermost level (L2 or L3 cache) The cache and memory are interleaved to improve bandwidth by allowing multiple concurrent accesses Normally small scale due to heavy bandwidth demand on switch and shared cache
112
Four organizations
Interconnect is a shared bus located between the private cache hierarchies and P0 Pn memory controller CACHE CACHE The most popular organization for small to BUS medium-scale servers Possible to connect 30 or so MEM processors with smart bus design Bus bandwidth requirement is Scalability is limited by lower compared to shared the shared bus bandwidth cache approach
June-July 2009
Bus-based SMP
Why?
113
Four organizations
Dancehall
P0 CACHE
Pn CACHE
June-July 2009
Better scalability compared to previous two designs The difference from busbased SMP is that the interconnect is a scalable point-to-point network (e.g. crossbar or other topology) Memory is still symmetric from all processors Drawback: a cache miss may take a long time since all memory banks too far off from the processors (may be several network hops)
114
Four organizations
Distributed shared memory
P0 CACHE MEM
Pn CACHE MEM
The most popular scalable organization Each node now has local memory banks Shared memory on other nodes must be accessed over the network Non-uniform memory access (NUMA)
Latency to access local memory is much smaller compared to remote memory Remote memory access
INTERCONNECT
Caching is very important to reduce remote memory access 115 June-July 2009
In all four organizations caches play an important role in reducing latency and bandwidth requirement
If an access is satisfied in cache, the transaction will not appear on the interconnect and hence the bandwidth requirement of the interconnect will be less (shared L1 cache does not have this advantage)
Four organizations
In distributed shared memory (DSM) cache and local memory should be used cleverly Bus-based SMP and DSM are the two designs supported today by industry vendors
In bus-based SMP every cache miss is launched on the shared bus so that all processors can see all transactions In DSM this is not the case
June-July 2009 116
Hierarchical design
Possible to combine bus-based SMP and DSM to build hierarchical shared memory
Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to form a 112p multiprocessor IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and L3 caches (this is called a chip multiprocessor); connect these chips over a network to form scalable multiprocessors
117
Cache Coherence
Intuitive memory model
For sequential programs we expect a memory location to return the latest value written to that location For concurrent programs running on multiple threads or processes on a single processor we expect the same model to hold because all threads see the same cache hierarchy (same as shared L1 cache) For multiprocessors there remains a danger of using a stale value: in SMP or DSM the caches are not shared and processors are allowed to replicate data independently in each cache; hardware must ensure that cached values are coherent across the system and they satisfy programmers intuitive memory model
June-July 2009 118
Example
Assume a write-through cache i.e. every store updates the value in cache as well as in memory
P0: reads x from memory, puts it in its cache, and gets the value 5 P1: reads x from memory, puts it in its cache, and gets the value 5 P1: writes x=7, updates its cached value and memory value P0: reads x from its cache and gets the value 5 P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is completely incoherent) P2: writes x=10, updates its cached value and memory value
June-July 2009 119
Example
Consider the same example with a writeback cache i.e. values are written back to memory only when the cache line is evicted from the cache
P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not write through) The state of the line in P1 and P2 is M while the line in P0 is clean Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from P0 will not issue a writeback (clean lines do not need writeback) Suppose P2 evicts the line first, and then P1 Final memory value is 7: we lost the store x=10 from P2
June-July 2009 120
Writeback cache
Problem is even more complicated: stores are no longer visible to memory immediately Writeback order is important Lesson learned: do not allow more than one copy of a cache line in M state
June-July 2009 121
June-July 2009
122
Definitions
Memory operation: a read (load), a write (store), or a read-modify-write A memory operation is said to issue when it leaves the issue queue and looks up the cache A memory operation is said to perform with respect to a processor when a processor can tell that from other issued memory operations
Assumed to take place atomically
A read is said to perform with respect to a processor when subsequent writes issued by that processor cannot affect the returned read value A write is said to perform with respect to a processor when a subsequent read from that processor to the same address returns the new value
June-July 2009 123
Ordering memory op
A memory operation is said to complete when it has performed with respect to all processors in the system Assume that there is a single shared memory and no caches
Memory operations complete in shared memory when they access the corresponding memory locations Operations from the same processor complete in program order: this imposes a partial order among the memory operations Operations from different processors are interleaved in such a way that the program order is maintained for each processor: memory imposes some total order (many are 124 June-July 2009 possible)
Example
P0: x=8; u=y; v=9; P1: r=5; y=4; t=v; Legal total order: x=8; u=y; r=5; y=4; t=v; v=9; Another legal total order: x=8; r=5; y=4; u=y; v=9; t=v; Last means the most recent in some legal total order A system is coherent if
Reads get the last written value in the total order All processors see writes to a location in the same order
June-July 2009 125
1. 2. A. B.
A memory system is coherent if the values returned by reads to a memory location during an execution of a program are such that all operations to that location can form a hypothetical total order that is consistent with the serial order and has the following two properties: Operations issued by any particular processor perform according to the issue order The value returned by a read is the value written to that location by the last write in the total order Two necessary features that follow from above: Write propagation: writes must eventually become visible to all processors Write serialization: Every processor should see the writes to a location in the same order (if I see w1 before w2, you should not see w2 before w1) 126 June-July 2009
Bus-based SMP
Three phases: arbitrate for bus, launch command (often called request) and address, transfer data Every device connected to the bus can observe the transaction Appropriate device responds to the request In SMP, processors also observe the transactions and may take appropriate actions to guarantee coherence The other device on the bus that will be of interest to us is the memory controller (north bridge in standard mother boards) Depending on the bus transaction a cache block executes a finite state machine implementing the coherence protocol
June-July 2009 127
Cache coherence protocols implemented in busbased machines are called snoopy protocols
Snoopy protocols
The processors snoop or monitor the bus and take appropriate protocol actions based on snoop results Cache controller now receives requests both from processor and bus Since cache state is maintained on a per line basis that also dictates the coherence granularity Cannot normally take a coherence action on parts of a cache line The coherence protocol is implemented as a finite state machine on a per cache line basis The snoop logic in each processor grabs the address from the bus and decides if any action should be taken on the cache line containing that address (only if the line 128 June-July 2009 is in cache)
Read access to a cache line in I state generates a BusRd request on the bus
Memory controller responds to the request and after reading from memory launches the line on the bus Requester matches the address and picks up the line from the bus and fills the cache in V state A store to a line always generates a BusWr transaction on the bus (since write through); other sharers either invalidate the line in their caches or update the line with 129 June-July 2009 new value
State transition
The finite state machine for each cache line:
PrWr/BusWr PrRd/BusRd PrRd/-
I
BusWr (snoop)
V
PrWr/BusWr
A/B means: A is generated by processor, B is the resulting bus transaction (if any) Changes for write through write allocate?
June-July 2009 130
Ordering memory op
It takes up the next transaction only after finishing the previous one
Read misses and writes appear on the bus and hence are visible to all processors What about read hits?
In general, in between writes reads can happen in any order without violating coherence
Writes establish a partial order
June-July 2009 131
They take place transparently in the cache But they are correct as long as they are correctly ordered with respect to writes And all writes appear on the bus and hence are visible immediately in the presence of an atomic bus
Memory consistency
How to establish the order between reads and writes from different processors?
The most clear way is to use synchronization P0: A=1; flag=1 P1: while (!flag); print A; Another example (assume A=0, B=0 initially) P0: A=1; print B; P1: B=1; print A;
What do you expect?
Memory consistency model is a contract between programmer and hardware regarding memory 133 June-July 2009 ordering
Consistency model
A multiprocessor normally advertises the supported memory consistency model
This essentially tells the programmer what the possible correct outcome of a program could be when run on that machine Cache coherence deals with memory operations to the same location, but not different locations Without a formally defined order across all memory operations it often becomes impossible to argue about what is correct and what is wrong in shared memory
Sequential consistency
Total order achieved by interleaving accesses from different processors The accesses from the same processor are presented to the memory system in program order Essentially, behaves like a randomly moving switch connecting the processors to memory Lamports definition of SC
Picks the next access from a randomly chosen processor A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program
June-July 2009 135
The program order is the order of instructions from a sequential piece of code where programmers intuition is preserved Can out-of-order execution violate program order?
The order must produce the result a programmer expects No. All microprocessors commit instructions in-order and that is where the state becomes visible For modern microprocessors the program order is really the commit order Yes. Need extra logic to support SC on top of OOO
June-July 2009
136
P0: x=w+1; r=y+1; P1: y=2; w=y+1; Suppose the load that reads w takes a miss and so w is not ready for a long time; therefore, x=w+1 cannot complete immediately; eventually w returns with value 3 Inside the microprocessor r=y+1 completes (but does not commit) before x=w+1 and gets the old value of y (possibly from cache); eventually instructions commit in order with x=4, r=1, y=2, w=3 So we have the following partial orders P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1 Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2 Combine these to get a contradictory total order
SC example
Consider the following example Possible outcomes for an SC machine
P0: A=1; print B; P1: B=1; print A;
(A, B) = (0,1); interleaving: B=1; print A; A=1; print B (A, B) = (1,0); interleaving: A=1; print B; B=1; print A (A, B) = (1,1); interleaving: A=1; B=1; print A; print B A=1; B=1; print B; print A (A, B) = (0,0) is impossible: read of A must occur before write of A and read of B must occur before write of B i.e. print A < A=1 and print B < B=1, but A=1 < print B and B=1 < print A; thus print B < B=1 < print A < A=1 < print B which implies print B < print B, a contradiction
June-July 2009 138
Implementing SC
Two basic requirements
Memory operations issued by a processor must become visible to others in program order Need to make sure that all processors see the same total order of memory operations: in the previous example for the (0,1) case both P0 and P1 should see the same interleaving: B=1; print A; A=1; print B
The tricky part is to make sure that writes become visible in the same order to all processors
Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different values (which may still be correct from the viewpoint of cache coherence, but will violate SC)
June-July 2009 139
Write atomicity
Example (A=0, B=0 initially)
P0: A=1; P1: while (!A); B=1; P2: while (!B); print A;
A=0 will be printed only if write to A is not visible to P2, but clearly it is visible to P1 since it came out of the loop Thus A=0 is possible if P1 sees the order A=1 < B=1 and P2 sees the order B=1 < A=1 i.e. from the viewpoint of the whole system the write A=1 was not atomic Without write atomicity P2 may proceed to print 0 with a stale value from its cache
June-July 2009 140
Program order from each processor creates a partial order among memory operations Interleaving of these partial orders defines a total order Sequential consistency: one of many total orders A multiprocessor is said to be SC if any execution on this machine is SC compliant Sufficient but not necessary conditions for SC
Summary of SC
Issue memory operation in program order Every processor waits for write to complete before issuing the next operation Every processor waits for read to complete and the write that affects the returned value to complete before issuing the next operation (important for write atomicity)
June-July 2009 141
Writes and reads are all serialized in a total order through the bus transaction ordering If a read gets a value of a previous write, that write is guaranteed to be complete because that bus transaction is complete The write order seen by all processors is the same in a write through system because every write causes a transaction and hence is visible to all in the same order In a nutshell, every processor sees the same total bus order for all memory operations and therefore any busbased SMP with write through caches is SC
Snoopy protocols
Just extend the cache controller with snoop logic and exploit the bus Possible states of a cache line: Invalid (I), Shared (S), Modified or dirty (M), Clean exclusive (E), Owned (O); every processor does not support all five states E state is equivalent to M in the sense that the line has permission to write, but in E state the line is not yet modified and the copy in memory is the same as in cache; if someone else requests the line the memory will provide the line O state is exactly same as E state but in this case memory is not responsible for servicing requests to the line; the owner must supply the line (just as in M state) June-July 2009 Stores really read the memory (as opposed to write)143
Stores
Look at stores a little more closely
There are three situations at the time a store issues: the line is not in the cache, the line is in the cache in S state, the line is in the cache in one of M, E and O states If the line is in I state, the store generates a read-exclusive request on the bus and gets the line in M state If the line is in S or O state, that means the processor only has read permission for that line; the store generates an upgrade request on the bus and the upgrade acknowledgment gives it the write permission (this is a data-less transaction) If the line is in M or E state, no bus transaction is generated; the cache already has write permission for the line (this is the case of a write hit; previous two are 144 June-July 2009 write misses)
Difficult to answer
Also think about the overhead of initiating small updates for every write in update protocols
Invalidation-based protocols are much more popular Some systems support both or maybe some hybrid based on dynamic sharing pattern of a cache line
June-July 2009
146
MSI protocol
Forms the foundation of invalidation-based writeback protocols
Assumes only three supported cache line states: I, S, and M There may be multiple processors caching a line in S state There must be exactly one processor caching a line in M state and it is the owner of the line If none of the caches have the line, memory must have the most up-to-date copy of the line
Processor requests to cache: PrRd, PrWr Bus transactions: BusRd, BusRdX, BusUpgr, BusWB June-July 2009
147
State transition
PrWr/BusRdX PrWr/BusUpgr
PrRd/BusRd
PrRd/BusRd/-
S
BusRd/Flush
M
PrRd/PrWr/-
{BusRdX, BusUpgr}/CacheEvict/-
BusRdX/Flush CacheEvict/BusWB
148
June-July 2009
MSI protocol
Few things to note
Flush operation essentially launches the line on the bus Processor with the cache line in M state is responsible for flushing the line on bus whenever there is a BusRd or BusRdX transaction generated by some other processor On BusRd the line transitions from M to S, but not M to I. Why? Also at this point both the requester and memory pick up the line from the bus; the requester puts the line in its cache in S state while memory writes the line back. Why does memory need to write back? On BusRdX the line transitions from M to I and this time memory does not need to pick up the line from bus. Only the requester picks up the line and puts it in M state in its cache. Why? 149 June-July 2009
M to S, or M to I?
The assumption here is that the processor will read it soon, so save a cache miss by going to S May not be good if the sharing pattern is migratory: P0 reads and writes cache line A, then P1 reads and writes cache line A, then P2 For migratory patterns it makes sense to go to I state so that a future invalidation is saved But for bus-based SMPs it does not matter much because an upgrade transaction will be launched anyway by the next writer, unless there is special hardware support to avoid that: how? The big problem is that the sharing pattern for a cache line may change dynamically: adaptive protocols are good and are supported by Sequent Symmetry and MIT Alewife 150 June-July 2009
P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 reads x, P3 writes x Assume the state of the cache line containing the address of x is I in all processors P0 generates BusRd, memory provides line, P0 puts line in S state P1 generates BusRd, memory provides line, P1 puts line in S state P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1 sets state of line to M P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state, memory writes back P2 generates BusRd, memory provides line, P2 puts line in S state 151 June-July P1, P3 generates BusRdX, P0, 2009 P2 snoop and invalidate, memory provides line, P3 puts line in cache in M state
MESI protocol
The most popular invalidation-based protocol e.g., appears in Intel Xeon MP Why need E state?
The MSI protocol requires two transactions to go from I to M even if there is no intervening requests for the line: BusRd followed by BusUpgr We can save one transaction by having memory controller respond to the first BusRd with E state if there is no other sharer in the system How to know if there is no other sharer? Needs a dedicated control wire that gets asserted by a sharer (wired OR) Processor can write to a line in E state silently and take it 152 June-July 2009 to M state
State transition
PrRd/BusRd(S) {BusRdX, BusUpgr}/Flush CacheEvict/BusRdX/Flush CacheEvict/BusWB BusRd/Flush PrRd/BusRd/Flush
I
PrRd/BusRd(!S)
BusRdX/Flush CacheEvict/-
E
PrRd/-
PrRd/PrWr/153
June-July 2009
MESI protocol
If a cache line is in M state definitely the processor with the line is responsible for flushing it on the next BusRd or BusRdX transaction If a line is not in M state who is responsible?
Memory or other caches in S or E state? Original Illinois MESI protocol assumed cache-to-cache transfer i.e. any processor in E or S state is responsible for flushing the line However, it requires some expensive hardware, namely, if multiple processors are caching the line in S state who flushes it? Also, memory needs to wait to know if it should source the line Without cache-to-cache sharing memory always sources the line unless it is in M state
June-July 2009 154
MESI example
P0 reads x, P0 writes x, P1 reads x, P1 writes x, P0 generates BusRd, memory provides line, P0 puts line in cache in E state P0 does write silently, goes to M state P1 generates BusRd, P0 provides line, P1 puts line in cache in S state, P0 transitions to S state Rest is identical to MSI Consider this example: P0 reads x, P1 reads x, P0 generates BusRd, memory provides line, P0 puts line in cache in E state P1 generates BusRd, memory provides line, P1 puts line in cache in S state, P0 transitions to S state (no cache-tocache sharing) Rest is same as MSI
June-July 2009 155
Some SMPs implement MOESI today e.g., AMD Athlon MP and the IBM servers Why is the O state needed?
MOESI protocol
O state is very similar to E state with four differences: 1. If a cache line is in O state in some cache, that cache is responsible for sourcing the line to the next requester; 2. The memory may not have the most up-to-date copy of the line (this implies 1); 3. Eviction of a line in O state generates a BusWB; 4. Write to a line in O state must generate a bus transaction When a line transitions from M to S it is necessary to write the line back to memory For a migratory sharing pattern (frequent in database workloads) this leads to a series of writebacks to memory 156 These writebacks just keep the memory banks busy and June-July 2009 consumes memory bandwidth
MOESI protocol
Take the following example
P0 reads x, P0 writes x, P1 reads x, P1 writes x, P2 reads x, P2 writes x, Thus at the time of a BusRd response the memory will write the line back: one writeback per processor handover O state aims at eliminating all these writebacks by transitioning from M to O instead of M to S on a BusRd/Flush Subsequent BusRd requests are replied by the owner holding the line in O state The line is written back only when the owner evicts it: one single writeback
June-July 2009 157
MOESI protocol
State transitions pertaining to O state
I to O: not possible (or maybe; next slide) E to O or S to O: not possible M to O: on a BusRd/Flush (but no memory writeback) O to I: on CacheEvict/BusWB or {BusRdX,BusUpgr}/Flush O to S: not possible (or maybe; next slide) O to E: not possible (or maybe if silent eviction not allowed) O to M: on PrWr/BusUpgr
At most one cache can have a line in O state at any point in time
June-July 2009 158
MOESI protocol
Two main design choices for MOESI
Consider the example P0 reads x, P0 writes x, P1 reads x, P2 reads x, P3 reads x, When P1 launches BusRd, P0 sources the line and now the protocol has two options: 1. The line in P0 goes to O and the line in P1 is filled in state S; 2. The line in P0 goes to S and the line in P1 is filled in state O i.e. P1 inherits ownership from P0 For bus-based SMPs the two choices will yield roughly the same performance For DSM multiprocessors we will revisit this issue if time permits According to the second choice, when P2 generates a BusRd request, P1 sources the line and transitions from O to S; P2 becomes the new owner
June-July 2009 159
MOSI protocol
Some SMPs do not support the E state
In many cases it is not helpful, only complicates the protocol MOSI allows a compact state encoding in 2 bits Sun WildFire uses MOSI protocol
June-July 2009
160
Synchronization
June-July 2009
161
Types
Mutual exclusion
Synchronize entry into critical sections Normally done with locks
Point-to-point synchronization
Tell a set of processors (normally set cardinality is one) that they can proceed Normally done with flags
Global synchronization
Bring every processor to sync Wait at a point until everyone is there Normally done with barriers
June-July 2009 162
Synchronization
Normally a two-part process: acquire and release; acquire can be broken into two parts: intent and wait
Intent: express intent to synchronize (i.e. contend for the lock, arrive at a barrier) Wait: wait for your turn to synchronization (i.e. wait until you get the lock) Release: proceed past synchronization and enable other contenders to synchronize
Waiting algorithms
Waiting processes repeatedly poll a location (implemented as a load in a loop) Releasing process sets the location appropriately May cause network or bus transactions
Block
Waiting processes are de-scheduled Frees up processor cycles for doing something else De-scheduling and re-scheduling take longer than busy waiting No other active process Does not work for single processor
Hybrid policies: busy wait for some time and then 164 June-July 2009 block
Implementation
Popular trend
Architects offer some simple atomic primitives Library writers use these primitives to implement synchronization algorithms Normally hardware primitives for acquire and possibly release are provided Hard to offer hardware solutions for waiting Also hardwired waiting may not offer that much of flexibility
June-July 2009
165
Hardwired locks
Not popular today
Less flexible Cannot support large number of locks
Possible designs
Dedicated lock line in bus so that the lock holder keeps it asserted and waiters snoop the lock line in hardware Set of lock registers shared among processors and lock holder gets a lock register (Cray Xmp)
June-July 2009
166
Shared: choosing[P] = FALSE, ticket[P] = 0; Acquire: choosing[i] = TRUE; ticket[i] = max(ticket[0],,ticket[P-1]) + 1; choosing[i] = FALSE; for j = 0 to P-1 while (choosing[j]); while (ticket[j] && ((ticket[j], j) < (ticket[i], i))); endfor Release: ticket[i] = 0;
Bakery algorithm
Software locks
Hardware support
Lock: lw register, lock_addr /* register is any processor register */ bnez register, Lock addi register, register, 0x1 sw register, lock_addr Unlock: xor register, register, register sw register, lock_addr
Assembly translation
Does it work?
What went wrong? We wanted the read-modify-write sequence to be atomic
June-July 2009 168
Atomic exchange
addi register, r0, 0x1 Lock: xchg register, lock_addr bnez register, Lock Unlock remains unchanged
169
Loads current lock value in a register and sets location always with 1
Exchange allows to swap any value
Possible to implement a lock with fetch & clear then add (used to be supported in BBN Butterfly 1)
Lock: addi reg1, r0, 0x1 fetch & clr then add reg1, reg2, lock_addr /* fetch in reg2, clear, add reg1 */ bnez reg2, Lock
Fetch & op
Butterfly 1 also supports fetch & clear then xor Sequent Symmetry supports fetch & store More sophisticated: compare & swap
Takes three operands: reg1, reg2, memory address Compares the value in reg1 with address and if they are equal swaps the contents of reg2 and address Not in line with RISC philosophy (same goes for fetch & 171 June-July 2009 add)
June-July 2009
172
In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported
Let us assume that the lock location is cacheable and is kept coherent
Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Test & set with backoff 2009 June-July
173
How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff: exponential backoff (after the i th attempt the delay is k*ci where k and c are constants) Test & set with exponential backoff works pretty well
delay = k ts register, lock_addr bez register, Enter_CS pause (delay) /* Can be simulated as a timed loop */ delay = delay*c j Lock
June-July 2009
Lock:
174
Recall that unlock is always a simple store In the worst case everyone will try to enter the CS at the same time
First time P transactions for ts and one succeeds; every other processor suffers a miss on the load in Test loop; then loops from cache The lock-holder when unlocking generates an upgrade (why?) and invalidates all others All other processors suffer read miss and get value zero now; so they break Test loop and try ts and the process continues until everyone has visited the CS (P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+ = (3P-1) + (3P-4) + (3P-7) + ~ 1.5P2 asymptotically For distributed shared memory the situation is worse because each invalidation2009 becomes a separate message 176 June-July (more later)
Similar to Bakery algorithm but simpler A nice application of fetch & inc Basic idea is to come and hold a unique ticket and wait until your turn comes
Shared: ticket = 0, release_count = 0; Lock: fetch & inc reg1, ticket_addr Wait: lw reg2, release_count_addr sub reg3, reg2, reg1 bnez reg3, Wait Unlock: addi reg2, reg2, 0x1 sw reg2, release_count_addr
Ticket lock
/* release_count++ */
178
June-July 2009
Ticket lock
Initial fetch & inc generates O(P) traffic on busbased machines (may be worse in DSM depending on implementation of fetch & inc) But the waiting algorithm still suffers from 0.5P2 messages asymptotically
Researchers have proposed proportional backoff i.e. in the wait loop put a delay proportional to the difference between ticket value and last read release_count
Latency and storage-wise better than Bakery Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery) Guaranteed fairness: the ticket value induces a FIFO queue
June-July 2009 179
Array-based lock
Solves the O(P2) traffic problem The idea is to have a bit vector (essentially a character array if boolean type is not supported) Each processor comes and takes the next free index into the array via fetch & inc Then each processor loops on its index location until it becomes set On unlock a processor is responsible to set the next index location if someone is waiting Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic Disadvantage: storage overhead is O(P)
June-July 2009 180
Performance concerns
Array-based lock
Correctness concerns
Avoid false sharing: allocate each array location on a different cache line Assume a cache line size of 128 bytes and a character array: allocate an array of size 128P bytes and use every 128th position in the array For distributed shared memory the location a processor loops on may not be in its local memory: on acquire it must take a remote miss; allocate P pages and let each processor loop on one bit in a page? Too much wastage; better solution: MCS lock (Mellor-Crummey & Scott) Make sure to handle corner cases such as determining if someone is waiting on the next location (this must be an atomic operation) while unlocking 181 Remember to resetJune-July 2009 location to zero while your index unlocking
RISC processors
All these atomic instructions deviate from the RISC line
Instruction needs a load as well as a store
Also, it would be great if we can offer a few simple instructions with which we can build most of the atomic primitives
Note that it is impossible to build atomic fetch & inc with xchg instruction
Load linked behaves just like a normal load with some extra tricks
LL/SC
Puts the loaded value in destination register as usual Sets a load_linked bit residing in cache controller to 1 Puts the address in a special lock_address register residing in the cache controller sc reg, addr stores value in reg to addr only if load_linked bit is set; also it copies the value in load_linked bit to reg and resets load_linked bit
Any intervening operation (e.g., bus transaction or cache replacement) to the cache line containing the address in lock_address register clears the load_linked bit so that subsequent sc fails
June-July 2009 183
June-July 2009
184
Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying until comparison passes)
Try: LL r3, addr sub r4, r3, r1 bnez r4, Try add r4, r2, r0 SC r4, addr beqz r4, Try add r2, r3, r0
June-July 2009
185
Point-to-point synch.
Normally done in software with flags
P0: A = 1; flag = 1; P1: while (!flag); print A;
Each memory location is augmented with a full/empty bit Producer writes the location only if bit is reset Consumer reads location if bit is set and resets it Lot less flexible: one producer-one consumer sharing only (one producer-many consumers is very popular); all accesses to a memory location become synchronized (unless compiler flags some accesses as special)
Allocate flag and data structures (if small) guarded by 186 June-July 2009 flag in same cache line e.g., flag and A in above example
Centralized barrier
struct bar_type { int counter; struct lock_type lock; int flag = 0; } bar_name; BARINIT (bar_name) { LOCKINIT(bar_name.lock); bar_name.counter = 0; }
BARRIER (bar_name, P) { int my_count; LOCK (bar_name.lock); if (!bar_name.counter) { bar_name.flag = 0; /* first one */ } my_count = ++bar_name.counter; UNLOCK (bar_name.lock); if (my_count == P) { bar_name.counter = 0; bar_name.flag = 1; /* last one */ } else { while (!bar_name.flag); } }
188
June-July 2009
Sense reversal
The last implementation fails to work for two consecutive barrier invocations
BARRIER (bar_name, P) { local sense = !local_sense; /* this is private per processor */ LOCK (bar_name.lock); bar_name.counter++; Need to prevent a process if (bar_name.counter == P) { from entering a barrier UNLOCK (bar_name.lock); instance until all have left the previous instance bar_name.counter = 0; Reverse the sense of a bar_name.flag = local_sense; barrier i.e. every other } barrier will have the same else { sense: basically attach UNLOCK (bar_name.lock); parity or sense to a barrier while (bar_name.flag != local_sense); } 189 June-July 2009 }
Centralized barrier
Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency) The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P) Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release No fairness in terms of who exits first
June-July 2009 190
Tree barrier
Arrange the processors logically in a binary tree (higher degree also possible) Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival) One of them moves up the tree to participate in the next level of the barrier Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel Takes log(P) steps to complete the acquire A fixed processor starts a downward pass of release waking up other processors that in turn set other flags Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions 191 June-July the are any way serialized on 2009 bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs
Convince yourself that this works Take 8 processors and for (i = 0, mask = 1; (mask & arrange them on leaves of a pid) != 0; ++i, mask <<= 1) { tree of depth 3 while (!flag[pid][i]); You will find that only odd flag[pid][i] = 0; nodes move up at every level } during acquire (implemented if (pid < (P - 1)) { in the first for loop) flag[pid + mask][i] = 1; The even nodes just set the while (!flag[pid][MAX- 1]); flags (the first statement in flag[pid][MAX - 1] = 0; the if condition): they bail out } of the first loop with mask=1 for (mask >>= 1; mask > 0; The release is initiated by the mask >>= 1) { last processor in the last for flag[pid - mask][MAX-1] = 1; loop; only odd nodes execute } this loop (7 wakes up 3, 5, 6; } 5 192 June-July 2009 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0)
Tree barrier
Tree barrier
Each processor will need at most log (P) + 1 flags Avoid false sharing: allocate each processors flags on a separate chunk of cache lines With some memory wastage (possibly worth it) allocate each processors flags on a separate page and map that page locally in that processors physical memory
Avoid remote misses in DSM multiprocessor Does not matter in bus-based SMPs
June-July 2009
193
Coherence protocol is not enough to completely specify the output(s) of a parallel program
Memory consistency
Coherence protocol only provides the foundation to reason about legal outcome of accesses to the same memory location Consistency model tells us the possible outcomes arising from legal ordering of accesses to all memory locations A shared memory machine advertises the supported consistency model; it is a contract with the writers of parallel software and the writers of parallelizing compilers Implementing memory consistency model is really a hardware-software tradeoff: a strict sequential model (SC) offers execution that is intuitive, but may suffer in terms of performance; relaxed models (RC) make program reasoning difficult, but may offer better performance 195 June-July 2009
SC
Recall that an execution is SC if the memory operations form a valid total order i.e. it is an interleaving of the partial program orders
Sufficient conditions require that a new memory operation cannot issue until the previous one is completed This is too restrictive and essentially disallows compiler as well as hardware re-ordering of instructions No microprocessor that supports SC implements sufficient conditions Instead, all out-of-order execution is allowed, and a proper recovery mechanism is implemented in case of a memory order violation Lets discuss the MIPS R10000 implementation
June-July 2009 196
SC in MIPS R10000
The problem is with speculatively executed loads: a load may execute and use a value long before it finally commits In the meantime, some other processor may modify that value through a store and the store may commit (i.e. become globally visible) before the load commits: may violate SC (why?) How do you detect such a violation? How do you recover and guarantee an SC execution? Any special consideration for prefetches? Binding and non-binding prefetches
June-July 2009 197
In MIPS R10000 a store remains at the head of the active list until it is completed in cache
Can we just remove it as soon as it issues and let the other instructions commit (the store can complete from store buffer at a later point)? How far can we go and still guarantee SC?
SC in MIPS R10000
The Stanford DASH multiprocessor, on receiving a read reply that is already invalidated, forces the processor to retry that load Does the cache controller need to take any special action when a line is replaced from the cache?
June-July 2009 198
Why cant it use the value in the cache line and then discard the line?
Relaxed models
Is there an example that clearly shows the disaster of not implementing all these? Observe that cache coherence protocol is orthogonal But such violations are rare Does it make sense to invest so much time (for verification) and hardware (associative lookup logic in load queue)? Many processors today relax the consistency model to get rid of complex hardware and achieve some extra performance at the cost of making program reasoning complex P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B; SC is too restrictive; relaxing it does not always violate programmers intuition
June-July 2009 199
Case Studies
Today microprocessor designers can afford to have a lot of transistors on the die
Ever-shrinking feature size leads to dense packing What would you do with so many transistors? Can invest some to cache, but beyond a certain point it doesnt help Natural choice was to think about greater level of integration Few chip designers decided to bring the memory and coherence controllers along with the router on the die The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect
June-July 2009 202
Why CMP?
Moores law
Exponential growth in available transistor count If transistor utilization is constant, this would lead to exponential performance growth; but life is slightly more complicated Wires dont scale with transistor technology: wire delay becomes the bottleneck Short wires are good: dictates localized logic design But superscalar processors exercise a centralized control requiring long wires (or pipelined long wires) However, to utilize the transistors well, we need to overcome the memory wall problem To hide memory latency we need to extract more independent instructions i.e. more ILP
June-July 2009 203
Moores law
Moores law
June-July 2009
205
Power consumption?
Hey, didnt I just make my power consumption roughly N-fold by putting N cores on the die?
Yes, if you do not scale down voltage or frequency Usually CMPs are clocked at a lower frequency Oops! My games run slower! Voltage scaling happens due to smaller process technology Overall, roughly cubic dependence of power on voltage or frequency Need to talk about different metrics Performance/Watt (same as reciprocal of energy) More general, Performancek+1/Watt (k > 0) Need smarter techniques to further improve these metrics 206 June-July 2009 Online voltage/frequency scaling
Do not want to access the interconnect too frequently because these wires are slow It probably does not make much sense to have the L1 cache shared among the cores: requires very high bandwidth and may necessitate a redesign of the L1 cache and surrounding load/store unit which we do not want to do; so settle for private L1 caches, one per core Makes more sense to share the L2 or L3 caches Need a coherence protocol at L2 interface to keep private L1 caches coherent: may use a high-speed custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol An entirely different design choice is not to share the cache hierarchy at all (dual-core AMD and Intel): rids you of the on-chip coherence protocol, but no gain in 207 June-July communication latency 2009
IBM POWER4
June-July 2009
208
IBM POWER4
Dual-core chip multiprocessor
June-July 2009
209
June-July 2009
210
June-July 2009
211
POWER4 caches
L1 icache: 64 KB/direct mapped/128 bytes line L1 dcache: 32 KB/2-way associative/128 bytes line/LRU No M state in L1 data cache (write through) 1.5 MB/8-way associative/128 bytes line/pseudo LRU For on-chip coherence, L2 tag is augmented with a twobit sharer vector; used to invalidate L1 on other cores write Three L2 controllers and each L2 controller has four local coherence units; each L2 controller handles roughly 512 KB of data divided into four SRAM partitions For off-chip coherence, each L2 controller has four snoop engines; executes enhanced MESI with seven states
June-July 2009 212
POWER4 L2 cache
June-July 2009
213
POWER4 L3 cache
June-July 2009
214
POWER4 L3 cache
June-July 2009
215
June-July 2009
216
IBM POWER5
June-July 2009
217
IBM POWER5
Each core of the dual-core chip is 2-way SMT: 24% area growth per core More than two threads not only add complexity, may not provide extra performance benefit; in fact, performance may degrade because of resource contention and cache thrashing unless all shared resources are scaled up accordingly (hits a complexity wall) L3 cache is moved to the processor side so that L2 cache can directly talk to it: reduces bandwidth demand on the interconnect (L3 hits at least do not go on bus) This change enabled POWER5 designers to scale to 64processor systems (i.e. 32 chips with a total of 128 threads) Bigger L2 and L3 caches: 1.875 MB L2, 36 MB L3 On-chip memory controller 218 June-July 2009
IBM POWER5
IBM POWER5
Added SMT facility Like Pentium 4, fetches from each thread in alternate cycles (8-instruction fetch per cycle just like POWER4) Threads share ITLB and ICache Increased size of register file compared to POWER4 to support two threads: 120 integer and floating-point registers (POWER4 has 80 integer and 72 floating-point registers): improves single-thread performance compared to POWER4; smaller technology (0.13 m) made it possible to access a bigger register file in same or shorter time leading to same pipeline as POWER4 Doubled associativity of L1 caches to reduce conflict misses: icache is 2-way and dcache is 4-way
June-July 2009 220
With SMT and CMP average number of switching per cycle increases leading to more power consumption Need to reduce power consumption without losing performance: simple solution is to clock it at a slower frequency, but that hurts performance POWER5 employs fine-grain clock-gating: in every cycle the power management logic decides if a certain latch will be used in the next cycle; if not, it disables or gates the clock for that latch so that it will not unnecessarily switch in the next cycle Clock-gating and power management logic themselves should be very simple If both threads are running at priority level 1, the processor switches to a low power mode where it dispatches instructions at a much slower pace
June-July 2009 221
June-July 2009
222
Intel Montecito
June-July 2009
223
Dual core Itanium 2, each core dual threaded 1.7 billion transistors, 21.5 mm x 27.7 mm die 27 MB of on-chip three levels of cache
Not shared among cores
Features
Overview
Foxton technology
Power efficiency
Blind replication of Itanium 2 cores at 90 nm would lead to roughly 300 W peak power consumption (Itanium 2 consumes 130 W peak at 130 nm) In case of lower than the ceiling power consumption, the voltage is increased leading to higher frequency and performance 10% boost for enterprise applications Software or OS can also dictate a frequency change if power saving is required 100 ms response time for the feedback loop Frequency control is achieved by 24 voltage sensors distributed across the chip: the entire chip runs at a single frequency (other than asynchronous L3) Clock gating found limited application in Montecito
June-July 2009 226
Foxton technology
Reproduced from IEEE Micro Embedded microcontroller runs a real-time scheduler 227 to execute various tasksJune-July 2009
Die photo
June-July 2009
228
Features
June-July 2009
230
Pipeline details
231
Cache hierarchy
L1 instruction cache
16 KB / 4-way / 32 bytes / random replacement Fetches two instructions every cycle If both instructions are useful, next cycle is free for icache refill
L1 data cache
8 KB / 4-way / 16 bytes/ write-through, no-allocate On avearge 10% miss rate for target benchmarks L2 cache extends the tag to maintain a directory for keeping the core L1s coherent